[jsr369-experts] Re: PR Feedback: 173-RequestDataEncoding

From: Stuart Douglas <sdouglas_at_redhat.com>
Date: Thu, 27 Apr 2017 07:46:45 +1000

On Wed, Apr 26, 2017 at 8:08 PM, Mark Thomas <markt_at_apache.org> wrote:
> On 25/04/17 23:52, Stuart Douglas wrote:
>> I don't think this is misleading at all. When we talk about 'parsing
>> the data stream' we do not explicitly state that the incoming data
>> stream must be treated as a series of code points in the specified
>> charset.
>>
>> What this actually means for a form encoded parser is that it will
>> parse the form encoded data as US-ASCII, but decode all the resulting
>> name/value pairs into the specified charset. I think this is well
>> understood by developers, as this is how setCharacterEncoding works.
>>
>> If we do want to clarify anything here (which I am not convinced is
>> nessesary) IMHO we should state exactly what this affects, namely:
>> - The reader returned from getReader() will decode into this charset
>> - Request parameters from a post body will be decoded into this
>> charset after they have been parsed from the request
>
> +1
>
> The problem is that without the character encoding the server is left to
> guess which encoding was used to convert the non US-ASCII characters
> into %nn values.
>
> How about something along these lines:
>
> "Currently, many browsers do not send a char encoding
> qualifier with the Content-Type header, leaving open the
> determination of the character encoding that should be used
> to decode any %nn sequences in an
> "application/x-www-form-urlencoded" encoded request body.
> The default encoding the container
> uses to create the request reader and parse POST data must be
> ISO-8859-1 if none has been specified by the client request,
> web application or container vendor specific configuration
> (for all web applications in the container). However, in order
> to indicate to the developer, in this case, the failure of the
> client to send a character encoding, the container returns
> null from the getCharacterEncoding method."

I think this needs to be clarified. Does it return null:

1) If the encoding defaults to ISO-8859-1 because nothing was specified
or
2) If the client did not send a character encoding

"the failure of the client to send a character encoding, the container
returns null" implies that this is option 2), however I don't think
this is explicitly made clear, as the "in this case" appears to be
referring to the previous sentence which talks about defaulting to
ISO-8859-1.

Stuart

>
> Mark
>
>
>>
>> Stuart
>>
>> On Wed, Apr 26, 2017 at 7:52 AM, Edward Burns <edward.burns_at_oracle.com> wrote:
>>> Hello Volunteers,
>>>
>>> Julian Reschke, one of the authors of RFC 7231, filed two JIRAs today
>>> against the Public Review. One of them was trivial and I fixed it.
>>>
>>> The other one I'd like to run by you before fixing.
>>>
>>>>>>>> On Tue, 25 Apr 2017 15:25:32 +0000 (UTC), "reschke (JIRA)" <jira-no-reply_at_java.net> said:
>>>
>>> JR> URL: https://java.net/jira/browse/SERVLET_SPEC-173
>>>
>>> He quotes some text from 3.12 Request data encoding:
>>>
>>> Spec3.12> "Currently, many browsers do not send a char encoding
>>> Spec3.12> qualifier with the Content-Type header, leaving open the
>>> Spec3.12> determination of the character encoding for reading HTTP
>>> Spec3.12> requests. The default encoding of a request the container
>>> Spec3.12> uses to create the request reader and parse POST data must be
>>> Spec3.12> ISO-8859-1 if none has been specified by the client request,
>>> Spec3.12> web application or container vendor specific configuration
>>> Spec3.12> (for all web applications in the container). However, in order
>>> Spec3.12> to indicate to the developer, in this case, the failure of the
>>> Spec3.12> client to send a character encoding, the container returns
>>> Spec3.12> null from the getCharacterEncoding method."
>>>
>>> JR> That is very misleading.
>>>
>>> JR> From an HTTP payload point of view, the actual character encoding
>>> JR> for "application/x-www-form-urlencoded", as defined in
>>> JR> <https://www.w3.org/TR/html5/forms.html#application/x-www-form-urlencoded-encoding-algorithm>
>>> JR> is *always* US-ASCII. Period.
>>>
>>> Indeed, step 5 of the encoding algorithm is
>>>
>>> HTML5> 5. Encode result as US-ASCII and return the resulting byte stream.
>>>
>>> JR> The octet representation of non-US-ASCII characters is *always*
>>> JR> percent-encoded - this means that whatever the HTTP payload header
>>> JR> fields describes is totally irrelevant for this content type (as
>>> JR> long as it is an USASCII-compatible encoding).
>>>
>>> JR> It may not be possible to change the ISO-8859-1 default, but note
>>> JR> that the HTTP spec never ever said that this actually is the default
>>> JR> (I believe earlier versions of the servlet spec pretended that this
>>> JR> was the case).
>>>
>>> Though it's not exactly clear what he wants us to do, I propose the
>>> following.
>>>
>>> PROPOSAL:
>>>
>>> Modify the "very misleading" text to be the following:
>>>
>>> Spec3.12> "Currently, many browsers do not send a char encoding
>>> Spec3.12> qualifier with the Content-Type header, leaving open the
>>> Spec3.12> determination of the character encoding for reading HTTP
>>> Spec3.12> requests.
>>>
>>> In this case, if the Content-Type is application/x-www-form-urlencoded,
>>> the default encoding the container uses to create the request reader and
>>> parse POST data must be US-ASCII. For any other Content-Type, if none
>>> has been specified by the client request, web application or container
>>> vendor specific configuration (for all web applications in the
>>> container), the
>>>
>>> Spec3.12> default encoding of a request the container uses to create the
>>> Spec3.12> request reader and parse POST data must be ISO-8859-1.
>>> Spec3.12> However, in order to indicate to the developer, in this
>>> Spec3.12> case, the failure of the client to send a character encoding,
>>> Spec3.12> the container returns null from the getCharacterEncoding
>>> Spec3.12> method."
>>>
>>> ------------
>>>
>>> So basically the operative change is to explicitly call out the
>>> Content-Type of application/x-www-form-urlencoded and say that US-ASCII
>>> must be used to parse the request reader and parse the POST data.
>>>
>>> ACTION: Please let me know your thoughts on this by start of business
>>> PDT Friday 28 April 2017. In the absence of a response I'll change the
>>> text of 3.12.
>>>
>>> Thanks,
>>>
>>> Ed
>>>
>>> --
>>> | edward.burns_at_oracle.com | office: +1 407 458 0017
>