[jsr369-experts] Re: PR Feedback: 173-RequestDataEncoding

From: Mark Thomas <markt_at_apache.org>
Date: Wed, 26 Apr 2017 11:08:19 +0100

On 25/04/17 23:52, Stuart Douglas wrote:
> I don't think this is misleading at all. When we talk about 'parsing
> the data stream' we do not explicitly state that the incoming data
> stream must be treated as a series of code points in the specified
> charset.
>
> What this actually means for a form encoded parser is that it will
> parse the form encoded data as US-ASCII, but decode all the resulting
> name/value pairs into the specified charset. I think this is well
> understood by developers, as this is how setCharacterEncoding works.
>
> If we do want to clarify anything here (which I am not convinced is
> nessesary) IMHO we should state exactly what this affects, namely:
> - The reader returned from getReader() will decode into this charset
> - Request parameters from a post body will be decoded into this
> charset after they have been parsed from the request

+1

The problem is that without the character encoding the server is left to
guess which encoding was used to convert the non US-ASCII characters
into %nn values.

How about something along these lines:

"Currently, many browsers do not send a char encoding
qualifier with the Content-Type header, leaving open the
determination of the character encoding that should be used
to decode any %nn sequences in an
"application/x-www-form-urlencoded" encoded request body.
The default encoding the container
uses to create the request reader and parse POST data must be
ISO-8859-1 if none has been specified by the client request,
web application or container vendor specific configuration
(for all web applications in the container). However, in order
to indicate to the developer, in this case, the failure of the
client to send a character encoding, the container returns
null from the getCharacterEncoding method."

Mark

>
> Stuart
>
> On Wed, Apr 26, 2017 at 7:52 AM, Edward Burns <edward.burns_at_oracle.com> wrote:
>> Hello Volunteers,
>>
>> Julian Reschke, one of the authors of RFC 7231, filed two JIRAs today
>> against the Public Review. One of them was trivial and I fixed it.
>>
>> The other one I'd like to run by you before fixing.
>>
>>>>>>> On Tue, 25 Apr 2017 15:25:32 +0000 (UTC), "reschke (JIRA)" <jira-no-reply_at_java.net> said:
>>
>> JR> URL: https://java.net/jira/browse/SERVLET_SPEC-173
>>
>> He quotes some text from 3.12 Request data encoding:
>>
>> Spec3.12> "Currently, many browsers do not send a char encoding
>> Spec3.12> qualifier with the Content-Type header, leaving open the
>> Spec3.12> determination of the character encoding for reading HTTP
>> Spec3.12> requests. The default encoding of a request the container
>> Spec3.12> uses to create the request reader and parse POST data must be
>> Spec3.12> ISO-8859-1 if none has been specified by the client request,
>> Spec3.12> web application or container vendor specific configuration
>> Spec3.12> (for all web applications in the container). However, in order
>> Spec3.12> to indicate to the developer, in this case, the failure of the
>> Spec3.12> client to send a character encoding, the container returns
>> Spec3.12> null from the getCharacterEncoding method."
>>
>> JR> That is very misleading.
>>
>> JR> From an HTTP payload point of view, the actual character encoding
>> JR> for "application/x-www-form-urlencoded", as defined in
>> JR> <https://www.w3.org/TR/html5/forms.html#application/x-www-form-urlencoded-encoding-algorithm>
>> JR> is *always* US-ASCII. Period.
>>
>> Indeed, step 5 of the encoding algorithm is
>>
>> HTML5> 5. Encode result as US-ASCII and return the resulting byte stream.
>>
>> JR> The octet representation of non-US-ASCII characters is *always*
>> JR> percent-encoded - this means that whatever the HTTP payload header
>> JR> fields describes is totally irrelevant for this content type (as
>> JR> long as it is an USASCII-compatible encoding).
>>
>> JR> It may not be possible to change the ISO-8859-1 default, but note
>> JR> that the HTTP spec never ever said that this actually is the default
>> JR> (I believe earlier versions of the servlet spec pretended that this
>> JR> was the case).
>>
>> Though it's not exactly clear what he wants us to do, I propose the
>> following.
>>
>> PROPOSAL:
>>
>> Modify the "very misleading" text to be the following:
>>
>> Spec3.12> "Currently, many browsers do not send a char encoding
>> Spec3.12> qualifier with the Content-Type header, leaving open the
>> Spec3.12> determination of the character encoding for reading HTTP
>> Spec3.12> requests.
>>
>> In this case, if the Content-Type is application/x-www-form-urlencoded,
>> the default encoding the container uses to create the request reader and
>> parse POST data must be US-ASCII. For any other Content-Type, if none
>> has been specified by the client request, web application or container
>> vendor specific configuration (for all web applications in the
>> container), the
>>
>> Spec3.12> default encoding of a request the container uses to create the
>> Spec3.12> request reader and parse POST data must be ISO-8859-1.
>> Spec3.12> However, in order to indicate to the developer, in this
>> Spec3.12> case, the failure of the client to send a character encoding,
>> Spec3.12> the container returns null from the getCharacterEncoding
>> Spec3.12> method."
>>
>> ------------
>>
>> So basically the operative change is to explicitly call out the
>> Content-Type of application/x-www-form-urlencoded and say that US-ASCII
>> must be used to parse the request reader and parse the POST data.
>>
>> ACTION: Please let me know your thoughts on this by start of business
>> PDT Friday 28 April 2017. In the absence of a response I'll change the
>> text of 3.12.
>>
>> Thanks,
>>
>> Ed
>>
>> --
>> | edward.burns_at_oracle.com | office: +1 407 458 0017