[jsr369-experts] Re: PR Feedback: 173-RequestDataEncoding

From: Stuart Douglas <sdouglas_at_redhat.com>
Date: Wed, 26 Apr 2017 08:52:13 +1000

I don't think this is misleading at all. When we talk about 'parsing
the data stream' we do not explicitly state that the incoming data
stream must be treated as a series of code points in the specified
charset.

What this actually means for a form encoded parser is that it will
parse the form encoded data as US-ASCII, but decode all the resulting
name/value pairs into the specified charset. I think this is well
understood by developers, as this is how setCharacterEncoding works.

If we do want to clarify anything here (which I am not convinced is
nessesary) IMHO we should state exactly what this affects, namely:
- The reader returned from getReader() will decode into this charset
- Request parameters from a post body will be decoded into this
charset after they have been parsed from the request

Stuart

On Wed, Apr 26, 2017 at 7:52 AM, Edward Burns <edward.burns_at_oracle.com> wrote:
> Hello Volunteers,
>
> Julian Reschke, one of the authors of RFC 7231, filed two JIRAs today
> against the Public Review. One of them was trivial and I fixed it.
>
> The other one I'd like to run by you before fixing.
>
>>>>>> On Tue, 25 Apr 2017 15:25:32 +0000 (UTC), "reschke (JIRA)" <jira-no-reply_at_java.net> said:
>
> JR> URL: https://java.net/jira/browse/SERVLET_SPEC-173
>
> He quotes some text from 3.12 Request data encoding:
>
> Spec3.12> "Currently, many browsers do not send a char encoding
> Spec3.12> qualifier with the Content-Type header, leaving open the
> Spec3.12> determination of the character encoding for reading HTTP
> Spec3.12> requests. The default encoding of a request the container
> Spec3.12> uses to create the request reader and parse POST data must be
> Spec3.12> ISO-8859-1 if none has been specified by the client request,
> Spec3.12> web application or container vendor specific configuration
> Spec3.12> (for all web applications in the container). However, in order
> Spec3.12> to indicate to the developer, in this case, the failure of the
> Spec3.12> client to send a character encoding, the container returns
> Spec3.12> null from the getCharacterEncoding method."
>
> JR> That is very misleading.
>
> JR> From an HTTP payload point of view, the actual character encoding
> JR> for "application/x-www-form-urlencoded", as defined in
> JR> <https://www.w3.org/TR/html5/forms.html#application/x-www-form-urlencoded-encoding-algorithm>
> JR> is *always* US-ASCII. Period.
>
> Indeed, step 5 of the encoding algorithm is
>
> HTML5> 5. Encode result as US-ASCII and return the resulting byte stream.
>
> JR> The octet representation of non-US-ASCII characters is *always*
> JR> percent-encoded - this means that whatever the HTTP payload header
> JR> fields describes is totally irrelevant for this content type (as
> JR> long as it is an USASCII-compatible encoding).
>
> JR> It may not be possible to change the ISO-8859-1 default, but note
> JR> that the HTTP spec never ever said that this actually is the default
> JR> (I believe earlier versions of the servlet spec pretended that this
> JR> was the case).
>
> Though it's not exactly clear what he wants us to do, I propose the
> following.
>
> PROPOSAL:
>
> Modify the "very misleading" text to be the following:
>
> Spec3.12> "Currently, many browsers do not send a char encoding
> Spec3.12> qualifier with the Content-Type header, leaving open the
> Spec3.12> determination of the character encoding for reading HTTP
> Spec3.12> requests.
>
> In this case, if the Content-Type is application/x-www-form-urlencoded,
> the default encoding the container uses to create the request reader and
> parse POST data must be US-ASCII. For any other Content-Type, if none
> has been specified by the client request, web application or container
> vendor specific configuration (for all web applications in the
> container), the
>
> Spec3.12> default encoding of a request the container uses to create the
> Spec3.12> request reader and parse POST data must be ISO-8859-1.
> Spec3.12> However, in order to indicate to the developer, in this
> Spec3.12> case, the failure of the client to send a character encoding,
> Spec3.12> the container returns null from the getCharacterEncoding
> Spec3.12> method."
>
> ------------
>
> So basically the operative change is to explicitly call out the
> Content-Type of application/x-www-form-urlencoded and say that US-ASCII
> must be used to parse the request reader and parse the POST data.
>
> ACTION: Please let me know your thoughts on this by start of business
> PDT Friday 28 April 2017. In the absence of a response I'll change the
> text of 3.12.
>
> Thanks,
>
> Ed
>
> --
> | edward.burns_at_oracle.com | office: +1 407 458 0017