[jsr369-experts] PR Feedback: 173-RequestDataEncoding

From: Edward Burns <edward.burns_at_oracle.com>
Date: Tue, 25 Apr 2017 14:52:41 -0700

Hello Volunteers,

Julian Reschke, one of the authors of RFC 7231, filed two JIRAs today
against the Public Review. One of them was trivial and I fixed it.

The other one I'd like to run by you before fixing.

>>>>> On Tue, 25 Apr 2017 15:25:32 +0000 (UTC), "reschke (JIRA)" <jira-no-reply_at_java.net> said:

JR> URL: https://java.net/jira/browse/SERVLET_SPEC-173

He quotes some text from 3.12 Request data encoding:

Spec3.12> "Currently, many browsers do not send a char encoding
Spec3.12> qualifier with the Content-Type header, leaving open the
Spec3.12> determination of the character encoding for reading HTTP
Spec3.12> requests. The default encoding of a request the container
Spec3.12> uses to create the request reader and parse POST data must be
Spec3.12> ISO-8859-1 if none has been specified by the client request,
Spec3.12> web application or container vendor specific configuration
Spec3.12> (for all web applications in the container). However, in order
Spec3.12> to indicate to the developer, in this case, the failure of the
Spec3.12> client to send a character encoding, the container returns
Spec3.12> null from the getCharacterEncoding method."

JR> That is very misleading.

JR> From an HTTP payload point of view, the actual character encoding
JR> for "application/x-www-form-urlencoded", as defined in
JR> <https://www.w3.org/TR/html5/forms.html#application/x-www-form-urlencoded-encoding-algorithm>
JR> is *always* US-ASCII. Period.

Indeed, step 5 of the encoding algorithm is

HTML5> 5. Encode result as US-ASCII and return the resulting byte stream.

JR> The octet representation of non-US-ASCII characters is *always*
JR> percent-encoded - this means that whatever the HTTP payload header
JR> fields describes is totally irrelevant for this content type (as
JR> long as it is an USASCII-compatible encoding).

JR> It may not be possible to change the ISO-8859-1 default, but note
JR> that the HTTP spec never ever said that this actually is the default
JR> (I believe earlier versions of the servlet spec pretended that this
JR> was the case).

Though it's not exactly clear what he wants us to do, I propose the
following.

PROPOSAL:

Modify the "very misleading" text to be the following:

Spec3.12> "Currently, many browsers do not send a char encoding
Spec3.12> qualifier with the Content-Type header, leaving open the
Spec3.12> determination of the character encoding for reading HTTP
Spec3.12> requests.

In this case, if the Content-Type is application/x-www-form-urlencoded,
the default encoding the container uses to create the request reader and
parse POST data must be US-ASCII. For any other Content-Type, if none
has been specified by the client request, web application or container
vendor specific configuration (for all web applications in the
container), the

Spec3.12> default encoding of a request the container uses to create the
Spec3.12> request reader and parse POST data must be ISO-8859-1.
Spec3.12> However, in order to indicate to the developer, in this
Spec3.12> case, the failure of the client to send a character encoding,
Spec3.12> the container returns null from the getCharacterEncoding
Spec3.12> method."

------------

So basically the operative change is to explicitly call out the
Content-Type of application/x-www-form-urlencoded and say that US-ASCII
must be used to parse the request reader and parse the POST data.

ACTION: Please let me know your thoughts on this by start of business
PDT Friday 28 April 2017. In the absence of a response I'll change the
text of 3.12.

Thanks,

Ed

-- 
| edward.burns_at_oracle.com | office: +1 407 458 0017