[jsr369-experts] Re: PR Feedback: 173-RequestDataEncoding

From: Stuart Douglas <sdouglas_at_redhat.com>
Date: Sat, 29 Apr 2017 07:07:34 +1000

On Sat, Apr 29, 2017 at 6:44 AM, Edward Burns <edward.burns_at_oracle.com> wrote:
>>>>>> On Wed, 26 Apr 2017 08:52:13 +1000, Stuart Douglas <sdouglas_at_redhat.com> said:
>
> SD> If we do want to clarify anything here (which I am not convinced is
> SD> nessesary) IMHO we should state exactly what this affects, namely:
> SD> - The reader returned from getReader() will decode into this charset
> SD> - Request parameters from a post body will be decoded into this
> SD> charset after they have been parsed from the request
>
>>>>>> On Wed, 26 Apr 2017 11:08:19 +0100, Mark Thomas <markt_at_apache.org> said:
>
> MT> +1
>
> MT> The problem is that without the character encoding the server is left to
> MT> guess which encoding was used to convert the non US-ASCII characters
> MT> into %nn values.
>
> I know I said this was fine, but after re-reading the HTML5 section
> Mr. Reschke quotes
> <https://www.w3.org/TR/html5/forms.html#application/x-www-form-urlencoded-encoding-algorithm>
> I don't agree with your saying that the server is left to guess. The
> whole application/x-www-form-urlencoded encoding algorithm, with items
> 4.5, 4.5.2 and 5 in particular, very clearly states that everything will
> be in US-ASCII, including the %nn. There will be no non-USASCII
> characters if that algorithm is correctly used to produce the bytes sent
> to the server.

Yes, however the form data parser still needs to handle the percent
encoded values, and to do that we need to know which charset to decode
them to. The alternative would be to return the percent encoded values
to the user, which would be a breaking change that would cause a lot
of problems for existing applications.

>
> MT> How about something along these lines:
>
> MT> "Currently, many browsers do not send a char encoding
>
> [...]
>
> MT> null from the getCharacterEncoding method."
>
>>>>>> On Wed, 26 Apr 2017 14:17:11 -0700, Edward Burns <edward.burns_at_oracle.com> said:
>
> EB> This is fine with me.
>
> Yes, I know I said it was fine, but I have changed my position. I am no
> longer fine with it.
>
>>>>>> On Thu, 27 Apr 2017 07:46:45 +1000, Stuart Douglas <sdouglas_at_redhat.com> said:
>
> SD> I think this needs to be clarified. Does it return null:
>
> SD> 1) If the encoding defaults to ISO-8859-1 because nothing was specified
> SD> or
> SD> 2) If the client did not send a character encoding
>
> My interpretation of the existing text is 2).
>
> SD> "the failure of the client to send a character encoding, the container
> SD> returns null" implies that this is option 2), however I don't think
> SD> this is explicitly made clear, as the "in this case" appears to be
> SD> referring to the previous sentence which talks about defaulting to
> SD> ISO-8859-1.
>
> Given my reconsideration of Mark's proposal, I'm going to take another
> stab at the text, based on my initial attempt from Tuesday and trying to
> incorporate something from Mark's
>
> PROPOSAL:
>
> Modify the "very misleading" text to be the following:
>
> Spec3.12> "Currently, many browsers do not send a char encoding
> Spec3.12> qualifier with the Content-Type header, leaving open the
> Spec3.12> determination of the character encoding for reading HTTP
> Spec3.12> requests.
>
> In the absence of a char encoding qualifier, if the Content-Type is
> application/x-www-form-urlencoded, the default encoding the container
> uses to create the request reader and parse POST data must be US-ASCII.
> For any other Content-Type, if none has been specified by the client
> request, web application or container vendor specific configuration (for
> all web applications in the container), the
>
> Spec3.12> default encoding of a request the container uses to create the
> Spec3.12> request reader and parse POST data must be ISO-8859-1.
>
> However, in order to indicate to the developer the absence of a char
> encoding qualifier, the container must return null from the
> getCharacterEncoding method."

This is wrong. Just because the response itself if US-ASCII the actual
values can be decoded into a different charset, which AFAIK all
servlet containers currently do and users expect to work.

If you want to say anything about it then it should be something like:

In the case of application/x-www-form-urlencoded and multipart
requests accessed through ServletRequest.getParameter and similar
methods then this charset represents the charset that the parameters
will be decoded to, not the actual encoding of the request body (which
is ALWAYS US-ASCII as per RFC).

Stuart

>
> ------------
>
> ACTION: Please respond by start of business PDT Wednesday 3 May 2017.
> In the absence of a response, we will go with the above proposal.
>
> Thanks,
>
> Ed
>
> --
> | edward.burns_at_oracle.com | office: +1 407 458 0017