>>>>> On Wed, 26 Apr 2017 11:08:19 +0100, Mark Thomas <markt_at_apache.org> said:
MT> On 25/04/17 23:52, Stuart Douglas wrote:
>> I don't think this is misleading at all. When we talk about 'parsing
>> the data stream' we do not explicitly state that the incoming data
>> stream must be treated as a series of code points in the specified
>> charset.
>>
>> What this actually means for a form encoded parser is that it will
>> parse the form encoded data as US-ASCII, but decode all the resulting
>> name/value pairs into the specified charset. I think this is well
>> understood by developers, as this is how setCharacterEncoding works.
>>
>> If we do want to clarify anything here (which I am not convinced is
>> nessesary) IMHO we should state exactly what this affects, namely:
>> - The reader returned from getReader() will decode into this charset
>> - Request parameters from a post body will be decoded into this
>> charset after they have been parsed from the request
MT> +1
MT> The problem is that without the character encoding the server is left to
MT> guess which encoding was used to convert the non US-ASCII characters
MT> into %nn values.
MT> How about something along these lines:
MT> "Currently, many browsers do not send a char encoding
MT> qualifier with the Content-Type header, leaving open the
MT> determination of the character encoding that should be used
MT> to decode any %nn sequences in an
MT> "application/x-www-form-urlencoded" encoded request body.
MT> The default encoding the container
MT> uses to create the request reader and parse POST data must be
MT> ISO-8859-1 if none has been specified by the client request,
MT> web application or container vendor specific configuration
MT> (for all web applications in the container). However, in order
MT> to indicate to the developer, in this case, the failure of the
MT> client to send a character encoding, the container returns
MT> null from the getCharacterEncoding method."
This is fine with me.
Ed
--
| edward.burns_at_oracle.com | office: +1 407 458 0017