[jsr369-experts] Re: [servlet-spec users] [SPEC-161] Encoding in Deployment Descriptor

From: Stuart Douglas <sdouglas_at_redhat.com>
Date: Fri, 9 Sep 2016 00:08:32 +1000

On Thu, Sep 8, 2016 at 1:50 AM, Mark Thomas <markt_at_apache.org> wrote:
> On 07/09/2016 16:21, Edward Burns wrote:
>>>>>>> On Wed, 7 Sep 2016 09:18:42 +1000, Stuart Douglas <sdouglas_at_redhat.com> said:
>>
>> SD> Hmm, re-reading it this is not as clear cut as I thought it was. UTF-8
>> SD> gets mentioned in the following places:
>>
>> SD> 2.5 Identifying Data
>>
>> SD> When a new URI scheme defines a component that represents textual
>> SD> data consisting of characters from the Universal Character Set [UCS],
>> SD> the data should first be encoded as octets according to the UTF-8
>> SD> character encoding [STD63]; then only those octets that do not
>> SD> correspond to characters in the unreserved set should be percent-
>> SD> encoded.
>>
>> SD> 3.2.2. Host
>>
>> SD> Non-ASCII
>> SD> characters must first be encoded according to UTF-8 [STD63], and then
>> SD> each octet of the corresponding UTF-8 sequence must be percent-
>> SD> encoded to be represented as URI characters. URI producing
>> SD> applications must not use percent-encoding in host unless it is used
>> SD> to represent a UTF-8 character sequence.
>>
>> SD> So the 'host' part of the URI is definitely UTF-8, but it is not made
>> SD> super clear if this applies to the path component as well. I am pretty
>> SD> sure it does (and that seems to be the general consensus around the
>> SD> internet). I read section 2.5 as applying to all components, which
>> SD> includes the path.
>>
>> If the URI RFC is not itself clear, then I think we should not say
>> anything about using UTF-8 as the default encoding in the request.
>
> I strongly disagree.
>
> The web is moving (some might argue has moved) towards using UTF-8. We
> should be moving with "the general consensus around the internet" and
> using UTF-8 by default.
>
> Tomcat has been using UTF-8 by default for URIs since early 2014 and I
> don't recall a single issue being reported because of it.
>
> Mark
>

Just to make my position clear I also think we should be focusing on
UTF-8. In has become the standard, and from what I have seen the
ISO-8859-1 default just causes problems for users.

Even if we cannot change the default in this version of the
specification due to backwards compatibility concerns I think at the
very least we should be have something in place to change the default
in a future version, ISO-8859-1 will almost certainly be irrelevant in
a few years time.

Stuart

>
>>
>>>>>>>>> On Mon, 5 Sep 2016 15:47:34 +1000, Greg Wilkins <gregw_at_webtide.com> said:
>>>>
>> GW> So they could just be <request-encoding> and <response-encoding>, with
>> GW> documentation that says that the encoding set by these is overridden by the
>> GW> programmatic methods: setCharacterEncoding, setContent-Type and/or
>> GW> setLocale.
>>
>> I have filed SERVLET_SPEC-161 for this.
>>
>> Ed
>>
>