users@servlet-spec.java.net

[servlet-spec users] [jsr369-experts] Re: [SPEC-161] Encoding in Deployment Descriptor

From: Mark Thomas <markt_at_apache.org>
Date: Wed, 7 Sep 2016 16:50:15 +0100

On 07/09/2016 16:21, Edward Burns wrote:
>>>>>> On Wed, 7 Sep 2016 09:18:42 +1000, Stuart Douglas <sdouglas_at_redhat.com> said:
>
> SD> Hmm, re-reading it this is not as clear cut as I thought it was. UTF-8
> SD> gets mentioned in the following places:
>
> SD> 2.5 Identifying Data
>
> SD> When a new URI scheme defines a component that represents textual
> SD> data consisting of characters from the Universal Character Set [UCS],
> SD> the data should first be encoded as octets according to the UTF-8
> SD> character encoding [STD63]; then only those octets that do not
> SD> correspond to characters in the unreserved set should be percent-
> SD> encoded.
>
> SD> 3.2.2. Host
>
> SD> Non-ASCII
> SD> characters must first be encoded according to UTF-8 [STD63], and then
> SD> each octet of the corresponding UTF-8 sequence must be percent-
> SD> encoded to be represented as URI characters. URI producing
> SD> applications must not use percent-encoding in host unless it is used
> SD> to represent a UTF-8 character sequence.
>
> SD> So the 'host' part of the URI is definitely UTF-8, but it is not made
> SD> super clear if this applies to the path component as well. I am pretty
> SD> sure it does (and that seems to be the general consensus around the
> SD> internet). I read section 2.5 as applying to all components, which
> SD> includes the path.
>
> If the URI RFC is not itself clear, then I think we should not say
> anything about using UTF-8 as the default encoding in the request.

I strongly disagree.

The web is moving (some might argue has moved) towards using UTF-8. We
should be moving with "the general consensus around the internet" and
using UTF-8 by default.

Tomcat has been using UTF-8 by default for URIs since early 2014 and I
don't recall a single issue being reported because of it.

Mark


>
>>>>>>>> On Mon, 5 Sep 2016 15:47:34 +1000, Greg Wilkins <gregw_at_webtide.com> said:
>>>
> GW> So they could just be <request-encoding> and <response-encoding>, with
> GW> documentation that says that the encoding set by these is overridden by the
> GW> programmatic methods: setCharacterEncoding, setContent-Type and/or
> GW> setLocale.
>
> I have filed SERVLET_SPEC-161 for this.
>
> Ed
>