[jsr369-experts] Re: UTF-8 Again

From: Stuart Douglas <sdouglas_at_redhat.com>
Date: Wed, 7 Sep 2016 09:18:42 +1000

On Wed, Sep 7, 2016 at 8:29 AM, Edward Burns <edward.burns_at_oracle.com> wrote:
>>>>>> On Mon, 5 Sep 2016 11:18:19 +1000, Stuart Douglas <sdouglas_at_redhat.com> said:
>
> SD> URL Encoding
>
> SD> At the moment the spec does not really mention URL encoding at all, so
> SD> it is not really clear what the default should be. I think we should
> SD> explicitly mention in the spec that the recommended default URL
> SD> encoding should be UTF-8 as per RFC-3986.
>
> I skimmed RFC-3986 but could not find a definitive statement that UTF-8
> should be used. Did I miss it? It seems to favor US-ASCII:
>
> In local or regional contexts and with improving technology, users
> might benefit from being able to use a wider range of characters;
> such use is not defined by this specification. Percent-encoded
> octets (Section 2.1) may be used within a URI to represent characters
> outside the range of the US-ASCII coded character set if this
> representation is allowed by the scheme or by the protocol element in
> which the URI is referenced. Such a definition should specify the
> character encoding used to map those characters to octets prior to
> being percent-encoded for the URI.
>
> If a reserved character is found in a URI component and
> no delimiting role is known for that character, then it must be
> interpreted as representing the data octet corresponding to that
> character's encoding in US-ASCII.

Hmm, re-reading it this is not as clear cut as I thought it was. UTF-8
gets mentioned in the following places:

2.5 Identifying Data

  When a new URI scheme defines a component that represents textual
   data consisting of characters from the Universal Character Set [UCS],
   the data should first be encoded as octets according to the UTF-8
   character encoding [STD63]; then only those octets that do not
   correspond to characters in the unreserved set should be percent-
   encoded.

3.2.2. Host

  Non-ASCII
   characters must first be encoded according to UTF-8 [STD63], and then
   each octet of the corresponding UTF-8 sequence must be percent-
   encoded to be represented as URI characters. URI producing
   applications must not use percent-encoding in host unless it is used
   to represent a UTF-8 character sequence.

So the 'host' part of the URI is definitely UTF-8, but it is not made
super clear if this applies to the path component as well. I am pretty
sure it does (and that seems to be the general consensus around the
internet). I read section 2.5 as applying to all components, which
includes the path.

>
> SD> Request/Response Encoding
>
> SD> At the moment the spec explicitly states that these default to
> SD> ISO-8859-1, which made sense at the time as this was the default
> SD> character encoding for HTML4. HTML5 has changes this however and now
> SD> defaults to UTF-8.
>
> SD> To address this I think we need to allow the default to be controlled
> SD> in web.xml via a <default-encoding> element. This element will only
> SD> affect the request and response encoding, and will override any spec
> SD> mandated default. Obviously if the encoding is explicitly specified
> SD> the default will not be used.
>
> I think it should definitely be opt-in. Regarding the name, we have
> "locale-encoding-mapping", "encodingType" and JSP has "page-encoding".
>
>>>>>> On Mon, 5 Sep 2016 15:47:34 +1000, Greg Wilkins <gregw_at_webtide.com> said:
>
> GW> So they could just be <request-encoding> and <response-encoding>, with
> GW> documentation that says that the encoding set by these is overridden by the
> GW> programmatic methods: setCharacterEncoding, setContent-Type and/or
> GW> setLocale.
>
> Yes, this is good.
>
>>>>>> On Mon, 5 Sep 2016 09:01:00 +0100, Mark Thomas <markt_at_apache.org> said:
>
> MT> +1.
>
> Yes, I agree with Greg here.
>
> SD> We could also look at changing the default to UTF-8, although this may
> SD> break existing applications (although they can be fixed by explicitly
>
> I don't think it's worth the risk. I'll say no to that one.

I there there is actually some risk both ways. As HTML5 becomes more
pervasive the majority of new apps will use UTF-8, having to
explicitly set this in a deployment descriptor is just one more hoop a
user has to jump through to get started.

Is now the right time to change it? Probably not, but eventually it
may make sense. As long as users are aware of the issue the backwards
compatibility should not be too much of a problem, as most containers
already have ways to override the default encoding.

Stuart

>
> So are we good with this? If so, I'll file a JIRA.
>
> Ed
>
> --
> | edward.burns_at_oracle.com | office: +1 407 458 0017