[jsr369-experts] Re: UTF-8 Again

From: Edward Burns <edward.burns_at_oracle.com>
Date: Tue, 6 Sep 2016 15:29:40 -0700

>>>>> On Mon, 5 Sep 2016 11:18:19 +1000, Stuart Douglas <sdouglas_at_redhat.com> said:

SD> URL Encoding

SD> At the moment the spec does not really mention URL encoding at all, so
SD> it is not really clear what the default should be. I think we should
SD> explicitly mention in the spec that the recommended default URL
SD> encoding should be UTF-8 as per RFC-3986.

I skimmed RFC-3986 but could not find a definitive statement that UTF-8
should be used. Did I miss it? It seems to favor US-ASCII:

   In local or regional contexts and with improving technology, users
   might benefit from being able to use a wider range of characters;
   such use is not defined by this specification. Percent-encoded
   octets (Section 2.1) may be used within a URI to represent characters
   outside the range of the US-ASCII coded character set if this
   representation is allowed by the scheme or by the protocol element in
   which the URI is referenced. Such a definition should specify the
   character encoding used to map those characters to octets prior to
   being percent-encoded for the URI.

   If a reserved character is found in a URI component and
   no delimiting role is known for that character, then it must be
   interpreted as representing the data octet corresponding to that
   character's encoding in US-ASCII.

SD> Request/Response Encoding

SD> At the moment the spec explicitly states that these default to
SD> ISO-8859-1, which made sense at the time as this was the default
SD> character encoding for HTML4. HTML5 has changes this however and now
SD> defaults to UTF-8.

SD> To address this I think we need to allow the default to be controlled
SD> in web.xml via a <default-encoding> element. This element will only
SD> affect the request and response encoding, and will override any spec
SD> mandated default. Obviously if the encoding is explicitly specified
SD> the default will not be used.

I think it should definitely be opt-in. Regarding the name, we have
"locale-encoding-mapping", "encodingType" and JSP has "page-encoding".

>>>>> On Mon, 5 Sep 2016 15:47:34 +1000, Greg Wilkins <gregw_at_webtide.com> said:

GW> So they could just be <request-encoding> and <response-encoding>, with
GW> documentation that says that the encoding set by these is overridden by the
GW> programmatic methods: setCharacterEncoding, setContent-Type and/or
GW> setLocale.

Yes, this is good.

>>>>> On Mon, 5 Sep 2016 09:01:00 +0100, Mark Thomas <markt_at_apache.org> said:

MT> +1.

Yes, I agree with Greg here.

SD> We could also look at changing the default to UTF-8, although this may
SD> break existing applications (although they can be fixed by explicitly

I don't think it's worth the risk. I'll say no to that one.

So are we good with this? If so, I'll file a JIRA.


| edward.burns_at_oracle.com | office: +1 407 458 0017