[jsr369-experts] Re: [servlet-spec users] Re: [SPEC-161] Encoding in Deployment Descriptor

From: Edward Burns <edward.burns_at_oracle.com>
Date: Thu, 8 Sep 2016 09:06:16 -0700

>>>>> On Thu, 8 Sep 2016 14:38:43 +0200, Bauke Scholtz <balusc_at_gmail.com> said:

BalusC> - W3C recommends UTF-8 https://www.w3.org/TR/html40/appendix/notes.html#
BalusC> non-ascii-chars
BalusC> - HTML5 spec defaults to UTF-8 https://www.w3.org/TR/
BalusC> html5/document-metadata.html#charset
BalusC> - java.net.URLEncoder recommends UTF-8 https://docs.oracle.com/
BalusC> javase/8/docs/api/java/net/URLEncoder.html
BalusC> - I myself have advocated UTF-8 for almost a decade
BalusC> http://balusc.omnifaces.org/2009/05/unicode-how-to-get-characters-right.html
BalusC> - JSF/Facelets defaults to UTF-8
BalusC> - Everyone keeps homebrewing servlet filters to force
BalusC> request.setCharacterEncoding("UTF-8")
BalusC> - And, as Mark mentioned, Tomcat switched to UTF-8 since 2014 and no one
BalusC> complained

Ah, the old "avalanche of evidence" attack! But yes, you make a great point.

>>>>> On Fri, 9 Sep 2016 00:08:32 +1000, Stuart Douglas <sdouglas_at_redhat.com> said:

SD> Just to make my position clear I also think we should be focusing on
SD> UTF-8. In has become the standard, and from what I have seen the
SD> ISO-8859-1 default just causes problems for users.

Well, we will get part of the way there with the proposed
<request-encoding> and <response-encoding> elements.

SD> Even if we cannot change the default in this version of the
SD> specification due to backwards compatibility concerns I think at the
SD> very least we should be have something in place to change the default
SD> in a future version, ISO-8859-1 will almost certainly be irrelevant in
SD> a few years time.

Can we have some discussion here to explicitly spell out the exact
circumstances where changing Spec Section 3.11 to say UTF-8 is the
default would cause backward compatibility problems? If the set of
circumstances is sufficiently small, we can make a judgement call to
make the change. Here is what I see as the set of circumstances:

* a client that writes its request in ISO-8859-1

* it does not include any character encoding information in the request

* There are octets in the request with codepoints greater than 128

Looking at a table of Latin-1 [1], that does cover quite a lot of
commonly used non-English characters. But one could argue that if you
are going to have non-ascii code points in your request, it's likely you
would take the time to include character encoding information in the
request.

Thoughts?

Ed

-- 
| edward.burns_at_oracle.com | office: +1 407 458 0017
[1] http://publib.boulder.ibm.com/cgi-bin/bookmgr/BOOKS/QB3AQ501/F.22?SHELF=&DT=19971201194621