jsr369-experts@servlet-spec.java.net

[jsr369-experts] UTF-8 Again

From: Stuart Douglas <sdouglas_at_redhat.com>
Date: Mon, 5 Sep 2016 11:18:19 +1000

Hello everyone,

I know this was discussed before on the users list, but the discussion
kind of died out without anything being decided.

As I am sure everyone is aware HTML5 changes the default encoding from
ISO-8859-1 to UTF-8. Most modern web applications will be written to
use UTF-8 and as time goes on ISO-8859-1 will become less and less
relevant.

At the moment there is no easy and standard way to use UTF-8. The only
standard way is to do it programmatically using the relevant methods
on the request and response object. Most containers offer non standard
ways of setting the default, however there is no standard way.

I really think this is something we need to address in the spec.

There are really two different parts to this issue, URL encoding and
request/response encoding. I will talk about each of them separately.

URL Encoding

At the moment the spec does not really mention URL encoding at all, so
it is not really clear what the default should be. I think we should
explicitly mention in the spec that the recommended default URL
encoding should be UTF-8 as per RFC-3986.

The URL encoding is something that really needs to be determined
container wide, as the URL must be decoded before it is mapped to a
webapp, so I don't think this is something that we can control on a
per app basis.

Request/Response Encoding

At the moment the spec explicitly states that these default to
ISO-8859-1, which made sense at the time as this was the default
character encoding for HTML4. HTML5 has changes this however and now
defaults to UTF-8.

To address this I think we need to allow the default to be controlled
in web.xml via a <default-encoding> element. This element will only
affect the request and response encoding, and will override any spec
mandated default. Obviously if the encoding is explicitly specified
the default will not be used.

We could also look at changing the default to UTF-8, although this may
break existing applications (although they can be fixed by explicitly
setting the old default, either in container specific config or via
the new web.xml element). Even though breaking compatibility may cause
some short term pain I think it is probably worth it.

Stuart