users@servlet-spec.java.net

[servlet-spec users] Re: Easy UTF-8

From: Stuart Douglas <sdouglas_at_redhat.com>
Date: Mon, 31 Aug 2015 19:23:13 -0400 (EDT)

----- Original Message -----
> From: "Yannick Majoros" <yannick.majoros_at_gmail.com>
> To: users_at_servlet-spec.java.net
> Sent: Tuesday, 1 September, 2015 6:37:32 AM
> Subject: [servlet-spec users] Re: Easy UTF-8
>
> Hi,
>
> Mark, nothing personal, but I think that we should disagree properly ;-)
>
> Quoting the OP:
> *> configuring a Java EE web application to use UTF-8 has historically
> not been easy or doable in a portable manner*
>
> I still think this is not about defaults, and I still think using UTF-8 in
> Java EE apps is both easy and portable ;-) Can the OP comment his original
> request?

I think the use of the word "Configuring" implies that defaults are in use. Yes
it is possible to install a global request listener that calls setCharacterEncoding,
but this is not as simple as an entry in web.xml.

There have also been cases that I have seen where setCharacterEncoding() is broken
by a 3rd party library that reads data in a listener before the application has had a change
to do anything. In particular for CDI in order to make the conversation scope work Weld was
calling getParameter() in a listener looking for the CID parameter, which forced the data
to be read and rendered setCharacterEncoding() ineffective. If there was some way to set
the encoding via configuration this would not have been such a problem.

>
> I do think that the RFC I mentioned should be used for URL decoding, and it
> seems to indicate that URLs should be percent-encoded, to a UTF-8 encoded
> octet stream. Why would URL encoding for Java EE resources be any
> different? I can't see that in any specification.

This was my interpretation as well. Undertow uses UTF-8 by default, unless it
is explicitly changed.

>
> Checking the spec, it seems that ISO-8859-1 is for request data (POST), not
> for the URL part (which should be ascii-%-utf-8). Here is the relevant part
> from Servlet 3.0:
>
> *> 3.10 Request data encoding Currently, many browsers do not send a char
> encoding qualifier with the Content-Type header, leaving open the
> determination of the character encoding for reading HTTP requests. The
> default encoding of a request the container uses to create the request
> reader and parse POST data must be “ISO-8859-1” if none has been specified
> by the client request. However, in order to indicate to the developer, in
> this case, the failure of the client to send a character encoding, the
> container returns null from the getCharacterEncoding method. *
>
> *If the client hasn’t set character encoding and the request data is
> encoded with a different encoding than the default as described above,
> breakage can occur. To remedy this situation, a new method
> setCharacterEncoding(String enc) has been added to the ServletRequest
> interface. Developers can override the character encoding supplied by the
> container by calling this method. It must be called prior to parsing any
> post data or reading any input from the request. Calling this method once
> data has been read will not affect the encoding.*
>
> BTW, this is an easy and portable way of overriding the default.

Although not as easy as a config item in web.xml.

>
> I guess we do agree on one thing: UTF-8 would be a better default.

I agree, although if we were to change it we would definitely need a way to
override this in web.xml, so apps that rely on the current behaviour can
be made to work. This would potentially cause a lot of apps to break so it is
not something we should do lightly.

Stuart

>
> Best regards,
>
> Yannick
>
> Le lun. 31 août 2015 à 21:19, Mark Thomas <markt_at_apache.org> a écrit :
>
> > On 31/08/2015 13:23, Yannick Majoros wrote:
> > > Hi Mark,
> > >
> > > Maybe there is some misunderstanding here.
> > >
> > > There is a big difference between what you're saying, and what the OP is
> > > asking.
> >
> > I disagree. I think you and I have interpreted the OP's request
> > differently.
> >
> > > The OP is talking about "using UTF-8".
> > >
> > > I'm saying this is quite easy, and shouldn't be defined in web.xml. You
> > > could even accept multiple encodings in an application, with content
> > > negociation, per resource.
> > >
> > > You're using the word "default" in every single paragraph of your
> > > answer. I therefore understand that you're talking about defaults, which
> > > are container-specific and can surely be improved.
> >
> > No, these are not container specific. The defaults are mandated by the
> > Servlet spec and are currently ISO-8859-1. There are container specific
> > mechanisms for changing these defaults.
> >
> > > Personally, I insist that you shouldn't rely on defaults anyway, so I
> > > don't really care.
> >
> > I think it is perfectly reasonable for an application to depend on
> > specification defined defaults.
> >
> > I do think, as a minimum, we should change the specification to use
> > UTF-8 by default rather than ISO-8859-1.
> >
> > > From as far as I can tell with a very quick check, the url part of the
> > > input has already to be utf-8
> > > ( http://tools.ietf.org/html/rfc3987#section-6.4 ), so encoding defaults
> > > shouldn't have any influence on that.
> >
> > I don't believe that that specification applies to Servlet containers.
> > I'd be more than happy if it did since I'm in favour of UTF-8 by default.
> >
> > Note that from Tomcat 8, Tomcat does use UTF-8 by default unless the
> > 'strict adherence to the servlet spec' option is enabled in which case
> > it uses ISO-8859-1.
> >
> > > Still puzzled by what this should solve, besides a default that
> > > shouldn't be relied upon in most cases.
> >
> > We appear to disagree on whether or not an application depending on a
> > specification defined default is a reasonable thing to do.
> >
> > My view is the changing the default from ISO-8859-1 to UTF-8 throughout
> > the Servlet spec would be a beneficial change to users and should be a
> > compatible change for any application relying on a default of ISO-8859-1.
> >
> > I can see some merit in providing specification defined options for
> > changing the defaults but I don't view that as important or useful as
> > simply changing the current defaults to UTF-8.
> >
> > Cheers,
> >
> > Mark
> >
> > >
> > > Cheers,
> > >
> > > Yannick
> > >
> > > Le lun. 31 août 2015 à 13:02, Mark Thomas <markt_at_apache.org
> > > <mailto:markt_at_apache.org>> a écrit :
> > >
> > > On 30/08/2015 20:19, Yannick Majoros wrote:
> > > > Hi,
> > > >
> > > > Uh, it's always been quite easy. Why do you think it isn't?
> > > >
> > > > You're citing Tomcat, which isn't Java EE btw.
> > >
> > > No, Tomcat isn't a full Java EE implementation but Tomcat implements
> > the
> > > Servlet specification and this is the Servlet EG. Pointing out (using
> > > one of the many available Servlet implementations) that changing the
> > > default character encoding requires container specific configuration
> > and
> > > asking for the specification to provide something doesn't seem
> > > unreasonable.
> > >
> > > The OP could have made the same point with Glassfish, WebSphere,
> > > WebLogic etc.
> > >
> > > > For Servlet, it's up to you. As long as you don't rely on
> > > defaults, you
> > > > should be fine. JSPs, if you still use them have it quite clear
> > too.
> > >
> > > And that is the point. If you want the default to be something other
> > > than ISO-8859-1 then it has to be changed in multiple places and you
> > > almost certainly need to use container specific configuration as
> > well.
> > >
> > > > Everytime I've seen someone struggle with this, he used a
> > > framework that
> > > > made dumb assumptions (Struts anyone? That's not Java EE btw). Or
> > the
> > > > developer himself was confused, relied on defaults or converted
> > > multiple
> > > > times...
> > >
> > > That is a little unfair. While I have also seen those sorts of errors
> > > there are also issues (covered in the Tomcat FAQ linked below) with
> > > non-spec compliant browser behaviour that contribute to the problem.
> > >
> > > > I'm curious, what do you want an "encoding" element in web.xml to
> > do?
> > >
> > > That is a fair question. There are multiple things that you might
> > want
> > > to change.
> > >
> > > 1. URI decoding
> > > You can't define this per web application since the URI needs to
> > decoded
> > > before it is mapped to the web application. Therefore this has to be
> > a
> > > container wide setting which means this pretty much has to use
> > container
> > > specific configuration.
> > > What we could do is make UTF-8 rather than ISO-8859-1 the default.
> > >
> > > 2. Response bodies
> > > A web.xml setting could be used to change from the current ISO-8859-1
> > > default to a default of UTF-8.
> > >
> > > 3. Request bodies
> > > A web.xml setting (the same as 2?) could be used to change from the
> > > current ISO-8859-1 default to a default of UTF-8.
> > >
> > > Any changes in defaults would need to be reflected in the JSP
> > > specification.
> > >
> > > Mark
> > >
> > >
> > > > Le 8/30/2015 2:18 PM, Philippe Marschall a écrit :
> > > >>
> > > >> Hi
> > > >>
> > > >> UTF-8 is the most popular encoding on the web [1], [2], [3].
> > However
> > > >> configuring a Java EE web application to use UTF-8 has
> > historically
> > > >> not been easy or doable in a portable manner [4]. Are there any
> > plans
> > > >> to change this, for example by adding a <encoding> element to
> > > web.xml?
> > > >>
> > > >> [1]
> > http://w3techs.com/technologies/overview/character_encoding/all
> > > >> [2]
> > > http://googleblog.blogspot.ch/2010/01/unicode-nearing-50-of-web.html
> > > >> [3] http://www.w3.org/QA/2008/05/utf8-web-growth#c139948
> > > >> [4] http://wiki.apache.org/tomcat/FAQ/CharacterEncoding
> > > >>
> > > >> Cheers
> > > >> Philippe
> > > >
> > >
> > > --
> > > Yannick Majoros
> >
> > --
> Yannick Majoros
>