users@servlet-spec.java.net

[servlet-spec users] Re: Easy UTF-8

From: Yannick Majoros <yannick.majoros_at_gmail.com>
Date: Mon, 31 Aug 2015 20:37:32 +0000

Hi,

Mark, nothing personal, but I think that we should disagree properly ;-)

Quoting the OP:
*> configuring a Java EE web application to use UTF-8 has historically
not been easy or doable in a portable manner*

I still think this is not about defaults, and I still think using UTF-8 in
Java EE apps is both easy and portable ;-) Can the OP comment his original
request?

I do think that the RFC I mentioned should be used for URL decoding, and it
seems to indicate that URLs should be percent-encoded, to a UTF-8 encoded
octet stream. Why would URL encoding for Java EE resources be any
different? I can't see that in any specification.

Checking the spec, it seems that ISO-8859-1 is for request data (POST), not
for the URL part (which should be ascii-%-utf-8). Here is the relevant part
from Servlet 3.0:

*> 3.10 Request data encoding Currently, many browsers do not send a char
encoding qualifier with the Content-Type header, leaving open the
determination of the character encoding for reading HTTP requests. The
default encoding of a request the container uses to create the request
reader and parse POST data must be “ISO-8859-1” if none has been specified
by the client request. However, in order to indicate to the developer, in
this case, the failure of the client to send a character encoding, the
container returns null from the getCharacterEncoding method. *

*If the client hasn’t set character encoding and the request data is
encoded with a different encoding than the default as described above,
breakage can occur. To remedy this situation, a new method
setCharacterEncoding(String enc) has been added to the ServletRequest
interface. Developers can override the character encoding supplied by the
container by calling this method. It must be called prior to parsing any
post data or reading any input from the request. Calling this method once
data has been read will not affect the encoding.*

BTW, this is an easy and portable way of overriding the default.

I guess we do agree on one thing: UTF-8 would be a better default.

Best regards,

Yannick

Le lun. 31 août 2015 à 21:19, Mark Thomas <markt_at_apache.org> a écrit :

> On 31/08/2015 13:23, Yannick Majoros wrote:
> > Hi Mark,
> >
> > Maybe there is some misunderstanding here.
> >
> > There is a big difference between what you're saying, and what the OP is
> > asking.
>
> I disagree. I think you and I have interpreted the OP's request
> differently.
>
> > The OP is talking about "using UTF-8".
> >
> > I'm saying this is quite easy, and shouldn't be defined in web.xml. You
> > could even accept multiple encodings in an application, with content
> > negociation, per resource.
> >
> > You're using the word "default" in every single paragraph of your
> > answer. I therefore understand that you're talking about defaults, which
> > are container-specific and can surely be improved.
>
> No, these are not container specific. The defaults are mandated by the
> Servlet spec and are currently ISO-8859-1. There are container specific
> mechanisms for changing these defaults.
>
> > Personally, I insist that you shouldn't rely on defaults anyway, so I
> > don't really care.
>
> I think it is perfectly reasonable for an application to depend on
> specification defined defaults.
>
> I do think, as a minimum, we should change the specification to use
> UTF-8 by default rather than ISO-8859-1.
>
> > From as far as I can tell with a very quick check, the url part of the
> > input has already to be utf-8
> > ( http://tools.ietf.org/html/rfc3987#section-6.4 ), so encoding defaults
> > shouldn't have any influence on that.
>
> I don't believe that that specification applies to Servlet containers.
> I'd be more than happy if it did since I'm in favour of UTF-8 by default.
>
> Note that from Tomcat 8, Tomcat does use UTF-8 by default unless the
> 'strict adherence to the servlet spec' option is enabled in which case
> it uses ISO-8859-1.
>
> > Still puzzled by what this should solve, besides a default that
> > shouldn't be relied upon in most cases.
>
> We appear to disagree on whether or not an application depending on a
> specification defined default is a reasonable thing to do.
>
> My view is the changing the default from ISO-8859-1 to UTF-8 throughout
> the Servlet spec would be a beneficial change to users and should be a
> compatible change for any application relying on a default of ISO-8859-1.
>
> I can see some merit in providing specification defined options for
> changing the defaults but I don't view that as important or useful as
> simply changing the current defaults to UTF-8.
>
> Cheers,
>
> Mark
>
> >
> > Cheers,
> >
> > Yannick
> >
> > Le lun. 31 août 2015 à 13:02, Mark Thomas <markt_at_apache.org
> > <mailto:markt_at_apache.org>> a écrit :
> >
> > On 30/08/2015 20:19, Yannick Majoros wrote:
> > > Hi,
> > >
> > > Uh, it's always been quite easy. Why do you think it isn't?
> > >
> > > You're citing Tomcat, which isn't Java EE btw.
> >
> > No, Tomcat isn't a full Java EE implementation but Tomcat implements
> the
> > Servlet specification and this is the Servlet EG. Pointing out (using
> > one of the many available Servlet implementations) that changing the
> > default character encoding requires container specific configuration
> and
> > asking for the specification to provide something doesn't seem
> > unreasonable.
> >
> > The OP could have made the same point with Glassfish, WebSphere,
> > WebLogic etc.
> >
> > > For Servlet, it's up to you. As long as you don't rely on
> > defaults, you
> > > should be fine. JSPs, if you still use them have it quite clear
> too.
> >
> > And that is the point. If you want the default to be something other
> > than ISO-8859-1 then it has to be changed in multiple places and you
> > almost certainly need to use container specific configuration as
> well.
> >
> > > Everytime I've seen someone struggle with this, he used a
> > framework that
> > > made dumb assumptions (Struts anyone? That's not Java EE btw). Or
> the
> > > developer himself was confused, relied on defaults or converted
> > multiple
> > > times...
> >
> > That is a little unfair. While I have also seen those sorts of errors
> > there are also issues (covered in the Tomcat FAQ linked below) with
> > non-spec compliant browser behaviour that contribute to the problem.
> >
> > > I'm curious, what do you want an "encoding" element in web.xml to
> do?
> >
> > That is a fair question. There are multiple things that you might
> want
> > to change.
> >
> > 1. URI decoding
> > You can't define this per web application since the URI needs to
> decoded
> > before it is mapped to the web application. Therefore this has to be
> a
> > container wide setting which means this pretty much has to use
> container
> > specific configuration.
> > What we could do is make UTF-8 rather than ISO-8859-1 the default.
> >
> > 2. Response bodies
> > A web.xml setting could be used to change from the current ISO-8859-1
> > default to a default of UTF-8.
> >
> > 3. Request bodies
> > A web.xml setting (the same as 2?) could be used to change from the
> > current ISO-8859-1 default to a default of UTF-8.
> >
> > Any changes in defaults would need to be reflected in the JSP
> > specification.
> >
> > Mark
> >
> >
> > > Le 8/30/2015 2:18 PM, Philippe Marschall a écrit :
> > >>
> > >> Hi
> > >>
> > >> UTF-8 is the most popular encoding on the web [1], [2], [3].
> However
> > >> configuring a Java EE web application to use UTF-8 has
> historically
> > >> not been easy or doable in a portable manner [4]. Are there any
> plans
> > >> to change this, for example by adding a <encoding> element to
> > web.xml?
> > >>
> > >> [1]
> http://w3techs.com/technologies/overview/character_encoding/all
> > >> [2]
> > http://googleblog.blogspot.ch/2010/01/unicode-nearing-50-of-web.html
> > >> [3] http://www.w3.org/QA/2008/05/utf8-web-growth#c139948
> > >> [4] http://wiki.apache.org/tomcat/FAQ/CharacterEncoding
> > >>
> > >> Cheers
> > >> Philippe
> > >
> >
> > --
> > Yannick Majoros
>
> --
Yannick Majoros