users@jaxb.java.net

RE: possible JAXB bug with non-ASCII characters

From: Robert Lowe <rmlowe_at_rmlowe.com>
Date: Thu, 21 Aug 2003 17:31:36 +0800

Matt, I think you're confusing two separate things. Character entity references like &#x00C6; are part of XML, and are nothing to do with UTF-8 encoding per se. It's possible to have perfectly correct UTF-8 output with or without entity references. In fact, the entity references are mainly useful for character sets such as ISO-8859-1 that can't encode the full range of Unicode characters. Since UTF-8 can encode the complete range of Unicode characters, those entity references are never necessary in UTF-8-encoded output, except for XML reserved characters like <, >, & and ;.\

Best,
Rob

  -----Original Message-----
  From: Geis, Matt [mailto:Matt.Geis_at_schwab.com]
  Sent: Thursday, August 21, 2003 5:01 AM
  To: 'users_at_jaxb.dev.java.net'
  Subject: possible JAXB bug with non-ASCII characters


  Re my last email (included below), I have some more information. I tried setting the following property on the Marshaller.



  m.setProperty(m.JAXB_ENCODING, "DEFAULT");



  This change DID produce the correct, escaped output. However, my input document is encoded UTF-8, and is specified as such. The default output is UTF-8. However, the characters are not escaped unless I specify DEFAULT encoding. This is clearly not a workable solution, as I want my output file to be UTF-8.



  Why doesn’t JAXB correctly escape the characters, and how can I get it to do that? Is this a bug?



  Matt



  -----Original Message-----
  From: Geis, Matt
  Sent: Wednesday, August 20, 2003 12:35 PM
  To: users_at_jaxb.dev.java.net
  Subject: question about UTF-8 characters



  Hi,

  I’m running into a problem with JAXB. I have an XML document which contains the character Æ. More accurately, I have a document which contains the character entity reference &#x00C6;, which dereferences to Æ. When I unmarshall the document into a JAXB object, I can call the getter for the given property, and it correctly displays Æ.



  However, when I marshall the document back into XML, it becomes “Æ “.



  The ampersand is handled correctly. My XML document has ‘&amp;’ The getter method shows ‘&’. The marshaled version shows ‘&amp;’



  I messed around and changed the output encoding to ISO-8859-1, and the marshaled xml for the Unicode character was Æ. However, that’s not what I want. What I want is for the output to be UTF-8 encoded, and for it to have the text ‘&#x00C6;’



  I found a bug which may be related in JAXR where getBytes() is called on a String object, but if the String is encoded UTF-8 and the default charset is not UTF-8, an error will occur (as getBytes() uses the default charset encoding for the jvm).



  What do I need to do here? Is this a bug? If not, how to I correctly marshall the data?



  Thanks,

  Matt