Matt, I think you're confusing two separate things. Character entity references like Æ are part of XML, and are nothing to do with UTF-8 encoding per se. It's possible to have perfectly correct UTF-8 output with or without entity references. In fact, the entity references are mainly useful for character sets such as ISO-8859-1 that can't encode the full range of Unicode characters. Since UTF-8 can encode the complete range of Unicode characters, those entity references are never necessary in UTF-8-encoded output, except for XML reserved characters like <, >, & and ;.\
Best,
Rob
-----Original Message-----
From: Geis, Matt [mailto:Matt.Geis_at_schwab.com]
Sent: Thursday, August 21, 2003 5:01 AM
To: 'users_at_jaxb.dev.java.net'
Subject: possible JAXB bug with non-ASCII characters
Re my last email (included below), I have some more information. I tried setting the following property on the Marshaller.
m.setProperty(m.JAXB_ENCODING, "DEFAULT");
This change DID produce the correct, escaped output. However, my input document is encoded UTF-8, and is specified as such. The default output is UTF-8. However, the characters are not escaped unless I specify DEFAULT encoding. This is clearly not a workable solution, as I want my output file to be UTF-8.
Why doesn’t JAXB correctly escape the characters, and how can I get it to do that? Is this a bug?
Matt
-----Original Message-----
From: Geis, Matt
Sent: Wednesday, August 20, 2003 12:35 PM
To: users_at_jaxb.dev.java.net
Subject: question about UTF-8 characters
Hi,
I’m running into a problem with JAXB. I have an XML document which contains the character Æ. More accurately, I have a document which contains the character entity reference Æ, which dereferences to Æ. When I unmarshall the document into a JAXB object, I can call the getter for the given property, and it correctly displays Æ.
However, when I marshall the document back into XML, it becomes “Æ “.
The ampersand is handled correctly. My XML document has ‘&’ The getter method shows ‘&’. The marshaled version shows ‘&’
I messed around and changed the output encoding to ISO-8859-1, and the marshaled xml for the Unicode character was Æ. However, that’s not what I want. What I want is for the output to be UTF-8 encoded, and for it to have the text ‘Æ’
I found a bug which may be related in JAXR where getBytes() is called on a String object, but if the String is encoded UTF-8 and the default charset is not UTF-8, an error will occur (as getBytes() uses the default charset encoding for the jvm).
What do I need to do here? Is this a bug? If not, how to I correctly marshall the data?
Thanks,
Matt