users@jaxb.java.net

Re: Escaping illegal characters during marshalling

From: Erik van Zijst <erik.van.zijst_at_gmail.com>
Date: Wed, 22 Oct 2008 13:46:55 +1100

I think I figured it out.

While UTF-8 allows all ascii control characters (e.g. 0x10), the XML
spec explicitly forbids these characters, both in raw and escaped
format:

http://lists.xml.org/archives/xml-dev/199804/msg00502.html

Hence, it seems that xerces is in error here when it accepts the
control characters and writes them into the serialized xml.
Incidentally, nu.xom refuses serialization of illegal characters,
raising a IllegalCharacterDataException (RuntimeException) on
Element.appendChild("\u0010"), preventing invalid xml from being
generated.

In my situation, the data comes from a database that is fed through a
web interface that is happy to accept any UTF-8, including ascii
control chars. I suppose all I can do is remove/replace all control
chars before they hit the parser.

cheers,
Erik


On Wed, Oct 22, 2008 at 12:13 AM, Erik van Zijst
<erik.van.zijst_at_gmail.com> wrote:
> Hi folks,
>
> I'm running into a problem where a string that contains valid UTF-8
> characters that are illegal in XML (e.g. 0x10), gets serialized by
> jaxb without escaping/encoding these bytes, effectively producing
> illegal XML.
>
> When I later try to unmarshal these objects, the unmarshaller crashes with:
>
> javax.xml.bind.UnmarshalException
> - with linked exception:
> [org.xml.sax.SAXParseException: An invalid XML character (Unicode:
> 0x10) was found in the element content of the document.]
> at javax.xml.bind.helpers.AbstractUnmarshallerImpl.createUnmarshalException(AbstractUnmarshallerImpl.java:315)
> ...
>
> I've attached a very small unit test that reproduces this problem. I
> was under the impression that the serializer would escape illegal
> characters by encoding them like: &#010; but instead the test produces
> invalid xml at line 31 and then crashes on line 35.
> What am I overlooking?
>
> cheers,
> Erik
>