users@jaxb.java.net

Re: Escaping illegal characters during marshalling

From: Erik van Zijst <erik.van.zijst_at_gmail.com>
Date: Wed, 22 Oct 2008 23:20:15 +1100

On Wed, Oct 22, 2008 at 10:19 PM, Wolfgang Laun <wolfgang.laun_at_gmail.com> wrote:
> The ASCII control characters TAB, CR and LF are permitted.

Yes, but several others are not, such as 0x00 - 0x08, 0x0B, 0x0C, etc
(http://www.w3.org/TR/REC-xml/#charsets) and you cannot encode them
using ampersands.

To fix my problem of exposing our UTF-8 database through http/xml, I
now substitute every illegal character by 0xFFFD (conventionally used
to represent a character that couldn't be converted) before I pass it
to the xml parser.

cheers,
Erik


> On Wed, Oct 22, 2008 at 4:46 AM, Erik van Zijst <erik.van.zijst_at_gmail.com>
> wrote:
>>
>> I think I figured it out.
>>
>> While UTF-8 allows all ascii control characters (e.g. 0x10), the XML
>> spec explicitly forbids these characters, both in raw and escaped
>> format:
>>
>> http://lists.xml.org/archives/xml-dev/199804/msg00502.html
>>
>> Hence, it seems that xerces is in error here when it accepts the
>> control characters and writes them into the serialized xml.
>> Incidentally, nu.xom refuses serialization of illegal characters,
>> raising a IllegalCharacterDataException (RuntimeException) on
>> Element.appendChild("\u0010"), preventing invalid xml from being
>> generated.
>>
>> In my situation, the data comes from a database that is fed through a
>> web interface that is happy to accept any UTF-8, including ascii
>> control chars. I suppose all I can do is remove/replace all control
>> chars before they hit the parser.
>>
>> cheers,
>> Erik
>>
>>
>> On Wed, Oct 22, 2008 at 12:13 AM, Erik van Zijst
>> <erik.van.zijst_at_gmail.com> wrote:
>> > Hi folks,
>> >
>> > I'm running into a problem where a string that contains valid UTF-8
>> > characters that are illegal in XML (e.g. 0x10), gets serialized by
>> > jaxb without escaping/encoding these bytes, effectively producing
>> > illegal XML.
>> >
>> > When I later try to unmarshal these objects, the unmarshaller crashes
>> > with:
>> >
>> > javax.xml.bind.UnmarshalException
>> > - with linked exception:
>> > [org.xml.sax.SAXParseException: An invalid XML character (Unicode:
>> > 0x10) was found in the element content of the document.]
>> > at
>> > javax.xml.bind.helpers.AbstractUnmarshallerImpl.createUnmarshalException(AbstractUnmarshallerImpl.java:315)
>> > ...
>> >
>> > I've attached a very small unit test that reproduces this problem. I
>> > was under the impression that the serializer would escape illegal
>> > characters by encoding them like: &#010; but instead the test produces
>> > invalid xml at line 31 and then crashes on line 35.
>> > What am I overlooking?
>> >
>> > cheers,
>> > Erik
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe_at_jaxb.dev.java.net
>> For additional commands, e-mail: users-help_at_jaxb.dev.java.net
>>
>
>