users@jaxb.java.net

Re: Escaping illegal characters during marshalling

From: Erik van Zijst <erik.van.zijst_at_gmail.com>
Date: Mon, 03 Nov 2008 17:48:05 +1100

aashishpatil wrote:
> Hi Erik,
>
> Would you be willing to share your code to replace the characters?

Certainly.

I solved the problem by writing a custom XMLStreamWriter class that can
be used to wrap an existing writer. I simply delegate all calls and
added a bit of logic to filter our the ascii control characters before
passing the call to the underlying writer. I've attached my source file.

Note that I'm using JAXB implicitly from Jersey, so I also created a
custom javax.ws.rs.ext.MessageBodyWriter that uses
XMLOutputFactory.createXMLStreamWriter() to create a normal
XMLStreamWriter and then wraps it with my custom writer.

In your case, if you're not using Jersey, I think you should be able to
use the stream writer like this
(http://java.sun.com/javase/6/docs/api/index.html?javax/xml/bind/Marshaller.html):

XMLStreamWriter xmlStreamWriter =
            XMLOutputFactory.newInstance().createXMLStreamWriter( ... );
EscapingXMLStreamWriter filter = new
EscapingXMLStreamWriter(xmlStreamWriter);

Marshaller m = jaxbContext.createMarshaller();
m.marshal(element, filter);

P.S.
I'm using the CharOpenHashSet from the fastutil package
(http://fastutil.dsi.unimi.it) to do the filtering. If you don't want
that dependency you can of course adopt a different strategy.

cheers,
Erik


> Thanks,
> Aashish
>
>
> Erik van Zijst-7 wrote:
>> On Wed, Oct 22, 2008 at 10:19 PM, Wolfgang Laun <wolfgang.laun_at_gmail.com>
>> wrote:
>>> The ASCII control characters TAB, CR and LF are permitted.
>> Yes, but several others are not, such as 0x00 - 0x08, 0x0B, 0x0C, etc
>> (http://www.w3.org/TR/REC-xml/#charsets) and you cannot encode them
>> using ampersands.
>>
>> To fix my problem of exposing our UTF-8 database through http/xml, I
>> now substitute every illegal character by 0xFFFD (conventionally used
>> to represent a character that couldn't be converted) before I pass it
>> to the xml parser.
>>
>> cheers,
>> Erik
>>
>>
>>> On Wed, Oct 22, 2008 at 4:46 AM, Erik van Zijst
>>> <erik.van.zijst_at_gmail.com>
>>> wrote:
>>>> I think I figured it out.
>>>>
>>>> While UTF-8 allows all ascii control characters (e.g. 0x10), the XML
>>>> spec explicitly forbids these characters, both in raw and escaped
>>>> format:
>>>>
>>>> http://lists.xml.org/archives/xml-dev/199804/msg00502.html
>>>>
>>>> Hence, it seems that xerces is in error here when it accepts the
>>>> control characters and writes them into the serialized xml.
>>>> Incidentally, nu.xom refuses serialization of illegal characters,
>>>> raising a IllegalCharacterDataException (RuntimeException) on
>>>> Element.appendChild("\u0010"), preventing invalid xml from being
>>>> generated.
>>>>
>>>> In my situation, the data comes from a database that is fed through a
>>>> web interface that is happy to accept any UTF-8, including ascii
>>>> control chars. I suppose all I can do is remove/replace all control
>>>> chars before they hit the parser.
>>>>
>>>> cheers,
>>>> Erik
>>>>
>>>>
>>>> On Wed, Oct 22, 2008 at 12:13 AM, Erik van Zijst
>>>> <erik.van.zijst_at_gmail.com> wrote:
>>>>> Hi folks,
>>>>>
>>>>> I'm running into a problem where a string that contains valid UTF-8
>>>>> characters that are illegal in XML (e.g. 0x10), gets serialized by
>>>>> jaxb without escaping/encoding these bytes, effectively producing
>>>>> illegal XML.
>>>>>
>>>>> When I later try to unmarshal these objects, the unmarshaller crashes
>>>>> with:
>>>>>
>>>>> javax.xml.bind.UnmarshalException
>>>>> - with linked exception:
>>>>> [org.xml.sax.SAXParseException: An invalid XML character (Unicode:
>>>>> 0x10) was found in the element content of the document.]
>>>>> at
>>>>>
>>>> javax.xml.bind.helpers.AbstractUnmarshallerImpl.createUnmarshalException(AbstractUnmarshallerImpl.java:315)
>>>>> ...
>>>>>
>>>>> I've attached a very small unit test that reproduces this problem. I
>>>>> was under the impression that the serializer would escape illegal
>>>>> characters by encoding them like: &#010; but instead the test produces
>>>>> invalid xml at line 31 and then crashes on line 35.
>>>>> What am I overlooking?
>>>>>
>>>>> cheers,
>>>>> Erik
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe_at_jaxb.dev.java.net
>>>> For additional commands, e-mail: users-help_at_jaxb.dev.java.net
>>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe_at_jaxb.dev.java.net
>> For additional commands, e-mail: users-help_at_jaxb.dev.java.net
>>
>>
>>
>