users@jaxb.java.net

Re: Handling Special Chars

From: Kohsuke Kawaguchi <Kohsuke.Kawaguchi_at_Sun.COM>
Date: Thu, 29 Sep 2005 10:15:10 -0700

Ranjith,R wrote:
> Hi,
> I have an input coming in in valid XML format, sometimes it may contain
> string messages as:
> <msg>The good,the bad & the ugly</msg>
> Now, we know any parser would reject this because of "&" - JAXB unmarshal
> too fails.
> I wanted to know what should be the ideal approach or good practice in such
> a scenario.

The ideal approach is that you make an angry phone call to whoever
sending that message to you, and tell them to read the XML 1.0
recommendation.

If you can't do so, you first have to find an XML parser that is willing
to read those broken XML. I know of no such parser off the top of my
head, but there are some parsers that read God-awful HTML documents as
SAX events (like NekoHTML), so they might be able to cope with this, too.

Once you find a SAXParser that can do it, you can wrap it up to
SAXSource, and then pass it to JAXB.

> Scan the XML input and replace such special characters with entity refs?

That may work, if you can make more assumptions about the input
documents. In general, however, it's as difficult as parsing it as XML,
as you need to be able to handle things like

<foo><![CDATA[ not this & ]]> but this & </foo>


-- 
Kohsuke Kawaguchi
Sun Microsystems                   kohsuke.kawaguchi_at_sun.com