Re: Escaping illegal characters during marshalling

From: aashishpatil <aashishpatil_at_acm.org>
Date: Sun, 2 Nov 2008 21:35:40 -0800 (PST)

Hi Erik,

Would you be willing to share your code to replace the characters?

Thanks,
Aashish

Erik van Zijst-7 wrote:
>
> On Wed, Oct 22, 2008 at 10:19 PM, Wolfgang Laun <wolfgang.laun_at_gmail.com>
> wrote:
>> The ASCII control characters TAB, CR and LF are permitted.
>
> Yes, but several others are not, such as 0x00 - 0x08, 0x0B, 0x0C, etc
> (http://www.w3.org/TR/REC-xml/#charsets) and you cannot encode them
> using ampersands.
>
> To fix my problem of exposing our UTF-8 database through http/xml, I
> now substitute every illegal character by 0xFFFD (conventionally used
> to represent a character that couldn't be converted) before I pass it
> to the xml parser.
>
> cheers,
> Erik
>
>
>> On Wed, Oct 22, 2008 at 4:46 AM, Erik van Zijst
>> <erik.van.zijst_at_gmail.com>
>> wrote:
>>>
>>> I think I figured it out.
>>>
>>> While UTF-8 allows all ascii control characters (e.g. 0x10), the XML
>>> spec explicitly forbids these characters, both in raw and escaped
>>> format:
>>>
>>> http://lists.xml.org/archives/xml-dev/199804/msg00502.html
>>>
>>> Hence, it seems that xerces is in error here when it accepts the
>>> control characters and writes them into the serialized xml.
>>> Incidentally, nu.xom refuses serialization of illegal characters,
>>> raising a IllegalCharacterDataException (RuntimeException) on
>>> Element.appendChild("\u0010"), preventing invalid xml from being
>>> generated.
>>>
>>> In my situation, the data comes from a database that is fed through a
>>> web interface that is happy to accept any UTF-8, including ascii
>>> control chars. I suppose all I can do is remove/replace all control
>>> chars before they hit the parser.
>>>
>>> cheers,
>>> Erik
>>>
>>>
>>> On Wed, Oct 22, 2008 at 12:13 AM, Erik van Zijst
>>> <erik.van.zijst_at_gmail.com> wrote:
>>> > Hi folks,
>>> >
>>> > I'm running into a problem where a string that contains valid UTF-8
>>> > characters that are illegal in XML (e.g. 0x10), gets serialized by
>>> > jaxb without escaping/encoding these bytes, effectively producing
>>> > illegal XML.
>>> >
>>> > When I later try to unmarshal these objects, the unmarshaller crashes
>>> > with:
>>> >
>>> > javax.xml.bind.UnmarshalException
>>> > - with linked exception:
>>> > [org.xml.sax.SAXParseException: An invalid XML character (Unicode:
>>> > 0x10) was found in the element content of the document.]
>>> > at
>>> >
>>> javax.xml.bind.helpers.AbstractUnmarshallerImpl.createUnmarshalException(AbstractUnmarshallerImpl.java:315)
>>> > ...
>>> >
>>> > I've attached a very small unit test that reproduces this problem. I
>>> > was under the impression that the serializer would escape illegal
>>> > characters by encoding them like: 
 but instead the test produces
>>> > invalid xml at line 31 and then crashes on line 35.
>>> > What am I overlooking?
>>> >
>>> > cheers,
>>> > Erik
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe_at_jaxb.dev.java.net
>>> For additional commands, e-mail: users-help_at_jaxb.dev.java.net
>>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe_at_jaxb.dev.java.net
> For additional commands, e-mail: users-help_at_jaxb.dev.java.net
>
>
>

-- 
View this message in context: http://www.nabble.com/Escaping-illegal-characters-during-marshalling-tp20090044p20297583.html
Sent from the java.net - jaxb users mailing list archive at Nabble.com.