users@jaxb.java.net

AW: Escaping, or removing Invalid XML Characters.

From: Nick Pellow <nick.pellow_at_mindmatics.de>
Date: Thu, 17 Feb 2005 18:11:45 +0100

Hi Kohsuke,

Thanks for the info.

I wrote this method to remove such characters from
any String I pass to JAXB for XML marshalling.

It elemenates all characters between 0x0000 and 0x0020 excluding
0x0009,0x000A and 0x000D, (i.e. the illegal control characters.)

Cheers,
Nick


    /** Holder of all illegal XML chars. **/
    private static byte[] ILLEGAL_XML_1_0_CHARS;

    static {
        final StringBuffer buff = new StringBuffer();
        for (char i = 0x0000; i < 0x0020; i++) {
            if (i != 0x0009 &&
                    i != 0x000A &&
                    i != 0x000D) {
                buff.append(i);
            }
        }
        ILLEGAL_XML_1_0_CHARS = buff.toString().getBytes();
        Arrays.sort(ILLEGAL_XML_1_0_CHARS);
    }

    /**
     * Cleans a given String, so that it can be safely used in XML.
     * All Invalid characters, will be replaced with the given replace
character.
     * Valid XML characters are described here:
     * {@link "http://www.w3c.org/TR/2000/REC-xml-20001006#dt-character"}
     *
     * @param pString the string to clean
     * @param pReplacement the char to use to replace the invalid characters
     * @return the string, cleaned for XML.
     */
    public static String cleanStringForXml(String pString, char
pReplacement) {
        final byte[] bytes = pString.getBytes();
        for (int i = 0; i < bytes.length; i++) {
            byte aByte = bytes[i];
            if (Arrays.binarySearch(ILLEGAL_XML_1_0_CHARS, aByte) >= 0) {
                bytes[i] = (byte) pReplacement;
            }
        }
        return new String(bytes);
    }





>-----Ursprüngliche Nachricht-----
>Von: Kohsuke Kawaguchi [mailto:Kohsuke.Kawaguchi_at_Sun.COM]
>Gesendet: Mittwoch, 16. Februar 2005 22:17
>An: users_at_jaxb.dev.java.net
>Betreff: Re: Escaping, or removing Invalid XML Characters.
>
>
>Nick Pellow wrote:
>> Then I get the following error when marshalling:
>>
>> java.io.IOException: The character '^C' is an invalid XML character
>> at
>org.apache.xml.serialize.BaseMarkupSerializer.characters(Unknown
>> Source)
>>
>> What is the cleanest way to remove such invalid control characters from a
>> content String when marshalling using XML version 1.0 ?
>
>The easiest way is probably to not to put them into JAXB objects in the
>first place :-)
>
>That said, if you really want to just remove those characters, what you
>can do is to write a SAX XMLFilterImpl. You can intercept characters
>method and startElement to modify the text values by removing those
>illegal chars.
>
>Then you can forward it to some kind of XMLWriter to print out. Search
>the archive for 'XMLWriter' for more about how to turn SAX events to
>Unicode and angle brackets.
>
>--
>Kohsuke Kawaguchi
>Sun Microsystems kohsuke.kawaguchi_at_sun.com
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe_at_jaxb.dev.java.net
>For additional commands, e-mail: users-help_at_jaxb.dev.java.net
>