dev@fi.java.net

Re: Support required with resulting dependency on Xerces <was> Re: XML Names and FI parsing

From: Eduardo Pelegri-Llopart <Eduardo.Pelegrillopart_at_Sun.COM>
Date: Thu, 03 Mar 2005 09:38:41 -0800

Yep, having a dependency on the JWSDP version of the packages is not
ideal, but if we document this clearly it should be OK. And maybe
somebody will want to contribute to maintain a version that uses the ASF
package names - that is what OSS projects are all about, right? :-)

        - eduard/o

Paul Sandoz wrote:
> Hi,
>
> After further thought we should implement this so as to be credible with
> the XML community.
>
> For the decoding and checking of UTF-8 encoding characters we can have a
> 256 state table for the first byte decoded.
>
> For Basic Latin characters it is easy to determine whether a byte
> corresponds to a valid character for NCName and NCNameChar definitions
> in O(1) time.
>
> In addition the table can also be used to determine if the UTF-8 encoded
> character is encoded in 11 bits or 16 bits. To check for valid NCName
> and NCNameChar characters for such larger characters we really need to
> reuse the Xerces XMLChar class [1]. This creates a 2^16 character table
> and we do not want to have to duplicate this.
>
> This is unfortunately going to create a dependency. Since for JWSDP and
> JDK 5.0 Xerces classes are renamed we should probably depend on the
> renamed classes. (I think the use of reflection is out of the question
> for calling such a method). The JavaDoc states that it is possible to
> get access to the CHARS array, but this is no longer the case (otherwise
> our problem would be solved).
>
> Distributing the renamed jar is not a problem since it is already
> distributed in the JWSPD dir of Japex.
>
> However, now we are creating a specific dependency on a renamed Xerces
> which makes me uncomfortable.
>
> Paul.
>
> [1]
> http://xml.apache.org/xerces2-j/javadocs/xerces2/org/apache/xerces/util/XMLChar.html
>
>
> NCName (first character)
>
> [#x0041-#x005A] 01000001 - 01011010
> [#x005F] 01011111
> [#x0061-#x007A] 01100001 - 01111010
>
> NCNameChar
>
> [#x002D-#x002E] 00101101 - 00101110
> [#x0030-#x0039] 00110000 - 00111001
> [#x0041-#x005A] 01000001 - 01011010
> [#x005F] 01011111
> [#x0061-#x007A] 01100001 - 01111010
>
> Paul Sandoz wrote:
>
>> Hi,
>>
>> The XML specification defines the lexical representation of a Name for
>> an element or attribute [1]:
>>
>> [5] Name ::= (Letter | '_' | ':') (NameChar)*
>> [6] Names ::= Name (#x20 Name)*
>> [7] Nmtoken ::= (NameChar)+
>> [8] Nmtokens ::= Nmtoken (#x20 Nmtoken)*
>>
>>
>> The Namespaces in XML specification modifies Name accordingly [2]:
>>
>> [4] NCName ::= (Letter | '_') (NCNameChar)*
>> [5] NCNameChar ::= Letter | Digit | '.' | '-' | '_' |
>> CombiningChar | Extender
>>
>>
>> The FI Decoder currently does no such checking according to Name or
>> NCName.
>>
>> This should be fairly easy to implement and checking can be performed
>> at the level of the UTF-8 encoded bytes since such strings will always
>> be encoded in UTF-8.
>>
>>
>> However, i am wondering about the consequences of not implementing
>> such checking.
>>
>> Seems to me that round-tripping is the major issue since a
>> non-well-formed XML document may result:
>>
>> 'Syntehtic' XML infoset ->
>> fast infoset document ->
>> 'Synthetic' XML infoset ->
>> Non-well-formed XML document
>>
>>
>>
>> but there also seems valid reasons for not requiring to do such checks:
>>
>> - FI is not designed to be human readable, so who cares if there is a
>> numeric character at the start? (Infact what if people want a numeric
>> character at the start just like people may want a five decimal zeros
>> after a number. Tongue in cheek :-) ).
>>
>> - usage in systems that involve WSDL, XSD and XSTLC will keep in check
>> any 'leakage' as the tags that matter will be defined in such
>> documents.
>>
>>
>> Still, i am not entirely convinced.
>>
>> I suspect that most XML serializers do not bother to do such checks as
>> the parser will do them anyway. This leaves an interesting situation
>> where to interoperate successfully all serializers must be producing
>> well-formed documents. So to what extent would FI perterbate the
>> system? and to what extent would deliberate mis-use cause damange?
>>
>>
>> Seems like the best approach maybe is to have a property that allows
>> for such checking, which by default could be turned off.
>>
>> What do people think?
>>
>> Paul.
>>
>> [1] http://www.w3.org/TR/REC-xml/#NT-Name
>> [2] http://www.w3.org/TR/REC-xml-names/#NT-NCName
>
>