Hi,
After further thought we should implement this so as to be credible with
the XML community.
For the decoding and checking of UTF-8 encoding characters we can have a
256 state table for the first byte decoded.
For Basic Latin characters it is easy to determine whether a byte
corresponds to a valid character for NCName and NCNameChar definitions
in O(1) time.
In addition the table can also be used to determine if the UTF-8 encoded
character is encoded in 11 bits or 16 bits. To check for valid NCName
and NCNameChar characters for such larger characters we really need to
reuse the Xerces XMLChar class [1]. This creates a 2^16 character table
and we do not want to have to duplicate this.
This is unfortunately going to create a dependency. Since for JWSDP and
JDK 5.0 Xerces classes are renamed we should probably depend on the
renamed classes. (I think the use of reflection is out of the question
for calling such a method). The JavaDoc states that it is possible to
get access to the CHARS array, but this is no longer the case (otherwise
our problem would be solved).
Distributing the renamed jar is not a problem since it is already
distributed in the JWSPD dir of Japex.
However, now we are creating a specific dependency on a renamed Xerces
which makes me uncomfortable.
Paul.
[1]
http://xml.apache.org/xerces2-j/javadocs/xerces2/org/apache/xerces/util/XMLChar.html
NCName (first character)
[#x0041-#x005A] 01000001 - 01011010
[#x005F] 01011111
[#x0061-#x007A] 01100001 - 01111010
NCNameChar
[#x002D-#x002E] 00101101 - 00101110
[#x0030-#x0039] 00110000 - 00111001
[#x0041-#x005A] 01000001 - 01011010
[#x005F] 01011111
[#x0061-#x007A] 01100001 - 01111010
Paul Sandoz wrote:
> Hi,
>
> The XML specification defines the lexical representation of a Name for
> an element or attribute [1]:
>
> [5] Name ::= (Letter | '_' | ':') (NameChar)*
> [6] Names ::= Name (#x20 Name)*
> [7] Nmtoken ::= (NameChar)+
> [8] Nmtokens ::= Nmtoken (#x20 Nmtoken)*
>
>
> The Namespaces in XML specification modifies Name accordingly [2]:
>
> [4] NCName ::= (Letter | '_') (NCNameChar)*
> [5] NCNameChar ::= Letter | Digit | '.' | '-' | '_' |
> CombiningChar | Extender
>
>
> The FI Decoder currently does no such checking according to Name or NCName.
>
> This should be fairly easy to implement and checking can be performed at
> the level of the UTF-8 encoded bytes since such strings will always be
> encoded in UTF-8.
>
>
> However, i am wondering about the consequences of not implementing such
> checking.
>
> Seems to me that round-tripping is the major issue since a
> non-well-formed XML document may result:
>
> 'Syntehtic' XML infoset ->
> fast infoset document ->
> 'Synthetic' XML infoset ->
> Non-well-formed XML document
>
>
>
> but there also seems valid reasons for not requiring to do such checks:
>
> - FI is not designed to be human readable, so who cares if there is a
> numeric character at the start? (Infact what if people want a numeric
> character at the start just like people may want a five decimal zeros
> after a number. Tongue in cheek :-) ).
>
> - usage in systems that involve WSDL, XSD and XSTLC will keep in check
> any 'leakage' as the tags that matter will be defined in such
> documents.
>
>
> Still, i am not entirely convinced.
>
> I suspect that most XML serializers do not bother to do such checks as
> the parser will do them anyway. This leaves an interesting situation
> where to interoperate successfully all serializers must be producing
> well-formed documents. So to what extent would FI perterbate the system?
> and to what extent would deliberate mis-use cause damange?
>
>
> Seems like the best approach maybe is to have a property that allows for
> such checking, which by default could be turned off.
>
> What do people think?
>
> Paul.
>
> [1] http://www.w3.org/TR/REC-xml/#NT-Name
> [2] http://www.w3.org/TR/REC-xml-names/#NT-NCName
--
| ? + ? = To question
----------------\
Paul Sandoz
x38109
+33-4-76188109