dev@fi.java.net

XML Names and FI parsing

From: Paul Sandoz <Paul.Sandoz_at_Sun.COM>
Date: Fri, 25 Feb 2005 13:01:20 +0100

Hi,

The XML specification defines the lexical representation of a Name for
an element or attribute [1]:

[5] Name ::= (Letter | '_' | ':') (NameChar)*
[6] Names ::= Name (#x20 Name)*
[7] Nmtoken ::= (NameChar)+
[8] Nmtokens ::= Nmtoken (#x20 Nmtoken)*


The Namespaces in XML specification modifies Name accordingly [2]:

[4] NCName ::= (Letter | '_') (NCNameChar)*
[5] NCNameChar ::= Letter | Digit | '.' | '-' | '_' |
                                 CombiningChar | Extender


The FI Decoder currently does no such checking according to Name or NCName.

This should be fairly easy to implement and checking can be performed at
the level of the UTF-8 encoded bytes since such strings will always be
encoded in UTF-8.


However, i am wondering about the consequences of not implementing such
checking.

Seems to me that round-tripping is the major issue since a
non-well-formed XML document may result:

'Syntehtic' XML infoset ->
     fast infoset document ->
         'Synthetic' XML infoset ->
             Non-well-formed XML document



but there also seems valid reasons for not requiring to do such checks:

- FI is not designed to be human readable, so who cares if there is a
   numeric character at the start? (Infact what if people want a numeric
   character at the start just like people may want a five decimal zeros
   after a number. Tongue in cheek :-) ).

- usage in systems that involve WSDL, XSD and XSTLC will keep in check
   any 'leakage' as the tags that matter will be defined in such
   documents.


Still, i am not entirely convinced.

I suspect that most XML serializers do not bother to do such checks as
the parser will do them anyway. This leaves an interesting situation
where to interoperate successfully all serializers must be producing
well-formed documents. So to what extent would FI perterbate the system?
and to what extent would deliberate mis-use cause damange?


Seems like the best approach maybe is to have a property that allows for
such checking, which by default could be turned off.

What do people think?

Paul.

[1] http://www.w3.org/TR/REC-xml/#NT-Name
[2] http://www.w3.org/TR/REC-xml-names/#NT-NCName
-- 
| ? + ? = To question
----------------\
    Paul Sandoz
         x38109
+33-4-76188109