Re: TODO for FastInfoset

From: Joe Wang <joe.wang_at_sun.com>
Date: Thu, 13 Jan 2005 11:34:26 -0800

Paul Sandoz wrote:

>Paul Sandoz wrote:
>
>
>>Generic
>>-------
>>
>>
>>
>
>Two further generic items are:
>
>- Duplicate attributes
>
>To comply fully with well-formed XML it is necessary to ensure that
>there are no duplicate attributes among the [attributes] property of an EII.
>
>For efficient checking an integer array the size of the attribute name
>table is required and an integer that represents the current parser
>state. Each time an EII with AIIs occurs the integer is incremented. For
>each AII the index that corresponds to the AII is used to obtain the
>value in the integer array. If this value is equal to the integer then
>the attribute has occured before. If this value is not equal to the
>integer then the attribute has not occured before and the value in the
>integer array is set to the integer.
>
>
>
I noticed the attribute list array. Any reason why we were not using map
for holding Attributes?

>For this solution it needs to be ensured that integer wrap around does
>not occur, thus the integer array needs to be reset when the integer
>reaches the maximum value. This probably needs to be checked for each
>EII with AIIs rather than at the beginning of a parse. It appears at
>first inconcievable that a docuent could have 2^32 - 1 elements with
>attributes, however some documents reputably in the airline industry are
>meant to be huge, and there is the case of XMPP (IIRC) where an XML
>document is transmitted for the life time of a network connection.
>
>
>- In-scope checking of indexed qualified names
>
>To comply fully with well-formed XML it may be necessary to ensure that
>the namespace of an indexed qualified name is in scope.
>
>It is possible that once a qualified name is indexed that it could be
>referred to again when the qualified name is out of scope. For a
>conformant FI serializer this cannot occur. Thus i am in two minds
>whether this is completely necessary. Malicious FI serializers could
>potentially cause strange things to happen to a parser for the
>production of an EII or an AII that whose namespace is not in scope.
>However, for SAX such EIIs and AIIs can still be returned faithfully
>(the start and end prefix events will be missing).
>
>This has made me ponder a bit on the nature of XML namespaces and the
>concept of scope and whether scope is really necessary for such formats
>as FI.... however that is a different story and it is good to integrate
>as closely to the XML 1.x model as possible.
>
>To check whether a qualified name is in scope we could have an integer
>array one plus size of the namespace name table. When a namespace goes
>into scope the index for the namespace name is used to increment the
>index + 1 in the integer array. When a namespace goes out of scope the
>index for the namespace name is used to decrement the index + 1 in the
>integer array.
>
>
>
In BEA's RI, they used an idea of stack. When StartElement is
encountered, depth is incremented by 1 and namespace if any is pushed
into the stack with the depth. When EndElement is encountered, depth is
decremented by 1. Before depth--, the stack table is peeked to see if
the top namespace is in the same depth (i.e. if the namespace is in the
scope), if it is, it's popped. Also, the values are stored in a map
eliminating duplicate data for multiple references.
Is this an idea we could borrow? What would you think?

Joe

>A for an indexed qualified name the associated namespace name index
>needs to be checked against the integer table to see if it is greater
>then zero.
>
>
>Both these solutions will use more space but i think the time costs are
>quite minimal.
>
>Paul.
>
>
>
>
>>- Complete the support for the unimplemented information items
>> - Unparsed Entity information item
>> - Notation information item
>> - Document Type Declaration information item
>> This should be fairly straight forward given most of the core
>>encoding structure is in place.
>>
>>
>>- Review vocabulary implementation
>> It would be useful to review the current design for the support of
>>vocabularies and see if we need to make improvements or changes.
>>Currently i am toying with the idea of supporting vocabularies directly
>>in a parser and have an external vocabulary instance per parser. This
>>means only one array is required with an index that states at what
>>position the external vocabulary ends. This will boost the performance
>>of de-referencing indexes, since this will be a local operation with no
>>method calls.
>>
>>
>>- Initial vocabularies
>> An initial vocabulary will occur at the head of the fast infoset
>>document. Essentially a bunch of strings and indexes need to be decoded
>>and added to tables before the first element information item is decoded.
>>
>>
>>- Built-in restricted alphabets
>> - "numeric" restricted alphabet
>> - "date and time" restricted alphabet
>>
>>
>>- Restricted alphabets
>> Restricted alphabets whether built-in or defined in a vocabulary
>>(initial or external) basically consist of a set of characters that are
>>sequentially indexed with an integer value. The integer values are
>>encoded instead of using a character encoding scheme (UTF-8, UTF-16) or
>>an encoding algorithm. Thus whether the restricted alphabet is built-in
>>or defined in a vocabulary the algorithm is the same.
>>
>>
>>- Built-in encoding algorithms
>> - "hexadecimal" encoding algorithm
>> - "base64" encoding algorithm
>> - "short" encoding algorithm
>> - "int" encoding algorithm
>> - "long" encoding algorithm
>> - "boolean" encoding algorithm
>> - "float" encoding algorithm
>> - "double" encoding algorithm
>> - "uuid" encoding algorithm
>> - "cdata" encoding algorithm
>>
>>
>>- Encoding algorithms
>> Encoding algorithms specify a binary encoding to be used instead of
>>the corresponding string representation. Such algorithms can be used for
>>size and/or processing efficiency. A number of built-in encoding
>>algorithms have been defined. It is possible to specify further
>>algorithms by adding URIs to the vocabulary (initial or external). An
>>encoding algorithm may be used for text content or an attribute value
>>and is identified in the encoding as a small integer (1 to 256, there is
>>a maximum of 256 encoding algorithms alowed per fast infoset document).
>> A plugable registry of encoding algorithms needs to be defined so
>>that it is possible to add then for use by the FI serializer/parser. An
>>open question is how the parser/serializer API, e.g. SAX or StAX, can
>>return such binary information through the API. Such binary information
>>could be converted to a string by the algorithm but this would increase
>>processing. For the built-in algorithms specific extensions could be
>>defined. For the additional algorithms a generic method could be used
>>returning an instance of the data as an object and the URI of the algorithm.
>>
>>
>>SAX specific
>>------------
>>
>>- Support the interning of identifying strings using the SAX
>>http://xml.org/sax/features/string-interning feature.
>>
>>- Proper SAX error reporting to the application.
>>
>>- Performance measurements and optimizations of SAX serializer
>>
>>
>>StAX specific
>>-------------
>>
>>- Support the interning of identifying strings
>>
>>- Performance measurements and optimizations of StAX parser and serializer
>>
>>
>>JAXB
>>----
>>
>>Investiagte how FI can be plugged into JAXB. Given JAXBs schema
>>knowledge it might be possible to:
>>
>>- speed up the process of serialization since JAXB may be able to retain
>>local name and namespace association for faster look up of indexed
>>qualified, and also because strings will be interned thus only requiring
>>reference equality
>>
>>- external vocabularies may be used directly thus only integer values
>>need be written for elements and attributes. This has the potential to
>>speed up serialization even more since no lookup is required for indexing.
>>
>>- support for the built-in encoding algorithms and restricted alphabets
>>that map to corresponding XSD data types. This will require that we
>>design corresponding extensions to the appropriate XML API such that
>>binary data can be passed or received.
>>
>>
>>
>>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_fi.dev.java.net
For additional commands, e-mail: dev-help_at_fi.dev.java.net