Re: TODO for FastInfoset

From: Paul Sandoz <Paul.Sandoz_at_Sun.COM>
Date: Thu, 13 Jan 2005 15:53:36 +0100

Paul Sandoz wrote:
> Generic
> -------
>

Two further generic items are:

- Duplicate attributes

To comply fully with well-formed XML it is necessary to ensure that
there are no duplicate attributes among the [attributes] property of an EII.

For efficient checking an integer array the size of the attribute name
table is required and an integer that represents the current parser
state. Each time an EII with AIIs occurs the integer is incremented. For
each AII the index that corresponds to the AII is used to obtain the
value in the integer array. If this value is equal to the integer then
the attribute has occured before. If this value is not equal to the
integer then the attribute has not occured before and the value in the
integer array is set to the integer.

For this solution it needs to be ensured that integer wrap around does
not occur, thus the integer array needs to be reset when the integer
reaches the maximum value. This probably needs to be checked for each
EII with AIIs rather than at the beginning of a parse. It appears at
first inconcievable that a docuent could have 2^32 - 1 elements with
attributes, however some documents reputably in the airline industry are
meant to be huge, and there is the case of XMPP (IIRC) where an XML
document is transmitted for the life time of a network connection.

- In-scope checking of indexed qualified names

To comply fully with well-formed XML it may be necessary to ensure that
the namespace of an indexed qualified name is in scope.

It is possible that once a qualified name is indexed that it could be
referred to again when the qualified name is out of scope. For a
conformant FI serializer this cannot occur. Thus i am in two minds
whether this is completely necessary. Malicious FI serializers could
potentially cause strange things to happen to a parser for the
production of an EII or an AII that whose namespace is not in scope.
However, for SAX such EIIs and AIIs can still be returned faithfully
(the start and end prefix events will be missing).

This has made me ponder a bit on the nature of XML namespaces and the
concept of scope and whether scope is really necessary for such formats
as FI.... however that is a different story and it is good to integrate
as closely to the XML 1.x model as possible.

To check whether a qualified name is in scope we could have an integer
array one plus size of the namespace name table. When a namespace goes
into scope the index for the namespace name is used to increment the
index + 1 in the integer array. When a namespace goes out of scope the
index for the namespace name is used to decrement the index + 1 in the
integer array.

A for an indexed qualified name the associated namespace name index
needs to be checked against the integer table to see if it is greater
then zero.

Both these solutions will use more space but i think the time costs are
quite minimal.

Paul.

> - Complete the support for the unimplemented information items
> - Unparsed Entity information item
> - Notation information item
> - Document Type Declaration information item
> This should be fairly straight forward given most of the core
> encoding structure is in place.
>
>
> - Review vocabulary implementation
> It would be useful to review the current design for the support of
> vocabularies and see if we need to make improvements or changes.
> Currently i am toying with the idea of supporting vocabularies directly
> in a parser and have an external vocabulary instance per parser. This
> means only one array is required with an index that states at what
> position the external vocabulary ends. This will boost the performance
> of de-referencing indexes, since this will be a local operation with no
> method calls.
>
>
> - Initial vocabularies
> An initial vocabulary will occur at the head of the fast infoset
> document. Essentially a bunch of strings and indexes need to be decoded
> and added to tables before the first element information item is decoded.
>
>
> - Built-in restricted alphabets
> - "numeric" restricted alphabet
> - "date and time" restricted alphabet
>
>
> - Restricted alphabets
> Restricted alphabets whether built-in or defined in a vocabulary
> (initial or external) basically consist of a set of characters that are
> sequentially indexed with an integer value. The integer values are
> encoded instead of using a character encoding scheme (UTF-8, UTF-16) or
> an encoding algorithm. Thus whether the restricted alphabet is built-in
> or defined in a vocabulary the algorithm is the same.
>
>
> - Built-in encoding algorithms
> - "hexadecimal" encoding algorithm
> - "base64" encoding algorithm
> - "short" encoding algorithm
> - "int" encoding algorithm
> - "long" encoding algorithm
> - "boolean" encoding algorithm
> - "float" encoding algorithm
> - "double" encoding algorithm
> - "uuid" encoding algorithm
> - "cdata" encoding algorithm
>
>
> - Encoding algorithms
> Encoding algorithms specify a binary encoding to be used instead of
> the corresponding string representation. Such algorithms can be used for
> size and/or processing efficiency. A number of built-in encoding
> algorithms have been defined. It is possible to specify further
> algorithms by adding URIs to the vocabulary (initial or external). An
> encoding algorithm may be used for text content or an attribute value
> and is identified in the encoding as a small integer (1 to 256, there is
> a maximum of 256 encoding algorithms alowed per fast infoset document).
> A plugable registry of encoding algorithms needs to be defined so
> that it is possible to add then for use by the FI serializer/parser. An
> open question is how the parser/serializer API, e.g. SAX or StAX, can
> return such binary information through the API. Such binary information
> could be converted to a string by the algorithm but this would increase
> processing. For the built-in algorithms specific extensions could be
> defined. For the additional algorithms a generic method could be used
> returning an instance of the data as an object and the URI of the algorithm.
>
>
> SAX specific
> ------------
>
> - Support the interning of identifying strings using the SAX
> http://xml.org/sax/features/string-interning feature.
>
> - Proper SAX error reporting to the application.
>
> - Performance measurements and optimizations of SAX serializer
>
>
> StAX specific
> -------------
>
> - Support the interning of identifying strings
>
> - Performance measurements and optimizations of StAX parser and serializer
>
>
> JAXB
> ----
>
> Investiagte how FI can be plugged into JAXB. Given JAXBs schema
> knowledge it might be possible to:
>
> - speed up the process of serialization since JAXB may be able to retain
> local name and namespace association for faster look up of indexed
> qualified, and also because strings will be interned thus only requiring
> reference equality
>
> - external vocabularies may be used directly thus only integer values
> need be written for elements and attributes. This has the potential to
> speed up serialization even more since no lookup is required for indexing.
>
> - support for the built-in encoding algorithms and restricted alphabets
> that map to corresponding XSD data types. This will require that we
> design corresponding extensions to the appropriate XML API such that
> binary data can be passed or received.
>
>

-- 
| ? + ? = To question
----------------\
   Paul Sandoz
        x38109
+33-4-76188109
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_fi.dev.java.net
For additional commands, e-mail: dev-help_at_fi.dev.java.net