TODO for FastInfoset

From: Paul Sandoz <Paul.Sandoz_at_Sun.COM>
Date: Tue, 11 Jan 2005 19:05:14 +0100

Hi,

Below is a TODO list of what needs to be done to progress the Fast
Infoset implementation. I think these should probably be added as issues
so that we can easily discuss and expand on each.

Paul.

Generic
-------

- Complete the support for the unimplemented information items
   - Unparsed Entity information item
   - Notation information item
   - Document Type Declaration information item
   This should be fairly straight forward given most of the core
encoding structure is in place.

- Review vocabulary implementation
   It would be useful to review the current design for the support of
vocabularies and see if we need to make improvements or changes.
Currently i am toying with the idea of supporting vocabularies directly
in a parser and have an external vocabulary instance per parser. This
means only one array is required with an index that states at what
position the external vocabulary ends. This will boost the performance
of de-referencing indexes, since this will be a local operation with no
method calls.

- Initial vocabularies
   An initial vocabulary will occur at the head of the fast infoset
document. Essentially a bunch of strings and indexes need to be decoded
and added to tables before the first element information item is decoded.

- Built-in restricted alphabets
        - "numeric" restricted alphabet
        - "date and time" restricted alphabet

- Restricted alphabets
   Restricted alphabets whether built-in or defined in a vocabulary
(initial or external) basically consist of a set of characters that are
sequentially indexed with an integer value. The integer values are
encoded instead of using a character encoding scheme (UTF-8, UTF-16) or
an encoding algorithm. Thus whether the restricted alphabet is built-in
or defined in a vocabulary the algorithm is the same.

- Built-in encoding algorithms
        - "hexadecimal" encoding algorithm
        - "base64" encoding algorithm
        - "short" encoding algorithm
        - "int" encoding algorithm
        - "long" encoding algorithm
        - "boolean" encoding algorithm
        - "float" encoding algorithm
        - "double" encoding algorithm
        - "uuid" encoding algorithm
        - "cdata" encoding algorithm

- Encoding algorithms
   Encoding algorithms specify a binary encoding to be used instead of
the corresponding string representation. Such algorithms can be used for
size and/or processing efficiency. A number of built-in encoding
algorithms have been defined. It is possible to specify further
algorithms by adding URIs to the vocabulary (initial or external). An
encoding algorithm may be used for text content or an attribute value
and is identified in the encoding as a small integer (1 to 256, there is
a maximum of 256 encoding algorithms alowed per fast infoset document).
   A plugable registry of encoding algorithms needs to be defined so
that it is possible to add then for use by the FI serializer/parser. An
open question is how the parser/serializer API, e.g. SAX or StAX, can
return such binary information through the API. Such binary information
could be converted to a string by the algorithm but this would increase
processing. For the built-in algorithms specific extensions could be
defined. For the additional algorithms a generic method could be used
returning an instance of the data as an object and the URI of the algorithm.

SAX specific
------------

- Support the interning of identifying strings using the SAX
http://xml.org/sax/features/string-interning feature.

- Proper SAX error reporting to the application.

- Performance measurements and optimizations of SAX serializer

StAX specific
-------------

- Support the interning of identifying strings

- Performance measurements and optimizations of StAX parser and serializer

JAXB

----
Investiagte how FI can be plugged into JAXB. Given JAXBs schema 
knowledge it might be possible to:
- speed up the process of serialization since JAXB may be able to retain 
local name and namespace association for faster look up of indexed 
qualified, and also because strings will be interned thus only requiring 
reference equality
- external vocabularies may be used directly thus only integer values 
need be written for elements and attributes. This has the potential to 
speed up serialization even more since no lookup is required for indexing.
- support for the built-in encoding algorithms and restricted alphabets 
that map to corresponding XSD data types. This will require that we 
design corresponding extensions to the appropriate XML API such that 
binary data can be passed or received.
-- 
| ? + ? = To question
----------------\
    Paul Sandoz
         x38109
+33-4-76188109
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_fi.dev.java.net
For additional commands, e-mail: dev-help_at_fi.dev.java.net