Re: JAXB serialization

From: Paul Sandoz <Paul.Sandoz_at_Sun.COM>
Date: Thu, 02 Jun 2005 10:38:17 +0200

[Moving to dev_at_fi]

Kohsuke Kawaguchi wrote:
> Paul Sandoz wrote:
>> Got some cycles to look into FI and JAXB serialization now!
>>
>> Looking at the abstract class XmlOutput i see two different types of
>> method:
>>
>> - using a Name; and
>>
>> - using a integer for the prefix and a String
>>
>> e.g.
>>
>> public void beginStartTag(Name name)
>> public abstract void beginStartTag(int prefix, String localName)
>>
>> Are the methods using the Name parameter used when JAXB knows about an
>> EII or AII from the schema and the other forms of methods are used
>> when serializing non-bound infoset e.g. via xsd:any?
>
>
> Generally yes, but as far as the contract of the interface goes, the
> caller can always use (int,String) version, and it may use the Name
> version if it can. IOW, the XmlOutput cannot assume anything about when
> the Name version is used and when the (int,String) version is used.
>

OK.

Since the 'literal' output is likely to be the exception i think it
should be possible to produce the required data on the first 'literal'
call. However, to do this the FI serializer would require that Name
objects state whether they are used for EIIs, AIIs, or both.

One way is to have two separate index spaces for EIIs and AIIs. This
would also make the table sizes more efficient as the FI serializer
would not need to keep two tables for all Name objects for EIIs and
AIIs. Is this possible?

>
>> Having a unique integer with a Name would be useful so that efficient
>> table lookup can be achieved e.g. if there are 10 Names then each Name
>> would be assigned a unique integer in the interval [0, 9] say.
>> Although FI has separate tables for AIIs and EIIs i do not think it
>> makes a difference for JAXB to have once space for Name objects if
>> O(1) access is possible.
>
>
> Yes, this is on my TODO list. I can easily give them sequence number
> starting 0.
>

Great. I think we will need to differentiate between Names associated
with EIIs, AIIs or both for efficient support.

Will Name objects be unique to a JAXBContext?

>> If it can be ascertained that a Name will always use the same prefix
>> then look of a Name to an index in an FI document is very efficient.
>
>
> Unfortunately this assumption does not hold. One example is when JAXB
> starts by xmlns="foo", and then later found out that it needs to print
> out xmlns="" (because you can't assign URI "" to any prefix.)
>

Can you explain a bit more about this? Not sure i understand fully. I
thought in this case different Name objects will be used because of
different URIs.

In this example:

<a xmlns="foo">
   <a xmlns=""/>
</a>

would there be two Name objects? The first 'a' will have a Name object
URI of 'foo' and the second 'a' will have a Name object no URI defined.

The following edge case seems to me the problem:

<n1:a xmlns:n1="foo">
    <n2:a xmlns:n2="foo">n2:evil-qname-in-content</n2:a>
</n1:a>

But i hope for the most common cases this will not occur i.e. with JAX-RPC.

When i first looked at XML namespaces i thought "nice solution" but the
more i think about it now the less i like it, but i cannot think of any
better textually compact alternatives (unless of course one uses a
binary serialization :-) ).

> Another example is when the portion of a document is bound to DOM. DOM
> can declare namespaces in any way it wants.
>

In that case Name objects will not be used and the 'literal' approach is
used instead? so we would fall back to the slower solution (and ensure
consistency).

> XmlOutput has an access to a map which allows it to convert prefix index
> to the namespace URI.
>

Yes. I like this approach.

> In normal case, one namespace URI is bound to one prefix. So hopefully
> we can code FI such that it works fast as long as this assumption holds,
> and if it breaks, it takes the slow route.
>

Exactly my thoughts too. The question is how can one know when this
assumption holds? Perhaps the solution is to have an array of arrays.

int[] items = nameIndexToFINameIndex[nameIndex];
if (items[0] == prefixIndex) {
     int fiNameIndex = items[1];
} else {
     // search through item for required prefix
}

Ideally it would be nice to do just:

fiNameIndex = nameIndexToFINameIndex[nameIndex];

It should also be possible to retain arrays and not have to clear them
for every serialization if a per serialize counter is used. When the
counter reaches the maximum integer value the array is reset.

>
>> It would help if it can be known if advance if serialization may use
>> the non Name methods. If this is the case then it is not necessary to
>> maintain synchronized hash tables for the different forms of data.
>
>
> This isn't really possible without traversing the whole object tree
> beforehand.
>

I suppose there could be a hint by analysing at the static definition.
However, as i said early, i think it may be possible to avoid this using
a lazy calculation approach if the Name object has the required
information on where it is used.

>
>
>> Knowing when encoding an EII that attributes are present would be
>> benefical to FI so that a small amount of buffering is not required to
>> store the octets of the EII and then modify the first octet of the EII
>> if attributes are present.
>
>
> Sometimes the marshaller knows the element being written will not have
> any attribute nor namespace declaration. I can add a boolean parameter
> to beginStartTag to indicate this.
>
> If the hint says "absolutely no attribute/xmlns", you can skip buffering.
>
> Another possibility is to define the "writeLeafElement", which writes
> something like
>
> <foo>xxxx</foo>
>
> I found that this happens very often in many schemas, and this might
> allow better optimizations for FI.
>

I think both these optimizations would be useful for FI and XML
serialization as long as the higher layer does not have to do a load of
work.

For XML it means that there can be UTF-8 encoded strings "<foo>" and
"</foo>" that could also be reused for when there are attributes present
i.e. the former could write up to the last but one octet. For FI we can
still use the first UTF-8 encoded string by ignoring the first and last
octet.

Reducing the number of OutputStream.write can speed things up as some of
the implementations e.g. BufferedOutputStream and ByteArrayOutputStream
use synchronized methods.

>
>> I notice that there is a specific method:
>>
>> public void text( int value )
>>
>> Are you experimenting with having specific typed methods?
>
>
> Yes. Adding this isn't very cheap (in terms of the code size), so I'm
> not sure if I should retain it. I need to see if this is really making
> an improvement --- for XML, the only gain I get is that I can write an
> integer without creating a String.
>
> I still like the custom CharSequence implementation better.
>

Yes, it is cleaner. And for the writing of an integer a buffer could be
reused as opposed to creating a new String every time. It would be nice
however to get access to an underlying buffer without copying or doing a
loop over a method call for each character.

>> Having typed methods rather than doing an if/else instanceof would be
>> more efficient for FI when encoding a single value or an array of. Or
>> alternatively an integer indicating the type of data so that a switch
>> statement can be used.
>
>
> Even with the custom CharSequences, you wouldn't have to do
> "instanceof". We'll define a visitor. (granted it still involves virtual
> method invocation, so it might be still slow)
>

OK. It maybe a higher fixed cost for a single value but if it can be
made to work for arrays then i think it is worth it.

>
>> I wonder if it is possible to reduce the cost of checking for
>> namespaces on each element and push this out to the higher layer
>> which may better determine how namespaces are used? For example the
>> common case with JAX-RPC would be to define all required namespaces up
>> front on the SOAP envelope or on the root element fragment.
>>
>> Maybe another method:
>>
>> beginStartTagWithNamespaceDeclarations
>>
>> would be appropriate?
>
>
> NamespaceContextImpl keeps track of what namespaces need to be declared
> when. The only thing XmlOutput needs to do is to check the current
> NamespaceContext.Element and declare new elements.
>
> Today you can check if an element has any namespace declaration or not by:
>
> if(nsContext.getCurrent().count()==0)
>
> and I think this is cheap enough. If you think this is too expensive, I
> can pass in the value of nsContext.getCurrent(). But it just saves one
> memory look up --- given the access frequency, chances are, that this
> memory is in a processor cache.
>
>

OK, if the previous optimizations you proposed for attribute hints and
leaf elements are possible then i think it covers a lot of cases
efficiently already.

Paul.

-- 
| ? + ? = To question
----------------\
    Paul Sandoz
         x38109
+33-4-76188109