Re: JAXB serialization

From: Paul Sandoz <Paul.Sandoz_at_Sun.COM>
Date: Mon, 06 Jun 2005 11:33:49 +0200

Kohsuke Kawaguchi wrote:
> Paul Sandoz wrote:
>
>> Since the 'literal' output is likely to be the exception i think it
>> should be possible to produce the required data on the first 'literal'
>> call. However, to do this the FI serializer would require that Name
>> objects state whether they are used for EIIs, AIIs, or both.
>
>
> JAXB assigns indices to all Names, while FI assigns indices to used
> ones. So I think you'll have to maintain the index conversion table anyway.
>
> And I think from this index conversion table you can tell which names
> are used already when you hit the first 'literal'.
>

What i meant by 'literal' (not the best term!) was the case where the
Name objects were not used for serialization.

>>>> Having a unique integer with a Name would be useful so that
>>>> efficient table lookup can be achieved e.g. if there are 10 Names
>>>> then each Name would be assigned a unique integer in the interval
>>>> [0, 9] say. Although FI has separate tables for AIIs and EIIs i do
>>>> not think it makes a difference for JAXB to have once space for Name
>>>> objects if O(1) access is possible.
>>>
>>>
>>>
>>> Yes, this is on my TODO list. I can easily give them sequence number
>>> starting 0.
>>>
>>
>> Great. I think we will need to differentiate between Names associated
>> with EIIs, AIIs or both for efficient support.
>
>
> I can compute if Name is used in EII, AII, or both, and perhaps add
> flags on Name, but I'm not sure how they are used.
>
> I thought you'll assign FI indices to Names as you see them during
> marshalling.

Yes.

> So do you really have use for this flag?
>

We need to maintain two types of look up: one for when Names are used;
and one for when Names are not used.

It should be possible to create the look up information for when Names
are not used lazilly on the first occurence of such serialization. To
create such information it is necessary to know if a Name was used for
an EII, AII or both,

If say there were a total of 100 Names, with 75 associated with EIIs and
50 associated with AIIs, then i think the easiest solution would be to
have an unique integer for an EII and an AII. This is also more
efficient since it is not necessary to maintain two tables of 100
entries for the EII table and AII table.

>
>> Will Name objects be unique to a JAXBContext?
>
>
> Yes. That is,
>
> - A Name object always belongs to a JAXBContext
> - Two Names that belong to the same JAXBContext are always different
>

OK.

>
>
>>>> If it can be ascertained that a Name will always use the same prefix
>>>> then look of a Name to an index in an FI document is very efficient.
>>>
>>>
>>>
>>> Unfortunately this assumption does not hold. One example is when JAXB
>>> starts by xmlns="foo", and then later found out that it needs to
>>> print out xmlns="" (because you can't assign URI "" to any prefix.)
>>>
>>
>> Can you explain a bit more about this? Not sure i understand fully. I
>> thought in this case different Name objects will be used because of
>> different URIs.
>
>
> The question was whether JAXB can guarantee that a given Name object can
> be guaranteed to use the same prefix throughout the marshalling. But it
> doesn't. Taking your QName-in-content example...
>
> Consider the following example.
>
> class Foo {
> @XmlValue
> QName n;
> }
>
> @XmlRootElement(ns="foo",name="foo")
> class Bar {
> Foo foo;
> }
>
> And imagine
>
> Foo f = new Foo();
> f.n = new QName("","zot");
> Bar b = new Bar();
> b.foo = f;
>
> JAXB marshals it as:
>
> <foo xmlns="foo">
> <ns1:foo xmlns="" xmlns:ns1="foo">zot</ns1:foo>
> </foo>
>
> So the Name object for {foo}foo uses two different namespace prefixes.
>

Nasty! I would have thought unprefixed qualified names in content would
have behaved like unprefixed attributes names i.e. they would not be
associated with any namespace.

>
>> But i hope for the most common cases this will not occur i.e. with
>> JAX-RPC.
>
>
> I agree.
>
>> When i first looked at XML namespaces i thought "nice solution" but
>> the more i think about it now the less i like it, but i cannot think
>> of any better textually compact alternatives (unless of course one
>> uses a binary serialization :-) ).
>
>
> If only they'd let us bind a prefix to the default "" namespace...
>
> And oh so evil QName in content...
>

:-)

>>> Another example is when the portion of a document is bound to DOM.
>>> DOM can declare namespaces in any way it wants.
>>>
>>
>> In that case Name objects will not be used and the 'literal' approach
>> is used instead? so we would fall back to the slower solution (and
>> ensure consistency).
>
>
> That's true.
>
>
>>> In normal case, one namespace URI is bound to one prefix. So
>>> hopefully we can code FI such that it works fast as long as this
>>> assumption holds, and if it breaks, it takes the slow route.
>>>
>>
>> Exactly my thoughts too. The question is how can one know when this
>> assumption holds? Perhaps the solution is to have an array of arrays.
>>
>> int[] items = nameIndexToFINameIndex[nameIndex];
>> if (items[0] == prefixIndex) {
>> int fiNameIndex = items[1];
>> } else {
>> // search through item for required prefix
>> }
>
>
> Something like that.
>
> If the prefixIndex isn't the "typical" index (if your test in the "if"
> statement fails), I assumed that you can look that up from the name
> table you wrote.
>

Fallback to the slower solution should be possible, or have further
prefix entries in the items array.

>> It should also be possible to retain arrays and not have to clear them
>> for every serialization if a per serialize counter is used. When the
>> counter reaches the maximum integer value the array is reset.
>
>
> Right.
>
>
>>>> It would help if it can be known if advance if serialization may use
>>>> the non Name methods. If this is the case then it is not necessary
>>>> to maintain synchronized hash tables for the different forms of data.
>
> >>
>
>>> This isn't really possible without traversing the whole object tree
>>> beforehand.
>>
>>
>> I suppose there could be a hint by analysing at the static definition.
>> However, as i said early, i think it may be possible to avoid this
>> using a lazy calculation approach if the Name object has the required
>> information on where it is used.
>
>
> I think corner cases make it somewhat difficult for JAXB to say "this
> JAXBContext will never ever use the non-Name version.
>
> For example, with any JAXBContext, currently the user is allowed to
> create an instance of JAXBElement with arbitrary QName and marshal it.
>

OK.

>
>>> Sometimes the marshaller knows the element being written will not
>>> have any attribute nor namespace declaration. I can add a boolean
>>> parameter to beginStartTag to indicate this.
>>>
>>> If the hint says "absolutely no attribute/xmlns", you can skip
>>> buffering.
>>>
>>> Another possibility is to define the "writeLeafElement", which writes
>>> something like
>>>
>>> <foo>xxxx</foo>
>>>
>>> I found that this happens very often in many schemas, and this might
>>> allow better optimizations for FI.
>>>
>>
>> I think both these optimizations would be useful for FI and XML
>> serialization as long as the higher layer does not have to do a load
>> of work.
>
>
> Another pressure for us to make the runtime smaller. So picking the
> right optimization is tricky. I think I'm inclined to do the leaf
> optimization.
>

Is leaf optimization a specialization of "absolutely no attributes and
namespaces"? i.e. is it necessary to check this first before checking
leaf status?

>> For XML it means that there can be UTF-8 encoded strings "<foo>" and
>> "</foo>" that could also be reused for when there are attributes
>> present i.e. the former could write up to the last but one octet. For
>> FI we can still use the first UTF-8 encoded string by ignoring the
>> first and last octet.
>>
>> Reducing the number of OutputStream.write can speed things up as some
>> of the implementations e.g. BufferedOutputStream and
>> ByteArrayOutputStream use synchronized methods.
>
>
> 8-O
>
> I didn't know that. Given that the OutputStream uses the decorator
> pattern excessively, why they combined the synchronization with
> buffering into one class is really beyond me!
>

Yes. I would have thought that multiple threads writing to one stream
would require some application level coordination, thus implying that
synchronized is not going to help much.

It should be more efficient for JAXB to use it's own byte buffering.
Setting a buffer size property on the context might be appropriate.

Paul.

-- 
| ? + ? = To question
----------------\
    Paul Sandoz
         x38109
+33-4-76188109