Re: VTD speed

From: Jimmy Zhang <crackeur_at_comcast.net>
Date: Mon, 23 Jan 2006 11:35:33 -0800

----- Original Message -----
From: "Paul Sandoz" <Paul.Sandoz_at_Sun.COM>
To: <users_at_fi.dev.java.net>
Sent: Monday, January 23, 2006 5:12 AM
Subject: Re: VTD speed

> Mark Swanson wrote:
>>> a lot of object; VTD-XML doesn't. So VTd-Xml
>>> should hold the edge both in memory and performance...
>>> The best way is to try it...
>>
>>
>> Tried it. VTD wins by a wide margin in memory and performance (throughput
>> as well as memory pressure placed on the GC by object creation) over
>> XmlBeans. I used the VTD Java API and the XmlBeans API to loop through
>> parse, get some element attribute value as String 20 times in loops of
>> 5000.
>>
>
> (just been browsing the VTD parsing code, not a String in sight!)
>
> VTD is a very efficient parser that performs no instantiation of String
> objects when parsing. This is fantastic for the scenarios where you only
> need access to a certain part of the document e.g. routing decisions using
> XPath come to mind.

VTD-XML, like many technologies, isn't perfect, it is designed
to provide an option and new possiblities...

>
> (IIRC Xerces avoids instantiation of String objects for tags that have
> previously occured by using a symbol table of interned Strings.)
>
> As far as i can tell from the code it looks like the Java VTD parser is
> not performing any namespace validation when the document is parsed e.g.
> no checking if a prefix of an element or attribute is in-scope. As a
> consequence duplicate attributes are not fully checked at parsing (the
> local name and namespace URI need to be checked in addition to checking
> for attributes with the same qualified name).
>
> VTD parsing avoids a lot of work, some related to instantiation of objects
> and some related to checking of the XML, when performed by other APIs and
> implementations. For the latter it may not be know if a non-well-formed
> document is being parsed, and in some cases it will never be known because
> the the non-well-formed parts of the document will never be navigated.
>
The spirit of VTD is that one simply doesn't have to, and have every
incentive
not to, create a lot of objects, which are not only slow to allocate, even
worse they
need to eventually garbage collected.

Most of the work VTD-XML avoids are *overwhelmingly* the right things to do,
and
what has stymied the performance and memory usage of DOM and SAX.

Then there is namespace checking.

The first thing about name space is that it is designed to disambiguate
between
different vocabulary sets. VTD-XML VTD-XML performs late binding between
prefixes and URLs by doing lookup using the context object (an integer
array).

> no checking if a prefix of an element or attribute is in-scope. As a
> consequence duplicate attributes are not fully checked at parsing (the
> local name and namespace URI need to be checked in addition to checking
> for attributes with the same qualified name).

We actually made a concious decision on VTD-XML's handling of namespaces
during the design process. Some of the questions we thought about:

What if the namespace has an error?
SAX would simply spit an error/exception during parsing. VTD-XML handles
this differently: you can still navigate the document but won't be able to
locate the
elements/attributes containing errors during navigation. The binding is
late?

What about the non-wellformedness of URI:localname not being unique?

Some well-formedness errors are due to careless errors, but no this one.
We
feel that so rare will this actually happen that the only reason it will
happen is
becasue (1) someone is abusing the namespace (2) he intentionally wants to
make
a bad/confusing XML. So in a way, this part of checking defined in namespace
spec is overly rigid, and quite problematic...

The bottom line is that VTD-XML conforms to XML spec (with the exception of
entity part).

> If you need to a access a significant portion the document, e.g. for
> processing SOAP header blocks and data binding the payload of a SOAP
> message, then from the code at least i think it likely that, for a UTF-8
> encoded SOAP message, UTF-8 decoding will be performed twice on nearly all
> relevant characters (once by parsing to determin offsets, the second when
> iterating through the document). In addition string equality is performed
> on a per character basis, where as if string is interned a binding tool
> can check for equality using '==' (especially useful for namespace URIs).

That is not a problem at all, decoding cost is minimal because UTF-8 is
mostly ASCII,

>
> So it is swings and roundabouts, ya choose ya model to best suit ya needs.
> VTD looks like a great model for some XML processing scenarios but
> defintely not for all.

VTD-XML should get even better going forward...

>
>
>> I'm now even more intrigued by an FI/VTD combo.
>>
>
> This is may be possible, although the VTD representation of the document
> in memory may require some changes. It is certainly possible to reference
> the FI document for literal strings. Indexed qualified names and strings
> may be more problematic.
>
> Paul.
>
> --
> | ? + ? = To question
> ----------------\
> Paul Sandoz
> x38109
> +33-4-76188109
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe_at_fi.dev.java.net
> For additional commands, e-mail: users-help_at_fi.dev.java.net
>
>