Re: VTD speed

From: Paul Sandoz <Paul.Sandoz_at_Sun.COM>
Date: Mon, 23 Jan 2006 14:12:13 +0100

Mark Swanson wrote:
>> a lot of object; VTD-XML doesn't. So VTd-Xml
>> should hold the edge both in memory and performance...
>> The best way is to try it...
>
>
> Tried it. VTD wins by a wide margin in memory and performance
> (throughput as well as memory pressure placed on the GC by object
> creation) over XmlBeans. I used the VTD Java API and the XmlBeans API to
> loop through parse, get some element attribute value as String 20 times
> in loops of 5000.
>

(just been browsing the VTD parsing code, not a String in sight!)

VTD is a very efficient parser that performs no instantiation of String
objects when parsing. This is fantastic for the scenarios where you only
need access to a certain part of the document e.g. routing decisions
using XPath come to mind.

(IIRC Xerces avoids instantiation of String objects for tags that have
previously occured by using a symbol table of interned Strings.)

As far as i can tell from the code it looks like the Java VTD parser is
not performing any namespace validation when the document is parsed e.g.
no checking if a prefix of an element or attribute is in-scope. As a
consequence duplicate attributes are not fully checked at parsing (the
local name and namespace URI need to be checked in addition to checking
for attributes with the same qualified name).

VTD parsing avoids a lot of work, some related to instantiation of
objects and some related to checking of the XML, when performed by other
APIs and implementations. For the latter it may not be know if a
non-well-formed document is being parsed, and in some cases it will
never be known because the the non-well-formed parts of the document
will never be navigated.

If you need to a access a significant portion the document, e.g. for
processing SOAP header blocks and data binding the payload of a SOAP
message, then from the code at least i think it likely that, for a UTF-8
encoded SOAP message, UTF-8 decoding will be performed twice on nearly
all relevant characters (once by parsing to determin offsets, the second
when iterating through the document). In addition string equality is
performed on a per character basis, where as if string is interned a
binding tool can check for equality using '==' (especially useful for
namespace URIs).

So it is swings and roundabouts, ya choose ya model to best suit ya
needs. VTD looks like a great model for some XML processing scenarios
but defintely not for all.

> I'm now even more intrigued by an FI/VTD combo.
>

This is may be possible, although the VTD representation of the document
in memory may require some changes. It is certainly possible to
reference the FI document for literal strings. Indexed qualified names
and strings may be more problematic.

Paul.

-- 
| ? + ? = To question
----------------\
    Paul Sandoz
         x38109
+33-4-76188109