Re: VTD speed

From: Tatu Saloranta <cowtowncoder_at_yahoo.com>
Date: Mon, 23 Jan 2006 15:22:14 -0800 (PST)

--- Jimmy Zhang <crackeur_at_comcast.net> wrote:

> > As far as i can tell from the code it looks like
> the Java VTD parser is
> > not performing any namespace validation when the
> document is parsed e.g.
> > no checking if a prefix of an element or attribute
> is in-scope. As a
> > consequence duplicate attributes are not fully
> checked at parsing (the
> > local name and namespace URI need to be checked in
> addition to checking
> > for attributes with the same qualified name).
> >
> > VTD parsing avoids a lot of work, some related to
> instantiation of objects
> > and some related to checking of the XML, when
> performed by other APIs and
> > implementations. For the latter it may not be know
> if a non-well-formed
> > document is being parsed, and in some cases it
> will never be known because
> > the the non-well-formed parts of the document will
> never be navigated.
> >
> The spirit of VTD is that one simply doesn't have
> to, and have every
> incentive
> not to, create a lot of objects, which are not only
> slow to allocate, even
> worse they
> need to eventually garbage collected.

There are 2 separate issues though: lack of
well-formedness checks (which many people would
consider bugs), and trying to avoid object creation.
More about former case later on.

In latter case, one really should profile things.
Quite frankly, at least with respect to various names
(attribute, element, ns, prefix), all modern xml
parsers do name intern'ing (or similar approach), and
vast majority of String objects are shared (either
within a single document, or globally).
Further, with current JVMS, cost of allocating
short-lived objects is very low; and even
garbage-collecting them is rather cheap. So even
though there are still benefits with trying avoid
object creation, benefits are generally smaller than
what people think they are. What this comes down to is
this: just avoiding all object allocations is no
guarantee of fast parsing. Actual measurements are, as
long as they are apples-to-apples.

> Most of the work VTD-XML avoids are *overwhelmingly*
> the right things to do,

"right" as in and based on... ? Well-formedness checks
it skips?

> and
> what has stymied the performance and memory usage of
> DOM and SAX.

Can you point examples of this, esp. regarding SAX? I
am really not convinced about the statement: SAX
presents textual content as character arrays, which
only needs charset decoding (which, granted, could be
avoided -- but then you usually end up DOUBLE decoding
if and when you actually do need to process the
content).
Have you profiled performance bottlenecks of SAX
parsers to reach this conclusion?

I mean, with DOM I can agree with the statement (since
specs require so many odd things to be supported,
adding some unavoidable overhead), but with SAX I'm
not sure either memory usage or performance really has
that much to do with object allocations these days. If
I went ahead, and commented out String allocation code
in, say, Xerces, I wouldn't expect to see a major
speed improvement (based on profiling other xml
parsers): at most maybe 15-20% speedup.
The only remaining obvious thing would be how
attribute values are accessed: SAX does require
constructing of a Map with String values. But even
this does not seem like a gigantic performance
concern.

> Then there is namespace checking.
>
...
>
> What if the namespace has an error?
> SAX would simply spit an error/exception during
> parsing. VTD-XML handles
> this differently: you can still navigate the
> document but won't be able to
> locate the
> elements/attributes containing errors during
> navigation. The binding is
> late?
>
> What about the non-wellformedness of URI:localname
> not being unique?
>
> Some well-formedness errors are due to careless
> errors, but no this one.
> We
> feel that so rare will this actually happen that the
> only reason it will
> happen is
> becasue (1) someone is abusing the namespace (2) he
> intentionally wants to
> make

How about the usual case of (3) someone's XML output
code has bugs?

> a bad/confusing XML. So in a way, this part of
> checking defined in namespace
> spec is overly rigid, and quite problematic...

No. That you can not (or chose not to) implement a
mandatory part of namespace specification does not
imply that specification itself is wrong.
I may not be big fan of NS specs, but to me choice is
clear: either you implement it, or you don't. But in
latter case you can not claim you implement it but
only parts you happen to like.
But even without namespaces, one really really should
check for duplicate attributes.

Additionally, not doing these necessary parts of
namespace resolution (etc), you are doing
optimizations that give unrealistic picture of
performance; so comparison between SAX parsers and
VTD-XML are apples-to-oranges comparisons. A somewhat
significant overhead in xml parsing does indeed come
from having to keep dynamic namespace binding
mappings, as well as attribute name map to prevent
duplicate attributes.

> The bottom line is that VTD-XML conforms to XML
> spec (with the exception of
> entity part).

Uhh... are you serious? Based on above description, it
does no such thing.
Handling properly just well-formed document is not
being compliant. Majority of the spec specifically
deals with what it means for something to be
non-well-formed, ie. non-XML.

I'm not an XML purist, and implementing complete xml
1.0/1.1 spec is lots of work. But I would never have
thought that obvious well-formedness error checks
(prefix/URI bindings, duplicate attributes) would be
ignored by what is called an xml parser.

-+ Tatu +-

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com