Re: VTD speed

From: Jimmy Zhang <crackeur_at_comcast.net>
Date: Mon, 23 Jan 2006 16:05:51 -0800

Thanks for the comment.

VTD-XML outperform's SAX (ns or not) because it
minimizes object allocation.

XML data is often hierarchical, which means the best
way to process them is to allow random access.

Other issue of SAX: if the doc is not wellformed, say,
the last character is not '>', in this case the document is not
wellformed, but the code making use of SAX only processes
a part of XML,so the well-formedness error is not really
detected and probably won't matter ...

I guess VTD-XML and SAX each has its pros and cons,
direct apple-to-apple comparison is hard ...

----- Original Message -----
From: "Tatu Saloranta" <cowtowncoder_at_yahoo.com>
To: <users_at_fi.dev.java.net>
Sent: Monday, January 23, 2006 3:22 PM
Subject: Re: VTD speed

> --- Jimmy Zhang <crackeur_at_comcast.net> wrote:
>
>> > As far as i can tell from the code it looks like
>> the Java VTD parser is
>> > not performing any namespace validation when the
>> document is parsed e.g.
>> > no checking if a prefix of an element or attribute
>> is in-scope. As a
>> > consequence duplicate attributes are not fully
>> checked at parsing (the
>> > local name and namespace URI need to be checked in
>> addition to checking
>> > for attributes with the same qualified name).
>> >
>> > VTD parsing avoids a lot of work, some related to
>> instantiation of objects
>> > and some related to checking of the XML, when
>> performed by other APIs and
>> > implementations. For the latter it may not be know
>> if a non-well-formed
>> > document is being parsed, and in some cases it
>> will never be known because
>> > the the non-well-formed parts of the document will
>> never be navigated.
>> >
>> The spirit of VTD is that one simply doesn't have
>> to, and have every
>> incentive
>> not to, create a lot of objects, which are not only
>> slow to allocate, even
>> worse they
>> need to eventually garbage collected.
>
> There are 2 separate issues though: lack of
> well-formedness checks (which many people would
> consider bugs), and trying to avoid object creation.
> More about former case later on.
>
> In latter case, one really should profile things.
> Quite frankly, at least with respect to various names
> (attribute, element, ns, prefix), all modern xml
> parsers do name intern'ing (or similar approach), and
> vast majority of String objects are shared (either
> within a single document, or globally).
> Further, with current JVMS, cost of allocating
> short-lived objects is very low; and even
> garbage-collecting them is rather cheap. So even
> though there are still benefits with trying avoid
> object creation, benefits are generally smaller than
> what people think they are. What this comes down to is
> this: just avoiding all object allocations is no
> guarantee of fast parsing. Actual measurements are, as
> long as they are apples-to-apples.
>
>> Most of the work VTD-XML avoids are *overwhelmingly*
>> the right things to do,
>
> "right" as in and based on... ? Well-formedness checks
> it skips?
>
>> and
>> what has stymied the performance and memory usage of
>> DOM and SAX.
>
> Can you point examples of this, esp. regarding SAX? I
> am really not convinced about the statement: SAX
> presents textual content as character arrays, which
> only needs charset decoding (which, granted, could be
> avoided -- but then you usually end up DOUBLE decoding
> if and when you actually do need to process the
> content).
> Have you profiled performance bottlenecks of SAX
> parsers to reach this conclusion?
>
> I mean, with DOM I can agree with the statement (since
> specs require so many odd things to be supported,
> adding some unavoidable overhead), but with SAX I'm
> not sure either memory usage or performance really has
> that much to do with object allocations these days. If
> I went ahead, and commented out String allocation code
> in, say, Xerces, I wouldn't expect to see a major
> speed improvement (based on profiling other xml
> parsers): at most maybe 15-20% speedup.
> The only remaining obvious thing would be how
> attribute values are accessed: SAX does require
> constructing of a Map with String values. But even
> this does not seem like a gigantic performance
> concern.
>
>> Then there is namespace checking.
>>
> ...
>>
>> What if the namespace has an error?
>> SAX would simply spit an error/exception during
>> parsing. VTD-XML handles
>> this differently: you can still navigate the
>> document but won't be able to
>> locate the
>> elements/attributes containing errors during
>> navigation. The binding is
>> late?
>>
>> What about the non-wellformedness of URI:localname
>> not being unique?
>>
>> Some well-formedness errors are due to careless
>> errors, but no this one.
>> We
>> feel that so rare will this actually happen that the
>> only reason it will
>> happen is
>> becasue (1) someone is abusing the namespace (2) he
>> intentionally wants to
>> make
>
> How about the usual case of (3) someone's XML output
> code has bugs?
>
>> a bad/confusing XML. So in a way, this part of
>> checking defined in namespace
>> spec is overly rigid, and quite problematic...
>
> No. That you can not (or chose not to) implement a
> mandatory part of namespace specification does not
> imply that specification itself is wrong.
> I may not be big fan of NS specs, but to me choice is
> clear: either you implement it, or you don't. But in
> latter case you can not claim you implement it but
> only parts you happen to like.
> But even without namespaces, one really really should
> check for duplicate attributes.
>
> Additionally, not doing these necessary parts of
> namespace resolution (etc), you are doing
> optimizations that give unrealistic picture of
> performance; so comparison between SAX parsers and
> VTD-XML are apples-to-oranges comparisons. A somewhat
> significant overhead in xml parsing does indeed come
> from having to keep dynamic namespace binding
> mappings, as well as attribute name map to prevent
> duplicate attributes.
>
>> The bottom line is that VTD-XML conforms to XML
>> spec (with the exception of
>> entity part).
>
> Uhh... are you serious? Based on above description, it
> does no such thing.
> Handling properly just well-formed document is not
> being compliant. Majority of the spec specifically
> deals with what it means for something to be
> non-well-formed, ie. non-XML.
>
> I'm not an XML purist, and implementing complete xml
> 1.0/1.1 spec is lots of work. But I would never have
> thought that obvious well-formedness error checks
> (prefix/URI bindings, duplicate attributes) would be
> ignored by what is called an xml parser.
>
> -+ Tatu +-
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe_at_fi.dev.java.net
> For additional commands, e-mail: users-help_at_fi.dev.java.net
>
>