Re: VTD speed

From: Jimmy Zhang <crackeur_at_comcast.net>
Date: Wed, 21 Feb 2007 16:12:38 -0800

I have noticed a pretty good speed up in jdk6's XPath performance...
what did you guys do?
----- Original Message -----
From: "Santiago Pericas-Geertsen" <Santiago.Pericasgeertsen_at_Sun.COM>
To: <users_at_fi.dev.java.net>
Cc: <users_at_jaxp.dev.java.net>
Sent: Friday, January 05, 2007 1:34 PM
Subject: Re: VTD speed

> Hi Jimmy,
>
> I'm really interested in learning more about VTD-XML and its XPath
> support (I'll get to that next week, I hope). Here is a blog I just
> finished about XPath in the RI [1]. In a nutshell, I believe the main
> issue is DTMs as explained in that blog. XPath is an area that we'd
> like to improve in JAXP.next, so I'd like to look at what people have
> done in the last couple of years.
>
> As for benchmarking, I've created a simple test suite based on
> Japex [2] called XPathpex. I haven't published it yet (but I can send
> it you privately if you want). It uses documents from the XMark suite
> and it is base on XPathMark, but because it uses Japex, you get nice
> reports, multi-threading, etc. A simple driver is all you'd need to
> write.
>
> Thanks for sharing your findings.
>
> -- Santiago
>
> [1] http://weblogs.java.net/blog/spericas/archive/2007/01/
> whats_next_for_1.html
> [2] https://japex.dev.java.net
>
> On Jan 5, 2007, at 2:29 AM, Jimmy Zhang wrote:
>
>> Paul, Do you know what XPath implementation JDK 1.5 bundles...
>> We did a benchmarking on its performance, and found that the
>> performance
>> is not very good (http://www.ximpleware.com/benchmark_xpath.html).
>> Assuming
>> DOM offers good random access, the results seem like there is
>> something wrong...
>> is this a known issue?
>> Best regards,
>> Jimmy Zhang
>>
>> ----- Original Message ----- From: "Paul Sandoz" <Paul.Sandoz_at_Sun.COM>
>> To: <users_at_fi.dev.java.net>
>> Sent: Monday, January 23, 2006 5:12 AM
>> Subject: Re: VTD speed
>>
>>
>>> Mark Swanson wrote:
>>>>> a lot of object; VTD-XML doesn't. So VTd-Xml
>>>>> should hold the edge both in memory and performance...
>>>>> The best way is to try it...
>>>>
>>>>
>>>> Tried it. VTD wins by a wide margin in memory and performance
>>>> (throughput as well as memory pressure placed on the GC by object
>>>> creation) over XmlBeans. I used the VTD Java API and the XmlBeans
>>>> API to loop through parse, get some element attribute value as
>>>> String 20 times in loops of 5000.
>>>>
>>>
>>> (just been browsing the VTD parsing code, not a String in sight!)
>>>
>>> VTD is a very efficient parser that performs no instantiation of
>>> String objects when parsing. This is fantastic for the scenarios
>>> where you only need access to a certain part of the document e.g.
>>> routing decisions using XPath come to mind.
>>>
>>> (IIRC Xerces avoids instantiation of String objects for tags that
>>> have previously occured by using a symbol table of interned Strings.)
>>>
>>> As far as i can tell from the code it looks like the Java VTD
>>> parser is not performing any namespace validation when the
>>> document is parsed e.g. no checking if a prefix of an element or
>>> attribute is in-scope. As a consequence duplicate attributes are
>>> not fully checked at parsing (the local name and namespace URI
>>> need to be checked in addition to checking for attributes with the
>>> same qualified name).
>>>
>>> VTD parsing avoids a lot of work, some related to instantiation of
>>> objects and some related to checking of the XML, when performed by
>>> other APIs and implementations. For the latter it may not be know
>>> if a non-well-formed document is being parsed, and in some cases
>>> it will never be known because the the non-well-formed parts of
>>> the document will never be navigated.
>>>
>>> If you need to a access a significant portion the document, e.g.
>>> for processing SOAP header blocks and data binding the payload of
>>> a SOAP message, then from the code at least i think it likely
>>> that, for a UTF-8 encoded SOAP message, UTF-8 decoding will be
>>> performed twice on nearly all relevant characters (once by parsing
>>> to determin offsets, the second when iterating through the
>>> document). In addition string equality is performed on a per
>>> character basis, where as if string is interned a binding tool can
>>> check for equality using '==' (especially useful for namespace URIs).
>>>
>>> So it is swings and roundabouts, ya choose ya model to best suit
>>> ya needs. VTD looks like a great model for some XML processing
>>> scenarios but defintely not for all.
>>>
>>>
>>>> I'm now even more intrigued by an FI/VTD combo.
>>>>
>>>
>>> This is may be possible, although the VTD representation of the
>>> document in memory may require some changes. It is certainly
>>> possible to reference the FI document for literal strings. Indexed
>>> qualified names and strings may be more problematic.
>>>
>>> Paul.
>>>
>>> --
>>> | ? + ? = To question
>>> ----------------\
>>> Paul Sandoz
>>> x38109
>>> +33-4-76188109
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe_at_fi.dev.java.net
>>> For additional commands, e-mail: users-help_at_fi.dev.java.net
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe_at_fi.dev.java.net
>> For additional commands, e-mail: users-help_at_fi.dev.java.net
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe_at_fi.dev.java.net
> For additional commands, e-mail: users-help_at_fi.dev.java.net
>
>