users@fi.java.net

Re: VTD speed

From: Santiago Pericas-Geertsen <Santiago.Pericasgeertsen_at_Sun.COM>
Date: Fri, 05 Jan 2007 16:34:22 -0500

Hi Jimmy,

  I'm really interested in learning more about VTD-XML and its XPath
support (I'll get to that next week, I hope). Here is a blog I just
finished about XPath in the RI [1]. In a nutshell, I believe the main
issue is DTMs as explained in that blog. XPath is an area that we'd
like to improve in JAXP.next, so I'd like to look at what people have
done in the last couple of years.

  As for benchmarking, I've created a simple test suite based on
Japex [2] called XPathpex. I haven't published it yet (but I can send
it you privately if you want). It uses documents from the XMark suite
and it is base on XPathMark, but because it uses Japex, you get nice
reports, multi-threading, etc. A simple driver is all you'd need to
write.

  Thanks for sharing your findings.

-- Santiago

[1] http://weblogs.java.net/blog/spericas/archive/2007/01/
whats_next_for_1.html
[2] https://japex.dev.java.net

On Jan 5, 2007, at 2:29 AM, Jimmy Zhang wrote:

> Paul, Do you know what XPath implementation JDK 1.5 bundles...
> We did a benchmarking on its performance, and found that the
> performance
> is not very good (http://www.ximpleware.com/benchmark_xpath.html).
> Assuming
> DOM offers good random access, the results seem like there is
> something wrong...
> is this a known issue?
> Best regards,
> Jimmy Zhang
>
> ----- Original Message ----- From: "Paul Sandoz" <Paul.Sandoz_at_Sun.COM>
> To: <users_at_fi.dev.java.net>
> Sent: Monday, January 23, 2006 5:12 AM
> Subject: Re: VTD speed
>
>
>> Mark Swanson wrote:
>>>> a lot of object; VTD-XML doesn't. So VTd-Xml
>>>> should hold the edge both in memory and performance...
>>>> The best way is to try it...
>>>
>>>
>>> Tried it. VTD wins by a wide margin in memory and performance
>>> (throughput as well as memory pressure placed on the GC by object
>>> creation) over XmlBeans. I used the VTD Java API and the XmlBeans
>>> API to loop through parse, get some element attribute value as
>>> String 20 times in loops of 5000.
>>>
>>
>> (just been browsing the VTD parsing code, not a String in sight!)
>>
>> VTD is a very efficient parser that performs no instantiation of
>> String objects when parsing. This is fantastic for the scenarios
>> where you only need access to a certain part of the document e.g.
>> routing decisions using XPath come to mind.
>>
>> (IIRC Xerces avoids instantiation of String objects for tags that
>> have previously occured by using a symbol table of interned Strings.)
>>
>> As far as i can tell from the code it looks like the Java VTD
>> parser is not performing any namespace validation when the
>> document is parsed e.g. no checking if a prefix of an element or
>> attribute is in-scope. As a consequence duplicate attributes are
>> not fully checked at parsing (the local name and namespace URI
>> need to be checked in addition to checking for attributes with the
>> same qualified name).
>>
>> VTD parsing avoids a lot of work, some related to instantiation of
>> objects and some related to checking of the XML, when performed by
>> other APIs and implementations. For the latter it may not be know
>> if a non-well-formed document is being parsed, and in some cases
>> it will never be known because the the non-well-formed parts of
>> the document will never be navigated.
>>
>> If you need to a access a significant portion the document, e.g.
>> for processing SOAP header blocks and data binding the payload of
>> a SOAP message, then from the code at least i think it likely
>> that, for a UTF-8 encoded SOAP message, UTF-8 decoding will be
>> performed twice on nearly all relevant characters (once by parsing
>> to determin offsets, the second when iterating through the
>> document). In addition string equality is performed on a per
>> character basis, where as if string is interned a binding tool can
>> check for equality using '==' (especially useful for namespace URIs).
>>
>> So it is swings and roundabouts, ya choose ya model to best suit
>> ya needs. VTD looks like a great model for some XML processing
>> scenarios but defintely not for all.
>>
>>
>>> I'm now even more intrigued by an FI/VTD combo.
>>>
>>
>> This is may be possible, although the VTD representation of the
>> document in memory may require some changes. It is certainly
>> possible to reference the FI document for literal strings. Indexed
>> qualified names and strings may be more problematic.
>>
>> Paul.
>>
>> --
>> | ? + ? = To question
>> ----------------\
>> Paul Sandoz
>> x38109
>> +33-4-76188109
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe_at_fi.dev.java.net
>> For additional commands, e-mail: users-help_at_fi.dev.java.net
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe_at_fi.dev.java.net
> For additional commands, e-mail: users-help_at_fi.dev.java.net
>