Re: FI parser buffer

From: Paul Sandoz <Paul.Sandoz_at_Sun.COM>
Date: Wed, 07 Sep 2005 15:35:48 +0200

Hi Brian,

Apologies for the delay in responding i am just back from holiday.

Brian Pontarelli wrote:
>
> I'm working on a project that is using FI documents in a stateful
> protocol. Currently we are using the standard request/response paradigm
> for the protocol, but the socket and the streams aren't closed because
> the protocol is stateful.
>

OK. That is interesting are you ensuring that the vocabularies of the
parsers and serializers between the peers are also shared? i.e. there is
no need to reset vocabularies per parse and serialization. The
parsers/serializers were designed to support such functionality, but it
is not so obvious how to enable it. I can send a further email
explaining this if you like.

This has the advantage that message size is reduced and
parsing/serializing should be faster. However, i would not recommend
this for the case where the vocabulary can vary quite a lot. It is good
for small messages of whose vocabulary is similar for multiple
requests/responses.

> Currently, it appears that the code for the FI parser uses by default a
> buffer size of 1024 although there appears to be a configuration
> parameter that can change this. This causes some problems because our
> streams never reach the end (i.e. aren't closed) and therefore the
> stream being used by the parser never receives a -1 result from a read
> operation. In addition, if our documents are small, which most are, the
> parser blocks, waiting for the buffer to read the full 1024.
>
> My questions are what was the design decision behind this (i.e. to
> support buffered streams I would assume)?

Yes buffering, if the parser returns a string of X characters then there
needs to be some buffer holding the encoding of those X characters. This
is for performance reasons. Using a buffered input stream method calls
can be expensive when reading bytes and there are optimizations that can
be achieved when reading length prefixed data by combining parsing and
buffering. Methods get hotspoted and inlined for example when using such
techniques.

> What overall impact would
> reducing the size to something like 16 have? I can't seem to get the
> system property to take hold, even when used command line, any ideas
> why? and finally, wouldn't it make more sense to have a resizeable
> buffer rather than a fixed sized buffer?
>

Using the system property in the following manner should work:

-Dcom.sun.xml.fastinfoset.parser.buffer-size=16

However, the buffer could be getting resized as there might be encoded
strings greater than 16 bytes e.g. namespace names.

The implementation of the decoder is very much dependent on reuse of the
buffer for the decoding of strings. It relies on the property that the
input stream is self-contained and contains one or more fast infoset
documents.

I wonder if other XML parsers do similar things. IIRC Sun's JAXB depends
on their being one XML document per input stream, at least when using
the SAX parser.

> The reason I suggest the resizeable buffer is that is seems like more
> ideal to parse what you can immediately, rather than blocking. Beside
> when parsing the header, the buffer in some cases need be only 4 bytes
> long.
>

The tricky thing is how to implement without loosing the current
efficiencies. Not sure it can. Avoiding as many reads as possible to the
underlying stream is important for performance.

> Because our solution is relying on the FI parser to determine where the
> end of the incoming message is so that the message can then be
> processed, any blocking operation comes at a large price.

Understand.

> One possible
> solution I've thought of is to make my own InputStream that can
> determine when the end of the FI document is encountered and can then
> pretend as though it has reached the end of stream and return -1 from
> the read operations. This seems slightly orthogonal to what the FI
> parser is doing for the most part (besides obviously calling to the
> ContentHandler). It really seems as though the FI implementation should
> either provide this type of InputStream, or be able to provide a good
> non-blocking solution to this type of scenario.
>

I would prefer the latter if possible. I think it will be more efficient
and provide a clean layering.

This is very much related to what type of transport is used to
communicate FI messages. Using an open TCP/IP connection directly for
communicating multiple messages is generally not recommended unless a
transport is layered on top that provides a framing mechanism (i.e. an
InputStream of the content). HTTP and BEEP [1] transports provide such a
mechanism.

Microsoft proposed using DIME [2] as a light weight framing mechanism
alternative to HTTP for the transportation of SOAP messages using TCP/IP
or UDP. You may want to copy this approach if the HTTP protocol or BEEP
is too heavy for you. Note that DIME is not a standard.

It says in [1] that BEEP adds "about 60 octets per exchange, , and is
designed to be simple to parse". I am not sure what the average for HTTP
would be, but for the basic header fields alone this could add up to
about 45 bytes so BEEP is probably more efficient in this respect.

Paul.

[1] http://www.beepcore.org/
[2]
http://msdn.microsoft.com/library/en-us/dnglobspec/html/draft-nielsen-dime-02.txt

> Thoughts or suggestions would help immensely.
>
> Thanks,
> Brian Pontarelli
>

-- 
| ? + ? = To question
----------------\
    Paul Sandoz
         x38109
+33-4-76188109