dev@fi.java.net

Re: FI parser buffer

From: Brian Pontarelli <brian_at_pontarelli.com>
Date: Wed, 07 Sep 2005 21:57:57 -0500

>
> OK. That is interesting are you ensuring that the vocabularies of the
> parsers and serializers between the peers are also shared? i.e. there
> is no need to reset vocabularies per parse and serialization. The
> parsers/serializers were designed to support such functionality, but
> it is not so obvious how to enable it. I can send a further email
> explaining this if you like.
>
> This has the advantage that message size is reduced and
> parsing/serializing should be faster. However, i would not recommend
> this for the case where the vocabulary can vary quite a lot. It is
> good for small messages of whose vocabulary is similar for multiple
> requests/responses.

I think you are wondering if I'm using the same serializer and parser
for the entire communication? If that is the question I'm not doing
that. I have a thread pool where I grab a thread from to handle the
request/response. This thread constructs a new parser/serializer.

> Yes buffering, if the parser returns a string of X characters then
> there needs to be some buffer holding the encoding of those X
> characters. This is for performance reasons. Using a buffered input
> stream method calls can be expensive when reading bytes and there are
> optimizations that can be achieved when reading length prefixed data
> by combining parsing and buffering. Methods get hotspoted and inlined
> for example when using such techniques.

Hmmmm... I'm interested in a few things about this statement, neither of
which have to do with FI. How do you know that the methods are inlined
during hotspot and which methods are you talking about, the
BufferedInputStream methods or the FI parser methods?

> Using the system property in the following manner should work:
>
> -Dcom.sun.xml.fastinfoset.parser.buffer-size=16
>
> However, the buffer could be getting resized as there might be encoded
> strings greater than 16 bytes e.g. namespace names.

I think it is being resized for larger parts of the document like the
XML document declaration, but I found a way to use the FI parser without
changing this flag.

> The implementation of the decoder is very much dependent on reuse of
> the buffer for the decoding of strings. It relies on the property that
> the input stream is self-contained and contains one or more fast
> infoset documents.
>
> I wonder if other XML parsers do similar things. IIRC Sun's JAXB
> depends on their being one XML document per input stream, at least
> when using the SAX parser.

Oh yeah. JAXB was quite a trick. We had a complete implementation for
JAXB and JAXB was quite picky so I had to write a quick XML stream
reader that knew when the end of the XML doc was hit and could handle
the end of the stream and all that magic properly. I think most parsers
and tools not only rely on there being only a single document and the
end of stream, but they also do bad things like close streams,
especially during error cases. One of the HUGE reasons I picked this FI
implementation rather than writing my own was that it doesn't man-handle
the InputStream but can reasonably parse multiple FI documents on the
same stream. Great work in that area by the way.

> The tricky thing is how to implement without loosing the current
> efficiencies. Not sure it can. Avoiding as many reads as possible to
> the underlying stream is important for performance.

Really? I not totally convinced that this is completely true for all
implementations. I read in using NIO and then buffer that data into my
own InputStream and allow reads from that in blocks or byte by byte. I'm
not convinced that lots of method invocations to read byte by byte
really slow it down that much. My only performance concern is my use of
volatile variable for the current buffer I read from NIO. This access
might be slowed down, but once the NIO thread is finished and the parser
is working, there is no contention. If the NIO thread and the parsing
thread are working concurrently, then it should perform better under
heavier load. This will even out the access between the thread
performing the NIO and each of my "execute threads" parsing the FI
documents via my InputStream implementation, since at most there are
only two threads contending for the volatile variable of each
InputStream and most likely the NIO thread will be working on an
InputStream for a thread not being put on the CPU next.

>
>> One possible solution I've thought of is to make my own InputStream
>> that can determine when the end of the FI document is encountered and
>> can then pretend as though it has reached the end of stream and
>> return -1 from the read operations. This seems slightly orthogonal to
>> what the FI parser is doing for the most part (besides obviously
>> calling to the ContentHandler). It really seems as though the FI
>> implementation should either provide this type of InputStream, or be
>> able to provide a good non-blocking solution to this type of scenario.
>>
>
> I would prefer the latter if possible. I think it will be more
> efficient and provide a clean layering.
>
> This is very much related to what type of transport is used to
> communicate FI messages. Using an open TCP/IP connection directly for
> communicating multiple messages is generally not recommended unless a
> transport is layered on top that provides a framing mechanism (i.e. an
> InputStream of the content). HTTP and BEEP [1] transports provide such
> a mechanism.
>
> Microsoft proposed using DIME [2] as a light weight framing mechanism
> alternative to HTTP for the transportation of SOAP messages using
> TCP/IP or UDP. You may want to copy this approach if the HTTP protocol
> or BEEP is too heavy for you. Note that DIME is not a standard.
>
> It says in [1] that BEEP adds "about 60 octets per exchange, , and is
> designed to be simple to parse". I am not sure what the average for
> HTTP would be, but for the basic header fields alone this could add up
> to about 45 bytes so BEEP is probably more efficient in this respect.
>

I've actually gotten it working since the message I sent, but at random
if fails because of the "TODO" line in the Decoder line 1245:

                // TODO keep reading until require bytes have been obtained

The way I managed to get this to work was using the NIO and custom
InputStream I mentioned above and allow the FI parser to read in blocks
as it wants. I don't return all the bytes that the parser requests
unless I know for certain I have that many bytes. This works well since
the FI parser handles the majority of these cases well (sans the TODO I
mentioned).

Therefore, most of what you've mentioned is unnecessary considering that
the FI parser handles most cases of block reads that read less bytes
than the buffer length and the fact that the parser does a good job of
determining the end of the document and stopping parsing and not closing
the stream or doing anything else to the stream. Once the TODO is
finished, I should be able to stream documents in both directions down a
TCP/IP connection without any special protocol around the FI documents.

One question was why it isn't recommended to have an open connection
that streams multiple FI documents in both directions without additional
protocol semantics such as HTTP? I honestly can't see a reason unless
the end of the FI document was non-deterministic, which it isn't (to my
knowledge). The FI document actually forms the protocol itself in that
it defines a fully encapsulated array of bytes that define a single
message. The streams look like this in my case:

        [request4][request3][request2][request1]
        --------------------------------------->
client server
         <--------------------------------------
    [response4][response3][response2][response1]

Each message is stacked directly after the last byte of the previous
message regardless of direction.

Anyways, you thoughts and experience on this would be really helpful
because we are putting a large investment into the FI protocol I've
described and I want to ensure that it will work well and makes good sense.

Thanks,
Brian Pontarelli