Re: [Fwd: grizzly versus mina]

From: charlie hunt <charlie.hunt_at_sun.com>
Date: Fri, 25 May 2007 10:22:03 -0500

Robert Greig wrote:
> On 25/05/07, charlie hunt <charlie.hunt_at_sun.com> wrote:
>
>> An approach that I find has been working very well on the reading side
>> general approach where upon receiving a read event notification, read as
>> much data as can be read into a ByteBuffer. Then, ask a message parser
>> that knows how to parse the data just read into messages. As messages
>> are parsed give those messages to a protocol processor. If you are left
>> with a partial message as the last message in your ByteBuffer you
>> continue to try to read more data. This is a condition I call,
>> "expecting more data".
>
> This is essentially what we do in Qpid, in MINA terminology this is
> the "CumulativeProtocolDecoder".
>
> In our broker we have certain cases where messages coming in can go
> straight out to one or more consumers so we take care in that decoder
> not to compact() the bytebyffer in the event you get a number of
> unprocessed bytes at the end. Avoiding the compact means we can just
> take a slice() of the data when handing it off to the next stage.

Makes perfect sense.

Obviously you want to avoid unnecessary work. I suppose the term
"message parser" isn't best term for doing the initial parsing of the
raw bytes? Perhaps I needed to say that instead of parsing the bytes
into messages. In your particular case it sounds like you want to parse
just enough information to figure out if the bytes should go straight
out before doing any kind of compacting.

What you've done sounds like a very good decision.

>
>> As long as you are "expecting more data", you
>> use a temporary Selector to wait for more data. When you are no longer
>> "expecting more data", you can then consider being done with the overall
>> read event.
>
> What is the motivation for that? In Qpid, the SocketIOProcessor is
> constantly constantly reading off the socket(s) it is responsible for.
> What is the cost of creating a selector and tearing it down?

So, if I understand correctly, the SocketIOProcessor is dedicated to not
only a socket, but also a Thread, correct ?

As for the cost of creating a temporary selector and tearing it down
.... here's a little secret ;-) Don't constantly create and tear a
Selector. Either keep a pool of temporary Selectors and get one from
the pool when you need one or, dedicate a temporary Selector for each
connection. For large number of connections you'd wanna go with the
former approach, for small number of connections the latter approach may
work better.

This brings me to an architecture I've considered for applications which
would have a small number of connections. For example consider RMI-IIOP
in a typical application server. The number of connections it would
serve would be very small compared to the number of HTTP client it would
serve. For a small number of connections, one might consider using
non-blocking SocketChannels and configure a Selector per SocketChannel.
Then, you do non-blocking reads and read as much data as can be read per
read. You might be thinking, "what's the benefit" for the Selector per
SocketChannel. The benefit are; avoiding of expensive thread context
switches between a Selector thread and Worker Threads as read events
occur, you don't have to enable and disable interest ops on the Selector
which avoids expensive system calls, (note, if the current thread never
leaves the Selector and goes back to a Selector.select() you don't have
to disable & re-enable interest ops). Minimizing the number of thread
context switches and minimizing the number times you enable / disable
interest ops can have a huge impact on your performance. I'm currently
considering how to integrate such a model into Grizzly as a possible
configuration along with how to transition from that configuration in a
small number of connections environment to the more traditional Selector
to more than one SocketChannel configuration as some threshold of the
number connections is exceeded.

>
>> There's some additional variations one can incorporate too
>> such as ... distributed network applications tend to be bursty, for that
>> reason you might consider adding to the definition "expecting more data"
>> the notion of waiting a few more seconds before considering the overall
>> read event being done.
>
> Do you build anything in to ensure fairness/avoid starvation?

Note that the "expecting more data" and waiting a few more seconds for
data to arrive is done in a worker thread. So, it would not block other
worker threads or the selector thread. The potential starvation could
come from the fact that you are willing to wait for more data in one
worker thread, while there is another read event waiting to be processed
and there's no worker thread to available to handle the work. The
solution to that issue is have more worker threads. So, the trade-off
is potentially more worker threads in exchange for avoiding some thread
context switching.

>
>> The writing side is a little more interesting. One could consider
>> putting outbound messages into a queue structure and have an writing
>> thread waiting for data on the queue to be written and doing scatter
>> writes when more than one entry is on the queue at a given time. This
>> approach has its advantages and challenges, as you well know.
>
> Yes, the queue approach is what we do in Qpid.

I thought that might be the case.

>
> I actually spent a reasonable amount of time trying to get a
> measurable improvement using scattering writes (and gathering reads
> too) but I was unable to get any measurable improvement.

We haven't done a lot in this area. So, any additional information
beyond what you've mentioned below would be useful. :-)

When I look at the implementation of scatter writes in the JDK, I know
that I wouldn't attempt to do a scatter write using non-direct
ByteBuffer. But, scatter writes on Solaris or Linux should perform
well based on it's implementation though.

As I said it's something we want to do some more investigation on.

>
> We found some interesting characteristics of the ConcurrentLinkedQueue
> when looking at this. It didn't seem to perform too well when the
> queue was often empty so in the end I believe we actually just used a
> standard LinkedList with sychronized blocks around the accessors.

Interesting observation! It may be the CAS (compare and set) was more
expensive of an operation than the synchronized blocks. We're done
some experimenting with this too and have some similar results.
Unfortunately we haven't had a chance to do enough analysis to figure
out exactly what the root cause might be. It's on my (length) "TODO"
list ;-)

>
>> And,
>> there's also the approach that a thread that has formulated a response
>> or constructed an outbound message simply just invokes the connection
>> write.
>
> Presumably it would then have to deal with issues like the kernel
> buffer being full etc.

Yes, it's gonna need to the thing of checking if all bytes have been
written to the SocketChannel and potentially use a temporary Selector to
tell us when it's ready for more data to be written. That's what
Grizzly does as of now. I'd like to integrate some additional
flexibility that could do the queue approach as you've done.

>
>> Performance testing message brokers is a little different ;-)
>> Throughput and scalability are both equally crucial.
>
> Yes, and for some applications latency can be an issue and getting
> that right can be a challenge. Other message brokers we tested came
> close to Qpid in some throughput tests but had horrific latency.
>
> One other thing we found with Qpid was that direct buffers were
> significantly slower than heap buffers and that pooling buffers
> (something that MINA can do) was counterproductive if you use heap
> buffers. Do you use heap or direct buffers?

We've found that it depends largely on the use case as to whether direct
or heap bytebuffer perform better. What we have found with Grizzly HTTP
is that pooled heap bytebuffers that are sliced into views performs the
best. However, Grizzly HTTP does a lot of copying of bytes to a
bytebuffer. As a result of a lot that activity, heap bytebuffers are
currently showing the best performance on Grizzly HTTP. But, I also
know that Java SE, (which is the group I actually belong with) is making
some improvements in the performance of copying data to bytebuffers, not
only for heap bytebuffers by also for direct bytebuffers. So, our
observation may change in a upcoming Java SE release.

In general, I tend to favor pooling large direct ByteBuffers and slicing
them into views as I need them. Then, recycling the views back into the
large direct ByteBuffers. I'm currently working on an implementation
that will do this in a situation where the thread that gets the
ByteBuffer view is different from the thread that releases the
ByteBuffer view, (an interesting challenge).

The choice on whether we use heap or direct bytebuffers in a given
configuration, (i.e. Grizzly HTTP in GlassFish for example), is based on
performance throughput and scalability we see for each. On the JDKs
supported by GlassFish V2, pooled heap bytebuffers are performing the
best. This being said, I know of different configurations of Grizzly
where pool direct bytebuffers will perform better. Unfortunately it's
not as simple as saying one is always faster than the other. It varies
with application and in addition it also varies from JDK release to JDK
release.

Thanks for the interesting discussion. I'd love to hear more about your
observations about ConcurrentLinkedQueue versus a synchronized
LinkedList. You wouldn't happen to have a micro-benchmark that
illustrates the performance difference?

charlie ...

>
> RG
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe_at_grizzly.dev.java.net
> For additional commands, e-mail: dev-help_at_grizzly.dev.java.net
>

-- 
Charlie Hunt
Java Performance Engineer
630.285.7708 x47708 (Internal)
<http://java.sun.com/docs/performance/>