Re: [Fwd: grizzly versus mina]

From: charlie hunt <charlie.hunt_at_sun.com>
Date: Sat, 26 May 2007 12:57:27 -0500

Robert Greig wrote:
> On 25/05/07, charlie hunt <charlie.hunt_at_sun.com> wrote:
>
>> So, if I understand correctly, the SocketIOProcessor is dedicated to not
>> only a socket, but also a Thread, correct ?
>
> Each socket processor is a thread, but each socket processor can
> handle multiple threads. We typically run with one socket processor
> per CPU core.

Understood.

>
>> As for the cost of creating a temporary selector and tearing it down
>> .... here's a little secret ;-) Don't constantly create and tear a
>> Selector. Either keep a pool of temporary Selectors and get one from
>> the pool when you need one or, dedicate a temporary Selector for each
>> connection. For large number of connections you'd wanna go with the
>> former approach, for small number of connections the latter approach may
>> work better.
>
> I suppose the advantage this brings is the reduction in context
> switching. Although it must depend on where (i.e which thread) you are
> interpreting bytes. With Qpid, the socket IO processors just read the
> data and have no understanding of how many bytes are still to be read
> to complete an individual command. When bytes are read up to maximum
> frame size they are just passed to a per-connection queue, from where
> worker threads pull them for decoding.

The advantage is not only avoiding a context switch and also avoiding
the enabling & disabling of interest ops.

What you are doing is similar to the model we currently using GlassFish
IIOP / ORB with a slight difference.

In GlassFish IIOP / ORB, we allocate a really large ByteBuffer. When we
get a read event for a connection, we do a non-blocking read and read as
much data into that really ByteBuffer as we can possibly read, (in some
cases we read as much as 64k bytes). Once that data is read, we stay in
the same thread and scan the bytes looking for the beginning and ending
of messages in data we just read. This is where I think we deviate from
your implementation. I think you give the data to another thread pool
to scan the bytes, right? As we scan the bytes for the beginning and
ending of (GIOP) messages we slice the really large ByteBuffer into GIOP
PDU (protocol data units). But, it is not decoded. That GIOP PDU is
then put on a worker thread pool where it is processed on its own worker
thread, (actually it will further dispatch to another thread pool
too). Once we've scanned the bytes we've read into the really big
ByteBuffer, we know if we have a partial message left in the really big
ByteBuffer. If there's a partial message, we know that we are expecting
more data, (there's other conditions too for IIOP that will tell us
we're expecting more data, but it's not important for this discussion).
If we are expecting more data we'll try to do an immediate non-blocking
read again. If that non-blocking read results in 0 bytes read, then
we'll going into a "psuedo" blocking read using a temporary Selector.
We'll stay in the "psuedo" blocking read until we are no longer
expecting more data. It's only after we've determined by scanning the
bytes read for GIOP PDU's there's no more expected data do we return to
using the "main" Selector.

So, I think where your implementation deviates from what we do in
GlassFish IIOP / ORB is where we scan the bytes just read.

I should also point out that Grizzly currently does not provide this
directly behind its APIs. You'd have to build some of this yourself.
But, we hope to change that in the near future as we begin to utilize
Grizzly in GlassFish IIOP / ORB.

In Grizzly 1.5, you'd probably have to build the additional structure to
scan the bytes in another thread. In other words, it's do-able, but
probably not too straight forward.
>
>> SocketChannel. The benefit are; avoiding of expensive thread context
>> switches between a Selector thread and Worker Threads as read events
>> occur, you don't have to enable and disable interest ops on the Selector
>> which avoids expensive system calls, (note, if the current thread never
>> leaves the Selector and goes back to a Selector.select() you don't have
>> to disable & re-enable interest ops). Minimizing the number of thread
>> context switches and minimizing the number times you enable / disable
>> interest ops can have a huge impact on your performance.
>
> Does the grizzly model currently do the reads on a separate thread
> (what you call the worker thread here?) -- i.e. the select thread
> literally only does selects -- or is worker thread for actual
> processing of the data?

Actually you can configure things however you want. The default model
is accept new connections on the the same thread as the main selector is
running. And, dispatch to worker threads for handling read events.

That being said, once you dispatched to a worker thread for handling a
read event there's nothing to prevent you from creating filters that
read the data, pass the read data to a different worker thread, or
thread pool to scanning, or whatever kind of work you want. The model
is not forced upon you that have to read & process the data in the same
thread.

When we get to integrating Grizzly with GlassFish IIOP / ORB we're gonna
need to be able to delegate GIOP PDUs to a different thread than the one
that has read the data. And, one could do this with Grizzly 1.5. It
just may not be too obvious how to do it. You just have to realize
that you are free to what ever you want in the worker thread that
handles a read event, i.e. you could further delegate to other threads
or thread pools as you see fit.

>
> We did try having a model with Qpid where we do decoding of bytes in
> the same thread as the reading. However this was significantly worse.
> The reason I came up with was that with a separate thread for
> decoding, in the case where you have decoded only to discover that you
> need more data, by the time you've figured that out the reader thread
> will already have read the data off the socket. If you are only using
> a single thread then you cannot read and decode in parallel.

I would expect that decoding and reading in the same thread would be
result in poor performance. As I mentioned above, that's something that
we need to make sure we do right in GlassFish IIOP / ORB too. And,
that's what we do ... we decode in a different thread that what we
read. We do however, scan the bytes we read to isolate beginning and
ending bytes of a GIOP PDU. But, once we have those boundaries
identified, (which is pretty easy, it's a matter of a 12 byte header, 4
bytes of which contain the overall message length). So, the byte
scanner that produces a GIOP PDU is very fast and very simple. Once
those PDUs are identified and sliced, they are dispatched to a worker
thread where they are processed further, (they actually don't get
decoded in that thread either, they get decoded in yet another thread).

>
>> > I actually spent a reasonable amount of time trying to get a
>> > measurable improvement using scattering writes (and gathering reads
>> > too) but I was unable to get any measurable improvement.
>>
>> We haven't done a lot in this area. So, any additional information
>> beyond what you've mentioned below would be useful. :-)
>
> Part of the issue we had was that it is hard to decide how much to
> attempt to scatter particularly when you are dealing with a queue,
> i.e. on the queue of items each item has an upper bound but not a
> lower bound. How many items to attempt to scatter? The best thing is
> to be able to write data without filling up the kernel buffer but when
> doing the scattering write we found that we often just filled the
> kernel buffer (and hence had the expensive selector manipulation to
> do).

+1, it's a difficult problem with no easy solution.

>
> In fact, one of the really hard things we found with a message broker
> is getting the case where the client is sending and receiving data at
> the same time (something that is extremely common). It was while
> optimising that case that we had to introduce separate threads for
> reading and writing.

+1, (again). GlassFish ORB has a similar problem. In fact any
protocol that multiplexes multiple clients over the same connection and
additionally uses the same connection for both reads & writes. You can
imagine the challenge with multiple clients multiplexed over the same
connection and also having to share the same connection for all reads &
writes.

>
>> When I look at the implementation of scatter writes in the JDK, I know
>> that I wouldn't attempt to do a scatter write using non-direct
>> ByteBuffer. But, scatter writes on Solaris or Linux should perform
>> well based on it's implementation though.
>
> Ah, I'm pretty sure I would have been using heap buffers at the time
> when I tested that.

That's good to know.

There's so many little nasty little "pot holes" to navigate through to
make Java NIO perform really well. But, I suppose that's why we're so
aggressively trying to make Grizzly available to anyone who'd like to
use Java NIO and realize its potential.

Sounds like you've had quite a few experiences with it as well.

Hope you don't mind me asking some additional questions on scatter
writes as the day comes when I start looking at it again?

>
>> In general, I tend to favor pooling large direct ByteBuffers and slicing
>> them into views as I need them. Then, recycling the views back into the
>> large direct ByteBuffers. I'm currently working on an implementation
>> that will do this in a situation where the thread that gets the
>> ByteBuffer view is different from the thread that releases the
>> ByteBuffer view, (an interesting challenge).
>
> That would be interesting. It would work well with AMQP where the
> maximum size of each protocol message is negotiated on connection. For
> example, with AMQP you can know that you'll never get more than say
> 32k in a protocol message so we could just allocate a 1MB direct
> buffer for each socket io processor to use by taking slices.

That would be nice to have a maximum. Not all protocols give that
definition.

Your approach makes perfect sense and what you've done is very much like
what I would recommend someone to do working with your kind of
constraints. :-)

>
>> The choice on whether we use heap or direct bytebuffers in a given
>> configuration, (i.e. Grizzly HTTP in GlassFish for example), is based on
>> performance throughput and scalability we see for each. On the JDKs
>> supported by GlassFish V2, pooled heap bytebuffers are performing the
>> best.
>
> Which garbage collection options did you use for those tests?

We tend to use the throughput collector with aggressive opts. More
specifically a combination of -XX:+AggressiveOpts and
-XX:+UseParallelOldGC, (note: old parallel gc also enables
-XX:+UseParallelGC).

Should also mention two additional things; we see pool heap ByteBuffers
that are sliced as the best, and secondly the reason heap ByteBuffer is
currently performing the best is actually the result of "copying bytes
to the ByteBuffer" performance. What I mean by the latter is that the
performance of the ByteBuffer in GlassFish is largely dependent on the
throughput of being able to copying bytes into a ByteBuffer than it is
on reading or writing to a ByteBuffer. As a result, the copy performance
difference between a direct and heap ByteBuffer dictates the choice of
ByteBuffer. However, we've addressed the "copying bytes" performance
issue with direct ByteBuffers which will become available in a update
release of the JDK/JVM. So, it's likely we'll migrate to pooled direct
ByteBuffer in the future. That's also the reason my recycling direct
ByteBuffer work I described earlier.

>
>> Thanks for the interesting discussion. I'd love to hear more about your
>> observations about ConcurrentLinkedQueue versus a synchronized
>> LinkedList. You wouldn't happen to have a micro-benchmark that
>> illustrates the performance difference?
>
> We did actually develop a benchmark and tested across various
> platforms and JDKs. However I am not sure if I still have the code. I
> will take a look though, maybe I can find it on an internal subversion
> somewhere.

If you happen to have any benchmarks you'd feel comfortable sharing or
contribution, we'd be glad to look at them. If they turn out to be
good candidates for testing the Java SE builds against them and you'd be
ok with us using them, I could probably get them integrated into the
performance test suite that tests Java SE releases.

charlie ...

>
> RG
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe_at_grizzly.dev.java.net
> For additional commands, e-mail: dev-help_at_grizzly.dev.java.net
>

-- 
Charlie Hunt
Java Performance Engineer
630.285.7708 x47708 (Internal)
<http://java.sun.com/docs/performance/>