Re: [Fwd: grizzly versus mina]

From: Robert Greig <robert.j.greig_at_gmail.com>
Date: Sat, 26 May 2007 16:13:39 +0100

On 25/05/07, charlie hunt <charlie.hunt_at_sun.com> wrote:

> So, if I understand correctly, the SocketIOProcessor is dedicated to not
> only a socket, but also a Thread, correct ?

Each socket processor is a thread, but each socket processor can
handle multiple threads. We typically run with one socket processor
per CPU core.

> As for the cost of creating a temporary selector and tearing it down
> .... here's a little secret ;-) Don't constantly create and tear a
> Selector. Either keep a pool of temporary Selectors and get one from
> the pool when you need one or, dedicate a temporary Selector for each
> connection. For large number of connections you'd wanna go with the
> former approach, for small number of connections the latter approach may
> work better.

I suppose the advantage this brings is the reduction in context
switching. Although it must depend on where (i.e which thread) you are
interpreting bytes. With Qpid, the socket IO processors just read the
data and have no understanding of how many bytes are still to be read
to complete an individual command. When bytes are read up to maximum
frame size they are just passed to a per-connection queue, from where
worker threads pull them for decoding.

> SocketChannel. The benefit are; avoiding of expensive thread context
> switches between a Selector thread and Worker Threads as read events
> occur, you don't have to enable and disable interest ops on the Selector
> which avoids expensive system calls, (note, if the current thread never
> leaves the Selector and goes back to a Selector.select() you don't have
> to disable & re-enable interest ops). Minimizing the number of thread
> context switches and minimizing the number times you enable / disable
> interest ops can have a huge impact on your performance.

Does the grizzly model currently do the reads on a separate thread
(what you call the worker thread here?) -- i.e. the select thread
literally only does selects -- or is worker thread for actual
processing of the data?

We did try having a model with Qpid where we do decoding of bytes in
the same thread as the reading. However this was significantly worse.
The reason I came up with was that with a separate thread for
decoding, in the case where you have decoded only to discover that you
need more data, by the time you've figured that out the reader thread
will already have read the data off the socket. If you are only using
a single thread then you cannot read and decode in parallel.

> > I actually spent a reasonable amount of time trying to get a
> > measurable improvement using scattering writes (and gathering reads
> > too) but I was unable to get any measurable improvement.
>
> We haven't done a lot in this area. So, any additional information
> beyond what you've mentioned below would be useful. :-)

Part of the issue we had was that it is hard to decide how much to
attempt to scatter particularly when you are dealing with a queue,
i.e. on the queue of items each item has an upper bound but not a
lower bound. How many items to attempt to scatter? The best thing is
to be able to write data without filling up the kernel buffer but when
doing the scattering write we found that we often just filled the
kernel buffer (and hence had the expensive selector manipulation to
do).

In fact, one of the really hard things we found with a message broker
is getting the case where the client is sending and receiving data at
the same time (something that is extremely common). It was while
optimising that case that we had to introduce separate threads for
reading and writing.

> When I look at the implementation of scatter writes in the JDK, I know
> that I wouldn't attempt to do a scatter write using non-direct
> ByteBuffer. But, scatter writes on Solaris or Linux should perform
> well based on it's implementation though.

Ah, I'm pretty sure I would have been using heap buffers at the time
when I tested that.

> In general, I tend to favor pooling large direct ByteBuffers and slicing
> them into views as I need them. Then, recycling the views back into the
> large direct ByteBuffers. I'm currently working on an implementation
> that will do this in a situation where the thread that gets the
> ByteBuffer view is different from the thread that releases the
> ByteBuffer view, (an interesting challenge).

That would be interesting. It would work well with AMQP where the
maximum size of each protocol message is negotiated on connection. For
example, with AMQP you can know that you'll never get more than say
32k in a protocol message so we could just allocate a 1MB direct
buffer for each socket io processor to use by taking slices.

> The choice on whether we use heap or direct bytebuffers in a given
> configuration, (i.e. Grizzly HTTP in GlassFish for example), is based on
> performance throughput and scalability we see for each. On the JDKs
> supported by GlassFish V2, pooled heap bytebuffers are performing the
> best.

Which garbage collection options did you use for those tests?

> Thanks for the interesting discussion. I'd love to hear more about your
> observations about ConcurrentLinkedQueue versus a synchronized
> LinkedList. You wouldn't happen to have a micro-benchmark that
> illustrates the performance difference?

We did actually develop a benchmark and tested across various
platforms and JDKs. However I am not sure if I still have the code. I
will take a look though, maybe I can find it on an internal subversion
somewhere.

RG