Parallel writes revisited: bug or me?

From: Matthew Swift <matthew.swift_at_gmail.com>
Date: Mon, 6 Feb 2012 17:46:07 +0100

Hi Grizzly devs,

I'm currently using Grizzly 2.1.7 as the IO framework for our LDAP SDK.
I've just noticed a serious problem in our client which occurs when
multiple client threads attempt to write large requests simultaneously to
the same connection (unlike my previous emails, I am not using SSL this
time, just plain TCP).

My client code looks a bit like the code below. I've redacted the encoding
a bit to make it easier, but the write to the connection is identical to
what we are doing:

// Serialize the LDAP request to a Grizzly Buffer.
LDAPSearchRequest request = ...;
LDAPWriter writer = new LDAPWriter();
writer.writeSearchRequest(request);
Buffer buffer = writer.getBuffer();

// Write the buffer content to the org.glassfish.grizzly.Connection
org.glassfish.grizzly.Connection connection = ...;
connection.write(buffer, null);

Our assumption was that the call to connection.write() does not need
synchronization. However, we are seeing messages becoming interleaved when
the buffer is relatively big (10KB). I enabled FINE logging which shows
what's going on (the client requests are always 10732 bytes):

FINE: TCPNIOConnection (TCPNIOConnection{localSocketAddress=/
10.10.0.113:43007, peerSocketAddress=my-server.local/1010.0.114:1389})
(plain) write 10,732 bytes
Feb 6, 2012 3:29:06 PM org.glassfish.grizzly.nio.transport.TCPNIOTransport
write
FINE: TCPNIOConnection (TCPNIOConnection{localSocketAddress=/
10.10.0.113:43007, peerSocketAddress=my-server.local/1010.0.114:1389})
(plain) write 10,136 bytes
Feb 6, 2012 3:29:06 PM org.glassfish.grizzly.nio.transport.TCPNIOTransport
write
FINE: TCPNIOConnection (TCPNIOConnection{localSocketAddress=/
10.10.0.113:43007, peerSocketAddress=my-server.local/1010.0.114:1389})
(plain) write 0 bytes
Feb 6, 2012 3:29:06 PM org.glassfish.grizzly.nio.transport.TCPNIOTransport
write
FINE: TCPNIOConnection (TCPNIOConnection{localSocketAddress=/
10.10.0.113:43007, peerSocketAddress=my-server.local/1010.0.114:1389})
(plain) write 0 bytes
Feb 6, 2012 3:29:06 PM org.glassfish.grizzly.nio.transport.TCPNIOTransport
write
FINE: TCPNIOConnection (TCPNIOConnection{localSocketAddress=/
10.10.0.113:43007, peerSocketAddress=my-server.local/1010.0.114:1389})
(plain) write 0 bytes
Feb 6, 2012 3:29:06 PM org.glassfish.grizzly.nio.transport.TCPNIOTransport
write
FINE: TCPNIOConnection (TCPNIOConnection{localSocketAddress=/
10.10.0.113:43007, peerSocketAddress=my-server.local/1010.0.114:1389})
(plain) write 10,732 bytes
Feb 6, 2012 3:29:06 PM org.glassfish.grizzly.nio.transport.TCPNIOTransport
write
FINE: TCPNIOConnection (TCPNIOConnection{localSocketAddress=/
10.10.0.113:43007, peerSocketAddress=my-server.local/1010.0.114:1389})
(plain) write 10,732 bytes
Feb 6, 2012 3:29:06 PM org.glassfish.grizzly.nio.transport.TCPNIOTransport
write
FINE: TCPNIOConnection (TCPNIOConnection{localSocketAddress=/
10.10.0.113:43007, peerSocketAddress=my-server.local/1010.0.114:1389})
(plain) write 596 bytes
Feb 6, 2012 3:29:06 PM org.glassfish.grizzly.nio.transport.TCPNIOTransport
write
FINE: TCPNIOConnection (TCPNIOConnection{localSocketAddress=/
10.10.0.113:43007, peerSocketAddress=my-server.local/1010.0.114:1389})
(plain) write 10,732 bytes

You can see that the first request is sent in one go, but the second
request requires several write attempts (3 writes writing 0 bytes), before
the remaining 596 bytes are written. However, the remaining bytes are
written *after* two other requests.

It's easy to see what is going wrong in the call-stack:

TCPNIOTemporarySelectorWriter(TemporarySelectorWriter).write0(NIOConnection,
SocketAddress, Buffer, WriteResult<Buffer,SocketAddress>, long, TimeUnit)
line: 167
TCPNIOTemporarySelectorWriter(TemporarySelectorWriter).write(Connection,
SocketAddress, Buffer,
CompletionHandler<WriteResult<Buffer,SocketAddress>>,
Interceptor<WriteResult<Buffer,SocketAddress>>, long, TimeUnit) line: 130
TCPNIOTemporarySelectorWriter(TemporarySelectorWriter).write(Connection,
SocketAddress, Buffer,
CompletionHandler<WriteResult<Buffer,SocketAddress>>,
Interceptor<WriteResult<Buffer,SocketAddress>>) line: 87
TCPNIOTemporarySelectorWriter(TemporarySelectorWriter).write(Connection,
Object, Buffer, CompletionHandler, Interceptor) line: 66
TCPNIOTemporarySelectorWriter(AbstractWriter<L>).write(Connection, L,
Buffer, CompletionHandler<WriteResult<Buffer,L>>) line: 90
TCPNIOTransportFilter.handleWrite(FilterChainContext) line: 151
TransportFilter.handleWrite(FilterChainContext) line: 200
ExecutorResolver$8.execute(Filter, FilterChainContext) line: 111
DefaultFilterChain.executeFilter(FilterExecutor, Filter,
FilterChainContext) line: 286
DefaultFilterChain.executeChainPart(FilterChainContext, FilterExecutor,
int, int) line: 223
DefaultFilterChain.execute(FilterChainContext) line: 155
DefaultFilterChain.process(Context) line: 134
ProcessorExecutor.execute(Context) line: 78
DefaultFilterChain.write(Connection, Object, Object, CompletionHandler)
line: 408
TCPNIOConnection(NIOConnection).write(SocketAddress, M,
CompletionHandler<WriteResult<M,SocketAddress>>) line: 338
TCPNIOConnection(NIOConnection).write(M,
CompletionHandler<WriteResult<M,SocketAddress>>) line: 330

Where TemporarySelectorWriter.write0() is looping without any
synchronization:

            while (buffer.hasRemaining()) {
                int len = writeNow0(connection, dstAddress, buffer,
                        currentResult);

                ...
            }

I have a couple of questions:

   - Is this a bug or should we be synchronizing on the call to
   connection.write()? If we should synchronize then I find it a bit
   surprising since the method's declaration looks like it should be thread
   safe since it returns a Future and takes a CompletionHandler as a parameter.

   - If this is a bug, why hasn't anyone else hit it? Are we using unusual
   or deprecated APIs? I can understand why HTTP client implementations would
   not run into this issue since the protocol is synchronous, but I'd imagine
   other protocol implementations could hit this.

   - I tried using Grizzly 2.1.8 with the same results. I also ran into a
   separate minor issue as well with 2.1.8 in that it always creates a worker
   thread pool even if you're using the SameThreadIOStrategy - the builder is
   initialized with a WorkerThreadIOStrategy and setting the IOStrategy does
   not clear the ThreadPoolConfig. In addition, forcefully setting the thread
   pool size to 0 causes an exception later on when the FixedThreadPool is
   created.

Cheers :-)

Matt