Re: closed connection under high load

From: Mkrtchyan, Tigran <tigran.mkrtchyan_at_desy.de>
Date: Wed, 12 Mar 2014 09:17:07 +0100 (CET)

----- Original Message -----
> From: "Oleksiy Stashok" <oleksiy.stashok_at_oracle.com>
> To: dev_at_grizzly.java.net
> Sent: Wednesday, March 12, 2014 1:40:20 AM
> Subject: Re: closed connection under high load
>
> Hi Tigran,
>
> the only explanation I have is that underlying Socket output buffer is
> overloaded, so Grizzly waits for a signal from OS that Socket is ready
> to accept another chunk of output data.
> Looks like you're hitting some OS/Hardware limits.
>
> By making writes blocking IMHO you limit the number of clients you
> process simultaneously, so it looks like the error is gone, but in
> reality you just handle less clients than with non-blocking,

Hi Alexey,

I will keep looking for to source of the problem.

>
> What's the CPU load you observe?

The CPU load is ~0.4. There are 24 cores and only 2 of them
get some load, one 14% (selector loop?) and an other one ~5%.

This is with 6 selector threads and 48 worker threads. The number of
clients 32. I get 120 MB/s, which is the network limit.

BTW, I believe, you need to apply the patch, which fixes selector runners count.

Thanks,
Tigran.

>
> Thanks.
>
> WBR,
> Alexey.
>
> On 11.03.14 08:00, Mkrtchyan, Tigran wrote:
> > It looks like reply is stored in async write queue.
> > This is what I can see from logging of my requets:
> >
> > [root_at_dcache-lab002 ~]# grep 0xda61ff90
> > /var/log/dcache/dcache-lab002Domain.log
> > 11 Mar 2014 15:25:16 (dcache-lab002-A) processing RPC message:
> > xid=0xda61ff90 remote=/131.169.161.125:847
> > 11 Mar 2014 15:25:16 (dcache-lab002-A) dispatching message 0xda61ff90
> > 11 Mar 2014 15:25:16 (dcache-lab002-A) ready to reply Done message
> > 0xda61ff90
> > 11 Mar 2014 15:25:16 (dcache-lab002-A) dispatching Done message 0xda61ff90
> >
> > this is the point after FilterChainContext#write
> >
> > 11 Mar 2014 15:25:23 (dcache-lab002-A) Failed to send RPC message:
> > xid=0xda61ff90 remote=/131.169.161.125:847 : Connection closed
> >
> > and this is the point where CompletionHandler#failed is called
> >
> > The request was 7 sec in the queue.
> > For now I have changed the write call to be blocking:
> >
> > _context.write(_context.getAddress(), buffer, new
> > EmptyCompletionHandler<WriteResult>() {
> > @Override
> > public void failed(Throwable throwable) {
> > _log.error("Failed to send RPC message: xid=0x{} remote={}
> > : {}",
> > Integer.toHexString(buffer.getInt(0)),
> > _connection.getPeerAddress(),
> > throwable.getMessage());
> > }
> >
> > }, true);
> >
> > The error is gore. But I want to understand why reply was queued so long.
> >
> > Tigran.
> >
> > ----- Original Message -----
> >> From: "Tigran Mkrtchyan" <tigran.mkrtchyan_at_desy.de>
> >> To: dev_at_grizzly.java.net
> >> Sent: Tuesday, March 11, 2014 8:57:32 AM
> >> Subject: Re: closed connection under high load
> >>
> >>
> >>
> >> Unfortunately, this is not fix delayed reply issue.
> >> So I am keep digging....
> >>
> >> Tigran.
> >>
> >> ----- Original Message -----
> >>> From: "Tigran Mkrtchyan" <tigran.mkrtchyan_at_desy.de>
> >>> To: dev_at_grizzly.java.net
> >>> Sent: Monday, March 10, 2014 4:49:38 PM
> >>> Subject: Re: closed connection under high load
> >>>
> >>>
> >>> Ok, it looks like I found it.
> >>> In NIOTransport#startSelectorRunners() starts as many
> >>> SelectorRunner as many cores node has. Nevertheless
> >>> the thread pool size to run them is limited to
> >>> number of threads configured with
> >>> NIOTransportBuilder#setSelectorThreadPoolConfig.
> >>> Is the numbers do not match or setSelectorThreadPoolConfig
> >>> configures less threads as SelectorRunner was already created,
> >>> some of them will never be executed.
> >>>
> >>>
> >>> Two small fixes attached.
> >>>
> >>> Tigran.
> >>>
> >>> ----- Original Message -----
> >>>> From: "Tigran Mkrtchyan" <tigran.mkrtchyan_at_desy.de>
> >>>> To: dev_at_grizzly.java.net
> >>>> Sent: Monday, March 10, 2014 2:51:57 PM
> >>>> Subject: Re: closed connection under high load
> >>>>
> >>>>
> >>>>
> >>>> Hi Alexey,
> >>>>
> >>>> probably I do not really understand how grizzly is working,
> >>>> but this is what I observe: a selector thread has only
> >>>> one connection in the selector. As a result, only events from a
> >>>> single connection get processed.
> >>>>
> >>>> Is it expected behavior?
> >>>>
> >>>> Tigran.
> >>>>
> >>>> ----- Original Message -----
> >>>>> From: "Tigran Mkrtchyan" <tigran.mkrtchyan_at_desy.de>
> >>>>> To: dev_at_grizzly.java.net
> >>>>> Sent: Monday, March 10, 2014 9:10:07 AM
> >>>>> Subject: Re: closed connection under high load
> >>>>>
> >>>>>
> >>>>>
> >>>>> ----- Original Message -----
> >>>>>> From: "Oleksiy Stashok" <oleksiy.stashok_at_oracle.com>
> >>>>>> To: dev_at_grizzly.java.net
> >>>>>> Sent: Friday, March 7, 2014 10:50:15 PM
> >>>>>> Subject: Re: closed connection under high load
> >>>>>>
> >>>>>> Hi Tigran,
> >>>>>>
> >>>>>>
> >>>>>> On 07.03.14 07:36, Mkrtchyan, Tigran wrote:
> >>>>>>> It looks like that IO dominated by a single connection.
> >>>>>>> I monitor three probes: onWriteEvent, onReadEvent and
> >>>>>>> onIOEventReadyEvent. While I see ACCEPT from all clients,
> >>>>>>> READ and WRITE comes only from a single client.
> >>>>>> Looks like your app is made in a way, that it can't use
> >>>>>> SameThreadStrategy. If you can share your code and prepare the test -
> >>>>>> I
> >>>>>> can take a look and try to find out why it doesn't work with the
> >>>>>> SameThreadStrategy.
> >>>>> Hi Alexey,
> >>>>>
> >>>>> the code is available on the github:
> >>>>>
> >>>>> https://github.com/kofemann/oncrpc4j
> >>>>>
> >>>>> You probably need to have a look at
> >>>>>
> >>>>> https://github.com/kofemann/oncrpc4j/blob/master/oncrpc4j-core/src/main/java/org/dcache/xdr/OncRpcSvc.java
> >>>>>
> >>>>> and
> >>>>>
> >>>>> https://github.com/kofemann/oncrpc4j/blob/master/oncrpc4j-core/src/main/java/org/dcache/xdr/GrizzlyXdrTransport.java
> >>>>>
> >>>>> possibly
> >>>>>
> >>>>> https://github.com/kofemann/oncrpc4j/blob/master/oncrpc4j-core/src/main/java/org/dcache/xdr/RpcDispatcher.java
> >>>>>
> >>>>> I will try to prepare a simple client to produce a load without running
> >>>>> full
> >>>>> NFS server.
> >>>>>
> >>>>> Thanks,
> >>>>> Tigran.
> >>>>>
> >>>>>> WBR,
> >>>>>> Alexey.
> >>>>>>
> >>>>>>> Tigran.
> >>>>>>>
> >>>>>>> ----- Original Message -----
> >>>>>>>> From: "Oleksiy Stashok" <oleksiy.stashok_at_oracle.com>
> >>>>>>>> To: dev_at_grizzly.java.net
> >>>>>>>> Sent: Friday, March 7, 2014 2:39:22 AM
> >>>>>>>> Subject: Re: closed connection under high load
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 06.03.14 01:29, Mkrtchyan, Tigran wrote:
> >>>>>>>>> ----- Original Message -----
> >>>>>>>>>> From: "Oleksiy Stashok" <oleksiy.stashok_at_oracle.com>
> >>>>>>>>>> To: dev_at_grizzly.java.net
> >>>>>>>>>> Sent: Thursday, March 6, 2014 2:35:00 AM
> >>>>>>>>>> Subject: Re: closed connection under high load
> >>>>>>>>>>
> >>>>>>>>>> Hi Tigran,
> >>>>>>>>>>
> >>>>>>>>>>>> Hi Tigran,
> >>>>>>>>>>>>
> >>>>>>>>>>>> looks like the connection is getting closed because of
> >>>>>>>>>>>> IOException
> >>>>>>>>>>>> or
> >>>>>>>>>>>> -1
> >>>>>>>>>>>> returned during channel.read(), which is expected.
> >>>>>>>>>>>> So IMO it's nothing decided by Grizzly, it's either other side
> >>>>>>>>>>>> or
> >>>>>>>>>>>> local
> >>>>>>>>>>>> OS closed the connection.
> >>>>>>>>>>> Yes, in did, the client closes the connection. But it looks
> >>>>>>>>>>> like
> >>>>>>>>>>> the
> >>>>>>>>>>> reason
> >>>>>>>>>>> of it
> >>>>>>>>>>> that reply from server sent 15sec after request is arrived. The
> >>>>>>>>>>> filter
> >>>>>>>>>>> itself took
> >>>>>>>>>>> max 800ms. Any ideas how to trace where rest of the time was
> >>>>>>>>>>> spent?
> >>>>>>>>>> You can register ConnectionProbe for each Connection:
> >>>>>>>>>> / connection.getMonitoringConfig().addProbes(new
> >>>>>>>>>> ConnectionProbe() {
> >>>>>>>>>> .....
> >>>>>>>>>> });/
> >>>>>>>>>>
> >>>>>>>>>> and monitor different Connection events and time when it
> >>>>>>>>>> happened.
> >>>>>>>>>> It might be interesting to monitor onWriteEvent() and see if
> >>>>>>>>>> write
> >>>>>>>>>> delay
> >>>>>>>>>> occurs on Grizzly/Java or OS level. For example Grizzly reports
> >>>>>>>>>> the
> >>>>>>>>>> socketChannel.write(ByteBuffer) had been accepted the
> >>>>>>>>>> ByteBuffer,
> >>>>>>>>>> but
> >>>>>>>>>> the peer didn't get this data, which means the delay happens in
> >>>>>>>>>> OS
> >>>>>>>>>> layer.
> >>>>>>>>> I was logging in my code where I start to process the request and
> >>>>>>>>> this
> >>>>>>>>> was with quite a delay. Unfortunately, my work load nodes are
> >>>>>>>>> turned
> >>>>>>>>> off till the end of the week as we update cooling infrastructure
> >>>>>>>>> in
> >>>>>>>>> our
> >>>>>>>>> data center. But I will continue investigations next week. Is
> >>>>>>>>> there
> >>>>>>>>> any
> >>>>>>>>> best practice in which relations should be the number of selector
> >>>>>>>>> and
> >>>>>>>>> worker threads?
> >>>>>>>> By default we'd suggest selector-threads-count=CPU-count and
> >>>>>>>> worker-threads-count=CPU-count*2
> >>>>>>>> I'd also recommend to use SameThreadStrategy (no worker threads),
> >>>>>>>> where
> >>>>>>>> it's possible. For example if the app logic doesn't have any
> >>>>>>>> blocking,
> >>>>>>>> waiting kind of code - there is no reason to pay for a thread
> >>>>>>>> context
> >>>>>>>> switch.
> >>>>>>>>
> >>>>>>>>> BTW, what do you think about proposed change:
> >>>>>>>>> https://github.com/GrizzlyNIO/grizzly-mirror/pull/1
> >>>>>>>> Looks good.
> >>>>>>>> Will apply it.
> >>>>>>>>
> >>>>>>>> Thank you!
> >>>>>>>>
> >>>>>>>> WBR,
> >>>>>>>> Alexey.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Tigran.
> >>>>>>>>>
> >>>>>>>>>> Thanks.
> >>>>>>>>>>
> >>>>>>>>>> WBR,
> >>>>>>>>>> Alexey.
> >>>>>>>>>>
> >>>>>>>>>>> Looking at the requests log I can see that request processing
> >>>>>>>>>>> is
> >>>>>>>>>>> not
> >>>>>>>>>>> equally distributed.
> >>>>>>>>>>> Some connections processed quite active, while others are not.
> >>>>>>>>>>> Is
> >>>>>>>>>>> there
> >>>>>>>>>>> a
> >>>>>>>>>>> way how I can see
> >>>>>>>>>>> selector threads activity?
> >>>>>>>>>>>
> >>>>>>>>>>> Tigran.
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks.
> >>>>>>>>>>>>
> >>>>>>>>>>>> WBR,
> >>>>>>>>>>>> Alexey.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 26.02.14 14:06, Mkrtchyan, Tigran wrote:
> >>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> we observe that grizzly decides to close connection under
> >>>>>>>>>>>>> high
> >>>>>>>>>>>>> load.
> >>>>>>>>>>>>> Any idea how to find the reason?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Tigran.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> java.io.IOException: Connection closed
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.asyncqueue.TaskQueue.onClose(TaskQueue.java:302)
> >>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.nio.AbstractNIOAsyncQueueWriter.onClose(AbstractNIOAsyncQueueWriter.java:591)
> >>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.nio.transport.TCPNIOTransport.closeConnection(TCPNIOTransport.java:389)
> >>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.nio.NIOConnection$2.run(NIOConnection.java:439)
> >>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.nio.DefaultSelectorHandler$RunnableTask.run(DefaultSelectorHandler.java:489)
> >>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.nio.DefaultSelectorHandler.processPendingTaskQueue(DefaultSelectorHandler.java:295)
> >>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.nio.DefaultSelectorHandler.processPendingTasks(DefaultSelectorHandler.java:286)
> >>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.nio.DefaultSelectorHandler.preSelect(DefaultSelectorHandler.java:100)
> >>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.nio.SelectorRunner.doSelect(SelectorRunner.java:326)
> >>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.nio.SelectorRunner.run(SelectorRunner.java:271)
> >>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:565)
> >>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:545)
> >>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
> >>>>>>>>>>>>> Caused by: java.lang.Exception: Close stack trace
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.nio.NIOConnection.close0(NIOConnection.java:426)
> >>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.nio.transport.TCPNIOConnection.close0(TCPNIOConnection.java:292)
> >>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.nio.transport.TCPNIOTransport.read(TCPNIOTransport.java:648)
> >>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.nio.transport.TCPNIOTransportFilter.handleRead(TCPNIOTransportFilter.java:75)
> >>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.filterchain.TransportFilter.handleRead(TransportFilter.java:173)
> >>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:119)
> >>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:288)
> >>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:206)
> >>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:136)
> >>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:114)
> >>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:77)
> >>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:546)
> >>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:113)
> >>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:115)
> >>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:55)
> >>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>> at
> >>>>>>>>>>>>> org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:135)
> >>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>
>
>