Re: closed connection under high load

From: Mkrtchyan, Tigran <tigran.mkrtchyan_at_desy.de>
Date: Wed, 12 Mar 2014 22:28:03 +0100 (CET)

----- Original Message -----
> From: "Oleksiy Stashok" <oleksiy.stashok_at_oracle.com>
> To: dev_at_grizzly.java.net
> Sent: Wednesday, March 12, 2014 9:48:59 PM
> Subject: Re: closed connection under high load
>
> Hi Tigran,
>
> On 12.03.14 01:17, Mkrtchyan, Tigran wrote:
> > I will keep looking for to source of the problem.
> Ok.
>
> >> What's the CPU load you observe?
> > The CPU load is ~0.4. There are 24 cores and only 2 of them
> > get some load, one 14% (selector loop?) and an other one ~5%.
> > This is with 6 selector threads and 48 worker threads. The number of
> > clients 32. I get 120 MB/s, which is the network limit.
> Yeah, the CPU load is very small. Try to monitor ConnectionProbe's:
> onIOEventReadyEvent, onIOEventEnableEvent, onIOEventDisableEvent events
> for IOEvent.WRITE. It's interesting to see which one of these probe
> events gets called last before the delay.
>
Other observation. When I watch threads with visualvm, then I can see that
with async write thread have long sleep phases, but with sync writes
running phases are dominating.

>
> > BTW, I believe, you need to apply the patch, which fixes selector runners
> > count.
> IMO we have to rework the patch a bit, because SelectorThreadPool might
> be set by user and also used to run user's task (it's not good idea but
> still), so we can't rely on its max threads count, because it can be
> very large value. What I propose is to log a warning at Transport start
> phase in case the SelectorThreadPool's min threads count is less than
> selector runners count.
>

I will look at it. But warning is too weak to my taste, as it's clear broken
situation: connections assigned to a SelectorRunner which will never get a thread.
As an option, we can set selector runners count to smallest from current value and
SelectorThreadPool's min thread count.

Tigran.

> What do you think?
>
> Thanks.
>
> WBR,
> Alexey.
> >
> > Thanks,
> > Tigran.
> >
> >> Thanks.
> >>
> >> WBR,
> >> Alexey.
> >>
> >> On 11.03.14 08:00, Mkrtchyan, Tigran wrote:
> >>> It looks like reply is stored in async write queue.
> >>> This is what I can see from logging of my requets:
> >>>
> >>> [root_at_dcache-lab002 ~]# grep 0xda61ff90
> >>> /var/log/dcache/dcache-lab002Domain.log
> >>> 11 Mar 2014 15:25:16 (dcache-lab002-A) processing RPC message:
> >>> xid=0xda61ff90 remote=/131.169.161.125:847
> >>> 11 Mar 2014 15:25:16 (dcache-lab002-A) dispatching message 0xda61ff90
> >>> 11 Mar 2014 15:25:16 (dcache-lab002-A) ready to reply Done message
> >>> 0xda61ff90
> >>> 11 Mar 2014 15:25:16 (dcache-lab002-A) dispatching Done message
> >>> 0xda61ff90
> >>>
> >>> this is the point after FilterChainContext#write
> >>>
> >>> 11 Mar 2014 15:25:23 (dcache-lab002-A) Failed to send RPC message:
> >>> xid=0xda61ff90 remote=/131.169.161.125:847 : Connection closed
> >>>
> >>> and this is the point where CompletionHandler#failed is called
> >>>
> >>> The request was 7 sec in the queue.
> >>> For now I have changed the write call to be blocking:
> >>>
> >>> _context.write(_context.getAddress(), buffer, new
> >>> EmptyCompletionHandler<WriteResult>() {
> >>> @Override
> >>> public void failed(Throwable throwable) {
> >>> _log.error("Failed to send RPC message: xid=0x{}
> >>> remote={}
> >>> : {}",
> >>> Integer.toHexString(buffer.getInt(0)),
> >>> _connection.getPeerAddress(),
> >>> throwable.getMessage());
> >>> }
> >>>
> >>> }, true);
> >>>
> >>> The error is gore. But I want to understand why reply was queued so long.
> >>>
> >>> Tigran.
> >>>
> >>> ----- Original Message -----
> >>>> From: "Tigran Mkrtchyan" <tigran.mkrtchyan_at_desy.de>
> >>>> To: dev_at_grizzly.java.net
> >>>> Sent: Tuesday, March 11, 2014 8:57:32 AM
> >>>> Subject: Re: closed connection under high load
> >>>>
> >>>>
> >>>>
> >>>> Unfortunately, this is not fix delayed reply issue.
> >>>> So I am keep digging....
> >>>>
> >>>> Tigran.
> >>>>
> >>>> ----- Original Message -----
> >>>>> From: "Tigran Mkrtchyan" <tigran.mkrtchyan_at_desy.de>
> >>>>> To: dev_at_grizzly.java.net
> >>>>> Sent: Monday, March 10, 2014 4:49:38 PM
> >>>>> Subject: Re: closed connection under high load
> >>>>>
> >>>>>
> >>>>> Ok, it looks like I found it.
> >>>>> In NIOTransport#startSelectorRunners() starts as many
> >>>>> SelectorRunner as many cores node has. Nevertheless
> >>>>> the thread pool size to run them is limited to
> >>>>> number of threads configured with
> >>>>> NIOTransportBuilder#setSelectorThreadPoolConfig.
> >>>>> Is the numbers do not match or setSelectorThreadPoolConfig
> >>>>> configures less threads as SelectorRunner was already created,
> >>>>> some of them will never be executed.
> >>>>>
> >>>>>
> >>>>> Two small fixes attached.
> >>>>>
> >>>>> Tigran.
> >>>>>
> >>>>> ----- Original Message -----
> >>>>>> From: "Tigran Mkrtchyan" <tigran.mkrtchyan_at_desy.de>
> >>>>>> To: dev_at_grizzly.java.net
> >>>>>> Sent: Monday, March 10, 2014 2:51:57 PM
> >>>>>> Subject: Re: closed connection under high load
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Hi Alexey,
> >>>>>>
> >>>>>> probably I do not really understand how grizzly is working,
> >>>>>> but this is what I observe: a selector thread has only
> >>>>>> one connection in the selector. As a result, only events from a
> >>>>>> single connection get processed.
> >>>>>>
> >>>>>> Is it expected behavior?
> >>>>>>
> >>>>>> Tigran.
> >>>>>>
> >>>>>> ----- Original Message -----
> >>>>>>> From: "Tigran Mkrtchyan" <tigran.mkrtchyan_at_desy.de>
> >>>>>>> To: dev_at_grizzly.java.net
> >>>>>>> Sent: Monday, March 10, 2014 9:10:07 AM
> >>>>>>> Subject: Re: closed connection under high load
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> ----- Original Message -----
> >>>>>>>> From: "Oleksiy Stashok" <oleksiy.stashok_at_oracle.com>
> >>>>>>>> To: dev_at_grizzly.java.net
> >>>>>>>> Sent: Friday, March 7, 2014 10:50:15 PM
> >>>>>>>> Subject: Re: closed connection under high load
> >>>>>>>>
> >>>>>>>> Hi Tigran,
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 07.03.14 07:36, Mkrtchyan, Tigran wrote:
> >>>>>>>>> It looks like that IO dominated by a single connection.
> >>>>>>>>> I monitor three probes: onWriteEvent, onReadEvent and
> >>>>>>>>> onIOEventReadyEvent. While I see ACCEPT from all clients,
> >>>>>>>>> READ and WRITE comes only from a single client.
> >>>>>>>> Looks like your app is made in a way, that it can't use
> >>>>>>>> SameThreadStrategy. If you can share your code and prepare the test
> >>>>>>>> -
> >>>>>>>> I
> >>>>>>>> can take a look and try to find out why it doesn't work with the
> >>>>>>>> SameThreadStrategy.
> >>>>>>> Hi Alexey,
> >>>>>>>
> >>>>>>> the code is available on the github:
> >>>>>>>
> >>>>>>> https://github.com/kofemann/oncrpc4j
> >>>>>>>
> >>>>>>> You probably need to have a look at
> >>>>>>>
> >>>>>>> https://github.com/kofemann/oncrpc4j/blob/master/oncrpc4j-core/src/main/java/org/dcache/xdr/OncRpcSvc.java
> >>>>>>>
> >>>>>>> and
> >>>>>>>
> >>>>>>> https://github.com/kofemann/oncrpc4j/blob/master/oncrpc4j-core/src/main/java/org/dcache/xdr/GrizzlyXdrTransport.java
> >>>>>>>
> >>>>>>> possibly
> >>>>>>>
> >>>>>>> https://github.com/kofemann/oncrpc4j/blob/master/oncrpc4j-core/src/main/java/org/dcache/xdr/RpcDispatcher.java
> >>>>>>>
> >>>>>>> I will try to prepare a simple client to produce a load without
> >>>>>>> running
> >>>>>>> full
> >>>>>>> NFS server.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Tigran.
> >>>>>>>
> >>>>>>>> WBR,
> >>>>>>>> Alexey.
> >>>>>>>>
> >>>>>>>>> Tigran.
> >>>>>>>>>
> >>>>>>>>> ----- Original Message -----
> >>>>>>>>>> From: "Oleksiy Stashok" <oleksiy.stashok_at_oracle.com>
> >>>>>>>>>> To: dev_at_grizzly.java.net
> >>>>>>>>>> Sent: Friday, March 7, 2014 2:39:22 AM
> >>>>>>>>>> Subject: Re: closed connection under high load
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 06.03.14 01:29, Mkrtchyan, Tigran wrote:
> >>>>>>>>>>> ----- Original Message -----
> >>>>>>>>>>>> From: "Oleksiy Stashok" <oleksiy.stashok_at_oracle.com>
> >>>>>>>>>>>> To: dev_at_grizzly.java.net
> >>>>>>>>>>>> Sent: Thursday, March 6, 2014 2:35:00 AM
> >>>>>>>>>>>> Subject: Re: closed connection under high load
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Tigran,
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Tigran,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> looks like the connection is getting closed because of
> >>>>>>>>>>>>>> IOException
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>> -1
> >>>>>>>>>>>>>> returned during channel.read(), which is expected.
> >>>>>>>>>>>>>> So IMO it's nothing decided by Grizzly, it's either other side
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>> local
> >>>>>>>>>>>>>> OS closed the connection.
> >>>>>>>>>>>>> Yes, in did, the client closes the connection. But it looks
> >>>>>>>>>>>>> like
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>> reason
> >>>>>>>>>>>>> of it
> >>>>>>>>>>>>> that reply from server sent 15sec after request is arrived. The
> >>>>>>>>>>>>> filter
> >>>>>>>>>>>>> itself took
> >>>>>>>>>>>>> max 800ms. Any ideas how to trace where rest of the time was
> >>>>>>>>>>>>> spent?
> >>>>>>>>>>>> You can register ConnectionProbe for each Connection:
> >>>>>>>>>>>> / connection.getMonitoringConfig().addProbes(new
> >>>>>>>>>>>> ConnectionProbe() {
> >>>>>>>>>>>> .....
> >>>>>>>>>>>> });/
> >>>>>>>>>>>>
> >>>>>>>>>>>> and monitor different Connection events and time when it
> >>>>>>>>>>>> happened.
> >>>>>>>>>>>> It might be interesting to monitor onWriteEvent() and see if
> >>>>>>>>>>>> write
> >>>>>>>>>>>> delay
> >>>>>>>>>>>> occurs on Grizzly/Java or OS level. For example Grizzly reports
> >>>>>>>>>>>> the
> >>>>>>>>>>>> socketChannel.write(ByteBuffer) had been accepted the
> >>>>>>>>>>>> ByteBuffer,
> >>>>>>>>>>>> but
> >>>>>>>>>>>> the peer didn't get this data, which means the delay happens in
> >>>>>>>>>>>> OS
> >>>>>>>>>>>> layer.
> >>>>>>>>>>> I was logging in my code where I start to process the request and
> >>>>>>>>>>> this
> >>>>>>>>>>> was with quite a delay. Unfortunately, my work load nodes are
> >>>>>>>>>>> turned
> >>>>>>>>>>> off till the end of the week as we update cooling infrastructure
> >>>>>>>>>>> in
> >>>>>>>>>>> our
> >>>>>>>>>>> data center. But I will continue investigations next week. Is
> >>>>>>>>>>> there
> >>>>>>>>>>> any
> >>>>>>>>>>> best practice in which relations should be the number of selector
> >>>>>>>>>>> and
> >>>>>>>>>>> worker threads?
> >>>>>>>>>> By default we'd suggest selector-threads-count=CPU-count and
> >>>>>>>>>> worker-threads-count=CPU-count*2
> >>>>>>>>>> I'd also recommend to use SameThreadStrategy (no worker threads),
> >>>>>>>>>> where
> >>>>>>>>>> it's possible. For example if the app logic doesn't have any
> >>>>>>>>>> blocking,
> >>>>>>>>>> waiting kind of code - there is no reason to pay for a thread
> >>>>>>>>>> context
> >>>>>>>>>> switch.
> >>>>>>>>>>
> >>>>>>>>>>> BTW, what do you think about proposed change:
> >>>>>>>>>>> https://github.com/GrizzlyNIO/grizzly-mirror/pull/1
> >>>>>>>>>> Looks good.
> >>>>>>>>>> Will apply it.
> >>>>>>>>>>
> >>>>>>>>>> Thank you!
> >>>>>>>>>>
> >>>>>>>>>> WBR,
> >>>>>>>>>> Alexey.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Tigran.
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks.
> >>>>>>>>>>>>
> >>>>>>>>>>>> WBR,
> >>>>>>>>>>>> Alexey.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Looking at the requests log I can see that request processing
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>> not
> >>>>>>>>>>>>> equally distributed.
> >>>>>>>>>>>>> Some connections processed quite active, while others are not.
> >>>>>>>>>>>>> Is
> >>>>>>>>>>>>> there
> >>>>>>>>>>>>> a
> >>>>>>>>>>>>> way how I can see
> >>>>>>>>>>>>> selector threads activity?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Tigran.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> WBR,
> >>>>>>>>>>>>>> Alexey.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 26.02.14 14:06, Mkrtchyan, Tigran wrote:
> >>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> we observe that grizzly decides to close connection under
> >>>>>>>>>>>>>>> high
> >>>>>>>>>>>>>>> load.
> >>>>>>>>>>>>>>> Any idea how to find the reason?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>> Tigran.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> java.io.IOException: Connection closed
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.asyncqueue.TaskQueue.onClose(TaskQueue.java:302)
> >>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.nio.AbstractNIOAsyncQueueWriter.onClose(AbstractNIOAsyncQueueWriter.java:591)
> >>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.nio.transport.TCPNIOTransport.closeConnection(TCPNIOTransport.java:389)
> >>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.nio.NIOConnection$2.run(NIOConnection.java:439)
> >>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.nio.DefaultSelectorHandler$RunnableTask.run(DefaultSelectorHandler.java:489)
> >>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.nio.DefaultSelectorHandler.processPendingTaskQueue(DefaultSelectorHandler.java:295)
> >>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.nio.DefaultSelectorHandler.processPendingTasks(DefaultSelectorHandler.java:286)
> >>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.nio.DefaultSelectorHandler.preSelect(DefaultSelectorHandler.java:100)
> >>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.nio.SelectorRunner.doSelect(SelectorRunner.java:326)
> >>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.nio.SelectorRunner.run(SelectorRunner.java:271)
> >>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:565)
> >>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:545)
> >>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
> >>>>>>>>>>>>>>> Caused by: java.lang.Exception: Close stack trace
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.nio.NIOConnection.close0(NIOConnection.java:426)
> >>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.nio.transport.TCPNIOConnection.close0(TCPNIOConnection.java:292)
> >>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.nio.transport.TCPNIOTransport.read(TCPNIOTransport.java:648)
> >>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.nio.transport.TCPNIOTransportFilter.handleRead(TCPNIOTransportFilter.java:75)
> >>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.filterchain.TransportFilter.handleRead(TransportFilter.java:173)
> >>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:119)
> >>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:288)
> >>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:206)
> >>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:136)
> >>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:114)
> >>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:77)
> >>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:546)
> >>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:113)
> >>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:115)
> >>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:55)
> >>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>> org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:135)
> >>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
> >>
>
>