Re: closed connection under high load

From: Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
Date: Wed, 12 Mar 2014 13:48:59 -0700

Hi Tigran,

On 12.03.14 01:17, Mkrtchyan, Tigran wrote:
> I will keep looking for to source of the problem.
Ok.

>> What's the CPU load you observe?
> The CPU load is ~0.4. There are 24 cores and only 2 of them
> get some load, one 14% (selector loop?) and an other one ~5%.
> This is with 6 selector threads and 48 worker threads. The number of
> clients 32. I get 120 MB/s, which is the network limit.
Yeah, the CPU load is very small. Try to monitor ConnectionProbe's:
onIOEventReadyEvent, onIOEventEnableEvent, onIOEventDisableEvent events
for IOEvent.WRITE. It's interesting to see which one of these probe
events gets called last before the delay.

> BTW, I believe, you need to apply the patch, which fixes selector runners count.
IMO we have to rework the patch a bit, because SelectorThreadPool might
be set by user and also used to run user's task (it's not good idea but
still), so we can't rely on its max threads count, because it can be
very large value. What I propose is to log a warning at Transport start
phase in case the SelectorThreadPool's min threads count is less than
selector runners count.

What do you think?

Thanks.

WBR,
Alexey.
>
> Thanks,
> Tigran.
>
>> Thanks.
>>
>> WBR,
>> Alexey.
>>
>> On 11.03.14 08:00, Mkrtchyan, Tigran wrote:
>>> It looks like reply is stored in async write queue.
>>> This is what I can see from logging of my requets:
>>>
>>> [root_at_dcache-lab002 ~]# grep 0xda61ff90
>>> /var/log/dcache/dcache-lab002Domain.log
>>> 11 Mar 2014 15:25:16 (dcache-lab002-A) processing RPC message:
>>> xid=0xda61ff90 remote=/131.169.161.125:847
>>> 11 Mar 2014 15:25:16 (dcache-lab002-A) dispatching message 0xda61ff90
>>> 11 Mar 2014 15:25:16 (dcache-lab002-A) ready to reply Done message
>>> 0xda61ff90
>>> 11 Mar 2014 15:25:16 (dcache-lab002-A) dispatching Done message 0xda61ff90
>>>
>>> this is the point after FilterChainContext#write
>>>
>>> 11 Mar 2014 15:25:23 (dcache-lab002-A) Failed to send RPC message:
>>> xid=0xda61ff90 remote=/131.169.161.125:847 : Connection closed
>>>
>>> and this is the point where CompletionHandler#failed is called
>>>
>>> The request was 7 sec in the queue.
>>> For now I have changed the write call to be blocking:
>>>
>>> _context.write(_context.getAddress(), buffer, new
>>> EmptyCompletionHandler<WriteResult>() {
>>> @Override
>>> public void failed(Throwable throwable) {
>>> _log.error("Failed to send RPC message: xid=0x{} remote={}
>>> : {}",
>>> Integer.toHexString(buffer.getInt(0)),
>>> _connection.getPeerAddress(),
>>> throwable.getMessage());
>>> }
>>>
>>> }, true);
>>>
>>> The error is gore. But I want to understand why reply was queued so long.
>>>
>>> Tigran.
>>>
>>> ----- Original Message -----
>>>> From: "Tigran Mkrtchyan" <tigran.mkrtchyan_at_desy.de>
>>>> To: dev_at_grizzly.java.net
>>>> Sent: Tuesday, March 11, 2014 8:57:32 AM
>>>> Subject: Re: closed connection under high load
>>>>
>>>>
>>>>
>>>> Unfortunately, this is not fix delayed reply issue.
>>>> So I am keep digging....
>>>>
>>>> Tigran.
>>>>
>>>> ----- Original Message -----
>>>>> From: "Tigran Mkrtchyan" <tigran.mkrtchyan_at_desy.de>
>>>>> To: dev_at_grizzly.java.net
>>>>> Sent: Monday, March 10, 2014 4:49:38 PM
>>>>> Subject: Re: closed connection under high load
>>>>>
>>>>>
>>>>> Ok, it looks like I found it.
>>>>> In NIOTransport#startSelectorRunners() starts as many
>>>>> SelectorRunner as many cores node has. Nevertheless
>>>>> the thread pool size to run them is limited to
>>>>> number of threads configured with
>>>>> NIOTransportBuilder#setSelectorThreadPoolConfig.
>>>>> Is the numbers do not match or setSelectorThreadPoolConfig
>>>>> configures less threads as SelectorRunner was already created,
>>>>> some of them will never be executed.
>>>>>
>>>>>
>>>>> Two small fixes attached.
>>>>>
>>>>> Tigran.
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Tigran Mkrtchyan" <tigran.mkrtchyan_at_desy.de>
>>>>>> To: dev_at_grizzly.java.net
>>>>>> Sent: Monday, March 10, 2014 2:51:57 PM
>>>>>> Subject: Re: closed connection under high load
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Alexey,
>>>>>>
>>>>>> probably I do not really understand how grizzly is working,
>>>>>> but this is what I observe: a selector thread has only
>>>>>> one connection in the selector. As a result, only events from a
>>>>>> single connection get processed.
>>>>>>
>>>>>> Is it expected behavior?
>>>>>>
>>>>>> Tigran.
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Tigran Mkrtchyan" <tigran.mkrtchyan_at_desy.de>
>>>>>>> To: dev_at_grizzly.java.net
>>>>>>> Sent: Monday, March 10, 2014 9:10:07 AM
>>>>>>> Subject: Re: closed connection under high load
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Oleksiy Stashok" <oleksiy.stashok_at_oracle.com>
>>>>>>>> To: dev_at_grizzly.java.net
>>>>>>>> Sent: Friday, March 7, 2014 10:50:15 PM
>>>>>>>> Subject: Re: closed connection under high load
>>>>>>>>
>>>>>>>> Hi Tigran,
>>>>>>>>
>>>>>>>>
>>>>>>>> On 07.03.14 07:36, Mkrtchyan, Tigran wrote:
>>>>>>>>> It looks like that IO dominated by a single connection.
>>>>>>>>> I monitor three probes: onWriteEvent, onReadEvent and
>>>>>>>>> onIOEventReadyEvent. While I see ACCEPT from all clients,
>>>>>>>>> READ and WRITE comes only from a single client.
>>>>>>>> Looks like your app is made in a way, that it can't use
>>>>>>>> SameThreadStrategy. If you can share your code and prepare the test -
>>>>>>>> I
>>>>>>>> can take a look and try to find out why it doesn't work with the
>>>>>>>> SameThreadStrategy.
>>>>>>> Hi Alexey,
>>>>>>>
>>>>>>> the code is available on the github:
>>>>>>>
>>>>>>> https://github.com/kofemann/oncrpc4j
>>>>>>>
>>>>>>> You probably need to have a look at
>>>>>>>
>>>>>>> https://github.com/kofemann/oncrpc4j/blob/master/oncrpc4j-core/src/main/java/org/dcache/xdr/OncRpcSvc.java
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> https://github.com/kofemann/oncrpc4j/blob/master/oncrpc4j-core/src/main/java/org/dcache/xdr/GrizzlyXdrTransport.java
>>>>>>>
>>>>>>> possibly
>>>>>>>
>>>>>>> https://github.com/kofemann/oncrpc4j/blob/master/oncrpc4j-core/src/main/java/org/dcache/xdr/RpcDispatcher.java
>>>>>>>
>>>>>>> I will try to prepare a simple client to produce a load without running
>>>>>>> full
>>>>>>> NFS server.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Tigran.
>>>>>>>
>>>>>>>> WBR,
>>>>>>>> Alexey.
>>>>>>>>
>>>>>>>>> Tigran.
>>>>>>>>>
>>>>>>>>> ----- Original Message -----
>>>>>>>>>> From: "Oleksiy Stashok" <oleksiy.stashok_at_oracle.com>
>>>>>>>>>> To: dev_at_grizzly.java.net
>>>>>>>>>> Sent: Friday, March 7, 2014 2:39:22 AM
>>>>>>>>>> Subject: Re: closed connection under high load
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 06.03.14 01:29, Mkrtchyan, Tigran wrote:
>>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>> From: "Oleksiy Stashok" <oleksiy.stashok_at_oracle.com>
>>>>>>>>>>>> To: dev_at_grizzly.java.net
>>>>>>>>>>>> Sent: Thursday, March 6, 2014 2:35:00 AM
>>>>>>>>>>>> Subject: Re: closed connection under high load
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Tigran,
>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Tigran,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> looks like the connection is getting closed because of
>>>>>>>>>>>>>> IOException
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> -1
>>>>>>>>>>>>>> returned during channel.read(), which is expected.
>>>>>>>>>>>>>> So IMO it's nothing decided by Grizzly, it's either other side
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> local
>>>>>>>>>>>>>> OS closed the connection.
>>>>>>>>>>>>> Yes, in did, the client closes the connection. But it looks
>>>>>>>>>>>>> like
>>>>>>>>>>>>> the
>>>>>>>>>>>>> reason
>>>>>>>>>>>>> of it
>>>>>>>>>>>>> that reply from server sent 15sec after request is arrived. The
>>>>>>>>>>>>> filter
>>>>>>>>>>>>> itself took
>>>>>>>>>>>>> max 800ms. Any ideas how to trace where rest of the time was
>>>>>>>>>>>>> spent?
>>>>>>>>>>>> You can register ConnectionProbe for each Connection:
>>>>>>>>>>>> / connection.getMonitoringConfig().addProbes(new
>>>>>>>>>>>> ConnectionProbe() {
>>>>>>>>>>>> .....
>>>>>>>>>>>> });/
>>>>>>>>>>>>
>>>>>>>>>>>> and monitor different Connection events and time when it
>>>>>>>>>>>> happened.
>>>>>>>>>>>> It might be interesting to monitor onWriteEvent() and see if
>>>>>>>>>>>> write
>>>>>>>>>>>> delay
>>>>>>>>>>>> occurs on Grizzly/Java or OS level. For example Grizzly reports
>>>>>>>>>>>> the
>>>>>>>>>>>> socketChannel.write(ByteBuffer) had been accepted the
>>>>>>>>>>>> ByteBuffer,
>>>>>>>>>>>> but
>>>>>>>>>>>> the peer didn't get this data, which means the delay happens in
>>>>>>>>>>>> OS
>>>>>>>>>>>> layer.
>>>>>>>>>>> I was logging in my code where I start to process the request and
>>>>>>>>>>> this
>>>>>>>>>>> was with quite a delay. Unfortunately, my work load nodes are
>>>>>>>>>>> turned
>>>>>>>>>>> off till the end of the week as we update cooling infrastructure
>>>>>>>>>>> in
>>>>>>>>>>> our
>>>>>>>>>>> data center. But I will continue investigations next week. Is
>>>>>>>>>>> there
>>>>>>>>>>> any
>>>>>>>>>>> best practice in which relations should be the number of selector
>>>>>>>>>>> and
>>>>>>>>>>> worker threads?
>>>>>>>>>> By default we'd suggest selector-threads-count=CPU-count and
>>>>>>>>>> worker-threads-count=CPU-count*2
>>>>>>>>>> I'd also recommend to use SameThreadStrategy (no worker threads),
>>>>>>>>>> where
>>>>>>>>>> it's possible. For example if the app logic doesn't have any
>>>>>>>>>> blocking,
>>>>>>>>>> waiting kind of code - there is no reason to pay for a thread
>>>>>>>>>> context
>>>>>>>>>> switch.
>>>>>>>>>>
>>>>>>>>>>> BTW, what do you think about proposed change:
>>>>>>>>>>> https://github.com/GrizzlyNIO/grizzly-mirror/pull/1
>>>>>>>>>> Looks good.
>>>>>>>>>> Will apply it.
>>>>>>>>>>
>>>>>>>>>> Thank you!
>>>>>>>>>>
>>>>>>>>>> WBR,
>>>>>>>>>> Alexey.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Tigran.
>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>> WBR,
>>>>>>>>>>>> Alexey.
>>>>>>>>>>>>
>>>>>>>>>>>>> Looking at the requests log I can see that request processing
>>>>>>>>>>>>> is
>>>>>>>>>>>>> not
>>>>>>>>>>>>> equally distributed.
>>>>>>>>>>>>> Some connections processed quite active, while others are not.
>>>>>>>>>>>>> Is
>>>>>>>>>>>>> there
>>>>>>>>>>>>> a
>>>>>>>>>>>>> way how I can see
>>>>>>>>>>>>> selector threads activity?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tigran.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> WBR,
>>>>>>>>>>>>>> Alexey.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 26.02.14 14:06, Mkrtchyan, Tigran wrote:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> we observe that grizzly decides to close connection under
>>>>>>>>>>>>>>> high
>>>>>>>>>>>>>>> load.
>>>>>>>>>>>>>>> Any idea how to find the reason?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Tigran.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> java.io.IOException: Connection closed
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.asyncqueue.TaskQueue.onClose(TaskQueue.java:302)
>>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.nio.AbstractNIOAsyncQueueWriter.onClose(AbstractNIOAsyncQueueWriter.java:591)
>>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.nio.transport.TCPNIOTransport.closeConnection(TCPNIOTransport.java:389)
>>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.nio.NIOConnection$2.run(NIOConnection.java:439)
>>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.nio.DefaultSelectorHandler$RunnableTask.run(DefaultSelectorHandler.java:489)
>>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.nio.DefaultSelectorHandler.processPendingTaskQueue(DefaultSelectorHandler.java:295)
>>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.nio.DefaultSelectorHandler.processPendingTasks(DefaultSelectorHandler.java:286)
>>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.nio.DefaultSelectorHandler.preSelect(DefaultSelectorHandler.java:100)
>>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.nio.SelectorRunner.doSelect(SelectorRunner.java:326)
>>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.nio.SelectorRunner.run(SelectorRunner.java:271)
>>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:565)
>>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:545)
>>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
>>>>>>>>>>>>>>> Caused by: java.lang.Exception: Close stack trace
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.nio.NIOConnection.close0(NIOConnection.java:426)
>>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.nio.transport.TCPNIOConnection.close0(TCPNIOConnection.java:292)
>>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.nio.transport.TCPNIOTransport.read(TCPNIOTransport.java:648)
>>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.nio.transport.TCPNIOTransportFilter.handleRead(TCPNIOTransportFilter.java:75)
>>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.filterchain.TransportFilter.handleRead(TransportFilter.java:173)
>>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:119)
>>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:288)
>>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:206)
>>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:136)
>>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:114)
>>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:77)
>>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:546)
>>>>>>>>>>>>>>> [grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:113)
>>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:115)
>>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:55)
>>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>> org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:135)
>>>>>>>>>>>>>>> ~[grizzly-framework-2.3.5.jar:2.3.5]
>>