users@grizzly.java.net

Re: SelectorRunner uses 100% CPU

From: Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
Date: Sun, 27 Sep 2015 22:12:16 -0700

Hi Marc,


On 23.09.15 02:13, Marc Arens wrote:
> We are still collecting debug infos ourself to analyze what exactly is going
> wrong. Your Fix will probably help us to narrow it down some more. We can't
> share memory dumps until we managed to reproduce it in a testing environment
> without any private data.
I recently found that JProfiler can put together nice GC roots summary
based on java memory dump, maybe other profilers also support that
(unfortunately NB/VisualVM doesn't support this feature). So instead of
sending the heap dump itself you can send the report.

Thanks.

WBR,
Alexey.

>> On 23 September 2015 at 02:38 Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
>> wrote:
>>
>>
>> Hi Marc,
>>
>> I just backported [1] and [2] to 2.2.x branch and published
>> 2.2.23-SNAPSHOT, can you pls try it?
>> You can enable a system property as mentioned in [2] and see if it helps
>> to prevent 100% CPU consumption, but it could be also a consequence of OOM.
>> Once you hit OOM, could you take a memory dump, so we can investigate if
>> it's a leak, or we just hit machine limit.
>>
>> Thanks.
>>
>> WBR,
>> Alexey.
>>
>> [1] https://java.net/jira/browse/GRIZZLY-1712
>> [2] https://java.net/jira/browse/GRIZZLY-1801
>>
>>
>>
>> On 22.09.15 09:58, Marc Arens wrote:
>>> Hey Ryan,
>>>
>>> thanks for the fast reply. Complete system outage(OOM not able to create new
>>> threads) happens ~15 minutes after server start in the worst case. But i
>>> can't
>>> even say when exactly the SelectorRunner starts using so much CPU within
>>> that
>>> time frame. I'd be very thankful for backports of those fixes!
>>>
>>>> On 22 September 2015 at 18:21 Ryan Lubke <ryan.lubke_at_oracle.com> wrote:
>>>>
>>>>
>>>> It's hard to say if you're hitting the issue or not based on the
>>>> provided information.
>>>>
>>>> We can try back-porting the fix as well as changes to force selector
>>>> spin detection.
>>>>
>>>> How often is this occurring?
>>>>
>>>> Thanks,
>>>> -rl
>>>>
>>>>> Marc Arens <mailto:marc.arens_at_open-xchange.com>
>>>>> September 22, 2015 at 10:51
>>>>> Moin,
>>>>>
>>>>> we are seeing system outages at one deployment where one repeating
>>>>> point of the
>>>>> outages is a
>>>>>
>>>>> "Grizzly(1) SelectorRunner" daemon prio=10 tid=0x00007fec31068800
>>>>> nid=0x23e6
>>>>> runnable [0x00007fec6e3aa000]
>>>>> java.lang.Thread.State: RUNNABLE
>>>>> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>>> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>>>> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
>>>>> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
>>>>> - locked <0x00000002c9c4e4a8> (a sun.nio.ch.Util$2)
>>>>> - locked <0x00000002ca5e08e0> (a java.util.Collections$UnmodifiableSet)
>>>>> - locked <0x00000002c9bdc060> (a sun.nio.ch.EPollSelectorImpl)
>>>>> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
>>>>> at
>>>>> org.glassfish.grizzly.nio.DefaultSelectorHandler.select(DefaultSelectorHandler.java:112)
>>>>> at
>>>>> org.glassfish.grizzly.nio.SelectorRunner.doSelect(SelectorRunner.java:325)
>>>>> at org.glassfish.grizzly.nio.SelectorRunner.run(SelectorRunner.java:264)
>>>>> at
>>>>> org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:551)
>>>>> at
>>>>> org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:531)
>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>
>>>>> which keeps a CPU busy at 100% for at least 15 seconds and probably
>>>>> even longer
>>>>> but those 15s is all our debug data covers at the moment.
>>>>>
>>>>> This is a Grizzly 2.2.x on java version "1.7.0_79", OpenJDK Runtime
>>>>> Environment
>>>>> (rhel-2.5.5.3.el6_6-x86_64 u79-b14), OpenJDK 64-Bit Server VM (build
>>>>> 24.79-b02,
>>>>> mixed mode).
>>>>>
>>>>> I found a 2.3.x related bug at
>>>>> https://java.net/jira/browse/GRIZZLY-1712 could
>>>>> it be this one we are hitting although it's 2.2.x? If yes, would it be
>>>>> possible
>>>>> to backport it to 2.2.x? Do you know about any other issues we might
>>>>> be hitting
>>>>> here?
>>>>>
>>>>> Best regards
>>>>> Marc
>>>>>
>>>>> --
>>>>> ...it was evening and it was morning and there were already two ways
>>>>> to store
>>>>> Unicode...