Re: SelectorRunner uses 100% CPU

From: Marc Arens <marc.arens_at_open-xchange.com>
Date: Wed, 23 Sep 2015 11:13:17 +0200 (CEST)

Hello Alexey,

thanks a lot for the backport. 1801 doesn't look like it applies here because
the used jvm is 1.7.0_79, still useful for other scenarios though.
We are still collecting debug infos ourself to analyze what exactly is going
wrong. Your Fix will probably help us to narrow it down some more. We can't
share memory dumps until we managed to reproduce it in a testing environment
without any private data.

> On 23 September 2015 at 02:38 Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
> wrote:
>
>
> Hi Marc,
>
> I just backported [1] and [2] to 2.2.x branch and published
> 2.2.23-SNAPSHOT, can you pls try it?
> You can enable a system property as mentioned in [2] and see if it helps
> to prevent 100% CPU consumption, but it could be also a consequence of OOM.
> Once you hit OOM, could you take a memory dump, so we can investigate if
> it's a leak, or we just hit machine limit.
>
> Thanks.
>
> WBR,
> Alexey.
>
> [1] https://java.net/jira/browse/GRIZZLY-1712
> [2] https://java.net/jira/browse/GRIZZLY-1801
>
>
>
> On 22.09.15 09:58, Marc Arens wrote:
> > Hey Ryan,
> >
> > thanks for the fast reply. Complete system outage(OOM not able to create new
> > threads) happens ~15 minutes after server start in the worst case. But i
> > can't
> > even say when exactly the SelectorRunner starts using so much CPU within
> > that
> > time frame. I'd be very thankful for backports of those fixes!
> >
> >> On 22 September 2015 at 18:21 Ryan Lubke <ryan.lubke_at_oracle.com> wrote:
> >>
> >>
> >> It's hard to say if you're hitting the issue or not based on the
> >> provided information.
> >>
> >> We can try back-porting the fix as well as changes to force selector
> >> spin detection.
> >>
> >> How often is this occurring?
> >>
> >> Thanks,
> >> -rl
> >>
> >>> Marc Arens <mailto:marc.arens_at_open-xchange.com>
> >>> September 22, 2015 at 10:51
> >>> Moin,
> >>>
> >>> we are seeing system outages at one deployment where one repeating
> >>> point of the
> >>> outages is a
> >>>
> >>> "Grizzly(1) SelectorRunner" daemon prio=10 tid=0x00007fec31068800
> >>> nid=0x23e6
> >>> runnable [0x00007fec6e3aa000]
> >>> java.lang.Thread.State: RUNNABLE
> >>> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >>> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> >>> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
> >>> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
> >>> - locked <0x00000002c9c4e4a8> (a sun.nio.ch.Util$2)
> >>> - locked <0x00000002ca5e08e0> (a java.util.Collections$UnmodifiableSet)
> >>> - locked <0x00000002c9bdc060> (a sun.nio.ch.EPollSelectorImpl)
> >>> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
> >>> at
> >>> org.glassfish.grizzly.nio.DefaultSelectorHandler.select(DefaultSelectorHandler.java:112)
> >>> at
> >>> org.glassfish.grizzly.nio.SelectorRunner.doSelect(SelectorRunner.java:325)
> >>> at org.glassfish.grizzly.nio.SelectorRunner.run(SelectorRunner.java:264)
> >>> at
> >>> org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:551)
> >>> at
> >>> org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:531)
> >>> at java.lang.Thread.run(Thread.java:745)
> >>>
> >>> which keeps a CPU busy at 100% for at least 15 seconds and probably
> >>> even longer
> >>> but those 15s is all our debug data covers at the moment.
> >>>
> >>> This is a Grizzly 2.2.x on java version "1.7.0_79", OpenJDK Runtime
> >>> Environment
> >>> (rhel-2.5.5.3.el6_6-x86_64 u79-b14), OpenJDK 64-Bit Server VM (build
> >>> 24.79-b02,
> >>> mixed mode).
> >>>
> >>> I found a 2.3.x related bug at
> >>> https://java.net/jira/browse/GRIZZLY-1712 could
> >>> it be this one we are hitting although it's 2.2.x? If yes, would it be
> >>> possible
> >>> to backport it to 2.2.x? Do you know about any other issues we might
> >>> be hitting
> >>> here?
> >>>
> >>> Best regards
> >>> Marc
> >>>
> >>> --
> >>> ...it was evening and it was morning and there were already two ways
> >>> to store
> >>> Unicode...
>