users@grizzly.java.net

Re: SelectorRunner uses 100% CPU

From: Marc Arens <marc.arens_at_open-xchange.com>
Date: Tue, 29 Sep 2015 10:30:38 +0200 (CEST)

Just a little feedback from my side: "... but it could be also a consequence of
OOM..." seems to have been the case. After several days we finally got the
confirmation that the admins didn't manage to set proper ulimits for the
user/server process which led to the OOM. As Grizzly performs pretty solid in
many many other deployments i was already wondering what strange edge case we
have been hitting here.

Thanks again for your help.

> On 28 September 2015 at 07:12 Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
> wrote:
>
>
> Hi Marc,
>
>
> On 23.09.15 02:13, Marc Arens wrote:
> > We are still collecting debug infos ourself to analyze what exactly is going
> > wrong. Your Fix will probably help us to narrow it down some more. We can't
> > share memory dumps until we managed to reproduce it in a testing environment
> > without any private data.
> I recently found that JProfiler can put together nice GC roots summary
> based on java memory dump, maybe other profilers also support that
> (unfortunately NB/VisualVM doesn't support this feature). So instead of
> sending the heap dump itself you can send the report.
>
> Thanks.
>
> WBR,
> Alexey.
>
> >> On 23 September 2015 at 02:38 Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
> >> wrote:
> >>
> >>
> >> Hi Marc,
> >>
> >> I just backported [1] and [2] to 2.2.x branch and published
> >> 2.2.23-SNAPSHOT, can you pls try it?
> >> You can enable a system property as mentioned in [2] and see if it helps
> >> to prevent 100% CPU consumption, but it could be also a consequence of OOM.
> >> Once you hit OOM, could you take a memory dump, so we can investigate if
> >> it's a leak, or we just hit machine limit.
> >>
> >> Thanks.
> >>
> >> WBR,
> >> Alexey.
> >>
> >> [1] https://java.net/jira/browse/GRIZZLY-1712
> >> [2] https://java.net/jira/browse/GRIZZLY-1801
> >>
> >>
> >>
> >> On 22.09.15 09:58, Marc Arens wrote:
> >>> Hey Ryan,
> >>>
> >>> thanks for the fast reply. Complete system outage(OOM not able to create
> >>> new
> >>> threads) happens ~15 minutes after server start in the worst case. But i
> >>> can't
> >>> even say when exactly the SelectorRunner starts using so much CPU within
> >>> that
> >>> time frame. I'd be very thankful for backports of those fixes!
> >>>
> >>>> On 22 September 2015 at 18:21 Ryan Lubke <ryan.lubke_at_oracle.com> wrote:
> >>>>
> >>>>
> >>>> It's hard to say if you're hitting the issue or not based on the
> >>>> provided information.
> >>>>
> >>>> We can try back-porting the fix as well as changes to force selector
> >>>> spin detection.
> >>>>
> >>>> How often is this occurring?
> >>>>
> >>>> Thanks,
> >>>> -rl
> >>>>
> >>>>> Marc Arens <mailto:marc.arens_at_open-xchange.com>
> >>>>> September 22, 2015 at 10:51
> >>>>> Moin,
> >>>>>
> >>>>> we are seeing system outages at one deployment where one repeating
> >>>>> point of the
> >>>>> outages is a
> >>>>>
> >>>>> "Grizzly(1) SelectorRunner" daemon prio=10 tid=0x00007fec31068800
> >>>>> nid=0x23e6
> >>>>> runnable [0x00007fec6e3aa000]
> >>>>> java.lang.Thread.State: RUNNABLE
> >>>>> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >>>>> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> >>>>> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
> >>>>> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
> >>>>> - locked <0x00000002c9c4e4a8> (a sun.nio.ch.Util$2)
> >>>>> - locked <0x00000002ca5e08e0> (a java.util.Collections$UnmodifiableSet)
> >>>>> - locked <0x00000002c9bdc060> (a sun.nio.ch.EPollSelectorImpl)
> >>>>> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
> >>>>> at
> >>>>> org.glassfish.grizzly.nio.DefaultSelectorHandler.select(DefaultSelectorHandler.java:112)
> >>>>> at
> >>>>> org.glassfish.grizzly.nio.SelectorRunner.doSelect(SelectorRunner.java:325)
> >>>>> at org.glassfish.grizzly.nio.SelectorRunner.run(SelectorRunner.java:264)
> >>>>> at
> >>>>> org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:551)
> >>>>> at
> >>>>> org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:531)
> >>>>> at java.lang.Thread.run(Thread.java:745)
> >>>>>
> >>>>> which keeps a CPU busy at 100% for at least 15 seconds and probably
> >>>>> even longer
> >>>>> but those 15s is all our debug data covers at the moment.
> >>>>>
> >>>>> This is a Grizzly 2.2.x on java version "1.7.0_79", OpenJDK Runtime
> >>>>> Environment
> >>>>> (rhel-2.5.5.3.el6_6-x86_64 u79-b14), OpenJDK 64-Bit Server VM (build
> >>>>> 24.79-b02,
> >>>>> mixed mode).
> >>>>>
> >>>>> I found a 2.3.x related bug at
> >>>>> https://java.net/jira/browse/GRIZZLY-1712 could
> >>>>> it be this one we are hitting although it's 2.2.x? If yes, would it be
> >>>>> possible
> >>>>> to backport it to 2.2.x? Do you know about any other issues we might
> >>>>> be hitting
> >>>>> here?
> >>>>>
> >>>>> Best regards
> >>>>> Marc
> >>>>>
> >>>>> --
> >>>>> ...it was evening and it was morning and there were already two ways
> >>>>> to store
> >>>>> Unicode...
>