users@grizzly.java.net

Re: Comet handler starts terminating TCP connections with RST?

From: Jeanfrancois Arcand <Jeanfrancois.Arcand_at_Sun.COM>
Date: Tue, 16 Dec 2008 12:10:06 -0500

Salut,

Jussi Kuosa wrote:
> Hi again,
> it took a while before I got to tackle this issue... I've been trying to
> solve this for a week now and here's what I've come up with: the TCP resets
> are caused by a Comet-enabled selector thread consuming 100% CPU. As the CPU
> is hot, the Linux networking stack begins to drop connections from that CPUs
> packet queue (confirmed by /proc/net/softnet_stat).

Grrrr I've reported many many times such JDK issues!!


>
> After a varying period of time (20s - 4h), one core (of 2) goes to and stays
> at 100% spinning wildly in the (epoll) selection loop. I used the JTop
> plugin in JConsole to track down the CPU hogging thread
> (SelectorThread-8282). From jstack output I found out that the thread seems
> to sit in sun.nio.ch.EPollArrayWrapper.epollWait all the time. I attached
> the NB6.1 debugger to GF and found that the SelectorThread.doSelect for all
> listeners keep returning with 0 state and there are no ready keys.
> Evaluating selector.keys() returns only one key with OP_ACCEPT interestOps
> (i.e. the server socket in port 8282), so I guess all client keys have been
> cleared out? I have no idea why the epoll returns immediately in this case?
>
> I used the jstack trace to find [1] and a JDK defect 6595055 [2] that lead
> me to 6403933 [3]. The defect is flagged as 10-Fix Delivered, but I guess
> the problem hasn't really been fixed yet (at least in jdk-6u7)... I also run
> a second server on top of Windows Server 2003sp2 and it hasn't had any
> problems, so that is currently our backup plan (that we really wouldn't like
> to use due to Windows NTFS (jar) file locking problems during
> redeployments).
>
> I've tried desperatly to make the problem reproduce systematically, but
> without any luck. All the test clients that I've found in JDK defects (and a
> similar defect in Twisted framework) run fine, as do my own attempts to
> break the disconnect cycle. The only time I even noticed a TCP reset
> happening before the CPU trashing is
> http://www.nabble.com/file/p21037201/reset_sequence.txt reset_sequence.txt
> (the checksums are offloaded). That was caused by a service shutdown on the
> client side, but the problem occurs also with the clients doing normal
> CONNECTs to a subscribed channel.
>

Thanks for the info. I've forwarded this to the NIO lead as this is
clearly a JDK bug.


> Is anyone on the list running 2.6 amd64 Linux cometd/bayeux servers and have
> you experienced similar behavior?
>
> Does someone have any other suggestions as to what I could try next?
>
> Can I provide more information to help solve this problem?

One workaround for you is to set the
CometContext.setExpirationDelay(-1). This will not enabled that extra
Selector.

Would you be able to produce a test case? Let me try to find a
workaround (again) by trying to detect the scenario and trash the
Selector. Meanwhile, can you file an issue here (as a p2):

https://glassfish.dev.java.net/servlets/ProjectIssues

Make sure you state it is a JDK issue. I'm asking because I might be
able to commit the workaround so GF 2.1 isn't suffering the issue. The
window is short by I should be able to send you a patch by the end of today.

Thanks

-- Jeanfrancois




>
> Best regards,
>
> Jussi Kuosa
>
>
> [1] http://forums.java.net/jive/thread.jspa?messageID=255525
> [2] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6595055
> [3] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6403933
>
>> Can you do a run using
>> -Dcom.sun.enterprise.web.connector.grizzly.enableSnoop=true
>> (and also make sure you always run with):
>> <jvm-options>-Dcom.sun.enterprise.server.ss.ASQuickStartup=false</jvm-options>
>
> Yes, I put those in.
>
>> Also, when that happens, can you grab a jstack PID to see if threads are
>> available.
>
> Looks ok, all HttpWorkerThreads for 8282 are waiting in the pipeline?
> http://www.nabble.com/file/p21037201/jstack_l.txt jstack_l.txt
>
>> One thing you might want to try is to update you v2 installation to use
>> grizzly 1.0.22.jar
>
> Yes, I bumped our GF to v2ur2-b04 (with 1.0.22) and we use JDK 1.6.0_07-b06.
>