Re: Comet handler starts terminating TCP connections with RST?

From: Jussi Kuosa <jussi.kuosa_at_f-secure.com>
Date: Tue, 16 Dec 2008 09:03:45 -0800 (PST)

Hi again,
it took a while before I got to tackle this issue... I've been trying to
solve this for a week now and here's what I've come up with: the TCP resets
are caused by a Comet-enabled selector thread consuming 100% CPU. As the CPU
is hot, the Linux networking stack begins to drop connections from that CPUs
packet queue (confirmed by /proc/net/softnet_stat).

After a varying period of time (20s - 4h), one core (of 2) goes to and stays
at 100% spinning wildly in the (epoll) selection loop. I used the JTop
plugin in JConsole to track down the CPU hogging thread
(SelectorThread-8282). From jstack output I found out that the thread seems
to sit in sun.nio.ch.EPollArrayWrapper.epollWait all the time. I attached
the NB6.1 debugger to GF and found that the SelectorThread.doSelect for all
listeners keep returning with 0 state and there are no ready keys.
Evaluating selector.keys() returns only one key with OP_ACCEPT interestOps
(i.e. the server socket in port 8282), so I guess all client keys have been
cleared out? I have no idea why the epoll returns immediately in this case?

I used the jstack trace to find [1] and a JDK defect 6595055 [2] that lead
me to 6403933 [3]. The defect is flagged as 10-Fix Delivered, but I guess
the problem hasn't really been fixed yet (at least in jdk-6u7)... I also run
a second server on top of Windows Server 2003sp2 and it hasn't had any
problems, so that is currently our backup plan (that we really wouldn't like
to use due to Windows NTFS (jar) file locking problems during
redeployments).

I've tried desperatly to make the problem reproduce systematically, but
without any luck. All the test clients that I've found in JDK defects (and a
similar defect in Twisted framework) run fine, as do my own attempts to
break the disconnect cycle. The only time I even noticed a TCP reset
happening before the CPU trashing is
http://www.nabble.com/file/p21037201/reset_sequence.txt reset_sequence.txt
(the checksums are offloaded). That was caused by a service shutdown on the
client side, but the problem occurs also with the clients doing normal
CONNECTs to a subscribed channel.

Is anyone on the list running 2.6 amd64 Linux cometd/bayeux servers and have
you experienced similar behavior?

Does someone have any other suggestions as to what I could try next?

Can I provide more information to help solve this problem?

Best regards,

Jussi Kuosa

[1] http://forums.java.net/jive/thread.jspa?messageID=255525
[2] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6595055
[3] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6403933

> Can you do a run using
> -Dcom.sun.enterprise.web.connector.grizzly.enableSnoop=true
> (and also make sure you always run with):
> <jvm-options>-Dcom.sun.enterprise.server.ss.ASQuickStartup=false</jvm-options>

Yes, I put those in.

> Also, when that happens, can you grab a jstack PID to see if threads are
> available.

Looks ok, all HttpWorkerThreads for 8282 are waiting in the pipeline?
http://www.nabble.com/file/p21037201/jstack_l.txt jstack_l.txt

> One thing you might want to try is to update you v2 installation to use
> grizzly 1.0.22.jar

Yes, I bumped our GF to v2ur2-b04 (with 1.0.22) and we use JDK 1.6.0_07-b06.

-- 
View this message in context: http://www.nabble.com/Comet-handler-starts-terminating-TCP-connections-with-RST--tp20337445p21037201.html
Sent from the Grizzly - Users mailing list archive at Nabble.com.