users@grizzly.java.net

Re: Comet context doesn't expire

From: Jussi Kuosa <jussi.kuosa_at_f-secure.com>
Date: Thu, 23 Jul 2009 08:23:10 -0700 (PDT)

> > comet selector spin problem...
> OK that one is now fixed with grizzly-1.0.30-SNAPSHOT:

Hi J-F,
back from my first holiday stretch. The original problem seems to become
more clear (see below), but the 1.0.30 patch causes some odd symptoms.

We patched our linux and windows servers with 1.0.30. Now our windows
cluster and linux single-node system test environment have started to gather
TCP connections in CLOSE_WAIT state that are not cleared even though the
client processes have gone away ages ago. They seem to come in batches and
irregularly... My colleague reported that after few days one cluster node
had about 8k connections in CLOSE_WAIT. I get similar results on my own
computer:

Running "netstat -n -v -p tcp -b" gives:
...
TCP 127.0.0.1:8282 127.0.0.1:4093 CLOSE_WAIT 7832
  C:\Program Files\F-Secure\FSPS\program\FSLSP.DLL
  C:\WINDOWS\system32\WS2_32.dll
  C:\Program Files\Java\jdk1.6.0_07\jre\bin\net.dll
  -- unknown component(s) --
  [java.exe]
...
where 7832 is the GF 2.1 pid. Our fslsp.dll is not the cause, as the test
images do now have our anti-malware products installed ATM.

Do you have any idea as to what might cause this? I know that the server
side has closed the connection with FIN and received ACK from the client,
but the client hasn't yet closed the connection from its side. If the client
dies (is killed) shouldn't the network stack close the connection and
release it also from the server side???

> However, on our production system our native clients (C++ and Python) are
> occasionally experiencing very long lockups (thousands of seconds, if
> timeout is disabled), as bayeux CONNECTs are not properly terminated when
> there is nothing to send. We also frequently call onEvent from a JMS
> onMessage handler that in turn is our cluster event distribution
> mechanism.

Ok, based on your later reply, we now know a way to cause this.

> The idea with expiration is the following:
> (1) If the connection gets suspended and no activities happens (no
> push), then the connection will be forcibly resumed after the delay
> expires.
> (2) If there is a push, the expiration delay will be resettled to
> default value.

We were unaware of (2) and presumed that the client expiration delays would
not be extended on every push. In addition we do not send the push data to
every connected client within a channel. Therefore we have identified a way
to push data to a few active clients that causes them reconnect and receive
additional push data within the expiration delay. This causes other
connected clients to constantly have their expiration delays reset and
therefore onInterrupt doesn't get called. Eventually there clients will have
a client-side timeout. The situation is cleared once the few clients stop
receiving push data on every CONNECT.

Clearly this is a problem on our side, as we use bayeux slightly differently
that what it was originally designed for. On the other hand, now every data
push synchronizes the clients so that they all have onInterrupt called at
the same time and then they re-connect at the same time that leads to a very
spiky network and CPU load.

Best regards,

    Jussi Kuosa


-- 
View this message in context: http://www.nabble.com/Comet-context-doesn%27t-expire-tp24072882p24628197.html
Sent from the Grizzly - Users mailing list archive at Nabble.com.