Re: IdleTimeoutFilter.queue leaking

From: Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
Date: Thu, 16 Jan 2014 23:17:07 -0800

Hi Marc,

makes sense, I commited the fix on 2.2.x branch and released new
2.2.22-SNAPSHOT, can you pls. give it a try?

Thanks.

WBR,
Alexey.

On 16.01.14 14:06, Marc Arens wrote:
> Hello Alexey,
>
>> On 13 January 2014 at 22:28 Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
>> wrote:
>>
>>
>> Hi Marc,
>>
>> On 13.01.14 08:37, Marc Arens wrote:
>>> Hello,
>>>
>>> we're currently seeing a problem for a single customer with
>>> org.glassfish.grizzly.utils.IdleTimeoutFilter.queue leaking/not getting
>>> cleaned
>>> properly so the server runs out of memory within ~48 hours. As far as i
>>> understand Connections aren't removed from the queue until
>>> org.glassfish.grizzly.utils.DelayedExecutor.DelayedRunnable instructs
>>> org.glassfish.grizzly.utils.IdleTimeoutFilter.DefaultWorker.doWork(Connection)
>>> to close the connection via closeSilently() which results in
>>> org.glassfish.grizzly.utils.IdleTimeoutFilter.handleClose(FilterChainContext)
>>> being called.
>>>
>>> Are/were there known bugs i missed or is there an easy explanation for the
>>> behaviour we're seeing? We don't see this on other installations and i could
>>> reproduce a similar behaviour by making a Servlet's service hangs endlessly.
>>> While the connections from the apache in front of grizzly were closed after
>>> a
>>> apache timeout of 100 seconds, the queue didn't shrink at all.
>> IdleTimeoutFilter keeps track of all FilterChainContext associated with
>> a Connection and allows to close the Connection only when it doesn't
>> have any active FilterChainContext associated with the Connection.
>> In your case, when you have endless Servlet.service() method - you keep
>> FilterChainContext active, so IdleTimeoutFilter doesn't treat the
>> Connection as idle.
>>
>> I'm not aware of any bugs related to IdleTimeoutFilter, but if you have
>> a testcase - we'll definitely take a look and fix the problem.
> No testcase but we finally managed to reproduce/watch via remote debugging it on
> a special debian squeeze derived distribution some customers are using. I
> assumed it might have been caused by an old openjdk-6 runtime but could be seen
> with sun-6 and even oracle-7 but only on this system.
>
> We had configured a idleTimeout of FOREVER during grizzly startup
>
> final KeepAlive keepAlive = networkListener.getKeepAlive();
> keepAlive.setIdleTimeoutInSeconds(-1);
> keepAlive.setMaxRequestsCount(-1);
>
> now watching a request get handled showed the following:
>
> during handleClose the IdleTimeoutFilter sets timeoutMillis to 0 as observed in
> the debugger:
>
> Daemon Thread [OXWorker-0000036] (Suspended (breakpoint at line 325 in
> IdleTimeoutFilter$Resolver))
> IdleTimeoutFilter$Resolver.removeTimeout(Connection) line: 325
> IdleTimeoutFilter$Resolver.removeTimeout(Object) line: 321
> DelayedExecutor$DelayQueue<E>.remove(E) line: 196
> IdleTimeoutFilter.handleClose(FilterChainContext) line: 18
>
> public boolean removeTimeout(final Connection connection) {
> IDLE_ATTR.get(connection).timeoutMillis.set(0);
> return true;
> }
>
> resulting in
>
> (org.glassfish.grizzly.nio.transport.TCPNIOConnection)
> TCPNIOConnection{localSocketAddress=/127.0.0.1:8009,
> peerSocketAddress=/127.0.0.1:57573}
> IDLE_ATTR.get(connection)
> (org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord)
> org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord_at_6ce13334
> IDLE_ATTR.get(connection).timeoutMillis
> (java.util.concurrent.atomic.AtomicLong) 0
>
> This should normally cause the next run of DelayedExecutor.DelayedRunnable to
> remove the element from the DelayQueue as currentTimeMillis - 0 >= 0
>
> final Object element = it.next();
> final Long timeoutMillis = resolver.getTimeoutMillis(element);
> ...
> } else if (currentTimeMillis - timeoutMillis >= 0) {
> it.remove();
> ...
>
> but when examining this part in the debugger again we se that timeoutMillis was
> reset from 0 to FOREVER(9223372036854775807) again
>
> (org.glassfish.grizzly.nio.transport.TCPNIOConnection)
> TCPNIOConnection{localSocketAddress=/127.0.0.1:8009,
> peerSocketAddress=/127.0.0.1:57573}
> IDLE_ATTR.get(connection)
> (org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord)
> org.glassfish.grizzly.utils.IdleTimeoutFilter$IdleRecord_at_6ce13334
> IDLE_ATTR.get(connection).timeoutMillis
> (java.util.concurrent.atomic.AtomicLong) 9223372036854775807
>
> this seems to be a concurrency issue as it doesn't always happen when i add an
> additional breakpoint in
> IdleTimeoutFilter.ContextCompletionListener.onComplete()
>
> The same code/config just works on a default debian. Any thoughts on this one?
>
>>> I found the org.glassfish.grizzly.utils.ActivityCheckFilter which sound
>>> useful
>>> for this kind of situation but how does it work in combination with
>>> suspended
>>> requests e.g. Comet?
>> ActivityCheckFilter is more lax, it doesn't track live
>> FilterChainContexts, but monitors Connection activity (read, write,
>> other events), and if there was no activity registered for a given
>> period of time - it closes the Connection.
>>
>> Both Filters could work with Comet, both of them meant to detect idle
>> Connections, but use different approaches for that. So it's up to you to
>> decide which one is better for a specific usecase.
>>
>> Thanks.
>>
>> WBR,
>> Alexey.
> Best regards,
> Marc