Re: File Descriptors leaking

From: Lachezar Dobrev <l.dobrev_at_gmail.com>
Date: Fri, 5 Oct 2012 18:19:01 +0300

  Hi Oleksiy...

  There was silence for some time. The server was operating without
running out of descriptors. Admittedly however we had to restart
Glassfish every 4-5 days due to OS upgrades.
  That is until 2012-10-03, when it all started all-over again. The
server stopped responding to requests. Last restart was 2012-09-28.

  I noticed a substantial pause in logging:
  Last log item before the crash: 2012-10-02T23:48:06.319+0300
  First log item within the crash: 2012-10-03T18:20:05.627+0300

  I also noticed that there are empty rolled log files:
> glassfish_at_glassfish:~/glassfish/domains/domain1/logs$ ls -al server.log_2012-10*
> -rw-r--r-- 1 glassfish glassfish 2000206 Oct 1 19:13 server.log_2012-10-01T19-13-25
> -rw-r--r-- 1 glassfish glassfish 2000131 Oct 2 23:48 server.log_2012-10-02T23-48-06
> -rw-r--r-- 1 glassfish glassfish 2004750 Oct 3 18:20 server.log_2012-10-03T18-20-05
> -rw-r--r-- 1 glassfish glassfish 0 Oct 3 18:20 server.log_2012-10-03T18-20-41
> -rw-r--r-- 1 glassfish glassfish 2000301 Oct 3 18:32 server.log_2012-10-03T18-32-36
> -rw-r--r-- 1 glassfish glassfish 2000242 Oct 3 18:44 server.log_2012-10-03T18-44-11
> -rw-r--r-- 1 glassfish glassfish 2000213 Oct 3 18:55 server.log_2012-10-03T18-55-47
> -rw-r--r-- 1 glassfish glassfish 2000242 Oct 3 19:07 server.log_2012-10-03T19-07-23
> -rw-r--r-- 1 glassfish glassfish 2000242 Oct 3 19:18 server.log_2012-10-03T19-18-59
> -rw-r--r-- 1 glassfish glassfish 2000292 Oct 4 03:04 server.log_2012-10-04T03-04-27
> -rw-r--r-- 1 glassfish glassfish 2000235 Oct 5 08:24 server.log_2012-10-05T08-24-35

  Is it possible, that some bug/defect in the logging subsystem leads
to selector leak. Maybe it's just that the logging subsystem falls
victim of the same selector deficiency.

  I have a weird feeling, that this error happens when I use the
administrative web application to check for Glassfish updates. I
checked for update somewhere in the afternoon on 2012-10-03.

  I decided to take your debug selector provider, and perform testing.
Since I can not be certain, that the logs will help, I rewrote the
selector provider to use a dedicated file to write, that is open on
start up. I also added logging for the openPipe() method.
  For obvious reasons I can not do that on the productions server, so
I tried to use it on a development/testing environment. The
development environment is a Debian Wheezy.

  I've had no luck trying that... The OS tools fail to provide me with
accurate information about the number of open files/descriptors/etc.
The 'lsof' command shows 60K+ open file descriptors, if I use 'lsof -p
<pid>' it shows around 800 open descriptors, and 'ls -l
/proc/<pid>/fd' has around 400 descriptors (the production system has
better behaviour, but still inconclusive). I should also mention, that
I've never had this issue on the development server, but only on the
production (Ubuntu Server 12.04) one.

  The Debug Selector Provider only shows 6 selectors open. That seems
reasonable, since 'lsof' does show only 6 'anon_inode'. I'll have to
go back later and check what's happening, but for the time being the
selectors do not show the run-away behaviour that I expect to kill the
server.

  Honestly... I am puzzled, and uncomfortable: the decision to use
Glassfish & Ubuntu Server was mine, and I will have to take
responsibility for the ineffective combination.

  I am seeing a few threads on the Forums, that raise the question if
this behaviour can be fixed by switching from Grizzly to an
alternative. I'll have to dig the forums to see what it was, and how
to make the switch.

  Regards,
  Lachezar

  P.S. Is it possible to update the server without installing the
updatetool into the OS? I have weird memories, that I had done that in
the past...

2012/9/12 Oleksiy Stashok <oleksiy.stashok_at_oracle.com>:
> Hey,
>
>
> On 09/12/2012 10:56 AM, Lachezar Dobrev wrote:
>>
>> Hello,
>> Today I was preparing to take your patch on-line. However I noticed,
>> that (another weird timing) there was an update for the JVM.
>>
>>> Upgrade: openjdk-7-jdk:amd64 (7~u3-2.1.1~pre1-1ubuntu3,
>>> 7u7-2.3.2-1ubuntu0.12.04.1),
>>
>> The version of the OpenJDK is now at 2.3.2, and I suspect that it
>> might mitigate the problem, since there is noted work on the subject
>> in the changes.
>>
>> So I am going to wait for a bit without your patch, hoping that the
>> updated JVM will fix the problem.
>
> Sure.
>
>
>> I decompiled your patch (sorry, but I feel uncomfortable running
>> unaudited code). I saw, that you're dumping the stack traces to
>> System.out, which in Glassfish is replaced with a logger, that too
>> uses locking. I have a hunch, that when the problem happens the system
>> will be unable to log. I am going to attempt to create a patch that
>> writes to a dedicated file, skipping the whole System.out ->
>> java.util.logging -> server.log.
>>
>> I hope you will not consider your efforts wasted, and thank you for
>> the time spent.
>
> No problem at all (sources attached) :))
>
> Thanks.
>
> WBR,
> Alexey.
>
>
>>
>> 2012/9/11 Oleksiy Stashok <oleksiy.stashok_at_oracle.com>:
>>>
>>> Hi,
>>>
>>>
>>> 1. Can I �upgrade� to 3.1.2.2 using the update manager?
>>> If not, where can I read about the procedure for upgrading a
>>> production server?
>>>
>>> trying to figure that out, will let you know. Glassfish 3.1.2.2 has fixes
>>> for JK connector [1], so it would really make sense for you, even though
>>> it
>>> doesn't have any file-descriptor related fixes.
>>>
>>> I took a look at my server. I can not find any reference to the JK
>>> Connector, but found out, that a number of the Installed Components
>>> are version 3.1.2.2-5. Grizzly however is a weird version:
>>> glassfish-grizzly 1.9.50-1
>>>
>>> There are no available updates.
>>>
>>> That's ok, you have the latest Glassfish 3.1.2.2
>>>
>>>
>>> When you see the problem, pls. run netstat, it will help us to understand
>>> the type of connections, which consume descriptors.
>>>
>>> Done that!
>>> When the server hangs there are very few network connections. Less
>>> than 10 in any case, and I believe all of them are hot-spares, or
>>> keep-alive.
>>> When descriptors leak the overwhelming majority is consumed by PIPE
>>> and 0000 descriptors.
>>> I am attaching an excerpt of the lsof output (obfuscated).
>>> I took the liberty to post only 1K lines, because lsof output is
>>> more than 5MB, and will probably be dropped by the mail agent. Just
>>> imagine the last three lines (1 anon_inode + 2 pipe) repeating for the
>>> next 64967 lines more.
>>>
>>> There is only one JK request, that has been invoked from the Nagios
>>> system (pokes the port every 5 minutes or so).
>>>
>>> Ok, I've created a patch, which should help us to understand if the
>>> problem
>>> is related to NIO Selectors leaking.
>>> Pls. copy the attached file to {domain-folder}/lib/ext (on my machine
>>> it's
>>> glassfish3/glassfish/domains/domain1/lib/ext).
>>> In the GF domain.xml pls. add the following line:
>>>
>>> <jvm-options>-Djava.nio.channels.spi.SelectorProvider=debug.DebugSelectorProvider</jvm-options>
>>>
>>> Under <java-config> element (sending back your domain.xml file w/ the
>>> line
>>> added).
>>>
>>> After you restart the GF - you'll see lots of stacktraces in the
>>> server.log
>>> recording all the Selector.open() operations.
>>> Once you see the problem - we can check if number of allocated selectors
>>> is
>>> really high and see stacktraces of these allocations. So once you see the
>>> problem - pls. share the server.log* files (compress them before sending
>>> :))
>>>
>>> Thanks.
>>>
>>> WBR,
>>> Alexey.
>>>
>>>
>>>
>>> Thanks.
>>>
>>> No, no! Thank you!
>>>
>>> WBR,
>>> Alexey.
>>>
>>> [1] http://java.net/jira/browse/GLASSFISH-18446
>>>
>>> 2012/9/10 Oleksiy Stashok <oleksiy.stashok_at_oracle.com>:
>>>
>>> Hi,
>>>
>>> can you pls. try GF version 3.1.2.2?
>>> Also, if it's possible, pls. attach GF domain.xml.
>>>
>>> Thanks.
>>>
>>> WBR,
>>> Alexey.
>>>
>>>
>>> On 09/10/2012 10:45 AM, Lachezar Dobrev wrote:
>>>
>>> Hello all...
>>> I have received no responses on this problem.
>>>
>>> I am still having this issue once or twice a week.
>>>
>>> After a number of searches in the past weeks I've gained little in
>>> terms of understanding what happens.
>>>
>>> In my search I found out a defect report against Oracle's JVM, that
>>> might be connected to the issue:
>>>
>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7118373
>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=2223521
>>>
>>> I also came up with a mention in some blog:
>>>
>>> http://blog.fuseyism.com/index.php/category/openjdk/
>>> (sorry, could not come up with a more legit source).
>>>
>>> From the blog post I can see, that the mentioned defect is noted in
>>> 'Release 2.3.0 (2012-08-15)', which is a funny two days after my post.
>>> May I have your comments? Does this sound like a OpenJDK defect? Is it
>>> possible that it has been fixed in the meantime? From the looks on my
>>> machine it seems it still uses OpenJDK 2.1.1 (openjdk-7-jdk
>>> 7~u3-2.1.1~pre1-1ubuntu3).
>>>
>>> Please advise!
>>>
>>> 2012/8/29 Lachezar Dobrev <l.dobrev_at_gmail.com>:
>>>
>>> Hello colleagues,
>>>
>>> Recently we switched from Tomcat to Glassfish.
>>> However I noticed, that at certain point (unknown as of yet) the
>>> Glassfish server stops responding. I can't even stop it correctly
>>> (asadmin stop-domain hangs!).
>>>
>>> - Ubuntu Server - 12.04 (precise)
>>> - Intel Xeon (x64 arch)
>>> - java version "1.7.0_03"
>>> - OpenJDK 64-Bit Server VM (build 22.0-b10, mixed mode)
>>> - Glassfish 3.1.2 (no upgrades pending)
>>>
>>> The server serves via a JK Connector with a fa�ade Apache server
>>> using
>>> mod_jk.
>>>
>>> The server runs only three applications (and the admin interface).
>>> All applications use Spring Framework. One uses JPA to a PostgreSQL on
>>> the local host, one uses an ObjectDB JPA, two use JDBC pool
>>> connections to a remote Microsoft SQL Server.
>>>
>>> The culprit seems to be some kind of File Descriptor leak.
>>> Initially the server died within a day or two. I had to increase
>>> the
>>> open files limit (s1024/h4096) to (s65536/h65536) thinking this may be
>>> just because too many files need to be opened. However that just
>>> postponed the server death to about one week uptime.
>>>
>>> I was able to make some checks at the latest crash, since I was
>>> awake in 3AM. What I found out was that there were an unbelievable
>>> number of lost (unclosed) pipes:
>>>
>>> java 30142 glassfish 467r FIFO 0,8 0t0 4659245 pipe
>>> java 30142 glassfish 468w FIFO 0,8 0t0 4659245 pipe
>>> java 30142 glassfish 469u 0000 0,9 0 6821 anon_inode
>>> java 30142 glassfish 487r FIFO 0,8 0t0 4676297 pipe
>>> java 30142 glassfish 488w FIFO 0,8 0t0 4676297 pipe
>>> java 30142 glassfish 489u 0000 0,9 0 6821 anon_inode
>>>
>>> The logs show a very long quiet period, just before the failure the
>>> log shows a normal log line from the actual server working (one of the
>>> applications).
>>> Then the log rolls and starts rolling every second. The failures
>>> start with (attached error_one.txt)
>>>
>>> The only line that has been obfuscated is the one with .... in it.
>>> The com.planetj... is a filter used to implement gzip compression
>>> (input and output) since I could not find how to configure that in
>>> Glassfish.
>>> The org.springframework... is obviously the Spring Framework.
>>>
>>> The log has an enormous amount (2835 for 19 seconds) of those
>>> messages. The messages are logged from within the same thread (same
>>> _ThreadID and _ThreadName), which leads me to believe all messages are
>>> a result of the processing of a single request.
>>> Afterwards the server begins dumping a lot of messages like
>>> (attached error_two.txt).
>>> The server is effectively blocked from that time on.
>>>
>>> At that point lsof shows 64K open files from Glassfish, the
>>> enormous
>>> majority being open popes (three descriptors each).
>>>
>>> I am at a loss here... The server currently needs either a periodic
>>> restart, or I need to 'kill' it when it blocks.
>>>
>>> I've been digging for this error around the Internet, and the
>>> closest I've seen has been due to not closing (leaking) Selectors.
>>> Please advise!
>>>
>>>
>