users@glassfish.java.net

Re: File Descriptors leaking

From: Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
Date: Tue, 11 Sep 2012 14:23:25 +0200

Hi,

>>> 1. Can I Ğupgradeğ to 3.1.2.2 using the update manager?
>>> If not, where can I read about the procedure for upgrading a
>>> production server?
>> trying to figure that out, will let you know. Glassfish 3.1.2.2 has fixes
>> for JK connector [1], so it would really make sense for you, even though it
>> doesn't have any file-descriptor related fixes.
> I took a look at my server. I can not find any reference to the JK
> Connector, but found out, that a number of the Installed Components
> are version 3.1.2.2-5. Grizzly however is a weird version:
> glassfish-grizzly 1.9.50-1
>
> There are no available updates.
That's ok, you have the latest Glassfish 3.1.2.2

>> When you see the problem, pls. run netstat, it will help us to understand
>> the type of connections, which consume descriptors.
> Done that!
> When the server hangs there are very few network connections. Less
> than 10 in any case, and I believe all of them are hot-spares, or
> keep-alive.
> When descriptors leak the overwhelming majority is consumed by PIPE
> and 0000 descriptors.
> I am attaching an excerpt of the lsof output (obfuscated).
> I took the liberty to post only 1K lines, because lsof output is
> more than 5MB, and will probably be dropped by the mail agent. Just
> imagine the last three lines (1 anon_inode + 2 pipe) repeating for the
> next 64967 lines more.
>
> There is only one JK request, that has been invoked from the Nagios
> system (pokes the port every 5 minutes or so).
Ok, I've created a patch, which should help us to understand if the
problem is related to NIO Selectors leaking.
Pls. copy the attached file to {domain-folder}/lib/ext (on my machine
it's glassfish3/glassfish/domains/domain1/lib/ext).
In the GF domain.xml pls. add the following line:
/<jvm-options>-Djava.nio.channels.spi.SelectorProvider=debug.DebugSelectorProvider</jvm-options>/

Under <java-config> element (sending back your domain.xml file w/ the
line added).

After you restart the GF - you'll see lots of stacktraces in the
server.log recording all the Selector.open() operations.
Once you see the problem - we can check if number of allocated selectors
is really high and see stacktraces of these allocations. So once you see
the problem - pls. share the server.log* files (compress them before
sending :))

Thanks.

WBR,
Alexey.

>
>> Thanks.
> No, no! Thank you!
>
>> WBR,
>> Alexey.
>>
>> [1] http://java.net/jira/browse/GLASSFISH-18446
>>
>>> 2012/9/10 Oleksiy Stashok <oleksiy.stashok_at_oracle.com>:
>>>> Hi,
>>>>
>>>> can you pls. try GF version 3.1.2.2?
>>>> Also, if it's possible, pls. attach GF domain.xml.
>>>>
>>>> Thanks.
>>>>
>>>> WBR,
>>>> Alexey.
>>>>
>>>>
>>>> On 09/10/2012 10:45 AM, Lachezar Dobrev wrote:
>>>>> Hello all...
>>>>> I have received no responses on this problem.
>>>>>
>>>>> I am still having this issue once or twice a week.
>>>>>
>>>>> After a number of searches in the past weeks I've gained little in
>>>>> terms of understanding what happens.
>>>>>
>>>>> In my search I found out a defect report against Oracle's JVM, that
>>>>> might be connected to the issue:
>>>>>
>>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7118373
>>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=2223521
>>>>>
>>>>> I also came up with a mention in some blog:
>>>>>
>>>>> http://blog.fuseyism.com/index.php/category/openjdk/
>>>>> (sorry, could not come up with a more legit source).
>>>>>
>>>>> From the blog post I can see, that the mentioned defect is noted in
>>>>> 'Release 2.3.0 (2012-08-15)', which is a funny two days after my post.
>>>>> May I have your comments? Does this sound like a OpenJDK defect? Is it
>>>>> possible that it has been fixed in the meantime? From the looks on my
>>>>> machine it seems it still uses OpenJDK 2.1.1 (openjdk-7-jdk
>>>>> 7~u3-2.1.1~pre1-1ubuntu3).
>>>>>
>>>>> Please advise!
>>>>>
>>>>> 2012/8/29 Lachezar Dobrev <l.dobrev_at_gmail.com>:
>>>>>> Hello colleagues,
>>>>>>
>>>>>> Recently we switched from Tomcat to Glassfish.
>>>>>> However I noticed, that at certain point (unknown as of yet) the
>>>>>> Glassfish server stops responding. I can't even stop it correctly
>>>>>> (asadmin stop-domain hangs!).
>>>>>>
>>>>>> - Ubuntu Server - 12.04 (precise)
>>>>>> - Intel Xeon (x64 arch)
>>>>>> - java version "1.7.0_03"
>>>>>> - OpenJDK 64-Bit Server VM (build 22.0-b10, mixed mode)
>>>>>> - Glassfish 3.1.2 (no upgrades pending)
>>>>>>
>>>>>> The server serves via a JK Connector with a façade Apache server
>>>>>> using
>>>>>> mod_jk.
>>>>>>
>>>>>> The server runs only three applications (and the admin interface).
>>>>>> All applications use Spring Framework. One uses JPA to a PostgreSQL on
>>>>>> the local host, one uses an ObjectDB JPA, two use JDBC pool
>>>>>> connections to a remote Microsoft SQL Server.
>>>>>>
>>>>>> The culprit seems to be some kind of File Descriptor leak.
>>>>>> Initially the server died within a day or two. I had to increase
>>>>>> the
>>>>>> open files limit (s1024/h4096) to (s65536/h65536) thinking this may be
>>>>>> just because too many files need to be opened. However that just
>>>>>> postponed the server death to about one week uptime.
>>>>>>
>>>>>> I was able to make some checks at the latest crash, since I was
>>>>>> awake in 3AM. What I found out was that there were an unbelievable
>>>>>> number of lost (unclosed) pipes:
>>>>>>
>>>>>>> java 30142 glassfish 467r FIFO 0,8 0t0 4659245 pipe
>>>>>>> java 30142 glassfish 468w FIFO 0,8 0t0 4659245 pipe
>>>>>>> java 30142 glassfish 469u 0000 0,9 0 6821 anon_inode
>>>>>>> java 30142 glassfish 487r FIFO 0,8 0t0 4676297 pipe
>>>>>>> java 30142 glassfish 488w FIFO 0,8 0t0 4676297 pipe
>>>>>>> java 30142 glassfish 489u 0000 0,9 0 6821 anon_inode
>>>>>> The logs show a very long quiet period, just before the failure the
>>>>>> log shows a normal log line from the actual server working (one of the
>>>>>> applications).
>>>>>> Then the log rolls and starts rolling every second. The failures
>>>>>> start with (attached error_one.txt)
>>>>>>
>>>>>> The only line that has been obfuscated is the one with .... in it.
>>>>>> The com.planetj... is a filter used to implement gzip compression
>>>>>> (input and output) since I could not find how to configure that in
>>>>>> Glassfish.
>>>>>> The org.springframework... is obviously the Spring Framework.
>>>>>>
>>>>>> The log has an enormous amount (2835 for 19 seconds) of those
>>>>>> messages. The messages are logged from within the same thread (same
>>>>>> _ThreadID and _ThreadName), which leads me to believe all messages are
>>>>>> a result of the processing of a single request.
>>>>>> Afterwards the server begins dumping a lot of messages like
>>>>>> (attached error_two.txt).
>>>>>> The server is effectively blocked from that time on.
>>>>>>
>>>>>> At that point lsof shows 64K open files from Glassfish, the
>>>>>> enormous
>>>>>> majority being open popes (three descriptors each).
>>>>>>
>>>>>> I am at a loss here... The server currently needs either a periodic
>>>>>> restart, or I need to 'kill' it when it blocks.
>>>>>>
>>>>>> I've been digging for this error around the Internet, and the
>>>>>> closest I've seen has been due to not closing (leaking) Selectors.
>>>>>> Please advise!
>>>>