Re: File Descriptors leaking

From: Lachezar Dobrev <l.dobrev_at_gmail.com>
Date: Tue, 11 Sep 2012 02:37:58 +0300

2012/9/10 Oleksiy Stashok <oleksiy.stashok_at_oracle.com>:
> Hi,
>
>
> On 09/10/2012 01:48 PM, Lachezar Dobrev wrote:
>>
>> 1. Can I «upgrade» to 3.1.2.2 using the update manager?
>> If not, where can I read about the procedure for upgrading a
>> production server?
>
> trying to figure that out, will let you know. Glassfish 3.1.2.2 has fixes
> for JK connector [1], so it would really make sense for you, even though it
> doesn't have any file-descriptor related fixes.

  I took a look at my server. I can not find any reference to the JK
Connector, but found out, that a number of the Installed Components
are version 3.1.2.2-5. Grizzly however is a weird version:
  glassfish-grizzly 1.9.50-1

  There are no available updates.

>> 2. domain.xml attached.
>> Please be advised, that sensitive information inside has been
>> obfuscated. Only values have been obfuscated. Structure has not been
>> altered.
>
> Looks fine, not sure if you really need 200 threads for thread-pool-1, but
> it shouldn't cause any problems.

  The big pool size is required due to some actions taking long time
to execute, with very little load on the machine, because an SQL
Procedure is executed at the SQL Server. With smaller thread count the
threads are very quickly exhausted, with very little overall load.

> Does you server machine run only Glassfish server, or Apache front-end is
> also installed on the same machine?

  The Apache front-end is at the Gate. Glassfish runs on a machine in
the DMZ. Physically different machines, not VMs.

> When you see the problem, pls. run netstat, it will help us to understand
> the type of connections, which consume descriptors.

  Done that!
  When the server hangs there are very few network connections. Less
than 10 in any case, and I believe all of them are hot-spares, or
keep-alive.
  When descriptors leak the overwhelming majority is consumed by PIPE
and 0000 descriptors.
  I am attaching an excerpt of the lsof output (obfuscated).
  I took the liberty to post only 1K lines, because lsof output is
more than 5MB, and will probably be dropped by the mail agent. Just
imagine the last three lines (1 anon_inode + 2 pipe) repeating for the
next 64967 lines more.

  There is only one JK request, that has been invoked from the Nagios
system (pokes the port every 5 minutes or so).

> Thanks.

  No, no! Thank you!

> WBR,
> Alexey.
>
> [1] http://java.net/jira/browse/GLASSFISH-18446
>
>>
>> 2012/9/10 Oleksiy Stashok <oleksiy.stashok_at_oracle.com>:
>>>
>>> Hi,
>>>
>>> can you pls. try GF version 3.1.2.2?
>>> Also, if it's possible, pls. attach GF domain.xml.
>>>
>>> Thanks.
>>>
>>> WBR,
>>> Alexey.
>>>
>>>
>>> On 09/10/2012 10:45 AM, Lachezar Dobrev wrote:
>>>>
>>>> Hello all...
>>>> I have received no responses on this problem.
>>>>
>>>> I am still having this issue once or twice a week.
>>>>
>>>> After a number of searches in the past weeks I've gained little in
>>>> terms of understanding what happens.
>>>>
>>>> In my search I found out a defect report against Oracle's JVM, that
>>>> might be connected to the issue:
>>>>
>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7118373
>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=2223521
>>>>
>>>> I also came up with a mention in some blog:
>>>>
>>>> http://blog.fuseyism.com/index.php/category/openjdk/
>>>> (sorry, could not come up with a more legit source).
>>>>
>>>> From the blog post I can see, that the mentioned defect is noted in
>>>> 'Release 2.3.0 (2012-08-15)', which is a funny two days after my post.
>>>> May I have your comments? Does this sound like a OpenJDK defect? Is it
>>>> possible that it has been fixed in the meantime? From the looks on my
>>>> machine it seems it still uses OpenJDK 2.1.1 (openjdk-7-jdk
>>>> 7~u3-2.1.1~pre1-1ubuntu3).
>>>>
>>>> Please advise!
>>>>
>>>> 2012/8/29 Lachezar Dobrev <l.dobrev_at_gmail.com>:
>>>>>
>>>>> Hello colleagues,
>>>>>
>>>>> Recently we switched from Tomcat to Glassfish.
>>>>> However I noticed, that at certain point (unknown as of yet) the
>>>>> Glassfish server stops responding. I can't even stop it correctly
>>>>> (asadmin stop-domain hangs!).
>>>>>
>>>>> - Ubuntu Server - 12.04 (precise)
>>>>> - Intel Xeon (x64 arch)
>>>>> - java version "1.7.0_03"
>>>>> - OpenJDK 64-Bit Server VM (build 22.0-b10, mixed mode)
>>>>> - Glassfish 3.1.2 (no upgrades pending)
>>>>>
>>>>> The server serves via a JK Connector with a façade Apache server
>>>>> using
>>>>> mod_jk.
>>>>>
>>>>> The server runs only three applications (and the admin interface).
>>>>> All applications use Spring Framework. One uses JPA to a PostgreSQL on
>>>>> the local host, one uses an ObjectDB JPA, two use JDBC pool
>>>>> connections to a remote Microsoft SQL Server.
>>>>>
>>>>> The culprit seems to be some kind of File Descriptor leak.
>>>>> Initially the server died within a day or two. I had to increase
>>>>> the
>>>>> open files limit (s1024/h4096) to (s65536/h65536) thinking this may be
>>>>> just because too many files need to be opened. However that just
>>>>> postponed the server death to about one week uptime.
>>>>>
>>>>> I was able to make some checks at the latest crash, since I was
>>>>> awake in 3AM. What I found out was that there were an unbelievable
>>>>> number of lost (unclosed) pipes:
>>>>>
>>>>>> java 30142 glassfish 467r FIFO 0,8 0t0 4659245 pipe
>>>>>> java 30142 glassfish 468w FIFO 0,8 0t0 4659245 pipe
>>>>>> java 30142 glassfish 469u 0000 0,9 0 6821 anon_inode
>>>>>> java 30142 glassfish 487r FIFO 0,8 0t0 4676297 pipe
>>>>>> java 30142 glassfish 488w FIFO 0,8 0t0 4676297 pipe
>>>>>> java 30142 glassfish 489u 0000 0,9 0 6821 anon_inode
>>>>>
>>>>> The logs show a very long quiet period, just before the failure the
>>>>> log shows a normal log line from the actual server working (one of the
>>>>> applications).
>>>>> Then the log rolls and starts rolling every second. The failures
>>>>> start with (attached error_one.txt)
>>>>>
>>>>> The only line that has been obfuscated is the one with .... in it.
>>>>> The com.planetj... is a filter used to implement gzip compression
>>>>> (input and output) since I could not find how to configure that in
>>>>> Glassfish.
>>>>> The org.springframework... is obviously the Spring Framework.
>>>>>
>>>>> The log has an enormous amount (2835 for 19 seconds) of those
>>>>> messages. The messages are logged from within the same thread (same
>>>>> _ThreadID and _ThreadName), which leads me to believe all messages are
>>>>> a result of the processing of a single request.
>>>>> Afterwards the server begins dumping a lot of messages like
>>>>> (attached error_two.txt).
>>>>> The server is effectively blocked from that time on.
>>>>>
>>>>> At that point lsof shows 64K open files from Glassfish, the
>>>>> enormous
>>>>> majority being open popes (three descriptors each).
>>>>>
>>>>> I am at a loss here... The server currently needs either a periodic
>>>>> restart, or I need to 'kill' it when it blocks.
>>>>>
>>>>> I've been digging for this error around the Internet, and the
>>>>> closest I've seen has been due to not closing (leaking) Selectors.
>>>>> Please advise!
>>>
>>>
>

text/plain attachment: lsof_gf_1k.txt