users@glassfish.java.net

File Descriptors leaking

From: Lachezar Dobrev <l.dobrev_at_gmail.com>
Date: Wed, 29 Aug 2012 11:40:14 +0300

  Hello colleagues,

  Recently we switched from Tomcat to Glassfish.
  However I noticed, that at certain point (unknown as of yet) the
Glassfish server stops responding. I can't even stop it correctly
(asadmin stop-domain hangs!).

  - Ubuntu Server - 12.04 (precise)
  - Intel Xeon (x64 arch)
  - java version "1.7.0_03"
  - OpenJDK 64-Bit Server VM (build 22.0-b10, mixed mode)
  - Glassfish 3.1.2 (no upgrades pending)

  The server serves via a JK Connector with a façade Apache server using mod_jk.

  The server runs only three applications (and the admin interface).
All applications use Spring Framework. One uses JPA to a PostgreSQL on
the local host, one uses an ObjectDB JPA, two use JDBC pool
connections to a remote Microsoft SQL Server.

  The culprit seems to be some kind of File Descriptor leak.
  Initially the server died within a day or two. I had to increase the
open files limit (s1024/h4096) to (s65536/h65536) thinking this may be
just because too many files need to be opened. However that just
postponed the server death to about one week uptime.

  I was able to make some checks at the latest crash, since I was
awake in 3AM. What I found out was that there were an unbelievable
number of lost (unclosed) pipes:

> java 30142 glassfish 467r FIFO 0,8 0t0 4659245 pipe
> java 30142 glassfish 468w FIFO 0,8 0t0 4659245 pipe
> java 30142 glassfish 469u 0000 0,9 0 6821 anon_inode
> java 30142 glassfish 487r FIFO 0,8 0t0 4676297 pipe
> java 30142 glassfish 488w FIFO 0,8 0t0 4676297 pipe
> java 30142 glassfish 489u 0000 0,9 0 6821 anon_inode

  The logs show a very long quiet period, just before the failure the
log shows a normal log line from the actual server working (one of the
applications).
  Then the log rolls and starts rolling every second. The failures
start with (attached error_one.txt)

  The only line that has been obfuscated is the one with .... in it.
  The com.planetj... is a filter used to implement gzip compression
(input and output) since I could not find how to configure that in
Glassfish.
  The org.springframework... is obviously the Spring Framework.

  The log has an enormous amount (2835 for 19 seconds) of those
messages. The messages are logged from within the same thread (same
_ThreadID and _ThreadName), which leads me to believe all messages are
a result of the processing of a single request.
  Afterwards the server begins dumping a lot of messages like
(attached error_two.txt).
  The server is effectively blocked from that time on.

  At that point lsof shows 64K open files from Glassfish, the enormous
majority being open popes (three descriptors each).

  I am at a loss here... The server currently needs either a periodic
restart, or I need to 'kill' it when it blocks.

  I've been digging for this error around the Internet, and the
closest I've seen has been due to not closing (leaking) Selectors.
  Please advise!