users@glassfish.java.net

Re: glassfish 3.1.1, VPS (max sockets 1024 + 1024 local sockets), config suggestions to avoid Too many open files problem

From: Oleksiy Stashok <oleksiy.stashok_at_oracle.com>
Date: Tue, 24 Jul 2012 18:33:44 +0200

Hi Mladen,

> increased the number of connections in time_wait state.
> 110 CLOSE_WAIT
> 251 ESTABLISHED
> 2 FIN_WAIT2
> 10 LISTEN
> 1 SYN_RECV
> 404 TIME_WAIT
>
>
> Connection queue of 500 were full to accomodate further requests. I
> had to lower it so webserver doesn't run out of available sockets
> (1024, common value with Virtuozzo).
I doubt connection queue size will help you prevent more sockets to be
accepted. IMO you can keep the default value.
You can write simple app. like:

Socket[] sockets = new Socket[1024];
int i = 0;
try {
     for (; i < 1024; i++) {
          sockets[i] = new Socket(yourServerHost, yourServerPort);
     }
} catch (Exception e) {
       System.out.println("Failed on connection #" + i);
}

to double-check if connection queue size makes any different.


> If I understand correctly TIME_WAIT is when server waits for response
> from client it received the data and it can last up to 240ms. This
> time shall not be lowered in Linux, it can have big consequences.
Well, if you're so limited in resources, may be you can try to decrease
this timeout.
Anyway, agree w/ your conclusion below :))

Thanks.

WBR,
Alexey.

>
> Now after turning the server down and up, it starts to increase again:
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
> 35 ESTABLISHED
> 20 FIN_WAIT1
> 4 FIN_WAIT2
> 74 LAST_ACK
> 10 LISTEN
> 1 SYN_RECV
> 119 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
> 43 ESTABLISHED
> 1 FIN_WAIT1
> 23 FIN_WAIT2
> 10 LISTEN
> 2 SYN_RECV
> 228 TIME_WAIT
>
> OK, I decided to try to crash it from my localhost:
>
> I run from my localhost: # ab -n 10000 -c 1000
> http://www.numbeo.com/cost-of-living/
> to try to crash my own webserver, this is how it looks like:
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
> 7 CLOSE_WAIT
> 337 ESTABLISHED
> 25 FIN_WAIT1
> 12 FIN_WAIT2
> 10 LISTEN
> 74 SYN_RECV
> 610 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
> 343 ESTABLISHED
> 28 FIN_WAIT1
> 16 FIN_WAIT2
> 10 LISTEN
> 119 SYN_RECV
> 1051 TIME_WAIT
>
> ab finished with
> Completed 1000 requests
> Completed 2000 requests
> apr_socket_recv: Connection reset by peer (104)
> Total of 2520 requests completed
>
> and now sockets seems to come back to normal value:
> # netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
> 3 CLOSE_WAIT
> 119 ESTABLISHED
> 2 FIN_WAIT1
> 21 FIN_WAIT2
> 10 LISTEN
> 34 SYN_RECV
> 179 TIME_WAIT
>
> During the period I can see it run out of sockets in the log:
> [#|2012-07-24T10:08:40.855+0000|WARNING|glassfish3.1.2|com.sun.grizzly.config.GrizzlyServiceListener|_ThreadID=11;_ThreadName=Thread-2;|GRIZZLY0006:
> Exception accepting channel
> java.io.IOException: Too many open files
>
>
> and server load is not bad at all:
> # uptime
> 12:12:29 up 15:47, 1 user, load average: 0.24, 0.84, 0.55
>
>
> *Conclusion*: it looks to me that although webserver process the
> request for 3ms in average, the socket remains open much longer to be
> sure the response has been received by the client. So at the moment of
> heavy load testing, 120 request were processing but 1600 sockets were
> kept for TIME_WAIT and ESTABLISHED phase of tcp.
> This problem I might face it not be web server related issue but
> rather lack of available sockets under virtualization for TIME_WAIT.
> Perhaps for me time has come to migrate to dedicated server and to put
> available sockets to 65K.
>
>
>
>
> On Tue, Jul 24, 2012 at 11:58 AM, Mladen Adamovic
> <mladen.adamovic_at_gmail.com <mailto:mladen.adamovic_at_gmail.com>> wrote:
>
> Hi guys, currently with 1 max alive request per connection, the
> webserver started to refuse connections, connection queue with max
> 500 is full.
>
> server.log looks clean and this is summary of the netstat:
>
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 96 CLOSE_WAIT
> 143 ESTABLISHED
> 1 FIN_WAIT1
> 1 FIN_WAIT2
> 10 LISTEN
> 29 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 96 CLOSE_WAIT
> 138 ESTABLISHED
> 6 FIN_WAIT2
> 10 LISTEN
> 20 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 99 CLOSE_WAIT
> 134 ESTABLISHED
> 1 FIN_WAIT2
> 10 LISTEN
> 28 TIME_WAIT
> root_at_lvps176-28-13-94:~# uptime
> 11:52:46 up 15:27, 1 user, load average: 0.30, 0.06, 0.02
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 111 CLOSE_WAIT
> 223 ESTABLISHED
> 1 FIN_WAIT2
> 10 LISTEN
> 2 SYN_RECV
> 17 TIME_WAIT
>
>
> and the number of connections seems to increase:
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 110 CLOSE_WAIT
> 251 ESTABLISHED
> 2 FIN_WAIT2
> 10 LISTEN
> 1 SYN_RECV
> 404 TIME_WAIT
>
>
> Connection queue seems to be full although processing time is 3ms:
>
>
> onitor (34 Statistics)
> File Cache Statistics : http-listener-1
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> ContentMissesCount0 countJul 23, 2012 9:57:57 PM----Number of
> misses on cached file content
> OpenCacheEntriesCount0 countJul 23, 2012 9:57:57 PM----Number of
> current open cache entries
> MaxMappedMemorySize0 byte(s)Jul 24, 2012 9:54:48 AMJul 24, 2012
> 9:54:48 AM--Maximum memory map size used for caching
> HitsCount0 countJul 23, 2012 9:57:57 PM----Number of cache lookup hits
> InfoMissesCount0 countJul 23, 2012 9:57:57 PM----Number of misses
> on cached file info
> ContentHitsCount0 countJul 23, 2012 9:57:57 PM----Number of hits
> on cached file content
> HeapSize0 byte(s)Jul 24, 2012 9:54:48 AMJul 24, 2012 9:54:48
> AM--Current cache size in bytes
> MaxHeapSize0 byte(s)Jul 24, 2012 9:54:48 AMJul 24, 2012 9:54:48
> AM--Maximum heap space used for cache
> InfoHitsCount0 countJul 23, 2012 9:57:57 PM----Number of hits on
> cached file info
> MappedMemorySize0 byte(s)Jul 24, 2012 9:54:48 AMJul 24, 2012
> 9:54:48 AM--Size of mapped memory used for caching
> MissesCount0 countJul 23, 2012 9:57:57 PM----Number of cache
> lookup misses
> Keep Alive Statistics : http-listener-1
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> CountTimeouts48452 countJul 23, 2012 9:57:57 PMJul 24, 2012
> 9:47:58 AM--Number of keep-alive connections that timed out
> MaxRequests1 countJul 23, 2012 9:57:57 PMJul 23, 2012 9:57:58
> PM--Maximum number of requests allowed on a single keep-alive
> connection
> CountHits217839 countJul 23, 2012 9:57:57 PMJul 24, 2012 9:47:59
> AM--Number of requests received by connections in keep-alive mode
> SecondsTimeouts5 secondsJul 23, 2012 9:57:57 PMJul 23, 2012
> 9:57:58 PM--Keep-alive timeout value in seconds
> CountConnections188 countJul 23, 2012 9:57:57 PMJul 24, 2012
> 9:47:59 AM--Number of connections in keep-alive mode
> CountFlushes96184 countJul 23, 2012 9:57:57 PMJul 24, 2012 9:47:52
> AM--Number of keep-alive connections that were closed
> CountRefusals73011 countJul 23, 2012 9:57:57 PMJul 24, 2012
> 9:47:43 AM--Number of keep-alive connections that were rejected
> Connection Queue Statistics : http-listener-1
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> CountQueued15MinutesAverage3365 countJul 24, 2012 9:54:48 AMJul
> 24, 2012 9:54:48 AM--Average number of connections queued in the
> last 15 minutes
> CountQueued5MinutesAverage203 countJul 24, 2012 9:54:48 AMJul 24,
> 2012 9:54:48 AM--Average number of connections queued in the last
> 5 minutes
> CountQueued1MinuteAverage0 countJul 24, 2012 9:54:48 AMJul 24,
> 2012 9:54:48 AM--Average number of connections queued in the last
> 1 minute
> TicksTotalQueued0 countJul 23, 2012 9:57:57 PMJul 23, 2012 9:57:57
> PM--(Unsupported) Total number of ticks that connections have
> spent in the queue
> PeakQueued500 countJul 23, 2012 9:57:57 PMJul 24, 2012 9:51:45
> AM--Largest number of connections that were in the queue
> simultaneously
> CountOpenConnections202 countJul 24, 2012 9:54:48 AMJul 24, 2012
> 9:54:48 AM--The number of open/active connections
> CountTotalConnections163494 countJul 23, 2012 9:57:57 PMJul 24,
> 2012 9:54:48 AM--Total number of connections that have been accepted
> CountTotalQueued248658 countJul 23, 2012 9:57:57 PMJul 24, 2012
> 9:51:45 AM--Total number of connections that have been queued
> CountQueued500 countJul 23, 2012 9:57:57 PMJul 24, 2012 9:51:45
> AM--Number of connections currently in the queue
> CountOverflows888 countJul 23, 2012 9:57:57 PMJul 24, 2012 9:54:48
> AM--Number of times the queue has been too full to accommodate a
> connection
> MaxQueued500 countJul 23, 2012 9:57:57 PMJul 23, 2012 9:57:58
> PM--Maximum size of the connection queue
> Thread Pool Statistics : http-listener-1
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> CoreThreads5 countJul 23, 2012 9:57:57 PMJul 23, 2012 9:57:58
> PM--Core number of threads in the thread pool
> CurrentThreadsBusy200 countJul 23, 2012 9:57:57 PMJul 24, 2012
> 9:47:59 AM--Provides the number of request processing threads
> currently in use in the listener thread pool serving requests
> TotalExecutedTasksCount247958 countJul 23, 2012 9:57:57 PMJul 24,
> 2012 9:47:52 AM--Provides the total number of tasks, which were
> executed by the thread pool
> MaxThreads200 countJul 23, 2012 9:57:57 PMJul 23, 2012 9:57:58
> PM--Maximum number of threads allowed in the thread pool
> CurrentThreadCount200 countJul 23, 2012 9:57:57 PMJul 24, 2012
> 9:47:59 AM--Provides the number of request processing threads
> currently in the listener thread pool
>
>
>
> Monitor (18 Statistics)
> Web Container: Servlet Statistics : LivingCost
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> ActiveServletsLoaded12countJul 23, 2012 9:58:08 PMJul 24, 2012
> 8:50:52 AMHigh Water Mark: 12 count
> Low Water Mark: 0 count
> Number of Servlets loaded
> TotalServletsLoaded12 countJul 23, 2012 9:58:08 PMJul 24, 2012
> 8:50:52 AM--Total number of Servlets ever loaded
> ServletProcessingTimes867166 millisecondJul 23, 2012 9:58:08 PMJul
> 24, 2012 9:47:58 AM--Cumulative Servlet processing times
> Web Container: Session Statistics : LivingCost
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> PersistedSessionsTotal0 countJul 23, 2012 9:58:08 PM----Total
> number of sessions ever persisted
> RejectedSessionsTotal0 countJul 23, 2012 9:58:08 PM----Total
> number of sessions ever rejected
> SessionsTotal51635 countJul 23, 2012 9:58:08 PMJul 24, 2012
> 9:47:58 AM--Total number of sessions ever created
> ActiveSessions2074countJul 23, 2012 9:58:08 PMJul 24, 2012 9:55:31
> AMHigh Water Mark: 4468 count
> Low Water Mark: 0 count
> Number of active sessions
> ExpiredSessionsTotal49561 countJul 23, 2012 9:58:08 PMJul 24, 2012
> 9:55:31 AM--Total number of sessions ever expired
> ActivatedSessionsTotal0 countJul 23, 2012 9:58:08 PM----Total
> number of sessions ever activated
> PassivatedSessionsTotal0 countJul 23, 2012 9:58:08 PM----Total
> number of sessions ever passivated
> Web Container: Request Statistics : LivingCost
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> RequestCount222664 countJul 23, 2012 9:58:08 PMJul 24, 2012
> 9:56:20 AM--Cumulative number of requests processed so far
> MaxTime3896 millisecondJul 23, 2012 9:58:08 PMJul 24, 2012 9:56:20
> AM--Longest response time for a request; not a cumulative value,
> but the largest response time from among the response times
> ProcessingTime2 millisecondJul 23, 2012 9:58:08 PMJul 24, 2012
> 9:56:20 AM--Average request processing time
> ErrorCount958 countJul 23, 2012 9:58:08 PMJul 24, 2012 9:47:50
> AM--Cumulative value of the error count, with error count
> representing the number of cases where the response code was
> greater than or equal to 400
> Web Container: JSP Statistics : LivingCost
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> JspReloadedCount166 countJul 23, 2012 9:58:08 PMJul 24, 2012
> 9:01:43 AM--Total number of JSP pages that were reloaded
> TotalJspCount166 countJul 23, 2012 9:58:08 PMJul 24, 2012 9:01:43
> AM--Total number of JSP pages ever loaded
> JspCount166countJul 23, 2012 9:58:08 PMJul 24, 2012 9:01:43 AMHigh
> Water Mark: 166 count
> Low Water Mark: 0 count
> Number of active JSP pages
> JspErrorCount0 countJul 23, 2012 9:58:08 PM----Total number of
> errors triggered by JSP page invocations
>
>
>
>
>
> On Tue, Jul 24, 2012 at 12:08 AM, Mladen Adamovic
> <mladen.adamovic_at_gmail.com <mailto:mladen.adamovic_at_gmail.com>> wrote:
>
>
> When you disable keep-alive connections, does it change
> anything w.r.t. a number of CLOSE_WAIT connections?
>
>
> Well, I cannot always reproduce the problem, the last time it
> ran for 12 hours before crashing. Note that this is production
> server so it doesn't make sense for me to run "experiments
> against the real user base".
>
> Currently the webserver is running with modified
> configuration, which includes keep-alive max requests to 20,
> connection time out to 15 seconds and this is how number of
> connections did look like recently:
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 105 ESTABLISHED
> 1 FIN_WAIT1
> 42 FIN_WAIT2
> 2 LAST_ACK
> 10 LISTEN
> 4 SYN_RECV
> 112 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 102 ESTABLISHED
> 1 FIN_WAIT1
> 33 FIN_WAIT2
> 2 LAST_ACK
> 10 LISTEN
> 4 SYN_RECV
> 123 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 88 ESTABLISHED
> 1 FIN_WAIT1
> 20 FIN_WAIT2
> 2 LAST_ACK
> 10 LISTEN
> 8 SYN_RECV
> 137 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -a | grep "CLOSE_WAIT" | wc -l
> 0
> root_at_lvps176-28-13-94:~# netstat -a | grep "CLOSE_WAIT" | wc -l
> 0
> root_at_lvps176-28-13-94:~# netstat -a | grep "CLOSE_WAIT" | wc -l
> 2
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 2 CLOSE_WAIT
> 72 ESTABLISHED
> 2 FIN_WAIT1
> 39 FIN_WAIT2
> 10 LISTEN
> 2 SYN_RECV
> 110 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 35 ESTABLISHED
> 7 FIN_WAIT2
> 10 LISTEN
> 31 SYN_RECV
> 88 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 1 CLOSE_WAIT
> 70 ESTABLISHED
> 26 FIN_WAIT2
> 1 LAST_ACK
> 10 LISTEN
> 18 SYN_RECV
> 88 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 71 ESTABLISHED
> 2 FIN_WAIT1
> 28 FIN_WAIT2
> 10 LISTEN
> 6 SYN_RECV
> 128 TIME_WAIT
> root_at_lvps176-28-13-94:~#
>
>
> OK, now checking with max request per keep alive connection
> set to 1:
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 28 ESTABLISHED
> 1 FIN_WAIT1
> 3 FIN_WAIT2
> 10 LISTEN
> 5 SYN_RECV
> 241 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 17 ESTABLISHED
> 2 FIN_WAIT1
> 5 FIN_WAIT2
> 10 LISTEN
> 5 SYN_RECV
> 250 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 1 CLOSING
> 30 ESTABLISHED
> 9 FIN_WAIT2
> 10 LISTEN
> 5 SYN_RECV
> 267 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 29 ESTABLISHED
> 6 FIN_WAIT1
> 15 FIN_WAIT2
> 10 LISTEN
> 5 SYN_RECV
> 221 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 26 ESTABLISHED
> 8 FIN_WAIT1
> 16 FIN_WAIT2
> 10 LISTEN
> 6 SYN_RECV
> 227 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 27 ESTABLISHED
> 3 FIN_WAIT1
> 19 FIN_WAIT2
> 10 LISTEN
> 10 SYN_RECV
> 183 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 1 CLOSE_WAIT
> 21 ESTABLISHED
> 4 FIN_WAIT1
> 21 FIN_WAIT2
> 10 LISTEN
> 10 SYN_RECV
> 196 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print
> $6}'|sort|uniq -c
> 1 CLOSE_WAIT
> 32 ESTABLISHED
> 4 FIN_WAIT1
> 15 FIN_WAIT2
> 10 LISTEN
> 7 SYN_RECV
> 239 TIME_WAIT
>
>
> Looks worser. Around ~200 (oscillating) connections in
> TIME_WAIT state. When not using keep-alive many connections
> are created and dropped and sometimes connection can stay in
> TIME_WAIT for 240ms. It's easy to go out of network sockets,
> it seems.
>
>
>
>
> Thanks.
>
> WBR,
> Alexey.
>
>
>>
>> Hm, I've seen one problem with the Virtuozzo server in
>> the log:
>> quotaugidlimit
>> Number of user/group IDs allowed for the Container
>> internal disk quota. If set to 0, UID/GID quota will not
>> be enabled.
>>
>> This is set to limit 2000, but I don't understand what it
>> has with files, it shall be number of UID/GIDs, and at
>> the moment is has been steady at 43 at VPS.
>> Perhaps these are problem with Virtuozzo setup but it has
>> to be proven somehow.
>>
>>
>>
>> --
>> Mladen Adamovic
>> Numbeo
>> Drziceva 9, 11120 Belgrade-Zvezdara, Serbia
>> (Business Registration Number 62612240)
>> Tel. +381-66-058-595 <tel:%2B381-66-058-595>
>> email: mladen.adamovic_at_gmail.com
>> <mailto:mladen.adamovic_at_gmail.com>
>> web: http://www.numbeo.com
>
>
>
>
> --
> Mladen Adamovic
> Numbeo
> Drziceva 9, 11120 Belgrade-Zvezdara, Serbia
> (Business Registration Number 62612240)
> Tel. +381-66-058-595 <tel:%2B381-66-058-595>
> email: mladen.adamovic_at_gmail.com
> <mailto:mladen.adamovic_at_gmail.com>
> web: http://www.numbeo.com
>
>
>
>
> --
> Mladen Adamovic
> Numbeo
> Drziceva 9, 11120 Belgrade-Zvezdara, Serbia
> (Business Registration Number 62612240)
> Tel. +381-66-058-595 <tel:%2B381-66-058-595>
> email: mladen.adamovic_at_gmail.com <mailto:mladen.adamovic_at_gmail.com>
> web: http://www.numbeo.com
>
>
>
>
> --
> Mladen Adamovic
> Numbeo
> Drziceva 9, 11120 Belgrade-Zvezdara, Serbia
> (Business Registration Number 62612240)
> Tel. +381-66-058-595
> email: mladen.adamovic_at_gmail.com <mailto:mladen.adamovic_at_gmail.com>
> web: http://www.numbeo.com