users@glassfish.java.net

Re: glassfish 3.1.1, VPS (max sockets 1024 + 1024 local sockets), config suggestions to avoid Too many open files problem

From: Mladen Adamovic <mladen.adamovic_at_gmail.com>
Date: Tue, 24 Jul 2012 12:21:35 +0200

News: my ISP provider claims that one downtime yesterday were caused by
DDoS attack!.

OK, let me try to summarize the last website downtime with max requests per
keep alive connection 1:
increased the number of connections in time_wait state.
110 CLOSE_WAIT
    251 ESTABLISHED
      2 FIN_WAIT2
     10 LISTEN
      1 SYN_RECV
    404 TIME_WAIT


Connection queue of 500 were full to accomodate further requests. I had to
lower it so webserver doesn't run out of available sockets (1024, common
value with Virtuozzo).

If I understand correctly TIME_WAIT is when server waits for response from
client it received the data and it can last up to 240ms. This time shall
not be lowered in Linux, it can have big consequences.

Now after turning the server down and up, it starts to increase again:
root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
     35 ESTABLISHED
     20 FIN_WAIT1
      4 FIN_WAIT2
     74 LAST_ACK
     10 LISTEN
      1 SYN_RECV
    119 TIME_WAIT
root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
     43 ESTABLISHED
      1 FIN_WAIT1
     23 FIN_WAIT2
     10 LISTEN
      2 SYN_RECV
    228 TIME_WAIT

OK, I decided to try to crash it from my localhost:

I run from my localhost: # ab -n 10000 -c 1000
http://www.numbeo.com/cost-of-living/
to try to crash my own webserver, this is how it looks like:
root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
      7 CLOSE_WAIT
    337 ESTABLISHED
     25 FIN_WAIT1
     12 FIN_WAIT2
     10 LISTEN
     74 SYN_RECV
    610 TIME_WAIT
root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
    343 ESTABLISHED
     28 FIN_WAIT1
     16 FIN_WAIT2
     10 LISTEN
    119 SYN_RECV
   1051 TIME_WAIT

ab finished with
Completed 1000 requests
Completed 2000 requests
apr_socket_recv: Connection reset by peer (104)
Total of 2520 requests completed

and now sockets seems to come back to normal value:
# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
      3 CLOSE_WAIT
    119 ESTABLISHED
      2 FIN_WAIT1
     21 FIN_WAIT2
     10 LISTEN
     34 SYN_RECV
    179 TIME_WAIT

During the period I can see it run out of sockets in the log:
[#|2012-07-24T10:08:40.855+0000|WARNING|glassfish3.1.2|com.sun.grizzly.config.GrizzlyServiceListener|_ThreadID=11;_ThreadName=Thread-2;|GRIZZLY0006:
Exception accepting channel
java.io.IOException: Too many open files


and server load is not bad at all:
# uptime
 12:12:29 up 15:47, 1 user, load average: 0.24, 0.84, 0.55


*Conclusion*: it looks to me that although webserver process the request
for 3ms in average, the socket remains open much longer to be sure the
response has been received by the client. So at the moment of heavy load
testing, 120 request were processing but 1600 sockets were kept for
TIME_WAIT and ESTABLISHED phase of tcp.
This problem I might face it not be web server related issue but rather
lack of available sockets under virtualization for TIME_WAIT. Perhaps for
me time has come to migrate to dedicated server and to put available
sockets to 65K.




On Tue, Jul 24, 2012 at 11:58 AM, Mladen Adamovic <mladen.adamovic_at_gmail.com
> wrote:

> Hi guys, currently with 1 max alive request per connection, the webserver
> started to refuse connections, connection queue with max 500 is full.
>
> server.log looks clean and this is summary of the netstat:
>
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
> 96 CLOSE_WAIT
> 143 ESTABLISHED
> 1 FIN_WAIT1
> 1 FIN_WAIT2
> 10 LISTEN
> 29 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
> 96 CLOSE_WAIT
> 138 ESTABLISHED
> 6 FIN_WAIT2
> 10 LISTEN
> 20 TIME_WAIT
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
> 99 CLOSE_WAIT
> 134 ESTABLISHED
> 1 FIN_WAIT2
> 10 LISTEN
> 28 TIME_WAIT
> root_at_lvps176-28-13-94:~# uptime
> 11:52:46 up 15:27, 1 user, load average: 0.30, 0.06, 0.02
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
> 111 CLOSE_WAIT
> 223 ESTABLISHED
> 1 FIN_WAIT2
> 10 LISTEN
> 2 SYN_RECV
> 17 TIME_WAIT
>
>
> and the number of connections seems to increase:
> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
> 110 CLOSE_WAIT
> 251 ESTABLISHED
> 2 FIN_WAIT2
> 10 LISTEN
> 1 SYN_RECV
> 404 TIME_WAIT
>
>
> Connection queue seems to be full although processing time is 3ms:
>
>
> onitor (34 Statistics)
> File Cache Statistics : http-listener-1
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> ContentMissesCount 0 count Jul 23, 2012 9:57:57 PM -- -- Number of misses
> on cached file content
> OpenCacheEntriesCount 0 count Jul 23, 2012 9:57:57 PM -- -- Number of
> current open cache entries
> MaxMappedMemorySize 0 byte(s) Jul 24, 2012 9:54:48 AM Jul 24, 2012
> 9:54:48 AM -- Maximum memory map size used for caching
> HitsCount 0 count Jul 23, 2012 9:57:57 PM -- -- Number of cache lookup
> hits
> InfoMissesCount 0 count Jul 23, 2012 9:57:57 PM -- -- Number of misses on
> cached file info
> ContentHitsCount 0 count Jul 23, 2012 9:57:57 PM -- -- Number of hits on
> cached file content
> HeapSize 0 byte(s) Jul 24, 2012 9:54:48 AM Jul 24, 2012 9:54:48 AM -- Current
> cache size in bytes
> MaxHeapSize 0 byte(s) Jul 24, 2012 9:54:48 AM Jul 24, 2012 9:54:48 AM -- Maximum
> heap space used for cache
> InfoHitsCount 0 count Jul 23, 2012 9:57:57 PM -- -- Number of hits on
> cached file info
> MappedMemorySize 0 byte(s) Jul 24, 2012 9:54:48 AM Jul 24, 2012 9:54:48 AM
> -- Size of mapped memory used for caching
> MissesCount 0 count Jul 23, 2012 9:57:57 PM -- -- Number of cache lookup
> misses
> Keep Alive Statistics : http-listener-1
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> CountTimeouts 48452 count Jul 23, 2012 9:57:57 PM Jul 24, 2012 9:47:58 AM
> -- Number of keep-alive connections that timed out
> MaxRequests 1 count Jul 23, 2012 9:57:57 PM Jul 23, 2012 9:57:58 PM -- Maximum
> number of requests allowed on a single keep-alive connection
> CountHits 217839 count Jul 23, 2012 9:57:57 PM Jul 24, 2012 9:47:59 AM -- Number
> of requests received by connections in keep-alive mode
> SecondsTimeouts 5 seconds Jul 23, 2012 9:57:57 PM Jul 23, 2012 9:57:58 PM
> -- Keep-alive timeout value in seconds
> CountConnections 188 count Jul 23, 2012 9:57:57 PM Jul 24, 2012 9:47:59 AM
> -- Number of connections in keep-alive mode
> CountFlushes 96184 count Jul 23, 2012 9:57:57 PM Jul 24, 2012 9:47:52 AM
> -- Number of keep-alive connections that were closed
> CountRefusals 73011 count Jul 23, 2012 9:57:57 PM Jul 24, 2012 9:47:43 AM
> -- Number of keep-alive connections that were rejected
> Connection Queue Statistics : http-listener-1
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> CountQueued15MinutesAverage 3365 count Jul 24, 2012 9:54:48 AM Jul 24,
> 2012 9:54:48 AM -- Average number of connections queued in the last 15
> minutes
> CountQueued5MinutesAverage 203 count Jul 24, 2012 9:54:48 AM Jul 24, 2012
> 9:54:48 AM -- Average number of connections queued in the last 5 minutes
> CountQueued1MinuteAverage 0 count Jul 24, 2012 9:54:48 AM Jul 24, 2012
> 9:54:48 AM -- Average number of connections queued in the last 1 minute
> TicksTotalQueued 0 count Jul 23, 2012 9:57:57 PM Jul 23, 2012 9:57:57 PM
> -- (Unsupported) Total number of ticks that connections have spent in the
> queue
> PeakQueued 500 count Jul 23, 2012 9:57:57 PM Jul 24, 2012 9:51:45 AM -- Largest
> number of connections that were in the queue simultaneously
> CountOpenConnections 202 count Jul 24, 2012 9:54:48 AM Jul 24, 2012
> 9:54:48 AM -- The number of open/active connections
> CountTotalConnections 163494 count Jul 23, 2012 9:57:57 PM Jul 24, 2012
> 9:54:48 AM -- Total number of connections that have been accepted
> CountTotalQueued 248658 count Jul 23, 2012 9:57:57 PM Jul 24, 2012
> 9:51:45 AM -- Total number of connections that have been queued
> CountQueued 500 count Jul 23, 2012 9:57:57 PM Jul 24, 2012 9:51:45 AM -- Number
> of connections currently in the queue
> CountOverflows 888 count Jul 23, 2012 9:57:57 PM Jul 24, 2012 9:54:48 AM
> -- Number of times the queue has been too full to accommodate a connection
> MaxQueued 500 count Jul 23, 2012 9:57:57 PM Jul 23, 2012 9:57:58 PM -- Maximum
> size of the connection queue
> Thread Pool Statistics : http-listener-1
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> CoreThreads 5 count Jul 23, 2012 9:57:57 PM Jul 23, 2012 9:57:58 PM -- Core
> number of threads in the thread pool
> CurrentThreadsBusy 200 count Jul 23, 2012 9:57:57 PM Jul 24, 2012 9:47:59
> AM -- Provides the number of request processing threads currently in use
> in the listener thread pool serving requests
> TotalExecutedTasksCount 247958 count Jul 23, 2012 9:57:57 PM Jul 24, 2012
> 9:47:52 AM -- Provides the total number of tasks, which were executed by
> the thread pool
> MaxThreads 200 count Jul 23, 2012 9:57:57 PM Jul 23, 2012 9:57:58 PM -- Maximum
> number of threads allowed in the thread pool
> CurrentThreadCount 200 count Jul 23, 2012 9:57:57 PM Jul 24, 2012 9:47:59
> AM -- Provides the number of request processing threads currently in the
> listener thread pool
>
>
>
> Monitor (18 Statistics)
> Web Container: Servlet Statistics : LivingCost
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> ActiveServletsLoaded 12count Jul 23, 2012 9:58:08 PM Jul 24, 2012 8:50:52
> AM High Water Mark: 12 count
> Low Water Mark: 0 count
> Number of Servlets loaded
> TotalServletsLoaded 12 count Jul 23, 2012 9:58:08 PM Jul 24, 2012 8:50:52
> AM -- Total number of Servlets ever loaded
> ServletProcessingTimes 867166 millisecond Jul 23, 2012 9:58:08 PM Jul 24,
> 2012 9:47:58 AM -- Cumulative Servlet processing times
> Web Container: Session Statistics : LivingCost
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> PersistedSessionsTotal 0 count Jul 23, 2012 9:58:08 PM -- -- Total number
> of sessions ever persisted
> RejectedSessionsTotal 0 count Jul 23, 2012 9:58:08 PM -- -- Total number
> of sessions ever rejected
> SessionsTotal 51635 count Jul 23, 2012 9:58:08 PM Jul 24, 2012 9:47:58 AM
> -- Total number of sessions ever created
> ActiveSessions 2074count Jul 23, 2012 9:58:08 PM Jul 24, 2012 9:55:31 AM High
> Water Mark: 4468 count
> Low Water Mark: 0 count
> Number of active sessions
> ExpiredSessionsTotal 49561 count Jul 23, 2012 9:58:08 PM Jul 24, 2012
> 9:55:31 AM -- Total number of sessions ever expired
> ActivatedSessionsTotal 0 count Jul 23, 2012 9:58:08 PM -- -- Total number
> of sessions ever activated
> PassivatedSessionsTotal 0 count Jul 23, 2012 9:58:08 PM -- -- Total
> number of sessions ever passivated
> Web Container: Request Statistics : LivingCost
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> RequestCount 222664 count Jul 23, 2012 9:58:08 PM Jul 24, 2012 9:56:20 AM
> -- Cumulative number of requests processed so far
> MaxTime 3896 millisecond Jul 23, 2012 9:58:08 PM Jul 24, 2012 9:56:20 AM
> -- Longest response time for a request; not a cumulative value, but the
> largest response time from among the response times
> ProcessingTime 2 millisecond Jul 23, 2012 9:58:08 PM Jul 24, 2012 9:56:20
> AM -- Average request processing time
> ErrorCount 958 count Jul 23, 2012 9:58:08 PM Jul 24, 2012 9:47:50 AM -- Cumulative
> value of the error count, with error count representing the number of cases
> where the response code was greater than or equal to 400
> Web Container: JSP Statistics : LivingCost
> Name
> Value
> Start Time
> Last Sample Time
> Details
> Description
> JspReloadedCount 166 count Jul 23, 2012 9:58:08 PM Jul 24, 2012 9:01:43 AM
> -- Total number of JSP pages that were reloaded
> TotalJspCount 166 count Jul 23, 2012 9:58:08 PM Jul 24, 2012 9:01:43 AM -- Total
> number of JSP pages ever loaded
> JspCount 166count Jul 23, 2012 9:58:08 PM Jul 24, 2012 9:01:43 AM High
> Water Mark: 166 count
> Low Water Mark: 0 count
> Number of active JSP pages
> JspErrorCount 0 count Jul 23, 2012 9:58:08 PM -- -- Total number of
> errors triggered by JSP page invocations
>
>
>
>
>
> On Tue, Jul 24, 2012 at 12:08 AM, Mladen Adamovic <
> mladen.adamovic_at_gmail.com> wrote:
>
>>
>>> When you disable keep-alive connections, does it change anything w.r.t.
>>> a number of CLOSE_WAIT connections?
>>>
>>>
>> Well, I cannot always reproduce the problem, the last time it ran for 12
>> hours before crashing. Note that this is production server so it doesn't
>> make sense for me to run "experiments against the real user base".
>>
>> Currently the webserver is running with modified configuration, which
>> includes keep-alive max requests to 20, connection time out to 15 seconds
>> and this is how number of connections did look like recently:
>> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>> 105 ESTABLISHED
>> 1 FIN_WAIT1
>> 42 FIN_WAIT2
>> 2 LAST_ACK
>> 10 LISTEN
>> 4 SYN_RECV
>> 112 TIME_WAIT
>> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>> 102 ESTABLISHED
>> 1 FIN_WAIT1
>> 33 FIN_WAIT2
>> 2 LAST_ACK
>> 10 LISTEN
>> 4 SYN_RECV
>> 123 TIME_WAIT
>> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>> 88 ESTABLISHED
>> 1 FIN_WAIT1
>> 20 FIN_WAIT2
>> 2 LAST_ACK
>> 10 LISTEN
>> 8 SYN_RECV
>> 137 TIME_WAIT
>> root_at_lvps176-28-13-94:~# netstat -a | grep "CLOSE_WAIT" | wc -l
>> 0
>> root_at_lvps176-28-13-94:~# netstat -a | grep "CLOSE_WAIT" | wc -l
>> 0
>> root_at_lvps176-28-13-94:~# netstat -a | grep "CLOSE_WAIT" | wc -l
>> 2
>> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>> 2 CLOSE_WAIT
>> 72 ESTABLISHED
>> 2 FIN_WAIT1
>> 39 FIN_WAIT2
>> 10 LISTEN
>> 2 SYN_RECV
>> 110 TIME_WAIT
>> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>> 35 ESTABLISHED
>> 7 FIN_WAIT2
>> 10 LISTEN
>> 31 SYN_RECV
>> 88 TIME_WAIT
>> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>> 1 CLOSE_WAIT
>> 70 ESTABLISHED
>> 26 FIN_WAIT2
>> 1 LAST_ACK
>> 10 LISTEN
>> 18 SYN_RECV
>> 88 TIME_WAIT
>> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>> 71 ESTABLISHED
>> 2 FIN_WAIT1
>> 28 FIN_WAIT2
>> 10 LISTEN
>> 6 SYN_RECV
>> 128 TIME_WAIT
>> root_at_lvps176-28-13-94:~#
>>
>>
>> OK, now checking with max request per keep alive connection set to 1:
>> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>> 28 ESTABLISHED
>> 1 FIN_WAIT1
>> 3 FIN_WAIT2
>> 10 LISTEN
>> 5 SYN_RECV
>> 241 TIME_WAIT
>> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>> 17 ESTABLISHED
>> 2 FIN_WAIT1
>> 5 FIN_WAIT2
>> 10 LISTEN
>> 5 SYN_RECV
>> 250 TIME_WAIT
>> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>> 1 CLOSING
>> 30 ESTABLISHED
>> 9 FIN_WAIT2
>> 10 LISTEN
>> 5 SYN_RECV
>> 267 TIME_WAIT
>> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>> 29 ESTABLISHED
>> 6 FIN_WAIT1
>> 15 FIN_WAIT2
>> 10 LISTEN
>> 5 SYN_RECV
>> 221 TIME_WAIT
>> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>> 26 ESTABLISHED
>> 8 FIN_WAIT1
>> 16 FIN_WAIT2
>> 10 LISTEN
>> 6 SYN_RECV
>> 227 TIME_WAIT
>> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>> 27 ESTABLISHED
>> 3 FIN_WAIT1
>> 19 FIN_WAIT2
>> 10 LISTEN
>> 10 SYN_RECV
>> 183 TIME_WAIT
>> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>> 1 CLOSE_WAIT
>> 21 ESTABLISHED
>> 4 FIN_WAIT1
>> 21 FIN_WAIT2
>> 10 LISTEN
>> 10 SYN_RECV
>> 196 TIME_WAIT
>> root_at_lvps176-28-13-94:~# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
>> 1 CLOSE_WAIT
>> 32 ESTABLISHED
>> 4 FIN_WAIT1
>> 15 FIN_WAIT2
>> 10 LISTEN
>> 7 SYN_RECV
>> 239 TIME_WAIT
>>
>>
>> Looks worser. Around ~200 (oscillating) connections in TIME_WAIT state.
>> When not using keep-alive many connections are created and dropped and
>> sometimes connection can stay in TIME_WAIT for 240ms. It's easy to go out
>> of network sockets, it seems.
>>
>>
>>
>>
>> Thanks.
>>>
>>> WBR,
>>> Alexey.
>>>
>>>
>>>
>>> Hm, I've seen one problem with the Virtuozzo server in the log:
>>> quotaugidlimit
>>> Number of user/group IDs allowed for the Container internal disk quota.
>>> If set to 0, UID/GID quota will not be enabled.
>>>
>>> This is set to limit 2000, but I don't understand what it has with
>>> files, it shall be number of UID/GIDs, and at the moment is has been steady
>>> at 43 at VPS.
>>> Perhaps these are problem with Virtuozzo setup but it has to be proven
>>> somehow.
>>>
>>>
>>>
>>> --
>>> Mladen Adamovic
>>> Numbeo
>>> Drziceva 9, 11120 Belgrade-Zvezdara, Serbia
>>> (Business Registration Number 62612240)
>>> Tel. +381-66-058-595
>>> email: mladen.adamovic_at_gmail.com
>>> web: http://www.numbeo.com
>>>
>>>
>>>
>>
>>
>> --
>> Mladen Adamovic
>> Numbeo
>> Drziceva 9, 11120 Belgrade-Zvezdara, Serbia
>> (Business Registration Number 62612240)
>> Tel. +381-66-058-595
>> email: mladen.adamovic_at_gmail.com
>> web: http://www.numbeo.com
>>
>
>
>
> --
> Mladen Adamovic
> Numbeo
> Drziceva 9, 11120 Belgrade-Zvezdara, Serbia
> (Business Registration Number 62612240)
> Tel. +381-66-058-595
> email: mladen.adamovic_at_gmail.com
> web: http://www.numbeo.com
>



-- 
Mladen Adamovic
Numbeo
Drziceva 9, 11120 Belgrade-Zvezdara, Serbia
(Business Registration Number 62612240)
Tel. +381-66-058-595
email: mladen.adamovic_at_gmail.com
web: http://www.numbeo.com