Re: Server stops responding due to Glassfish

From: Jeanfrancois Arcand <Jeanfrancois.Arcand_at_Sun.COM>
Date: Tue, 29 Apr 2008 23:11:14 -0400

Salut,

Ryan de Laplante wrote:
> When the client (browsers) would attempt to access a page, it would sit
> in "Waiting for response from web server..." forever, and people would
> just close the browser. That is probably why we see those errors.

Right. If that's the case, it seems all threads are consumed/deadlocked.
Most of the time this has nothing to do with the WebContainer, but with
a db connection that is locked/slow. All Servlet try to get a db
connection and they all wait of a lock. It happens when a db is slow
usually...

For sure getting a thread dump will tell us where all the thread dead
lock. I don't think this is related to the nbpool....I bet you a dead
lock with a bd connection :-) I will pay @ JavaOne if I'm wrong :-) :-)

>
>
> Do you have a tool we can install on our test server to put the kind of
> load you described on the system? I do not like the idea of doing that
> on our production system. When you said it breaks, does the app server
> not return any data when trying to access pages?

No it throw an IOException. When the nbpool problems happens, the entire
TCP stack is down on windows. Anything you will try to do on it will not
work (reboot is the only operation). But I'm convinced this is not the
nbpool.

Does restarting the
> app server solve the problem, or do you have to reboot Windows? For us,
> just restarting the service solves the problem. We only reboot once per
> month when installing Windows updates.

That's means you aren't suffering the nbpool.

>
> I wasn't able to do a jstack PID once, so I doubt I can do it every two
> hours for you.

I hate windows :-) Can you do a run without running it as a service? I
suspect using the asadmin start-domain --verbose will allow you do issue
a control-break or \, producing a thread dump.

Also, it runs as a Windows service so I can't see the
> console to interact with it.
> I like your suggestion of disabling Grizzly from an other email:
>
> -Dcom.sun.enterprise.web.useCoyoteConnector=true

You gonna face the same problem I suspect, or some requests will be
dropped by Coyote if all the thread gets dead locked like It looks like.
The difference between Grizzly & Coyote is Coyote drops requests (404),
Grizzly queue them until it reach the max (4096 connection in the
queue), and then close connection as well. The problem is if all Grizzly
threads deadlock, then issuing a request will exactly produce what you
are seeing, which is a browser spin. Now based on you exception, the
queue is executed and Grizzly try to write response on closed
connection. That means a thread has been released and grizzly is using
it. So your Servlet seems to starts executing really slowly. We need to
find why :-)

>
> I'm going to ask for permission to try this setting in production. I
> did a full transaction on my development computer using this setting. I
> don't know how to confirm that Coyote was running instead of Grizzly,
> but I did see that JVM parameter in the startup log messages.

Then it is there.

>
> Were there any changes in UR1 or UR2 that you think would affect this?

Difficult to say. One thing you may wan to try is to increase the number
of thread (<request-processing ....thread-count="XXX"/>...with more
thread, the problem will still happens, but will take more time.

> We're using the FCS + a patch you gave me in November that would
> eventually be released as part of UR1.

I think the issue is not with GlassFish :-) Let focus on trying to get a
thresd dump first :-)

A+

--Jeanfrancois

>
> Thanks,
> Ryan
>
>
> Jeanfrancois Arcand wrote:
>> Hi Ryan,
>>
>> thanks for the info...so far, the exception are expected. They just
>> means the client closed the connection before the server has a chance
>> to write a response:
>>
>>> Caused by: ClientAbortException:
>>> java.nio.channels.ClosedChannelException
>>> at
>>> org.apache.coyote.tomcat5.OutputBuffer.realWriteBytes(OutputBuffer.java:409)
>>>
>>> at
>>> org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:417)
>>> at
>>> org.apache.coyote.tomcat5.OutputBuffer.doFlush(OutputBuffer.java:357)
>>> at
>>> org.apache.coyote.tomcat5.OutputBuffer.flush(OutputBuffer.java:335)
>>> at
>>> org.apache.coyote.tomcat5.CoyoteResponse.flushBuffer(CoyoteResponse.java:638)
>>>
>>> at
>>> org.apache.coyote.tomcat5.CoyoteResponseFacade.flushBuffer(CoyoteResponseFacade.java:291)
>>>
>>> at
>>> com.sun.faces.application.ViewHandlerImpl.renderView(ViewHandlerImpl.java:203)
>>>
>>> at
>>> com.sun.rave.web.ui.appbase.faces.ViewHandlerImpl.renderView(ViewHandlerImpl.java:320)
>>>
>>> at
>>> com.sun.faces.lifecycle.RenderResponsePhase.execute(RenderResponsePhase.java:106)
>>>
>>> ... 34 more
>>> Caused by: java.nio.channels.ClosedChannelException
>>> at
>>> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
>>> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
>>> at
>>> com.sun.enterprise.web.connector.grizzly.OutputWriter.flushChannel(OutputWriter.java:94)
>>>
>>> at
>>> com.sun.enterprise.web.connector.grizzly.OutputWriter.flushChannel(OutputWriter.java:67)
>>>
>>> at
>>> com.sun.enterprise.web.connector.grizzly.SocketChannelOutputBuffer.flushChannel(SocketChannelOutputBuffer.java:167)
>>>
>>> at
>>> com.sun.enterprise.web.connector.grizzly.SocketChannelOutputBuffer.flushBuffer(SocketChannelOutputBuffer.java:202)
>>>
>>> at
>>> com.sun.enterprise.web.connector.grizzly.SocketChannelOutputBuffer.flush(SocketChannelOutputBuffer.java:178)
>>>
>>> at
>>> com.sun.enterprise.web.connector.grizzly.SocketChannelOutputBuffer.realWriteBytes(SocketChannelOutputBuffer.java:145)
>>>
>>> at
>>> org.apache.coyote.http11.InternalOutputBuffer$OutputStreamOutputBuffer.doWrite(InternalOutputBuffer.java:851)
>>>
>>> at
>>> org.apache.coyote.http11.filters.ChunkedOutputFilter.doWrite(ChunkedOutputFilter.java:149)
>>>
>>> at
>>> org.apache.coyote.http11.InternalOutputBuffer.doWrite(InternalOutputBuffer.java:626)
>>>
>>> at org.apache.coyote.Response.doWrite(Response.java:599)
>>> at
>>> org.apache.coyote.tomcat5.OutputBuffer.realWriteBytes(OutputBuffer.java:404)
>>>
>>> ... 42 more
>>> |#]
>>
>> This exception isn't the cause of the hangs. Can you try something?
>> Can you get a jstack every 2 hours to see how it goes? Also the nbpool
>> problem will happens faster if more http requests are made. I'm able
>> to reproduce it quite fast with a load of 300 users doing requests
>> every seconds...its takes less than 2 hours to break win32 :-)
>>
>> Thanks
>>
>> -- Jeanfrancois
>>
>>
>> Ryan de Laplante wrote:
>>> It went down again today! 5.5 hours since it went down last. This is
>>> a new record. It also has a comparatively low NP Pool count of 383K
>>> (I've seen it up to 2200K before), and is using only 304,504K
>>> memory. I forgot to try tweaking a setting in HTTP listener to see
>>> if it comes back to life or not. I did try to do a stack dump:
>>>
>>> > jstack 5180
>>> 5180: Not enough storage is available to process this command
>>>
>>> Then I tried using this tool to get a stack dump:
>>>
>>> http://www.adaptj.com/main/download
>>>
>>> 5180 java.exe session:0 threads:131 parent:5744
>>> The current version does not support processes running in a different
>>> session.
>>> Try any of the following options:
>>> 1) Run the StackTrace service in the same session with the target
>>> process.
>>> 2) Start the terminal client with "mstsc.exe /console"
>>> 3) Use VNC from http://www.realvnc.com/ as a remote client.
>>>
>>> Attached are some Grizzly and NIO channels related exceptions from
>>> server.log
>>>
>>> We've had to write a program that checks the server every 10 minutes
>>> and email us when it goes down. We're also now going to restart
>>> GlassFish three times a week. Based on the discussions on this
>>> mailing list today about linux users having these same problems, we
>>> are no longer convinced that it can be blamed on the Windows 2003 NP
>>> Pool leak. Yes there is a leak, but I think GlassFish has a serious
>>> problem too. We did not have this problem with JBoss on the same
>>> server and OS a year ago.
>>>
>>> Hopefully Sun will put more resources into this issue immediately.
>>> It is the only issue we've had to use our support contract for, and
>>> we seem to be getting nowhere with it after 6 months. My employer is
>>> not satisfied and I'm wondering if he will renew the contract, or
>>> switch app server vendors. This is a production server and it goes
>>> down all the time.
>>>
>>>
>>> Ryan
>>>
>>>
>>> Ryan de Laplante wrote:
>>>> glassfish_at_javadesktop.org wrote:
>>>>>> HTTP requests consistently stop reaching the web application
>>>>>>
>>>>>
>>>>> Detect same on my server (linux), but not consistently and very
>>>>> rarely.
>>>>> In that cases non of my webapplications are reachable, also admin gui.
>>>>> Nothing to see in log files.
>>>>>
>>>>> Think this must be an "unnormal" issue.
>>>>> Not familiar with that stuff, just guessing: could it be an problem
>>>>> with broken connections, mean if client/user aborts
>>>>> [Message sent by forum member 'hammoud' (hammoud)]
>>>>>
>>>>> http://forums.java.net/jive/thread.jspa?messageID=272085
>>>>>
>>>> This is concerning. Up until now I thought this problem was
>>>> specific to Windows 2003 NP Pool leak. That might explain why I
>>>> experience two similar but different issues:
>>>>
>>>> 1) Every week or two the web container would stop serving requests.
>>>> Sometimes it would say "Maximum connections reached: 4096" even when
>>>> there were only a couple of hundred transactions a day. Other times
>>>> it would show nothing in the browser or not respond at all. My
>>>> other http listener for web services also stops working. Usually
>>>> the web admin console is the only http listener that is
>>>> working. Restarting the SJSAS 9.1 Windows service solves the
>>>> problem.
>>>>
>>>> 2) Every few months I find that restarting SJSAS 9.1 Windows service
>>>> makes no difference. PostgreSQL also dies and you can't connect to
>>>> it anymore. The only solution is to reboot Windows.
>>>>
>>>> I think issue #2 is related to the Windows 2003 Server NP Pool leak
>>>> which may have been fixed now with Microsoft patches, but I doubt it
>>>> since we have to restart SJSAS 9.1 more often since installing the
>>>> patches. I think issue #1 is a GlassFish problem, since you
>>>> experience it on linux and so does an other poster in this forum.
>>>>
>>>>
>>>> Ryan
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe_at_glassfish.dev.java.net
>>> For additional commands, e-mail: users-help_at_glassfish.dev.java.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe_at_glassfish.dev.java.net
>> For additional commands, e-mail: users-help_at_glassfish.dev.java.net
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe_at_glassfish.dev.java.net
> For additional commands, e-mail: users-help_at_glassfish.dev.java.net
>