users@glassfish.java.net

Strange timeouts : webserver7 - loadbalancer plugin - glassfish2.1

From: <glassfish_at_javadesktop.org>
Date: Fri, 26 Jun 2009 10:35:01 PDT

I installed glassfish 2.1 and sun webserver 7u5 on Solaris (5.10)<br/>
I got the load balancer plugin (b10) to work, after resolving some library issues (off course) <br/>
So the entire setup is working just fine.<p/>

Changes made on the app server are pushed to the webserver, changes are being picked up nicely and being put into place as it should.<br/>
I did some testing with the loadbalancer to make sure it will keep session affinity and that is was really doing round robin. That seemed to be just fine. <br/>
I also experimented with the hearbeating so that when an instance goes down new requests are no longer forwarded to that instance etc. All good.<p/>

Then we continued with a load test to test some webapplication we were running on the appserver.<p/>

This went pretty well, untill we noticed that about 20% of our requests got the 'timeout' screen. <br/>Its the screen which is by default in errorpages (the purple screen telling you the server is too bussy and has timed out, asking you to refresh the page)<p/>

The strange thing was that when you push F5 the page loaded immediately.<p/>

Ok; eventually we could identify a certain combination (os, browser, request) that seemed to trigger this behaviour.<br> It was windows xp (running in VMware) with IE6.<p/>

The timeout seems be triggered when a ajax reply is received with the request of doing a redirect.<br>
The javascript 'engine' on the clients browser would parse that request from the response and issue a GET.<br> This GET seemed to trigger the timeout. We timed the timeout and it was exactly 30seconds.<p/>

So the scenario is like this:<p/>

[code]
C1 - 1. Client sends a POST
C1 - 2. Client receives HTTP 200 body containing xml requesting to do a redirect
C1 - 3. Javascript engine on the client parses the xml, and issues a GET following the redirect
C1 - 4. <waiting...30 seconds>
C1 - 5. Client receives HTTP 200 containing the timeout html
[/code]

(C1 meaning the connection number '1'. Its keepalive, so all http requests/responses go through the same connection)<p/>

Next we reduced every timeout we could find (both on the webserver as the plugin) to 5 seconds to see if any of these had impact on this.<br/> No luck, whatever we did, the timeout remained 30 seconds.<p/>

Then I reverted the loadbalancer plugin version back to the original version which is supplied with glassfish 2.1 (a custom build, so it seems, undocumented off course).<p>

With this plugin the behaviour got different.<br>
The timeout screen disappeared, but the 30seconds waittime was still there.
So the client waited 30seconds and then suddenly got the requested screen (instead of timeout as with latest version of plugin)<p/>

It seemed that for some reason the pattern was no like this:<p/>

[code]
C1 - 1. Client sends a POST
C1 - 2. Client receives HTTP 200 body containing xml requesting to do a redirect
C1 - 3. Javascript engine on the client parses the xml, and issues a GET following the redirect
C1 - 4. <waiting...30 seconds>
<c1 closed>
C2 - 1. Client RE-ISSUES get from step 3
C2 - 2. Client gets HTTP 200 with the request page
[/code]

So, for some reason the connection got closed, no timeout page was send by the server.<br>
The browser (for some reason) decided to do the last GET again in a new connection and immedialty receives the response.<p/>

When looking in the log files of the webserver, it appears that the GET is never 'received'.<br/>
So in the first scenario (with the latest version of the plugin) its like this;<p/>

[code]
POST
<nothing>
[/code]

In the second scenario like this;

[code]
POST
<30seconds passed>
GET
[/code]
<p/>
But off course, in latest case, the GET is logged, but its the SECOND GET (the one being issues in a new connection). The first GET issues in the initial connection just triggered the 30seconds wait time and then dissapears.<p/>

We now have a temorary workaround by turning keep-alive off on the server.<br>
This way each request is send using a new connection (killing performance, off course) and it seems to avoid this problem. However, the load test showed that still about 2% of the requests are getting the timeout screen, so it isnt completely soved.<p/>

I also used wireshark to sniff the actual packets, and there are indeed send to the webserver.<p/>
So the first GET is received, but is trigging a wait time somehow.<p/>

Any hints on this would be greatly apprecitated !<p/>
Also some hints on how I could debug this deeper, I allready set the log on finest, but nothing gets printed..
[Message sent by forum member 'errorken' (errorken)]

http://forums.java.net/jive/thread.jspa?messageID=353199