users@glassfish.java.net

Re: Debugging Glassfish CPU usage

From: <glassfish_at_javadesktop.org>
Date: Thu, 26 Mar 2009 01:15:30 PDT

Here's an update of our progress. We tried to isolate various parts of the setup today to find anything that would help. We tried creating an entirely new app server from scratch, using a different ec2 availability zone, the latest jdk6 and latest glassfish, but the results were the same.

So, I started investigating gc. I added the JVM options Xloggc and XX:+PrintGCDetails and then used apache benchmark to put the server under higher load to try to force memory to grow more quickly. After a while of doing this and watching the gc log, I concluded that gc was not a problem. For the duration, young gen collections took only a few milliseconds and full GC took 2-3 seconds (when it happened, which was rarely). Another lesson from this was that it occurred when the network variable was removed, as ab hit the same slow request loads when running locally on the app server.

I wanted to know more about where the slow requests were spending their time, so I added a filter to our app on /* to benchmark the processing time that each request spent in our app's code. I ran the server with this logging enabled until I'd collected 30K requests, then wrote a little util to parse that log and insert the results into my local DB. I ran some queries on the data and discovered that only 4 of the 30K requests had a processing time of >15s (1 at >20s, and 1 at >30s), of which 3 or those 4 requests were (oddly) to a HttpServlet that did nothing but call request.getSession() and then set two String constants to the session. Overall, 99.9% of requests were <2s.

This was confusing because I had data to show that our code was processing requests very quickly, but I was watching the counter on our down page hover around at 200+ users all afternoon (the down page is served by haproxy when it receives a 504 from glassfish). So, requests were processing slowly enough to timeout frequently, but they were not spending that time in our application code.

To be doubly sure that haproxy wasn't misbehaving, we directed traffic straight to glassfish for an hour or so, but that didn't change anything. On a whim, I cut the http acceptor threads down to 1 with an interesting effect: the site was still slow, but less erratic. Whereas, before, some requests were very fast and others very slow, moving to 1 acceptor just made every request in the range of sort-of-slow to more-slow. Also, with 1 acceptor, the server is noticeably snappier immediately after restart and becomes progressively slower thereafter.

Lacking other ideas, we're considering trying solaris instead of ubuntu, but are hoping to get access to an ec2 ami of OpenSolaris 2008.11 64-bit (if we can't, we might try the 64-bit xsce glassfish ami that we found).

Any other advice? Things to try? Data to collect?
[Message sent by forum member 'rwillie6' (rwillie6)]

http://forums.java.net/jive/thread.jspa?messageID=339103