Clustering issue with Glassfish 2.1.1 (NodeAgent loses contact with instance and starts new instance)

From: Barry van Someren <barry_at_bvansomeren.com>
Date: Wed, 17 Mar 2010 10:21:10 +0100

Hi there,

We are running a modest Glassfish cluster for a JSF 2.0/EJB 3
application and we ran into a weird issue in our Acceptance test
environment.
For some reason at the end of the day one of the glassfish nodeagents
decides that it has lost contact with the instance and starts a new
VM.
Unfortunately since the old instance is close to taking up the whole
physical memory the machine starts swapping, pretty much slowing the
second instance to the point where the nodeagent decides to start a
new instance, leading to a big OOM.

The machine in question was not processing requests at all when this
happened, in fact the last log entry before the restart is 20 minutes
before.
The problem is not load related as during the morning I ran a
succesful performance test against the machines (to ensure the cluster
stability)

My questions are:
1 Does anybody have a clue what could happen?
2 How does the nodeagent monitor the instance (so that I can figure
out why it decided to start a new instance)
3 What information should I post to help (admittably, the only things
we see in the logs is that the cluster breaks, but this is around the
time the machine decided to start swapping; It does seem that the
cluster was fine while the nodeagent thought the instance was down)

Thank you!

Regards,

Barry

-- 
Barry van Someren
---------------------------------------
LinkedIn: http://www.linkedin.com/in/barryvansomeren
Skype: BvsomerenSprout
Blog: http://blog.bvansomeren.com
KvK: 27317624
irc: BarryNL @ FreeNode