After 2 weeks or so of work, we have finally found the problem. It seems when
an instance in a cluster goes down, the recovery instance checks if the
instance is still up by trying to access the "node-host":"admin-node-port" of
the downed instance. If you are using the standard created node on the DAS
(as we were), the node-host is set to "localhost". So, instance-2 was trying
to see if instance-1 is down by trying to connect to "localhost", instead of
"instance-1-ip" as it should have been. Since it could connect to localhost,
the instance-1 was falsely marked as running and the recovery didn't go
ahead. We had to change the node-host for instance-1 node in domain
config.xml to fix this, since the configuration of default localhost- cannot
be changed through asadmin or admin console.
--
[Message sent by forum member 'ameyc']
View Post: http://forums.java.net/node/894128