Stale views in master node

From: Cameron Rochester <cameron_at_harvestroad.com.au>
Date: Mon, 04 Jan 2010 13:02:08 +0800

Hi all,

I have been using Shoal for some time now and have discovered an issue
that crops up every now and then.

Basically, when doing a DSC update to all peers I am seeing the
following exception:

WARNING: ClusterManager.send : sending of message
net.jxta.endpoint.Message_at_11882231(2){270} failed. Unable to create an
OutputPipe for
urn:jxta:uuid-59616261646162614A787461503250335FDDDB9470DA4390A3E692268159961303
route = null
java.io.IOException: Unable to create a messenger to
jxta://uuid-59616261646162614A787461503250335FDDDB9470DA4390A3E692268159961303/PipeService/urn:jxta:uuid-63B5938B46F147609C1C998286EA5F3B6E0638B5DF604AEEAC09A3FAE829FBE804
         at
net.jxta.impl.pipe.BlockingWireOutputPipe.checkMessenger(BlockingWireOutputPipe.java:238)
         at
net.jxta.impl.pipe.BlockingWireOutputPipe.<init>(BlockingWireOutputPipe.java:154)
         at
net.jxta.impl.pipe.BlockingWireOutputPipe.<init>(BlockingWireOutputPipe.java:135)
         at
net.jxta.impl.pipe.PipeServiceImpl.createOutputPipe(PipeServiceImpl.java:503)
         at
net.jxta.impl.pipe.PipeServiceImpl.createOutputPipe(PipeServiceImpl.java:435)
         at
net.jxta.impl.pipe.PipeServiceInterface.createOutputPipe(PipeServiceInterface.java:170)
         at
com.sun.enterprise.jxtamgmt.ClusterManager.send(ClusterManager.java:505)
         at
com.sun.enterprise.ee.cms.impl.jxta.GroupCommunicationProviderImpl.sendMessage(GroupCommunicationProviderImpl.java:254)
         at
com.sun.enterprise.ee.cms.impl.jxta.DistributedStateCacheImpl.sendMessage(DistributedStateCacheImpl.java:500)
         at
com.sun.enterprise.ee.cms.impl.jxta.DistributedStateCacheImpl.addToRemoteCache(DistributedStateCacheImpl.java:234)

This lead me to the HealthMonitor and ClusterViewManager and I found the
following things:

1) The HealthMonitor does not seem to get a list of advertisements from
the ClusterViewManager to monitor. As far as I can tell they are built
up via heartbeat messages.
2) Occasionally the master node can hold onto a stale advertisement.
When a new client receives the list of advertisements from the master at
start up I was seeing a node in the list (in STARTING state) that didn't
exist.
3) Once the master has a stale advertisement it never removes it (see
point 1)
4) This was then causing a problem (and long timeouts) when sending the
DSC update as it does a unicast to each advertisement, including the
failed one.

I am not sure why the Master node has a stale reference, it only happens
occasionally, and is very hard to track down.

To get around this I propose another fix. Basically the HealthMonitor
will compare the list of PeerIDs in it's cache, to the list of peers
known by the ClusterViewManager. If there are peers in the view that are
not in the HealthMonitor cache then I simply add them to the cache so
the InDoubtPeerDetector will do it's thing.

The patch is attached. Could someone please review and let me know if it
makes sense? The main thing I am unsure about is the sequence ID. It
doesn't seem to be used by the in doubt detection so I have just set it
to 0.

Thanks for looking
Cameron

Index: HealthMonitor.java
===================================================================
--- HealthMonitor.java (revision 16417)
+++ HealthMonitor.java (working copy)
@@ -836,6 +836,21 @@
                     } else {
                         reportMyState(ALIVE, null);
                     }
+
+ // Check to see if the cache matches the view. If not, we
+ // should insert the suspect entry into the healthmonitors
+ // cache for checking by the InDoubtPeerDetector
+ for (SystemAdvertisement adv : manager.getClusterViewManager().getLocalView().getView())
+ {
+ // There are elements in the view that do not match
+ // the cache. add them to the health monitor
+ PeerID id = (PeerID)adv.getID();
+ if (!cache.containsKey(id))
+ {
+ cache.put((PeerID)adv.getID(),
+ new HealthMessage.Entry(adv, getStateFromCache(id), 0));
+ }
+ }
                 }
             } catch (InterruptedException e) {
                 stop = true;