Re: [Shoal-Dev] About HealthMonitor's cache

From: Joseph Fialli <Joseph.Fialli_at_Sun.COM>
Date: Mon, 06 Jul 2009 12:21:34 -0400

Bongjae Chang wrote:
> Hi,
> HealthMonitor stores the cache with members' states.
> But if a member's state was stored once, the value would be never removed.
> Assume that A was the group' member and now A is still failed.
> Then, we can see the following FINE level's log continuously.
Bongjae,

I did not design the entry to stay in the cache, but I have been taking
advantage of it recently.
I will mention the two instances that I am aware of that benefit from it
staying in the cache.

It is my first impression that it is preferable to leave the the DEAD
state cache in the HealthMonitor cache due to
the existence of method GroupHandle.getMemberState(). getMemberState()
is a pull API provided to GMS client to poll on state of a member. If
there is no
entry for a member in the cache, then GMS would need to try to contact
the instance.
If an application requests the state of a member and it has recently
died, it is best to remember that state.
If the instance restarts, the state will get replaced in the cache.

The method HealthMonitor.cleanAllCaches() would be the place to clear
entry, but I would prefer not to.
Retaining the state ensures that we do not report an instance failed
twice. The DEAD instance is cleared
from all other caches that we want it to be cleaned from when the
instance is dead by calling the method
cleanAllCaches().

I propose to fix the event log message and processing code in
processCacheUpdate to skip entries for
DEAD instances (and other states that do not make sense to process in
that method.)
However, even the WATCHDOG api benefits from the entry remaining in the
healthcache since this
provides a mapping from instance name within a group to the jxta entry
id. The existence of a dead entry
prevents WATCHDOG mechanism from not reporting an instance failed twice.
There does exist a possible
race condition between GMS heartbeat failure detection reporting an
instance has failed and NA reporting an
instance has failed, the current implementation relies on healthmonitor
cache entry as central location to maintain
state of an instance and prevent double reporting that an instance is DEAD.

-Joe

> --
> [#|2009-07-03T21:42:45.930+0900|FINE|Shoal|ShoalLogger|_ThreadID=30;_ThreadName=InDoubtPeerDetector
> Thread for
> Group:test;ClassName=HealthMonitor$InDoubtPeerDetector;MethodName=processCacheUpdate;|pro
> cessCacheUpdate : A 's state is dead|#]
> [#|2009-07-03T21:42:48.930+0900|FINE|Shoal|ShoalLogger|_ThreadID=30;_ThreadName=InDoubtPeerDetector
> Thread for
> Group:test;ClassName=HealthMonitor$InDoubtPeerDetector;MethodName=processCacheUpdate;|pro
> cessCacheUpdate : A 's state is dead|#]
> [#|2009-07-03T21:42:51.930+0900|FINE|Shoal|ShoalLogger|_ThreadID=30;_ThreadName=InDoubtPeerDetector
> Thread for
> Group:test;ClassName=HealthMonitor$InDoubtPeerDetector;MethodName=processCacheUpdate;|pro
> cessCacheUpdate : A 's state is dead|#]
> [#|2009-07-03T21:42:54.930+0900|FINE|Shoal|ShoalLogger|_ThreadID=30;_ThreadName=InDoubtPeerDetector
> Thread for
> Group:test;ClassName=HealthMonitor$InDoubtPeerDetector;MethodName=processCacheUpdate;|pro
> cessCacheUpdate : A 's state is dead|#]
> [#|2009-07-03T21:42:57.930+0900|FINE|Shoal|ShoalLogger|_ThreadID=30;_ThreadName=InDoubtPeerDetector
> Thread for
> Group:test;ClassName=HealthMonitor$InDoubtPeerDetector;MethodName=processCacheUpdate;|pro
> cessCacheUpdate : A 's state is dead|#]
> [#|2009-07-03T21:43:00.945+0900|FINE|Shoal|ShoalLogger|_ThreadID=30;_ThreadName=InDoubtPeerDetector
> Thread for
> Group:test;ClassName=HealthMonitor$InDoubtPeerDetector;MethodName=processCacheUpdate;|pro
> cessCacheUpdate : A 's state is dead|#]
> --
> Is this expected for monitoring old member's state or members' history?
> Please advice me.
> Thanks.
> --
> Bongjae Chang