users@shoal.java.net

Re: [Shoal-Users] Fast failover

From: Mohamed Abdelaziz <Mohamed.Abdelaziz_at_Sun.COM>
Date: Thu, 28 Feb 2008 10:28:56 -0800

Gary,

Thanks for your interest and feedback about shoal. Your assessment of
Shoal's health monitoring is correct, and you're probably pushing it's
limits, as it currently relies on the limitation of the underlying Java
tcp/udp interfaces.

As you may already know, failure detection is not an easy problem to
address, due to the wide causes a message delivery can be
delayed/failed, and failing to differentiate between a failure of
slowness, can lead to false failures, an undesirable condition.

Currently we are trying to categorize failures and their causes, then
look for a generic cross platform way of detecting such conditions,
which will lead to improved detection times.

If you have experience with failure detections systems, we encourage
your participation with input, coding, and testing.

Thanks
Mohamed

p.s. 100ms may only achieved in very few situations (i.e. local nic
down, GW unreachable, etc.), as there are many variables outside the
control of shoal, java, and the underlaying os. Such or better
detections timeouts have been achieved through the use of dedicated high
speed interfaces (W/kernel drivers) between nodes.



Gary Fry wrote:
>
> Hi there.
>
> I have been playing with Shoal this week and I’m finding it a really
> easy pick up and run with.
>
> I have a specific problem I would like to solve, and that is lightning
> fast failover detection (less than 100ms; less would be better) – for
> reliable Leader Election. I’ve looked at the code and found that I can
> pass in some properties via GMSFactory.startGMSModule The properties
> I’ve found that are relevant are:
>
> (JxtaConfigConstants): FAILURE_DETECTION_TIMEOUT,
> FAILURE_DETECTION_RETRIES and FAILUIRE_VERIFICATION_TIMEOUT.
>
> I’ve tried setting the values to low amounts, without sufficient
> success. I have noticed that the HealhMonitor skews the failure
> detection timeout by adding 500ms (in the FailureVerifier private
> inner class of HealthMonitor).
>
> When running a test app, I can’t seem to get failover to occur within
> less than about three seconds. Am I doing something wrong, or am I
> simply trying to push Shoal/Jxta too much?
>
> Thanks for your attention J
>
> Gary Fry
>
>
> ________________________________________________________________________
> In order to protect our email recipients, Betfair Group use SkyScan from
> MessageLabs to scan all Incoming and Outgoing mail for viruses.
>
> ________________________________________________________________________