Re: [Shoal-Dev] About sailfin issue #484

From: Bongjae Chang <carryel_at_korea.com>
Date: Tue, 2 Jun 2009 23:34:15 +0900

Hi Joe,

Thank you so much for your kindness.

Bongjae wrote:
> The changes select a new master and notify a join notification about
> the old master in only new master.

I am sorry that I supposed that without testing.

Joe wrote:
>All other instances of the cluster will receive this join notification.

You are right. I could see that all other instances of the cluster as well as new master received the old master's join notification.

When new master is selected, the master removes the old master adv(id) from current view and sends the view to all other instances with MASTER_CHANGE_EVENT before join notification.

So all other instances received the view of new master which already removed the old master' adv, then other instances could receive the old master's join notification.

I understand it. Thank you again!

Joe wrote:
> Only the surviving instances of the cluster have been keeping that
> information and are quallified to be new Master.

I agree. So now I understand why the changes have given the master a special treatment. And I could also understand why WATCH_DOG was needed after I had seen the glassfish issue #8308. :-)

About the glassfish issue #8308, I have a question.

If the server which uses Shoal doesn't have a node agent which supports WATCH_DOG, the FAILURE event could be lost at this case, couldn't it?

Then the new master ought to notify the old master's FAILURE. is it Right?

Thanks in advance.

PS) Didn't you join the Javaone events with Shreedhar? Unfortunately, I couldn't join there this year. But I wish that I will attend next Javaone and meet you and many Shoal's users and devs next year!

--
Bongjae Chang

----- Original Message -----
From: "Joseph Fialli" <Joseph.Fialli_at_Sun.COM>
To: <dev_at_shoal.dev.java.net>
Sent: Tuesday, June 02, 2009 6:03 AM
Subject: Re: [Shoal-Dev] About sailfin issue #484

> Bongjae,
>
> See my comments inline below.
>
>
> Bongjae Chang wrote:
>> Hi,
>> I have a question about sailfin issue #484 relating to
>> MasterNode#processMasterNodeQuery()'s changes.
>> I tried to test the master's failure.
>> This test is like sailfin issue #484.
>> i.g. the master dies and comes back up quickly.
>> It seems that the policy and behavior about the failed master has been
>> changed from sailfin issue #484.
>> The changes select a new master and notify a join notification about
>> the old master in only new master.
>> This result was not my expectaion because the old master didn't have a
>> failure state at other members.
> Please see the following glassfish issue concerning fast restart of a
> failed instance.
> https://glassfish.dev.java.net/issues/show_bug.cgi?id=8308
>
> To summarize, GMS heartbeat detection (default of 7.5 seconds in
> Glassfish) is not able to detect
> and report FAILURE event when the glassfish NodeAgent automatically
> restarts an instance in less than
> 7.5 seconds. The instance has truely failed regardless if it is reported
> by a GMS failure event.
> It is not possible to send out a GMS FAILURE event once the instance has
> already restarted.
> That is what is discussed in much detail in glassfish issue 8308 and the
> ability to augment GMS failure
> detection when an external agent is restarting failed instances faster
> than gms heartbeat detection.
>
> The restarted instance is missing all state that the previous Master
> instance did have. It was a bug in sailfin 484 that the failure went
> undetected.
> It was not a policy change but a bug fix.
>
> Here is how GMS failure detection works at a high level.
> - The MasterNode monitors all other instance heartbeats in a cluster for
> failure.
> - All other instances in the cluster monitor the MasterNode heartbeats
> to check if it failed.
>
> Once the MasterNode is killed and comes back up quickly, ALL other
> instances in the cluster
> (not just the master node) will see a MasterNodeQuery. ALL OTHER
> INSTANCES recognize the
> former master node has restarted and that there is a need to recalculate
> who is the new Master from the surviving cluster instances since the
> newly restarted former master is missing all state
> (which instances make up the cluster).
> Only the surviving instances of the cluster have been keeping that
> information and are quallified to be new Master.
> Whichever instance is made the new Master (based on an algorithm that
> all instances are applying to their list of instances making up the cluster)
> all instances will agree on new Master.
>
> Only the newly elected Master sends out the join notification of the
> restarted old Master instance. That was the fix that
> was checked in for sailfin 484. All other instances of the cluster will
> receive this join notification.
>
> I hope this explains the motivation behind the fix for sailfin 484.
> It was not intended to be a policy change.
>
> -Joe
>
>> I thought that the old master should keep master' role if the old
>> master came back up quickly before others were aware of the old
>> master's failure.
>> And the changes are only notifying the old master's join notification
>> in a new master.
>> Assume that A, B and C are members and A is the master.
>> When A dies and comes back quickly, B becomes to be a new master and B
>> receives A's join notification. Maybe C doesn't receive A's join
>> notification because A is not only failure member but also indoubt
>> member. I think that C's behavior is right.
>> Assume that A, B and C are members and A is the master again.
>> When B dies and comes back quickly, both A and C doesn't receive join
>> notifications because B is not indoubt member as well as failure
>> member. I think that this behavior is also right.
>> When the old master dies and rejoins the group quickly, the old master
>> perhaps discovers the group's master. But the group doesn't have the
>> master because the old master itself has been the group master. Then
>> the old master which rejoins the group will wait for discovery time.
>> During discovery time, maybe all members can't receive the group's
>> event adequately.
>> So is the new master selected in order to save discovery time instead
>> of the old master?
>> And should we give the old master's join notification special
>> treatment when the old master dies and comes back?
>> What do you think?
>> Thanks!
>> --
>> Bongjae Cha
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe_at_shoal.dev.java.net
> For additional commands, e-mail: dev-help_at_shoal.dev.java.net
>
>
>
>