dev@shoal.java.net

Re: [Shoal-Dev] When group leader failed, any member couldn't receive FailureRecovery notification

From: Bongjae Chang <carryel_at_korea.com>
Date: Thu, 13 Nov 2008 11:03:32 +0900

This is a sample for appointing the recovery server without using failure member.

In RecoveryTargetSelector,

private static boolean resolveWithEasySelectionAlgorithm( final List<GMSMember> oldViewCache, final String failedMember, final String groupName ) {
    boolean recover = false;
    String recoverer = null;
    final GMSContext ctx = GMSContextFactory.getGMSContext( groupName );
    final String self = ctx.getServerIdentityToken();
    final List<String> liveCache = getMemberTokens( oldViewCache, ctx.getSuspectList() );
    logger.log( Level.FINE, "LiveCache = " + liveCache );
    final List<String> coreCache = getCoreMembers( oldViewCache );
    logger.log( Level.FINE, "CoreCache = " + coreCache );
    for( String coreMember : coreCache ) {
        if( liveCache.contains( coreMember ) ) {
            recoverer = coreMember;
            break;
        }
    }
    if( recoverer != null ) {
        if( recoverer.equals( self ) )
            recover = true;
        setRecoverySelectionState( recoverer, failedMember, groupName );
    }
    return recover;
}

Thanks.

--
Bongjae Chang


----- Original Message -----
From: "Bongjae Chang" <carryel_at_korea.com>
To: <dev_at_shoal.dev.java.net>
Sent: Thursday, November 13, 2008 10:53 AM
Subject: Re: [Shoal-Dev] When group leader failed, any member couldn't receive FailureRecovery notification


> Hi Joe.
> This issue is exactly equal to issue 6764333.
> And Sheetal's analysis is correct.
>
> Wrote Sheetal:
>>I am not sure how views.get(views.size() - 2) was arrived at. But this looks like the likely problem.
>
> In addition to,
> if master is failed, other member's HealthMonitor calls assignAndReportFailure() and appoints new master.
> By appointing new master, MASTER_CHANGE_EVENT will be fired with current view.
>
> I think that appointing new master and notifying MASTER_CHANGE_EVENT are innocent when master is failed.
> So I would like to propose more simple algorithm for appointing the recovery server.
>
> Thanks.
>
> --
> Bongjae Chang
>
>
> ----- Original Message -----
> From: "Joseph Fialli" <Joseph.Fialli_at_Sun.COM>
> To: <dev_at_shoal.dev.java.net>
> Sent: Thursday, November 13, 2008 6:35 AM
> Subject: Re: [Shoal-Dev] When group leader failed, any member couldn't receive FailureRecovery notification
>
>
>> Bongjae,
>>
>> Thanks for reporting this issue.
>>
>> An issue similar to this was recently filed in Sun's internal bug
>> database bugster as issue 6764333 by internal Shoal QA team.
>> While some of these issues are visible publically via Sun Bug Database,
>> I could not find an external
>> link to this specific bug.
>>
>> Summary of issue and analysis follows:
>>> Scenario that recreates this failure:
>>> ================
>>> - start das/NAs
>>> - start cluster
>>> - stop DAS
>>> - wait 20 sec
>>> - kill the master node
>>> - wait restart
>>> - wait 20 sec
>>
>> Sheetal's analysis of why the issue is occuring:
>>> Looking at the logs and the code, it looks like
>>> RecoveryTargetSelector.setRecoverySelectionState() which is
>>> responsible for appointing the recovery server does not get called in
>>> the run pointed out.
>>>
>>> ViewWindow.addFailureSignals() calls
>>> ViewWindow.generateFailureRecoverySignals(views.get(views.size() - 2),
>>> token,
>>> advert.getCustomTagValue(CustomTagNames.GROUP_NAME.toString()),
>>> Long.valueOf(advert.getCustomTagValue(CustomTagNames.START_TIME.toString())));
>>>
>>> I am not sure how views.get(views.size() - 2) was arrived at. But this
>>> looks like the likely problem. It basically passes the views.size()-2
>>> arraylist from the views vector into the above method. The control
>>> then goes to RecoveryTargetSelector.resolveRecoveryTarget() with the
>>> above arraylist. The arraylist does not contain the failedmember and
>>> the recoverer variable is never set. Hence
>>> RecoveryTargetSelector.setRecoverySelectionState() is never called.
>> We have not had a chance to address this issue any further than above
>> analysis. But above
>> corresponds well with your findings in analyzing the log.
>>
>> -Joe
>>
>>
>> Bongjae Chang wrote:
>>> Hi.
>>> I found another issue.
>>> When group leader failed, any member couldn't receive FailureRecovery
>>> notification.
>>> Of course, members added FailureRecoveryActionFactoryImpl and their
>>> callbacks to GMS.
>>> But if failure member was not group leader, other member received
>>> FailureRecovery notification successfully.
>>> Here are two logs.
>>> --------------------
>>> case 1) When failure member is group leader.
>>> 2008. 11. 12 ¿ÀÈÄ 9:43:28
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>>> for (before change analysis) are :
>>> 1: MemberId: dd4897f5-2383-420e-8d3e-87f77407da41, MemberType: CORE,
>>> Address:
>>> urn:jxta:uuid-59616261646162614A787461503250332E9EB1D0D35742638E5B9CF78B8253EE03
>>> 2008. 11. 12 ¿ÀÈÄ 9:43:28
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>>> MASTER_CHANGE_EVENT
>>> 2008. 11. 12 ¿ÀÈÄ 9:43:28
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>>> for (before change analysis) are :
>>> 1: MemberId: b6663a51-9b79-43e2-92dd-41899c907383, MemberType: CORE,
>>> Address:
>>> urn:jxta:uuid-59616261646162614A787461503250331DA08A66D0554F138E75E74AA363FC9E03
>>> 2: MemberId: dd4897f5-2383-420e-8d3e-87f77407da41, MemberType: CORE,
>>> Address:
>>> urn:jxta:uuid-59616261646162614A787461503250332E9EB1D0D35742638E5B9CF78B8253EE03
>>> 2008. 11. 12 ¿ÀÈÄ 9:43:28
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>>> MASTER_CHANGE_EVENT
>>> 2008. 11. 12 ¿ÀÈÄ 9:43:28
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>>> for (before change analysis) are :
>>> 1: MemberId: b6663a51-9b79-43e2-92dd-41899c907383, MemberType: CORE,
>>> Address:
>>> urn:jxta:uuid-59616261646162614A787461503250331DA08A66D0554F138E75E74AA363FC9E03
>>> 2: MemberId: dd4897f5-2383-420e-8d3e-87f77407da41, MemberType: CORE,
>>> Address:
>>> urn:jxta:uuid-59616261646162614A787461503250332E9EB1D0D35742638E5B9CF78B8253EE03
>>> 2008. 11. 12 ¿ÀÈÄ 9:43:28
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>>> ADD_EVENT
>>> 2008. 11. 12 ¿ÀÈÄ 9:43:53
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>>> for (before change analysis) are :
>>> 1: MemberId: b6663a51-9b79-43e2-92dd-41899c907383, MemberType: CORE,
>>> Address:
>>> urn:jxta:uuid-59616261646162614A787461503250331DA08A66D0554F138E75E74AA363FC9E03
>>> 2: MemberId: dd4897f5-2383-420e-8d3e-87f77407da41, MemberType: CORE,
>>> Address:
>>> urn:jxta:uuid-59616261646162614A787461503250332E9EB1D0D35742638E5B9CF78B8253EE03
>>> 2008. 11. 12 ¿ÀÈÄ 9:43:53
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>>> *IN_DOUBT_EVENT*
>>> 2008. 11. 12 ¿ÀÈÄ 9:43:53
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow addInDoubtMemberSignals
>>> Á¤º¸: gms.failureSuspectedEventReceived
>>> 2008. 11. 12 ¿ÀÈÄ 9:43:53 com.sun.enterprise.ee.cms.impl.common.Router
>>> notifyFailureSuspectedAction
>>> Á¤º¸: Sending FailureSuspectedSignals to registered Actions.
>>> Member:b6663a51-9b79-43e2-92dd-41899c907383...
>>> 2008. 11. 12 ¿ÀÈÄ 9:43:57
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>>> for (before change analysis) are :
>>> 1: MemberId: dd4897f5-2383-420e-8d3e-87f77407da41, MemberType: CORE,
>>> Address:
>>> urn:jxta:uuid-59616261646162614A787461503250332E9EB1D0D35742638E5B9CF78B8253EE03
>>> 2008. 11. 12 ¿ÀÈÄ 9:43:57
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>>> *MASTER_CHANGE_EVENT*
>>> 2008. 11. 12 ¿ÀÈÄ 9:43:57
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>>> for (before change analysis) are :
>>> 1: MemberId: dd4897f5-2383-420e-8d3e-87f77407da41, MemberType: CORE,
>>> Address:
>>> urn:jxta:uuid-59616261646162614A787461503250332E9EB1D0D35742638E5B9CF78B8253EE03
>>> 2008. 11. 12 ¿ÀÈÄ 9:43:57
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>>> *FAILURE_EVENT*
>>> 2008. 11. 12 ¿ÀÈÄ 9:43:57
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow addFailureSignals
>>> Á¤º¸: The following member has failed:
>>> b6663a51-9b79-43e2-92dd-41899c907383
>>> case 2) When failure member is not group leader
>>> 2008. 11. 12 ¿ÀÈÄ 9:40:03
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>>> for (before change analysis) are :
>>> 1: MemberId: 96438e75-740c-4613-af8d-6b2ab8ea4727, MemberType: CORE,
>>> Address:
>>> urn:jxta:uuid-59616261646162614A78746150325033376CC0C6DAB74C2BA6FAF9C6648D77BC03
>>> 2008. 11. 12 ¿ÀÈÄ 9:40:03
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>>> MASTER_CHANGE_EVENT
>>> 2008. 11. 12 ¿ÀÈÄ 9:40:14
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>>> for (before change analysis) are :
>>> 1: MemberId: 96438e75-740c-4613-af8d-6b2ab8ea4727, MemberType: CORE,
>>> Address:
>>> urn:jxta:uuid-59616261646162614A78746150325033376CC0C6DAB74C2BA6FAF9C6648D77BC03
>>> 2: MemberId: b77af0d3-581c-4392-89cf-6a06d736c90f, MemberType: CORE,
>>> Address:
>>> urn:jxta:uuid-59616261646162614A78746150325033EBEBAC9321A742D0B319D3F89446E0B103
>>> 2008. 11. 12 ¿ÀÈÄ 9:40:14
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>>> ADD_EVENT
>>> 2008. 11. 12 ¿ÀÈÄ 9:40:43
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>>> for (before change analysis) are :
>>> 1: MemberId: 96438e75-740c-4613-af8d-6b2ab8ea4727, MemberType: CORE,
>>> Address:
>>> urn:jxta:uuid-59616261646162614A78746150325033376CC0C6DAB74C2BA6FAF9C6648D77BC03
>>> 2: MemberId: b77af0d3-581c-4392-89cf-6a06d736c90f, MemberType: CORE,
>>> Address:
>>> urn:jxta:uuid-59616261646162614A78746150325033EBEBAC9321A742D0B319D3F89446E0B103
>>> 2008. 11. 12 ¿ÀÈÄ 9:40:49
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>>> *IN_DOUBT_EVENT*
>>> 2008. 11. 12 ¿ÀÈÄ 9:41:07
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow addInDoubtMemberSignals
>>> Á¤º¸: gms.failureSuspectedEventReceived
>>> 2008. 11. 12 ¿ÀÈÄ 9:41:12 com.sun.enterprise.ee.cms.impl.common.Router
>>> notifyFailureSuspectedAction
>>> Á¤º¸: Sending FailureSuspectedSignals to registered Actions.
>>> Member:b77af0d3-581c-4392-89cf-6a06d736c90f...
>>> 2008. 11. 12 ¿ÀÈÄ 9:41:29
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>>> for (before change analysis) are :
>>> 1: MemberId: 96438e75-740c-4613-af8d-6b2ab8ea4727, MemberType: CORE,
>>> Address:
>>> urn:jxta:uuid-59616261646162614A78746150325033376CC0C6DAB74C2BA6FAF9C6648D77BC03
>>> 2008. 11. 12 ¿ÀÈÄ 9:41:41
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>>> *FAILURE_EVENT*
>>> 2008. 11. 12 ¿ÀÈÄ 9:41:42
>>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow addFailureSignals
>>> Á¤º¸: The following member has failed:
>>> b77af0d3-581c-4392-89cf-6a06d736c90f
>>> *2008. 11. 12 ¿ÀÈÄ 9:42:19
>>> com.sun.enterprise.ee.cms.impl.common.RecoveryTargetSelector
>>> setRecoverySelectionState
>>> Á¤º¸: Appointed Recovery
>>> Server:96438e75-740c-4613-af8d-6b2ab8ea4727:for failed
>>> member:b77af0d3-581c-4392-89cf-6a06d736c90f:for group:DemoGroup
>>> 2008. 11. 12 ¿ÀÈÄ 9:42:19 com.sun.enterprise.ee.cms.impl.common.Router
>>> notifyFailureRecoveryAction
>>> Á¤º¸: Sending FailureRecoveryNotification to component service*
>>> --------------------
>>> In case1(abnormal case),
>>> group leader failed -> IN_DOUBT_EVENT -> MASTER_CHANGE_EVENT(because
>>> new master was selected) -> FAILURE_EVENT
>>> In case2(normal case),
>>> member failed -> IN_DOUBT_EVENT -> FAILURE_EVENT
>>> For receiving FailureRecovery notification, recovery target should be
>>> resolved. Selection algorithm for recovery target uses previous
>>> members' view.
>>> Assume that "A" and "B" are member in the same group and "A" is group
>>> leader.
>>> [case1: "B"'s view histroy]
>>> ... --> *(A, B)* --> A failed -> B became to be new master with master
>>> change event -> *(B)[previous view]* -> failure event -> *(B)[current
>>> view]*
>>> [case2: "A"'s view history]
>>> ... --> *(A, B)[previous view]* --> B failed -> failure event ->
>>> *(B)[current view]*
>>> In other words,
>>> case1's previous view doesn't have "A"(failure member), so default
>>> algorithm(SimpleSelectionAlgorithm) can't find proper recovery target.
>>> case2's previous view has "B"(failure member), so default algorithm
>>> can select "A" for recovery target.
>>> (I assume that you already know SimpleSelectionAlgorithm)
>>> So I think that this issue has a concern in selection algorithm for
>>> recovery target.
>>> I think that thinking out another simple algorithm can be an example
>>> for resolving this issue.
>>> ex) always selecting first core member in live cache.
>>> Thanks.
>>> --
>>> Bongjae Chang
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe_at_shoal.dev.java.net
>> For additional commands, e-mail: dev-help_at_shoal.dev.java.net
>>
>>
>>
>>