dev@shoal.java.net

Re: [Shoal-Dev] When group leader failed, any member couldn't receive FailureRecovery notification

From: Bongjae Chang <carryel_at_korea.com>
Date: Thu, 13 Nov 2008 10:53:11 +0900

Hi Joe.
This issue is exactly equal to issue 6764333.
And Sheetal's analysis is correct.

Wrote Sheetal:
>I am not sure how views.get(views.size() - 2) was arrived at. But this looks like the likely problem.

In addition to,
if master is failed, other member's HealthMonitor calls assignAndReportFailure() and appoints new master.
By appointing new master, MASTER_CHANGE_EVENT will be fired with current view.

I think that appointing new master and notifying MASTER_CHANGE_EVENT are innocent when master is failed.
So I would like to propose more simple algorithm for appointing the recovery server.

Thanks.

--
Bongjae Chang


----- Original Message -----
From: "Joseph Fialli" <Joseph.Fialli_at_Sun.COM>
To: <dev_at_shoal.dev.java.net>
Sent: Thursday, November 13, 2008 6:35 AM
Subject: Re: [Shoal-Dev] When group leader failed, any member couldn't receive FailureRecovery notification


> Bongjae,
>
> Thanks for reporting this issue.
>
> An issue similar to this was recently filed in Sun's internal bug
> database bugster as issue 6764333 by internal Shoal QA team.
> While some of these issues are visible publically via Sun Bug Database,
> I could not find an external
> link to this specific bug.
>
> Summary of issue and analysis follows:
>> Scenario that recreates this failure:
>> ================
>> - start das/NAs
>> - start cluster
>> - stop DAS
>> - wait 20 sec
>> - kill the master node
>> - wait restart
>> - wait 20 sec
>
> Sheetal's analysis of why the issue is occuring:
>> Looking at the logs and the code, it looks like
>> RecoveryTargetSelector.setRecoverySelectionState() which is
>> responsible for appointing the recovery server does not get called in
>> the run pointed out.
>>
>> ViewWindow.addFailureSignals() calls
>> ViewWindow.generateFailureRecoverySignals(views.get(views.size() - 2),
>> token,
>> advert.getCustomTagValue(CustomTagNames.GROUP_NAME.toString()),
>> Long.valueOf(advert.getCustomTagValue(CustomTagNames.START_TIME.toString())));
>>
>> I am not sure how views.get(views.size() - 2) was arrived at. But this
>> looks like the likely problem. It basically passes the views.size()-2
>> arraylist from the views vector into the above method. The control
>> then goes to RecoveryTargetSelector.resolveRecoveryTarget() with the
>> above arraylist. The arraylist does not contain the failedmember and
>> the recoverer variable is never set. Hence
>> RecoveryTargetSelector.setRecoverySelectionState() is never called.
> We have not had a chance to address this issue any further than above
> analysis. But above
> corresponds well with your findings in analyzing the log.
>
> -Joe
>
>
> Bongjae Chang wrote:
>> Hi.
>> I found another issue.
>> When group leader failed, any member couldn't receive FailureRecovery
>> notification.
>> Of course, members added FailureRecoveryActionFactoryImpl and their
>> callbacks to GMS.
>> But if failure member was not group leader, other member received
>> FailureRecovery notification successfully.
>> Here are two logs.
>> --------------------
>> case 1) When failure member is group leader.
>> 2008. 11. 12 ¿ÀÈÄ 9:43:28
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>> for (before change analysis) are :
>> 1: MemberId: dd4897f5-2383-420e-8d3e-87f77407da41, MemberType: CORE,
>> Address:
>> urn:jxta:uuid-59616261646162614A787461503250332E9EB1D0D35742638E5B9CF78B8253EE03
>> 2008. 11. 12 ¿ÀÈÄ 9:43:28
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>> MASTER_CHANGE_EVENT
>> 2008. 11. 12 ¿ÀÈÄ 9:43:28
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>> for (before change analysis) are :
>> 1: MemberId: b6663a51-9b79-43e2-92dd-41899c907383, MemberType: CORE,
>> Address:
>> urn:jxta:uuid-59616261646162614A787461503250331DA08A66D0554F138E75E74AA363FC9E03
>> 2: MemberId: dd4897f5-2383-420e-8d3e-87f77407da41, MemberType: CORE,
>> Address:
>> urn:jxta:uuid-59616261646162614A787461503250332E9EB1D0D35742638E5B9CF78B8253EE03
>> 2008. 11. 12 ¿ÀÈÄ 9:43:28
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>> MASTER_CHANGE_EVENT
>> 2008. 11. 12 ¿ÀÈÄ 9:43:28
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>> for (before change analysis) are :
>> 1: MemberId: b6663a51-9b79-43e2-92dd-41899c907383, MemberType: CORE,
>> Address:
>> urn:jxta:uuid-59616261646162614A787461503250331DA08A66D0554F138E75E74AA363FC9E03
>> 2: MemberId: dd4897f5-2383-420e-8d3e-87f77407da41, MemberType: CORE,
>> Address:
>> urn:jxta:uuid-59616261646162614A787461503250332E9EB1D0D35742638E5B9CF78B8253EE03
>> 2008. 11. 12 ¿ÀÈÄ 9:43:28
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>> ADD_EVENT
>> 2008. 11. 12 ¿ÀÈÄ 9:43:53
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>> for (before change analysis) are :
>> 1: MemberId: b6663a51-9b79-43e2-92dd-41899c907383, MemberType: CORE,
>> Address:
>> urn:jxta:uuid-59616261646162614A787461503250331DA08A66D0554F138E75E74AA363FC9E03
>> 2: MemberId: dd4897f5-2383-420e-8d3e-87f77407da41, MemberType: CORE,
>> Address:
>> urn:jxta:uuid-59616261646162614A787461503250332E9EB1D0D35742638E5B9CF78B8253EE03
>> 2008. 11. 12 ¿ÀÈÄ 9:43:53
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>> *IN_DOUBT_EVENT*
>> 2008. 11. 12 ¿ÀÈÄ 9:43:53
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow addInDoubtMemberSignals
>> Á¤º¸: gms.failureSuspectedEventReceived
>> 2008. 11. 12 ¿ÀÈÄ 9:43:53 com.sun.enterprise.ee.cms.impl.common.Router
>> notifyFailureSuspectedAction
>> Á¤º¸: Sending FailureSuspectedSignals to registered Actions.
>> Member:b6663a51-9b79-43e2-92dd-41899c907383...
>> 2008. 11. 12 ¿ÀÈÄ 9:43:57
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>> for (before change analysis) are :
>> 1: MemberId: dd4897f5-2383-420e-8d3e-87f77407da41, MemberType: CORE,
>> Address:
>> urn:jxta:uuid-59616261646162614A787461503250332E9EB1D0D35742638E5B9CF78B8253EE03
>> 2008. 11. 12 ¿ÀÈÄ 9:43:57
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>> *MASTER_CHANGE_EVENT*
>> 2008. 11. 12 ¿ÀÈÄ 9:43:57
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>> for (before change analysis) are :
>> 1: MemberId: dd4897f5-2383-420e-8d3e-87f77407da41, MemberType: CORE,
>> Address:
>> urn:jxta:uuid-59616261646162614A787461503250332E9EB1D0D35742638E5B9CF78B8253EE03
>> 2008. 11. 12 ¿ÀÈÄ 9:43:57
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>> *FAILURE_EVENT*
>> 2008. 11. 12 ¿ÀÈÄ 9:43:57
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow addFailureSignals
>> Á¤º¸: The following member has failed:
>> b6663a51-9b79-43e2-92dd-41899c907383
>> case 2) When failure member is not group leader
>> 2008. 11. 12 ¿ÀÈÄ 9:40:03
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>> for (before change analysis) are :
>> 1: MemberId: 96438e75-740c-4613-af8d-6b2ab8ea4727, MemberType: CORE,
>> Address:
>> urn:jxta:uuid-59616261646162614A78746150325033376CC0C6DAB74C2BA6FAF9C6648D77BC03
>> 2008. 11. 12 ¿ÀÈÄ 9:40:03
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>> MASTER_CHANGE_EVENT
>> 2008. 11. 12 ¿ÀÈÄ 9:40:14
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>> for (before change analysis) are :
>> 1: MemberId: 96438e75-740c-4613-af8d-6b2ab8ea4727, MemberType: CORE,
>> Address:
>> urn:jxta:uuid-59616261646162614A78746150325033376CC0C6DAB74C2BA6FAF9C6648D77BC03
>> 2: MemberId: b77af0d3-581c-4392-89cf-6a06d736c90f, MemberType: CORE,
>> Address:
>> urn:jxta:uuid-59616261646162614A78746150325033EBEBAC9321A742D0B319D3F89446E0B103
>> 2008. 11. 12 ¿ÀÈÄ 9:40:14
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>> ADD_EVENT
>> 2008. 11. 12 ¿ÀÈÄ 9:40:43
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>> for (before change analysis) are :
>> 1: MemberId: 96438e75-740c-4613-af8d-6b2ab8ea4727, MemberType: CORE,
>> Address:
>> urn:jxta:uuid-59616261646162614A78746150325033376CC0C6DAB74C2BA6FAF9C6648D77BC03
>> 2: MemberId: b77af0d3-581c-4392-89cf-6a06d736c90f, MemberType: CORE,
>> Address:
>> urn:jxta:uuid-59616261646162614A78746150325033EBEBAC9321A742D0B319D3F89446E0B103
>> 2008. 11. 12 ¿ÀÈÄ 9:40:49
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>> *IN_DOUBT_EVENT*
>> 2008. 11. 12 ¿ÀÈÄ 9:41:07
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow addInDoubtMemberSignals
>> Á¤º¸: gms.failureSuspectedEventReceived
>> 2008. 11. 12 ¿ÀÈÄ 9:41:12 com.sun.enterprise.ee.cms.impl.common.Router
>> notifyFailureSuspectedAction
>> Á¤º¸: Sending FailureSuspectedSignals to registered Actions.
>> Member:b77af0d3-581c-4392-89cf-6a06d736c90f...
>> 2008. 11. 12 ¿ÀÈÄ 9:41:29
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow getMemberTokens
>> Á¤º¸: GMS View Change Received for group DemoGroup : Members in view
>> for (before change analysis) are :
>> 1: MemberId: 96438e75-740c-4613-af8d-6b2ab8ea4727, MemberType: CORE,
>> Address:
>> urn:jxta:uuid-59616261646162614A78746150325033376CC0C6DAB74C2BA6FAF9C6648D77BC03
>> 2008. 11. 12 ¿ÀÈÄ 9:41:41
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow newViewObserved
>> Á¤º¸: Analyzing new membership snapshot received as part of event :
>> *FAILURE_EVENT*
>> 2008. 11. 12 ¿ÀÈÄ 9:41:42
>> com.sun.enterprise.ee.cms.impl.jxta.ViewWindow addFailureSignals
>> Á¤º¸: The following member has failed:
>> b77af0d3-581c-4392-89cf-6a06d736c90f
>> *2008. 11. 12 ¿ÀÈÄ 9:42:19
>> com.sun.enterprise.ee.cms.impl.common.RecoveryTargetSelector
>> setRecoverySelectionState
>> Á¤º¸: Appointed Recovery
>> Server:96438e75-740c-4613-af8d-6b2ab8ea4727:for failed
>> member:b77af0d3-581c-4392-89cf-6a06d736c90f:for group:DemoGroup
>> 2008. 11. 12 ¿ÀÈÄ 9:42:19 com.sun.enterprise.ee.cms.impl.common.Router
>> notifyFailureRecoveryAction
>> Á¤º¸: Sending FailureRecoveryNotification to component service*
>> --------------------
>> In case1(abnormal case),
>> group leader failed -> IN_DOUBT_EVENT -> MASTER_CHANGE_EVENT(because
>> new master was selected) -> FAILURE_EVENT
>> In case2(normal case),
>> member failed -> IN_DOUBT_EVENT -> FAILURE_EVENT
>> For receiving FailureRecovery notification, recovery target should be
>> resolved. Selection algorithm for recovery target uses previous
>> members' view.
>> Assume that "A" and "B" are member in the same group and "A" is group
>> leader.
>> [case1: "B"'s view histroy]
>> ... --> *(A, B)* --> A failed -> B became to be new master with master
>> change event -> *(B)[previous view]* -> failure event -> *(B)[current
>> view]*
>> [case2: "A"'s view history]
>> ... --> *(A, B)[previous view]* --> B failed -> failure event ->
>> *(B)[current view]*
>> In other words,
>> case1's previous view doesn't have "A"(failure member), so default
>> algorithm(SimpleSelectionAlgorithm) can't find proper recovery target.
>> case2's previous view has "B"(failure member), so default algorithm
>> can select "A" for recovery target.
>> (I assume that you already know SimpleSelectionAlgorithm)
>> So I think that this issue has a concern in selection algorithm for
>> recovery target.
>> I think that thinking out another simple algorithm can be an example
>> for resolving this issue.
>> ex) always selecting first core member in live cache.
>> Thanks.
>> --
>> Bongjae Chang
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe_at_shoal.dev.java.net
> For additional commands, e-mail: dev-help_at_shoal.dev.java.net
>
>
>
>