Re: [Shoal-Dev] About failure recovery

From: Bongjae Chang <carryel_at_korea.com>
Date: Sat, 31 Jan 2009 19:18:40 +0900

Hi Shreedhar.
If Shoal supports this case, I think that it is useful.

Here is the use case.
Member "A" has Transaction Manager component and TM's FRAF.
Member "B" doesn't have Transaction Manager component but also TM's FRAF because "B" doesn't support transaction service.
But "B" can have any other services and FRAFs.

Both "A" and "B" joined the same group and clustered.
When any member which supports TM service and already has joined the same group and clustered with "A" and "B" fails,
only members which support transaction service like member "A" should recover failed member and member "B" should not be selected as a recoverer.
In this case, it is not necessary that all members should support homogeneous services.

In JEUS(our web application server product), though server's instances clustered, they don't need to have homogeneous services.
Of course, the cluster where each member hasn't homogeneous services is not recommendation.

So I will try to make the cluster logic which can avoid this circumstance in JEUS, to begin with.

Thanks for your reply.

--
Bongjae Chang

  ----- Original Message -----
  From: Shreedhar Ganapathy
  To: dev_at_shoal.dev.java.net
  Sent: Saturday, January 31, 2009 12:26 AM
  Subject: Re: [Shoal-Dev] About failure recovery

  It could be but the original intention is to require all members to have the FailureRecoveryActionFactory. Is this check for registration of this FRAF required since that would involve additional messaging overhead to keep all members in sync about this information?

  Bongjae Chang wrote:
    Hi.

    I have a problem about failure recovery.

    When a member is failed, other member who has FailureRecoveryActionFactory can recover the failure member.

    But I found a limitation in order to recovery the failure member correctly.

    The restriction is that all members must have FailureRecoveryActionFactory.

    Assume that "A", "B", "C" and "D" are members in same group.

    "A" is the failure member and both "B" and "C" have FailureRecoveryFactory and "D" doesn't have FailureRecoveryFactory.

    When "A" is failed, "B" and "C" can only recover "A".

    But if "D" is selected for recoverer in "B" and "C"'s recovery-selection-algorithm, anyone can't recover "A".

    So I think that only members who have FailureRecoveryActionFactory are qualified for recoverer.

    In other words, I think that "D" should be excluded in recover's candidate.

    Unfortunately, current algorithm qualifies all members as recoverer if they are alive and CORE members.

    Could this case be supported?

    --
    Bongjae Chang