dev@shoal.java.net

Re: [Shoal-Dev] GroupHandle.raiseFence() issue

From: Bongjae Chang <carryel_at_korea.com>
Date: Mon, 21 Jul 2008 16:35:26 +0900

Hi Shreedhar.
Thanks for your reply.

This is not a critical requirement for JEUS yet.
I just reviewed and tested this in order to know whether Shoal supports this now or not.

But I am very interested in this issue, so if you may look at a total order solution such as a totem based protocol later,
I would like to participate though I have trivial knowledge and study if it is possible. e.g) sharing algorithm, materials and etc... or testing

Thanks.

--
Bongjae Chang


  ----- Original Message -----
  From: Shreedhar Ganapathy
  To: dev_at_shoal.dev.java.net
  Sent: Monday, July 21, 2008 3:22 PM
  Subject: Re: [Shoal-Dev] GroupHandle.raiseFence() issue


  Hi Bongjae
  Yes this is a synchronization hole. It is indeed one that can be addressed with a synchronized global lock mechanism. So far we have not been exposed this issue as the recovery selection algorithm typically results in only one member being selected to raise a fence.
  To solve the general purpose case though, it might be worth addressing this. Is this a critical requirement for JEUS?

  Relying on Master for the global lock allocation might be a way to do this but would have a cost on performance. We may
  have to look at a total order solution such as a totem based protocol for this.

  Thanks
  Shreedhar

  Bongjae Chang wrote:
    Hi.
    I am testing GroupHandle.raiseFence() API for recovery selection.
    If I understood API and source code's purpose rightly, GroupHandle.raiseFence() should only allow one member to raise fence for same component and failed member at same time. Is it right?
    But I could saw race contidition and some interesting result about this.

    The following is GroupHandle's code.
    ------------------------------------
    [GroupHandleImpl.java]
    public void raiseFence(final String componentNAme, final String failedMemberToken) throws GMSException {
        if(!isFenced(componentName, failedMemberToken)){
            ...
            dsc.addToCache(componentName, getGMSContext().getServerIdentityToken(), failedMemberToken, setStateAndTime() );
            ...
        } else {
            throw new GMSException(...);
        }
    }

    public boolean isFenced(final String componentName, final String memberToken) {
        ...
        entries = dsc.getFromCache(memberToken);
        for(GMSCacheable c:entries.keySet()) {
            if(componentName.equals(c.getComponentName())) {
                if(memberToken.equals(c.getKey())) {
                    if(!memberToken.equals(c.getMemberTokenId())) {
                        if(((String)entries.get(c)).startsWith(REC_PROGRESS_STATE)) {
                            logger.log(...);
                            retval = true;
                            break;
                        }
                    }
                }
            }
        }
        return retval;
    }
    ------------------------------------

    In raiseFence(), if isFenced() is false, RECOVERY_IN_PROGRESS state will be added to DSC.
    I think that the code checks the isFenced()'s result before adding the state to DSC in order to prevent muiltiple members from raising fence at same failed member token.
    But I think maybe isFenced() is not enough and there can be race condition. e.g) network traffic, system's overload and etc
    Assuming that "A", "B", "C" and "D" are members in same group and "D" was failed.
    "A", "B" and "C" try to raise fence for "D" concurrently like the following.

    ------------------------------------
    [In "A", "B" and "C"]
    GroupHandle gh = gms.getGroupHandle();
    gh.raiseFence( component, "D" );
    ------------------------------------

    If isFenced() is false in "A", "B" and "C" at same time, each members add the own state to DSC, so one more members can have RECOVERY_IN_PROGRESS state at same time.
    Because isFenced() checks the state at only local own cache, isFenced() doesn't provide raiseFence() with global lock and isFenced() is only local lock for raising fence. So I think that this situation can be occurred.

    e.g) I think isFenced() should check the state from master node for global lock. Of course additional overhead will be occurred because of network packet but because raiseFence() case is rare, it is not big problem.

    About this issue, I wrote some test code.
    After join(), the member raises fence for any failed member and check raising fence's count and lower fence and sleeping repeatedly.
    Here is checking logic after raising fence. This is similar to isFenced()' code.
    ------------------------------------
    private void runSimpleSample() {
        ...
        while( true ) {
            try {
                gh.raiseFence( COMPONENT_NAME, FAILED_MEMBER_TOKEN );
                ...
                checkRaiseFence( gh, COMPONENT_NAME, FAILED_MEMBER_TOKEN );
            } catch( GMSException gmse ) {
                gmse.printStackTrace();
            } catch( RuntimeException re ) {
                re.printStackTrace();
                ...
                System.exit( 0 );
            } finally {
                gh.lowerFence( COMPONENT_NAME, FAILED_MEMBER_TOKEN );
                ...
            }
            try {
                Thread.sleep( getRandomSleep() );
            } catch( InterruptedException e ) {
            }
        }
    }

    private void checkRaiseFence( GroupHandler gh, String componentName, String memberToken ) {
        DistributedStateCache dsc \ gh.getDistributedStateCache();
        final Map<GMSCacheable, Object> entries;
        int raisedCount = 0;
        for(GMSCacheable c:entries.keySet()) {
            if(componentName.equals(c.getComponentName())) {
                if(memberToken.equals(c.getKey())) {
                    if(!memberToken.equals(c.getMemberTokenId())) {
                        raisedCount++;
                    }
                }
            }
        }
        if( raisedCount > 1 )
            throw new RuntimeException( "raised count should not exceed 1" );
    }
    ------------------------------------

    When I executed my sample code in 4~5 processes concurrently, sometimes I could find above excepion("raised count should not exceed 1").
    I attached my sample.

    Thanks.

    --
    Bongjae Chang
----------------------------------------------------------------------------
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe_at_shoal.dev.java.net
For additional commands, e-mail: dev-help_at_shoal.dev.java.net