dev@shoal.java.net

GroupHandle.raiseFence() issue

From: Bongjae Chang <carryel_at_korea.com>
Date: Mon, 21 Jul 2008 13:20:09 +0900

Hi.
I am testing GroupHandle.raiseFence() API for recovery selection.
If I understood API and source code's purpose rightly, GroupHandle.raiseFence() should only allow one member to raise fence for same component and failed member at same time. Is it right?
But I could saw race contidition and some interesting result about this.

The following is GroupHandle's code.
------------------------------------
[GroupHandleImpl.java]
public void raiseFence(final String componentNAme, final String failedMemberToken) throws GMSException {
    if(!isFenced(componentName, failedMemberToken)){
        ...
        dsc.addToCache(componentName, getGMSContext().getServerIdentityToken(), failedMemberToken, setStateAndTime() );
        ...
    } else {
        throw new GMSException(...);
    }
}

public boolean isFenced(final String componentName, final String memberToken) {
    ...
    entries = dsc.getFromCache(memberToken);
    for(GMSCacheable c:entries.keySet()) {
        if(componentName.equals(c.getComponentName())) {
            if(memberToken.equals(c.getKey())) {
                if(!memberToken.equals(c.getMemberTokenId())) {
                    if(((String)entries.get(c)).startsWith(REC_PROGRESS_STATE)) {
                        logger.log(...);
                        retval = true;
                        break;
                    }
                }
            }
        }
    }
    return retval;
}
------------------------------------

In raiseFence(), if isFenced() is false, RECOVERY_IN_PROGRESS state will be added to DSC.
I think that the code checks the isFenced()'s result before adding the state to DSC in order to prevent muiltiple members from raising fence at same failed member token.
But I think maybe isFenced() is not enough and there can be race condition. e.g) network traffic, system's overload and etc
Assuming that "A", "B", "C" and "D" are members in same group and "D" was failed.
"A", "B" and "C" try to raise fence for "D" concurrently like the following.

------------------------------------
[In "A", "B" and "C"]
GroupHandle gh = gms.getGroupHandle();
gh.raiseFence( component, "D" );
------------------------------------

If isFenced() is false in "A", "B" and "C" at same time, each members add the own state to DSC, so one more members can have RECOVERY_IN_PROGRESS state at same time.
Because isFenced() checks the state at only local own cache, isFenced() doesn't provide raiseFence() with global lock and isFenced() is only local lock for raising fence. So I think that this situation can be occurred.

e.g) I think isFenced() should check the state from master node for global lock. Of course additional overhead will be occurred because of network packet but because raiseFence() case is rare, it is not big problem.

About this issue, I wrote some test code.
After join(), the member raises fence for any failed member and check raising fence's count and lower fence and sleeping repeatedly.
Here is checking logic after raising fence. This is similar to isFenced()' code.
------------------------------------
private void runSimpleSample() {
    ...
    while( true ) {
        try {
            gh.raiseFence( COMPONENT_NAME, FAILED_MEMBER_TOKEN );
            ...
            checkRaiseFence( gh, COMPONENT_NAME, FAILED_MEMBER_TOKEN );
        } catch( GMSException gmse ) {
            gmse.printStackTrace();
        } catch( RuntimeException re ) {
            re.printStackTrace();
            ...
            System.exit( 0 );
        } finally {
            gh.lowerFence( COMPONENT_NAME, FAILED_MEMBER_TOKEN );
            ...
        }
        try {
            Thread.sleep( getRandomSleep() );
        } catch( InterruptedException e ) {
        }
    }
}

private void checkRaiseFence( GroupHandler gh, String componentName, String memberToken ) {
    DistributedStateCache dsc \ gh.getDistributedStateCache();
    final Map<GMSCacheable, Object> entries;
    int raisedCount = 0;
    for(GMSCacheable c:entries.keySet()) {
        if(componentName.equals(c.getComponentName())) {
            if(memberToken.equals(c.getKey())) {
                if(!memberToken.equals(c.getMemberTokenId())) {
                    raisedCount++;
                }
            }
        }
    }
    if( raisedCount > 1 )
        throw new RuntimeException( "raised count should not exceed 1" );
}
------------------------------------

When I executed my sample code in 4~5 processes concurrently, sometimes I could find above excepion("raised count should not exceed 1").
I attached my sample.

Thanks.

--
Bongjae Chang