Hi Bongjae
Yes this is a synchronization hole. It is indeed one that can be
addressed with a synchronized global lock mechanism. So far we have not
been exposed this issue as the recovery selection algorithm typically
results in only one member being selected to raise a fence.
To solve the general purpose case though, it might be worth addressing
this. Is this a critical requirement for JEUS?
Relying on Master for the global lock allocation might be a way to do
this but would have a cost on performance. We may
have to look at a total order solution such as a totem based protocol
for this.
Thanks
Shreedhar
Bongjae Chang wrote:
> Hi.
> I am testing GroupHandle.raiseFence() API for recovery selection.
> If I understood API and source code's purposerightly,
> GroupHandle.raiseFence() should only allow one member to raise fence
> for same component and failed member at same time. Is it right?
> But I could saw race contidition and some interesting result about this.
> The following is GroupHandle's code.
> ------------------------------------
> [GroupHandleImpl.java]
> public void raiseFence(final String componentNAme, final String
> failedMemberToken) throws GMSException {
> if(!isFenced(componentName, failedMemberToken)){
> ...
> dsc.addToCache(componentName,
> getGMSContext().getServerIdentityToken(), failedMemberToken,
> setStateAndTime() );
> ...
> } else {
> throw new GMSException(...);
> }
> }
> public boolean isFenced(final String componentName, final String
> memberToken) {
> ...
> entries = dsc.getFromCache(memberToken);
> for(GMSCacheable c:entries.keySet()) {
> if(componentName.equals(c.getComponentName())) {
> if(memberToken.equals(c.getKey())) {
> if(!memberToken.equals(c.getMemberTokenId())) {
> if(((String)entries.get(c)).startsWith(REC_PROGRESS_STATE)) {
> logger.log(...);
> retval = true;
> break;
> }
> }
> }
> }
> }
> return retval;
> }
> ------------------------------------
> In raiseFence(), if isFenced() is false, RECOVERY_IN_PROGRESS state
> will be added to DSC.
> I think that the code checks the isFenced()'s resultbefore adding the
> state to DSC in order to prevent muiltiple members from raising fence
> at same failed member token.
> But I think maybe isFenced() is not enough and there can be race
> condition. e.g) network traffic, system's overload and etc
> Assuming that "A", "B","C" and "D"are members in same group and "D"was
> failed.
> "A", "B" and "C" try to raise fence for "D" concurrentlylike the
> following.
> ------------------------------------
> [In "A", "B" and "C"]
> GroupHandle gh = gms.getGroupHandle();
> gh.raiseFence( component, "D" );
> ------------------------------------
> If isFenced() is false in "A", "B" and "C" at same time, each members
> add the own state to DSC, so one more members can have
> RECOVERY_IN_PROGRESS state at same time.
> Because isFenced() checks the state at only local own cache,
> isFenced() doesn't provide raiseFence() with global lock and
> isFenced() is only local lock for raising fence. So I think that this
> situation can be occurred.
> e.g) I thinkisFenced() should check the state from master node for
> global lock. Of course additional overhead will be occurred because of
> network packet but because raiseFence() case is rare, it is not big
> problem.
> About this issue, I wrote some test code.
> After join(), the member raises fence for any failed member and check
> raising fence's count and lower fence and sleeping repeatedly.
> Here is checking logic after raising fence. This is similar to
> isFenced()' code.
> ------------------------------------
> private void runSimpleSample() {
> ...
> while( true ) {
> try {
> gh.raiseFence( COMPONENT_NAME, FAILED_MEMBER_TOKEN );
> ...
> checkRaiseFence( gh, COMPONENT_NAME, FAILED_MEMBER_TOKEN );
> } catch( GMSException gmse ) {
> gmse.printStackTrace();
> } catch( RuntimeException re ) {
> re.printStackTrace();
> ...
> System.exit( 0 );
> } finally {
> gh.lowerFence( COMPONENT_NAME, FAILED_MEMBER_TOKEN );
> ...
> }
> try {
> Thread.sleep( getRandomSleep() );
> } catch( InterruptedException e ) {
> }
> }
> }
> private void checkRaiseFence( GroupHandler gh, String componentName,
> String memberToken ) {
> DistributedStateCache dsc \ gh.getDistributedStateCache();
> final Map<GMSCacheable, Object> entries;
> int raisedCount = 0;
> for(GMSCacheable c:entries.keySet()) {
> if(componentName.equals(c.getComponentName())) {
> if(memberToken.equals(c.getKey())) {
> if(!memberToken.equals(c.getMemberTokenId())) {
> raisedCount++;
> }
> }
> }
> }
> if( raisedCount > 1 )
> throw new RuntimeException( "raised count should not exceed 1" );
> }
> ------------------------------------
> When I executed my sample code in 4~5 processes concurrently,
> sometimes I couldfind above excepion("raised count should not exceed 1").
> I attached my sample.
> Thanks.
> --
> Bongjae Chang
> ------------------------------------------------------------------------
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe_at_shoal.dev.java.net
> For additional commands, e-mail: dev-help_at_shoal.dev.java.net