Sure thing. Will certainly need your help here and will share possible
ways to solve this.
Bongjae Chang wrote:
> Hi Shreedhar.
> Thanks for your reply.
> This is not a critical requirement for JEUS yet.
> I just reviewed and testedthis in order to know whether Shoal supports
> this now or not.
> But I am very interested in this issue, so if you may look at a total
> order solution such as a totem based protocol later,
> I would like to participate though I have trivial knowledge and study
> if it is possible. e.g) sharing algorithm, materials and etc... or testing
> Thanks.
> --
> Bongjae Chang
>
> ----- Original Message -----
> *From:* Shreedhar Ganapathy <mailto:Shreedhar.Ganapathy_at_Sun.COM>
> *To:* dev_at_shoal.dev.java.net <mailto:dev_at_shoal.dev.java.net>
> *Sent:* Monday, July 21, 2008 3:22 PM
> *Subject:* Re: [Shoal-Dev] GroupHandle.raiseFence() issue
>
> Hi Bongjae
> Yes this is a synchronization hole. It is indeed one that can be
> addressed with a synchronized global lock mechanism. So far we
> have not been exposed this issue as the recovery selection
> algorithm typically results in only one member being selected to
> raise a fence.
> To solve the general purpose case though, it might be worth
> addressing this. Is this a critical requirement for JEUS?
>
> Relying on Master for the global lock allocation might be a way to
> do this but would have a cost on performance. We may
> have to look at a total order solution such as a totem based
> protocol for this.
>
> Thanks
> Shreedhar
>
> Bongjae Chang wrote:
>> Hi.
>> I am testing GroupHandle.raiseFence() API for recovery selection.
>> If I understood API and source code's purposerightly,
>> GroupHandle.raiseFence() should only allow one member to raise
>> fence for same component and failed member at same time. Is it right?
>> But I could saw race contidition and some interesting result
>> about this.
>> The following is GroupHandle's code.
>> ------------------------------------
>> [GroupHandleImpl.java]
>> public void raiseFence(final String componentNAme, final String
>> failedMemberToken) throws GMSException {
>> if(!isFenced(componentName, failedMemberToken)){
>> ...
>> dsc.addToCache(componentName,
>> getGMSContext().getServerIdentityToken(), failedMemberToken,
>> setStateAndTime() );
>> ...
>> } else {
>> throw new GMSException(...);
>> }
>> }
>> public boolean isFenced(final String componentName, final String
>> memberToken) {
>> ...
>> entries = dsc.getFromCache(memberToken);
>> for(GMSCacheable c:entries.keySet()) {
>> if(componentName.equals(c.getComponentName())) {
>> if(memberToken.equals(c.getKey())) {
>> if(!memberToken.equals(c.getMemberTokenId())) {
>> if(((String)entries.get(c)).startsWith(REC_PROGRESS_STATE)) {
>> logger.log(...);
>> retval = true;
>> break;
>> }
>> }
>> }
>> }
>> }
>> return retval;
>> }
>> ------------------------------------
>> In raiseFence(), if isFenced() is false, RECOVERY_IN_PROGRESS
>> state will be added to DSC.
>> I think that the code checks the isFenced()'s resultbefore adding
>> the state to DSC in order to prevent muiltiple members from
>> raising fence at same failed member token.
>> But I think maybe isFenced() is not enough and there can be race
>> condition. e.g) network traffic, system's overload and etc
>> Assuming that "A", "B","C" and "D"are members in same group and
>> "D"was failed.
>> "A", "B" and "C" try to raise fence for "D" concurrentlylike the
>> following.
>> ------------------------------------
>> [In "A", "B" and "C"]
>> GroupHandle gh = gms.getGroupHandle();
>> gh.raiseFence( component, "D" );
>> ------------------------------------
>> If isFenced() is false in "A", "B" and "C" at same time, each
>> members add the own state to DSC, so one more members can have
>> RECOVERY_IN_PROGRESS state at same time.
>> Because isFenced() checks the state at only local own cache,
>> isFenced() doesn't provide raiseFence() with global lock and
>> isFenced() is only local lock for raising fence. So I think that
>> this situation can be occurred.
>> e.g) I thinkisFenced() should check the state from master node
>> for global lock. Of course additional overhead will be occurred
>> because of network packet but because raiseFence() case is rare,
>> it is not big problem.
>> About this issue, I wrote some test code.
>> After join(), the member raises fence for any failed member and
>> check raising fence's count and lower fence and sleeping repeatedly.
>> Here is checking logic after raising fence. This is similar to
>> isFenced()' code.
>> ------------------------------------
>> private void runSimpleSample() {
>> ...
>> while( true ) {
>> try {
>> gh.raiseFence( COMPONENT_NAME, FAILED_MEMBER_TOKEN );
>> ...
>> checkRaiseFence( gh, COMPONENT_NAME, FAILED_MEMBER_TOKEN );
>> } catch( GMSException gmse ) {
>> gmse.printStackTrace();
>> } catch( RuntimeException re ) {
>> re.printStackTrace();
>> ...
>> System.exit( 0 );
>> } finally {
>> gh.lowerFence( COMPONENT_NAME, FAILED_MEMBER_TOKEN );
>> ...
>> }
>> try {
>> Thread.sleep( getRandomSleep() );
>> } catch( InterruptedException e ) {
>> }
>> }
>> }
>> private void checkRaiseFence( GroupHandler gh, String
>> componentName, String memberToken ) {
>> DistributedStateCache dsc \ gh.getDistributedStateCache();
>> final Map<GMSCacheable, Object> entries;
>> int raisedCount = 0;
>> for(GMSCacheable c:entries.keySet()) {
>> if(componentName.equals(c.getComponentName())) {
>> if(memberToken.equals(c.getKey())) {
>> if(!memberToken.equals(c.getMemberTokenId())) {
>> raisedCount++;
>> }
>> }
>> }
>> }
>> if( raisedCount > 1 )
>> throw new RuntimeException( "raised count should not exceed 1" );
>> }
>> ------------------------------------
>> When I executed my sample code in 4~5 processes concurrently,
>> sometimes I couldfind above excepion("raised count should not
>> exceed 1").
>> I attached my sample.
>> Thanks.
>> --
>> Bongjae Chang
>> ------------------------------------------------------------------------
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe_at_shoal.dev.java.net
>> For additional commands, e-mail: dev-help_at_shoal.dev.java.net
>