users@shoal.java.net

Re: Merging group issue

From: Tim <tim.shiu_at_ssc-ltd.com>
Date: Wed, 30 Nov 2011 10:05:20 +0800

Dear Joe,

Thanks a lot for your help. I will give it a try to implement the custom
DSC approach.

Many Thanks.

Regards,
Tim.Shiu

On 30/11/2011 0:20, Joseph Fialli wrote:
> Tim,
>
> Comments inline below.
>
> -Joe
>
> On 11/27/11 9:05 PM, Tim wrote:
>> Dear Joe,
>>
>> Thanks for your explanation. Let me share my use on Shoal for my system.
>>
>> In my use case, I will put the states of machines into the DSC for
>> handling failover.
>>
>> e.g. a running machine is down, the backup machine will be notified
>> and take over the tasks that the failure machine is performing. Those
>> states, like running, failure, recovering, etc, will be stored in DSC
>> to share among the nodes in the group. There will be m + n machines
>> in the group where m indicate the number of machines running tasks
>> and n is the number of machines ready for backup.
>>
>> There are 2 information I would like to get from DSC. 1) the states
>> of all machines in the group. 2) which backup machine is taking over
>> another normal machine
>>
>> So, when a running machine A is failed, a failure recovery
>> notification will be sent to one of the backup machine B to take over
>> the tasks and the states of the failed machine and recover machine
>> will be stored in the DSC by the leaderwhen receiving failure
>> notification. The states are used by following resume action after
>> the failed machine is returned to normal.
>>
>> It's great that Shoal has native ability to handle this failure case
>> even the failed machine is the leader node.
>>
>> Unfortunately, if the failure is triggered by network problem, both
>> machine A and machine B will act as leader and received failure
>> notification to mark opposite machine as failed in DSC. If the
>> network becomes normal, the groups are merged and the DSC will be
>> overwritten by one of the node among the group. (assumed that there
>> are only 2 machines in the group and both machines are not backup
>> machines)
>>
>> So, I would like to ask:
>>
>> 1) if there is any specific notification signal will be sent after
>> the groups are merged? (I can see that join notification will be sent
>> but is it able to distinguish between the normal join action and
>> merge action?)
> When the isolated groups re-merge, there will be a master collision.
> A GroupLeadershipNotification notification denotes that there is a
> change in group leadership.
> In the case of two isolated groups that are rediscovering themselves,
> a master collision will occur followed by a
> GroupLeadershipNotification indicating the
> one master of the unified group.
>>
>> 2) can the native merging of DSC be stopped pragmatically that handle
>> by my custom merging logic?
>>
> The current DistributedStateCacheImpl does not appear to handle the
> use case that you have identified.
> Namely, that the initial state of the instances of the group are
> isolated AND then the isolated members
> find themselves sometime much later.
>
> If you would like to experiment with an alternative implementation of
> DistributedStateCacheImpl
> with your changes, I recommend making the one call to
> DistributedStateCacheImpl.getInstance() into a more generic
> ServiceLoader load of a possible alternative implementations of
> DistributedStateCache.
> There is only a call to DistributedStateCacheImpl.getInstance() in
> GMSContextImpl.createDistributedStateCache().
>
> An example of a ServiceLoad is in
> AbstractNetworkManager.findByServiceLoader.
> This selects between jxta, grizzly 1.9 or grizzly 2.0 transport for GMS.
> See
> shoal/gms/etc/META-INF/services/com.sun.enterprise.mgmt.transport.NetworkManager
> for how to enumerate multiple implementations for a service. You can
> introduce a file
> called
> shoal/gms/etc/META-INF/services/com.sun.enterprise.ee.cms.core.DistributedStateCache
> with a reference to both your replacement DistributedStateCache
> implementation
> and the original DistributedStateCache.
>
> -Joe
>
>> 3) is there any exposed notification for the updated of DSC that I
>> can add custom logic to determine if the update can be proceeded?
>>
>> Sorry, the essay is quite long. Thanks a lot.
>>
>> Regards,
>> Tim.Shiu
>>
>> On 23/11/2011 0:10, Joseph Fialli wrote:
>>>
>>>
>>> On 11/22/11 9:28 AM, Tim Shiu wrote:
>>>> Dear Joe,
>>>>
>>>> Thanks a lot.
>>>> I checkout the project from SVN and tested that they can merge
>>>> together now.
>>>>
>>>> After that, I would like to ask one more question.
>>>> Is there any mechanism to merge the dsc between these nodes to
>>>> maintain the data in dsc is most updated?
>>> The Master of the GMS group synchronizes its distributed state
>>> context with the other members.
>>> For your isolation at the beginning case, both instances are
>>> initially masters of their one member groups.
>>> When the isolated instances find each other, a master collision
>>> resolution algorithm resolves which
>>> one will be master. The instance that is not the master should
>>> synchronize its dsc with the master.
>>> The master will distribute its latest dsc with all other members.
>>>
>>> -Joe
>>>> Or it will just pick one dsc in a node and distribute among the nodes.
>>>>
>>>> Thank you.
>>>>
>>>> Regards,
>>>> Tim.Shiu
>>>>
>>>> 引述 Joseph Fialli <joe.fialli_at_oracle.com>:
>>>>
>>>> > Tim,
>>>> >
>>>> > There were recent bug fixes for instances rejoining cluster checked
>>>> > in last Thursday.
>>>> > Using the gms-transport-module branch (that is latest shoal branch),
>>>> > I was able to confirm two instances
>>>> > on different machines finding each other after being started up on
>>>> > isolated network and then having
>>>> > the network reconnected after startup.
>>>> > (Simulated loss of network connectivity by running ifconfig
>>>> > <networkinterface> down followed
>>>> > 80 seconds later by an ifconfig up to the same network interface.)
>>>> >
>>>> > The same fix was checked into the shoal trunk last Thursday.
>>>> >
>>>> > -Joe Fialli, Oracle Corp.
>>>> >
>>>> > P.S.
>>>> > Just in case you do not know how to checkout and build shoal
>>>> workspace,
>>>> > there are instructions on how to check out the trunk or a branch and
>>>> > build it at
>>>> > the following link: http://shoal.java.net/HowToBuildSource.html.
>>>> >
>>>> > On 11/17/11 8:42 PM, Tim wrote:
>>>> >> Dear Joe,
>>>> >>
>>>> >> Thanks for your reply.
>>>> >>
>>>> >> I have already set both machines with the same multicast address
>>>> and
>>>> >> port (by using the property parameter by calling
>>>> >> GMSFactory.startGMSModule) and they are already under the same
>>>> >> subnet. Unfortunately, they still cannot detect each other after
>>>> the
>>>> >> network connected. Do I miss any setting?
>>>> >>
>>>> >> The following is the program fragment to join the group.
>>>> >>
>>>> >> Properties props = new Properties();
>>>> >> props.put(ServiceProviderConfigurationKeys.LOOPBACK.toString(),
>>>> >> "true");
>>>> >>
>>>> >>
>>>> props.put(ServiceProviderConfigurationKeys.FAILURE_DETECTION_TIMEOUT.toString(),
>>>> >> "500");
>>>> >>
>>>> >>
>>>> props.put(ServiceProviderConfigurationKeys.FAILURE_VERIFICATION_TIMEOUT.toString(),
>>>> >> "500");
>>>> >>
>>>> >>
>>>> props.put(ServiceProviderConfigurationKeys.FAILURE_DETECTION_RETRIES.toString(),
>>>> >> "2");
>>>> >>
>>>> props.put(ServiceProviderConfigurationKeys.MULTICASTADDRESS.toString(),
>>>>
>>>> >> "228.0.0.1");
>>>> >>
>>>> props.put(ServiceProviderConfigurationKeys.MULTICASTPORT.toString(),
>>>> >> "9800");
>>>> >> GroupManagementService gms = (GroupManagementService)
>>>> >> GMSFactory.startGMSModule("MACHINEA", "TESTGROUP",
>>>> >> MemberType.CORE, props);
>>>> >> gms.join();
>>>> >>
>>>> >> Thanks for your help.
>>>> >>
>>>> >> Regards,
>>>> >> Tim.Shiu
>>>> >>
>>>> >> On 18/11/2011 3:22, Joseph Fialli wrote:
>>>> >>> Tim,
>>>> >>>
>>>> >>> In addition to the same group name, the GMS clients would also
>>>> have
>>>> >>> to be using the
>>>> >>> same multicast group address and multicast port.
>>>> >>> Lastly, Machine A and B would have to be on the same subnet and
>>>> UDP
>>>> >>> multicast
>>>> >>> needs to be enabled for the network and possible switches/routers.
>>>> >>>
>>>> >>> They would find each other over UDP multicast and form a group
>>>> when
>>>> >>> network connectivity returns.
>>>> >>>
>>>> >>> -Joe Fialli
>>>> >>>
>>>> >>> On 11/17/11 5:46 AM, tim.shiu_at_ssc-ltd.com wrote:
>>>> >>>> Dear All,
>>>> >>>>
>>>> >>>> I would like to ask if there is any mechanism in Shoal that
>>>> can merge 2
>>>> >>>> separate groups (with same group name) into 1 after they join
>>>> to the
>>>> >>>> same network?
>>>> >>>>
>>>> >>>> e.g.
>>>> >>>> Machine A and B join group separately without connect to
>>>> network. After
>>>> >>>> they create their own group with the same name, plug the
>>>> network wire
>>>> >>>> and connect them together. Will they merge into the same group.
>>>> >>>>
>>>> >>>> Thanks.
>>>> >>>>
>>>> >>>> Regards,
>>>> >>>> Tim.Shiu
>>>> >>>
>>>> >>>
>>>> >
>>>>
>>>
>>>