users@shoal.java.net

Re: Merging group issue

From: Joseph Fialli <joe.fialli_at_oracle.com>
Date: Tue, 29 Nov 2011 11:20:12 -0500

  Tim,

Comments inline below.

-Joe

On 11/27/11 9:05 PM, Tim wrote:
> Dear Joe,
>
> Thanks for your explanation. Let me share my use on Shoal for my system.
>
> In my use case, I will put the states of machines into the DSC for
> handling failover.
>
> e.g. a running machine is down, the backup machine will be notified
> and take over the tasks that the failure machine is performing. Those
> states, like running, failure, recovering, etc, will be stored in DSC
> to share among the nodes in the group. There will be m + n machines in
> the group where m indicate the number of machines running tasks and n
> is the number of machines ready for backup.
>
> There are 2 information I would like to get from DSC. 1) the states of
> all machines in the group. 2) which backup machine is taking over
> another normal machine
>
> So, when a running machine A is failed, a failure recovery
> notification will be sent to one of the backup machine B to take over
> the tasks and the states of the failed machine and recover machine
> will be stored in the DSC by the leader when receiving failure
> notification. The states are used by following resume action after the
> failed machine is returned to normal.
>
> It's great that Shoal has native ability to handle this failure case
> even the failed machine is the leader node.
>
> Unfortunately, if the failure is triggered by network problem, both
> machine A and machine B will act as leader and received failure
> notification to mark opposite machine as failed in DSC. If the network
> becomes normal, the groups are merged and the DSC will be overwritten
> by one of the node among the group. (assumed that there are only 2
> machines in the group and both machines are not backup machines)
>
> So, I would like to ask:
>
> 1) if there is any specific notification signal will be sent after the
> groups are merged? (I can see that join notification will be sent but
> is it able to distinguish between the normal join action and merge
> action?)
When the isolated groups re-merge, there will be a master collision. A
GroupLeadershipNotification notification denotes that there is a change
in group leadership.
In the case of two isolated groups that are rediscovering themselves, a
master collision will occur followed by a GroupLeadershipNotification
indicating the
one master of the unified group.
>
> 2) can the native merging of DSC be stopped pragmatically that handle
> by my custom merging logic?
>
The current DistributedStateCacheImpl does not appear to handle the use
case that you have identified.
Namely, that the initial state of the instances of the group are
isolated AND then the isolated members
find themselves sometime much later.

If you would like to experiment with an alternative implementation of
DistributedStateCacheImpl
with your changes, I recommend making the one call to
DistributedStateCacheImpl.getInstance() into a more generic
ServiceLoader load of a possible alternative implementations of
DistributedStateCache.
There is only a call to DistributedStateCacheImpl.getInstance() in
GMSContextImpl.createDistributedStateCache().

An example of a ServiceLoad is in
AbstractNetworkManager.findByServiceLoader.
This selects between jxta, grizzly 1.9 or grizzly 2.0 transport for GMS.
See
shoal/gms/etc/META-INF/services/com.sun.enterprise.mgmt.transport.NetworkManager
for how to enumerate multiple implementations for a service. You can
introduce a file
called
shoal/gms/etc/META-INF/services/com.sun.enterprise.ee.cms.core.DistributedStateCache
with a reference to both your replacement DistributedStateCache
implementation
and the original DistributedStateCache.

-Joe

> 3) is there any exposed notification for the updated of DSC that I can
> add custom logic to determine if the update can be proceeded?
>
> Sorry, the essay is quite long. Thanks a lot.
>
> Regards,
> Tim.Shiu
>
> On 23/11/2011 0:10, Joseph Fialli wrote:
>>
>>
>> On 11/22/11 9:28 AM, Tim Shiu wrote:
>>> Dear Joe,
>>>
>>> Thanks a lot.
>>> I checkout the project from SVN and tested that they can merge
>>> together now.
>>>
>>> After that, I would like to ask one more question.
>>> Is there any mechanism to merge the dsc between these nodes to
>>> maintain the data in dsc is most updated?
>> The Master of the GMS group synchronizes its distributed state
>> context with the other members.
>> For your isolation at the beginning case, both instances are
>> initially masters of their one member groups.
>> When the isolated instances find each other, a master collision
>> resolution algorithm resolves which
>> one will be master. The instance that is not the master should
>> synchronize its dsc with the master.
>> The master will distribute its latest dsc with all other members.
>>
>> -Joe
>>> Or it will just pick one dsc in a node and distribute among the nodes.
>>>
>>> Thank you.
>>>
>>> Regards,
>>> Tim.Shiu
>>>
>>> 引述 Joseph Fialli <joe.fialli_at_oracle.com>:
>>>
>>> > Tim,
>>> >
>>> > There were recent bug fixes for instances rejoining cluster checked
>>> > in last Thursday.
>>> > Using the gms-transport-module branch (that is latest shoal branch),
>>> > I was able to confirm two instances
>>> > on different machines finding each other after being started up on
>>> > isolated network and then having
>>> > the network reconnected after startup.
>>> > (Simulated loss of network connectivity by running ifconfig
>>> > <networkinterface> down followed
>>> > 80 seconds later by an ifconfig up to the same network interface.)
>>> >
>>> > The same fix was checked into the shoal trunk last Thursday.
>>> >
>>> > -Joe Fialli, Oracle Corp.
>>> >
>>> > P.S.
>>> > Just in case you do not know how to checkout and build shoal
>>> workspace,
>>> > there are instructions on how to check out the trunk or a branch and
>>> > build it at
>>> > the following link: http://shoal.java.net/HowToBuildSource.html.
>>> >
>>> > On 11/17/11 8:42 PM, Tim wrote:
>>> >> Dear Joe,
>>> >>
>>> >> Thanks for your reply.
>>> >>
>>> >> I have already set both machines with the same multicast address and
>>> >> port (by using the property parameter by calling
>>> >> GMSFactory.startGMSModule) and they are already under the same
>>> >> subnet. Unfortunately, they still cannot detect each other after the
>>> >> network connected. Do I miss any setting?
>>> >>
>>> >> The following is the program fragment to join the group.
>>> >>
>>> >> Properties props = new Properties();
>>> >> props.put(ServiceProviderConfigurationKeys.LOOPBACK.toString(),
>>> >> "true");
>>> >>
>>> >>
>>> props.put(ServiceProviderConfigurationKeys.FAILURE_DETECTION_TIMEOUT.toString(),
>>> >> "500");
>>> >>
>>> >>
>>> props.put(ServiceProviderConfigurationKeys.FAILURE_VERIFICATION_TIMEOUT.toString(),
>>> >> "500");
>>> >>
>>> >>
>>> props.put(ServiceProviderConfigurationKeys.FAILURE_DETECTION_RETRIES.toString(),
>>> >> "2");
>>> >>
>>> props.put(ServiceProviderConfigurationKeys.MULTICASTADDRESS.toString(),
>>> >> "228.0.0.1");
>>> >>
>>> props.put(ServiceProviderConfigurationKeys.MULTICASTPORT.toString(),
>>> >> "9800");
>>> >> GroupManagementService gms = (GroupManagementService)
>>> >> GMSFactory.startGMSModule("MACHINEA", "TESTGROUP",
>>> >> MemberType.CORE, props);
>>> >> gms.join();
>>> >>
>>> >> Thanks for your help.
>>> >>
>>> >> Regards,
>>> >> Tim.Shiu
>>> >>
>>> >> On 18/11/2011 3:22, Joseph Fialli wrote:
>>> >>> Tim,
>>> >>>
>>> >>> In addition to the same group name, the GMS clients would also have
>>> >>> to be using the
>>> >>> same multicast group address and multicast port.
>>> >>> Lastly, Machine A and B would have to be on the same subnet and UDP
>>> >>> multicast
>>> >>> needs to be enabled for the network and possible switches/routers.
>>> >>>
>>> >>> They would find each other over UDP multicast and form a group when
>>> >>> network connectivity returns.
>>> >>>
>>> >>> -Joe Fialli
>>> >>>
>>> >>> On 11/17/11 5:46 AM, tim.shiu_at_ssc-ltd.com wrote:
>>> >>>> Dear All,
>>> >>>>
>>> >>>> I would like to ask if there is any mechanism in Shoal that can
>>> merge 2
>>> >>>> separate groups (with same group name) into 1 after they join
>>> to the
>>> >>>> same network?
>>> >>>>
>>> >>>> e.g.
>>> >>>> Machine A and B join group separately without connect to
>>> network. After
>>> >>>> they create their own group with the same name, plug the
>>> network wire
>>> >>>> and connect them together. Will they merge into the same group.
>>> >>>>
>>> >>>> Thanks.
>>> >>>>
>>> >>>> Regards,
>>> >>>> Tim.Shiu
>>> >>>
>>> >>>
>>> >
>>>
>>
>>