Re: [Shoal-Dev] Re: thoughts on virtual multicast

From: Barry van Someren <barry_at_coffeesprout.com>
Date: Fri, 9 Apr 2010 14:20:37 +0200

Thanks Joe, Shreedhar,

I'll have plenty of reading to this weekend and will post a few
updates in the next few days as I go to town on this documentation.

Regards,

Barry

On Thu, Apr 8, 2010 at 8:39 PM, Joseph Fialli <joe.fialli_at_oracle.com> wrote:
> Barry,
>
> See my comments inline below.
>
> -Joe
>
> On 4/8/10 9:55 AM, Barry van Someren wrote:
>>
>> Hi Joseph, Devs,
>>
>> Please feel free to correct me as needed, I'm still very much new to
>> Shoal, though I'm starting to understand how it works due to your
>> guidance (and excellent test code)
>> My comments inline:
>>
>>
>> On Thu, Apr 8, 2010 at 3:12 PM, Joseph Fialli<joe.fialli_at_oracle.com>
>> �wrote:
>>
>>>
>>> Barry,
>>>
>>> Some pointers for your task of implementing virtual multicast for shoal
>>> over
>>> grizzly. �(includes discovery)
>>>
>>> 1. when one broadcasts, �only the members that have joined see the
>>> message.
>>> � so boundary cases where members are joining or leaving the multicast
>>> group, �one can not
>>> �be certain if an instance misses a UDP message just because it has not
>>> joined group yet
>>> �or due UDP message drop. �There are OS level commands that capture UDP
>>> message drops
>>> �that can verify such things.
>>>
>>> �With UDP, �simply joining the multicast socket is all that is needed to
>>> receive broadcast.
>>> �With virtual broadcast in current shoal, �the window for receiving
>>> broadcast on boundary conditions
>>> � is larger than UDP. �It takes time for a member of cluster to receive a
>>> view from Shoal master including
>>> � a newly joined member. �During that window, �broadcast messages from an
>>> instance to entire group
>>> � will not reach a member that is up.
>>>
>>>
>>
>> Understood, my assumption is that a member has "joined" once it has
>> been added to the "router";
>>
>
> A member officially "joins" the group in the view of the master when its
> first message
> is received in MasterNode.receivedMessageEvent(). �(see call to
> clusterViewManager().add
> in MasterNode.processNextMessage() for precise place.)
>>
>> Is the join event synchronized across the main and backup router?
>> (router being the node that handles the virtual multicast traffic)
>>
>>
>
> A key to understanding shoal is that at any point in time, there is one and
> only one group leader who
> sends out GMS event notifications. �GMS event notifications carry with them
> the Master's view of the
> current group members. �That is the only view that all members of cluster
> use. � So the Master synchronizes
> its view of the cluster with all other members. �The window of time it takes
> for this to take place
> is where UDP broadcast and virtual broadcast over TCP differ in which
> members receive a broadcast.
>
> Of course, �your next question is probably what happens if Master fails.
> All other members monitor the master that it is alive. �If the master fails
> or stops, �the surviving members
> have a deterministic algorithm to select the successor. �All instances will
> compute to same master and agree
> on new master. �Also, �all members of cluster maintain a healthmonitor as
> master does, so that cache is
> up to date when a member is promoted to master due to previous master
> failing or being stopped.
>
>>> 2. �in shoal virtual synchronous multicast of sendMessage(attempt to be
>>> more
>>> robust), �we get the list of members that we know about and
>>> � �iterate over the members sending a synchronous TCP message to each
>>> member.
>>> � �(see GroupCommunicationProviderImpl.sendMessage(final String
>>> targetMemberIdentityToken,
>>> � � � � � � � � � � � � � �final Serializable message,
>>> � � � � � � � � � � � � � �final boolean synchronous) when
>>> targetMemberIdentifierToken is null.
>>> � This technique was implemented to have a less lossy broadcast, �but the
>>> current implementation is insufficient for initial discovery of a group
>>> w/o
>>> UDP.
>>> � (implementation takes for granted one already knows all group members
>>> due
>>> to discovery implemnted with UDP broadcast.)
>>>
>>> � �Points on how this virtual broadcast differs from UDP.
>>> � �1. first writing of this method failed on first send to a member whose
>>> state changed from
>>> � � � when we get the list of current core members. �Code used to throw
>>> an
>>> exception AND
>>> � � � not send to rest of members. �To make the code more predictable,
>>> �we
>>> changed that
>>> � � � to atttempt to send to ALL members regardless of difficulties
>>> sending
>>> to any particular member.
>>> � � � Otherwise, depending on first failure, �REST of members did not get
>>> message. �Too unpredictable.
>>>
>>
>> Agreed, as you point out in your next section, this job can be
>> parallelized
>>
>>
>>>
>>> � 2. The virtual broadcast should be done with threads so the virtual TCP
>>> broadcast to speed it up some.
>>>
>>
>> Yes, this should be perfectly acceptable for smaller clusters. I'm
>> going to assume that bigger clusters and subrouting is out of scope
>> ;-)
>>
>>
>>>
>>> � 3. When you attempt to create a TCP connection to a machine that has
>>> had
>>> its network plug pulled or machine turned off
>>> � � � results in a hang. �default TCP timeout is OS specific and is
>>> typically 2 minutes. �If you look at shoal code for pinging
>>> � � �a potentially failed instance, we are careful to only wait a
>>> configurable amount of time. �(see
>>> HealthMonitor.checkConnectionToPeerMachine.call())
>>> � � So virtual multicast using TCP should also be careful to not HANG for
>>> these situations. �(More a feature that is important to get correct at
>>> some
>>> point,
>>> � � not necessarily first attempt.) �But your design should account for
>>> fact
>>> that a send over virutal multicast is not same as a synchronous send to
>>> only
>>> one instance.
>>>
>>
>> Understood, the threaded design will also help prevent locking up the
>> entire stack.
>> Sounds to me that each "broadcaster-thread" should have a simple
>> message queue; We should probably add logic that declares a node dead
>> if it the messages can't be sent before the timeout (this can be done
>> asynchronously without exceptions, the node would just be removed from
>> the list when this happens)
>>
>
> GMS is already monitoring the health of all members using heartbeat failure
> detection.
> The heartbeat failure detection is fully parameterized (heartbeat frequency,
> max missed heartbeats,
> wait time before last tcp ping to a suspected failed instance).
> When a member is shutdown or is detected as failed, the system removes all
> references to the member that left.
> No other mechanism should be used to declare a member as dead. �A signal
> send could timeout because the
> receiver is busy due to full GC, high system load, �router glitch, ...
> �(Obviously the shorter the timeout length,
> the higher the chance of it happening and not meaning a fatal failure.)
>>
>>
>>>
>>> � � 4. Obviously, �virtual multicast is not going to scale as well as UDP
>>> as
>>> number of members are added to group.
>>>
>>
>> No, hence retaining both strategies is preferable, with multicast
>> being preferred and virtual multicast as a bandaid option for those
>> not lucky enough to have the right hardware.
>>
>>
>>>
>>> � �5. �Jxta had a rendezous service that assisted with group discovery
>>> w/o
>>> UDP. �The INITIAL_BOOTSTRAP node is probably quite similar to
>>> WellKnownAddresses that
>>> � � � �Shreedhar mentioned. �We are completely missing that type of
>>> service
>>> in shoal over grizzly. �I do not know that much about the jxta rendezvous
>>> service just that
>>> � � � it assisted in initial discovery.
>>>
>>
>> Alright, discovery was one of the things I've been focusing on.
>> My KIS approach would be to assign a system property within Shoal that
>> triggers the "virtual multicast" functionality.
>> For example, defining the following property for each node:
>> -Dshoal.bootstrapnode=IP_ADDRESS1,IP_ADDRESS2
>> Setting this property is the least invasive to existing applications
>> and requires no API changes.
>>
>
> Shreedhar's email refernenced the existing INITIAL_VIRTUAL_URI_LIST
> parameter that
> provides a way to input these values for shoal over jxta. �Can use same for
> grizzly transport.
>
> -Joe
>>
>> This places a very small administrative burden on those who do not run
>> the default (multicast) setup, but with simple documentation it should
>> not be hard to setup. Adding system properties to Glassfish is trivial
>> and using the DAS setup to propagate these settings would be easy. The
>> nodes would basically connect to the first system in the list (and try
>> another if the master does not reply within the timeout)
>>
>> This would leave the question on how to handle failure of the
>> "Master", but maybe it's a good idea to get started on working out a
>> single master approach. I still have a lot to learn about the Shoal
>> lifecycles etc.
>>
>>
>>>
>>> *****
>>>
>>> Here is how initial group discovery and groupleader (Master of group)
>>> works
>>> in shoal gms module. �Hope this description
>>> assists you on initial understanding how groupleader is chosen.
>>>
>>
>> It does, thanks
>>
>>
>>>
>>> Each shoal instances send out first message of MemberQueryRequest in
>>> "discoveryMode" when joining a shoal group. �(See
>>> MasterNode.startMasterNodeDiscovery()).
>>> This message is a UDP broadcast to all members of group. �The Master is
>>> suppose to respond to this query with a MasterNodeResponse
>>> within a discovery timeout time. �(time is defaulted to 5 seconds.) � If
>>> no
>>> member responds in discovery timeout, that instance makes itself
>>> the MASTER. �Tracking GroupLeadershipNotifications is go way to make sure
>>> you know that is working correctly. �All instances
>>> make themselves MASTER when they first join a group but with the special
>>> state that they are in discovery mode. �Once discovery
>>> mode timesout, �that is when a self-appointed master is promoted to the
>>> true
>>> master of cluster. �In all shoal dev tests, �we just start an
>>> admin instance in SPECTATOR shoal role and wait 5 seconds before starting
>>> CORE member of cluster. �This ensures there is not
>>> a battle over who is master and assists in automating our automatic
>>> analysis
>>> of log file results of test runs. (by knowing which instance
>>> is MASTER)
>>>
>>> -Joe
>>>
>>>
>>
>> With my suggestion on using a property there would be no need for a
>> clash on who is master.
>> The non multicast version could (and this is just a layman's
>> suggestion) do the following:
>>
>> Each shoal instances send out first message of MemberQueryRequest in
>> "discoveryMode" when joining a shoal group.
>> This message is a UDP/TCP unicast to the first listed IP address for
>> the "shoal.bootstrapnode" property . �The Master is suppose to respond
>> to this query with a MasterNodeResponse within a discovery timeout
>> time. �(time is defaulted to 5 seconds.) �if the master does not
>> respond, the node tries IP addresses further down the list. (and when
>> it reaches the bottom it will start at the top again, retrying every 5
>> seconds)
>>
>> I'm not sure if this change in functionality is acceptable, but it
>> would definitely simplify joining the cluster.
>>
>>
>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe_at_shoal.dev.java.net
> For additional commands, e-mail: dev-help_at_shoal.dev.java.net
>
>

-- 
Barry van Someren
---------------------------------------
LinkedIn: http://www.linkedin.com/in/barryvansomeren
Skype: BvsomerenSprout
Blog: http://blog.bvansomeren.com
KvK: 27317624
M: +31 (0)6-81464338
T: +31 (0)70-2500450