Re: [Shoal-Dev] Re: thoughts on virtual multicast

From: Joseph Fialli <joe.fialli_at_oracle.com>
Date: Thu, 08 Apr 2010 14:39:53 -0400

Barry,

See my comments inline below.

-Joe

On 4/8/10 9:55 AM, Barry van Someren wrote:
> Hi Joseph, Devs,
>
> Please feel free to correct me as needed, I'm still very much new to
> Shoal, though I'm starting to understand how it works due to your
> guidance (and excellent test code)
> My comments inline:
>
>
> On Thu, Apr 8, 2010 at 3:12 PM, Joseph Fialli<joe.fialli_at_oracle.com> wrote:
>
>> Barry,
>>
>> Some pointers for your task of implementing virtual multicast for shoal over
>> grizzly. (includes discovery)
>>
>> 1. when one broadcasts, only the members that have joined see the message.
>> so boundary cases where members are joining or leaving the multicast
>> group, one can not
>> be certain if an instance misses a UDP message just because it has not
>> joined group yet
>> or due UDP message drop. There are OS level commands that capture UDP
>> message drops
>> that can verify such things.
>>
>> With UDP, simply joining the multicast socket is all that is needed to
>> receive broadcast.
>> With virtual broadcast in current shoal, the window for receiving
>> broadcast on boundary conditions
>> is larger than UDP. It takes time for a member of cluster to receive a
>> view from Shoal master including
>> a newly joined member. During that window, broadcast messages from an
>> instance to entire group
>> will not reach a member that is up.
>>
>>
> Understood, my assumption is that a member has "joined" once it has
> been added to the "router";
>
A member officially "joins" the group in the view of the master when its
first message
is received in MasterNode.receivedMessageEvent(). (see call to
clusterViewManager().add
in MasterNode.processNextMessage() for precise place.)
> Is the join event synchronized across the main and backup router?
> (router being the node that handles the virtual multicast traffic)
>
>

A key to understanding shoal is that at any point in time, there is one
and only one group leader who
sends out GMS event notifications. GMS event notifications carry with
them the Master's view of the
current group members. That is the only view that all members of
cluster use. So the Master synchronizes
its view of the cluster with all other members. The window of time it
takes for this to take place
is where UDP broadcast and virtual broadcast over TCP differ in which
members receive a broadcast.

Of course, your next question is probably what happens if Master fails.
All other members monitor the master that it is alive. If the master
fails or stops, the surviving members
have a deterministic algorithm to select the successor. All instances
will compute to same master and agree
on new master. Also, all members of cluster maintain a healthmonitor
as master does, so that cache is
up to date when a member is promoted to master due to previous master
failing or being stopped.

>> 2. in shoal virtual synchronous multicast of sendMessage(attempt to be more
>> robust), we get the list of members that we know about and
>> iterate over the members sending a synchronous TCP message to each
>> member.
>> (see GroupCommunicationProviderImpl.sendMessage(final String
>> targetMemberIdentityToken,
>> final Serializable message,
>> final boolean synchronous) when
>> targetMemberIdentifierToken is null.
>> This technique was implemented to have a less lossy broadcast, but the
>> current implementation is insufficient for initial discovery of a group w/o
>> UDP.
>> (implementation takes for granted one already knows all group members due
>> to discovery implemnted with UDP broadcast.)
>>
>> Points on how this virtual broadcast differs from UDP.
>> 1. first writing of this method failed on first send to a member whose
>> state changed from
>> when we get the list of current core members. Code used to throw an
>> exception AND
>> not send to rest of members. To make the code more predictable, we
>> changed that
>> to atttempt to send to ALL members regardless of difficulties sending
>> to any particular member.
>> Otherwise, depending on first failure, REST of members did not get
>> message. Too unpredictable.
>>
> Agreed, as you point out in your next section, this job can be parallelized
>
>
>> 2. The virtual broadcast should be done with threads so the virtual TCP
>> broadcast to speed it up some.
>>
> Yes, this should be perfectly acceptable for smaller clusters. I'm
> going to assume that bigger clusters and subrouting is out of scope
> ;-)
>
>
>> 3. When you attempt to create a TCP connection to a machine that has had
>> its network plug pulled or machine turned off
>> results in a hang. default TCP timeout is OS specific and is
>> typically 2 minutes. If you look at shoal code for pinging
>> a potentially failed instance, we are careful to only wait a
>> configurable amount of time. (see
>> HealthMonitor.checkConnectionToPeerMachine.call())
>> So virtual multicast using TCP should also be careful to not HANG for
>> these situations. (More a feature that is important to get correct at some
>> point,
>> not necessarily first attempt.) But your design should account for fact
>> that a send over virutal multicast is not same as a synchronous send to only
>> one instance.
>>
> Understood, the threaded design will also help prevent locking up the
> entire stack.
> Sounds to me that each "broadcaster-thread" should have a simple
> message queue; We should probably add logic that declares a node dead
> if it the messages can't be sent before the timeout (this can be done
> asynchronously without exceptions, the node would just be removed from
> the list when this happens)
>
GMS is already monitoring the health of all members using heartbeat
failure detection.
The heartbeat failure detection is fully parameterized (heartbeat
frequency, max missed heartbeats,
wait time before last tcp ping to a suspected failed instance).
When a member is shutdown or is detected as failed, the system removes
all references to the member that left.
No other mechanism should be used to declare a member as dead. A signal
send could timeout because the
receiver is busy due to full GC, high system load, router glitch, ...
(Obviously the shorter the timeout length,
the higher the chance of it happening and not meaning a fatal failure.)
>
>
>> 4. Obviously, virtual multicast is not going to scale as well as UDP as
>> number of members are added to group.
>>
> No, hence retaining both strategies is preferable, with multicast
> being preferred and virtual multicast as a bandaid option for those
> not lucky enough to have the right hardware.
>
>
>> 5. Jxta had a rendezous service that assisted with group discovery w/o
>> UDP. The INITIAL_BOOTSTRAP node is probably quite similar to
>> WellKnownAddresses that
>> Shreedhar mentioned. We are completely missing that type of service
>> in shoal over grizzly. I do not know that much about the jxta rendezvous
>> service just that
>> it assisted in initial discovery.
>>
> Alright, discovery was one of the things I've been focusing on.
> My KIS approach would be to assign a system property within Shoal that
> triggers the "virtual multicast" functionality.
> For example, defining the following property for each node:
> -Dshoal.bootstrapnode=IP_ADDRESS1,IP_ADDRESS2
> Setting this property is the least invasive to existing applications
> and requires no API changes.
>
Shreedhar's email refernenced the existing INITIAL_VIRTUAL_URI_LIST
parameter that
provides a way to input these values for shoal over jxta. Can use same
for grizzly transport.

-Joe
> This places a very small administrative burden on those who do not run
> the default (multicast) setup, but with simple documentation it should
> not be hard to setup. Adding system properties to Glassfish is trivial
> and using the DAS setup to propagate these settings would be easy. The
> nodes would basically connect to the first system in the list (and try
> another if the master does not reply within the timeout)
>
> This would leave the question on how to handle failure of the
> "Master", but maybe it's a good idea to get started on working out a
> single master approach. I still have a lot to learn about the Shoal
> lifecycles etc.
>
>
>> *****
>>
>> Here is how initial group discovery and groupleader (Master of group) works
>> in shoal gms module. Hope this description
>> assists you on initial understanding how groupleader is chosen.
>>
> It does, thanks
>
>
>> Each shoal instances send out first message of MemberQueryRequest in
>> "discoveryMode" when joining a shoal group. (See
>> MasterNode.startMasterNodeDiscovery()).
>> This message is a UDP broadcast to all members of group. The Master is
>> suppose to respond to this query with a MasterNodeResponse
>> within a discovery timeout time. (time is defaulted to 5 seconds.) If no
>> member responds in discovery timeout, that instance makes itself
>> the MASTER. Tracking GroupLeadershipNotifications is go way to make sure
>> you know that is working correctly. All instances
>> make themselves MASTER when they first join a group but with the special
>> state that they are in discovery mode. Once discovery
>> mode timesout, that is when a self-appointed master is promoted to the true
>> master of cluster. In all shoal dev tests, we just start an
>> admin instance in SPECTATOR shoal role and wait 5 seconds before starting
>> CORE member of cluster. This ensures there is not
>> a battle over who is master and assists in automating our automatic analysis
>> of log file results of test runs. (by knowing which instance
>> is MASTER)
>>
>> -Joe
>>
>>
> With my suggestion on using a property there would be no need for a
> clash on who is master.
> The non multicast version could (and this is just a layman's
> suggestion) do the following:
>
> Each shoal instances send out first message of MemberQueryRequest in
> "discoveryMode" when joining a shoal group.
> This message is a UDP/TCP unicast to the first listed IP address for
> the "shoal.bootstrapnode" property . The Master is suppose to respond
> to this query with a MasterNodeResponse within a discovery timeout
> time. (time is defaulted to 5 seconds.) if the master does not
> respond, the node tries IP addresses further down the list. (and when
> it reaches the bottom it will start at the top again, retrying every 5
> seconds)
>
> I'm not sure if this change in functionality is acceptable, but it
> would definitely simplify joining the cluster.
>
>
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>