Re: thoughts on virtual multicast

From: Barry van Someren <barry_at_coffeesprout.com>
Date: Thu, 8 Apr 2010 15:55:58 +0200

Hi Joseph, Devs,

Please feel free to correct me as needed, I'm still very much new to
Shoal, though I'm starting to understand how it works due to your
guidance (and excellent test code)
My comments inline:

On Thu, Apr 8, 2010 at 3:12 PM, Joseph Fialli <joe.fialli_at_oracle.com> wrote:
> Barry,
>
> Some pointers for your task of implementing virtual multicast for shoal over
> grizzly. �(includes discovery)
>
> 1. when one broadcasts, �only the members that have joined see the message.
> � so boundary cases where members are joining or leaving the multicast
> group, �one can not
> �be certain if an instance misses a UDP message just because it has not
> joined group yet
> �or due UDP message drop. �There are OS level commands that capture UDP
> message drops
> �that can verify such things.
>
> �With UDP, �simply joining the multicast socket is all that is needed to
> receive broadcast.
> �With virtual broadcast in current shoal, �the window for receiving
> broadcast on boundary conditions
> � is larger than UDP. �It takes time for a member of cluster to receive a
> view from Shoal master including
> � a newly joined member. �During that window, �broadcast messages from an
> instance to entire group
> � will not reach a member that is up.
>
Understood, my assumption is that a member has "joined" once it has
been added to the "router";
Is the join event synchronized across the main and backup router?
(router being the node that handles the virtual multicast traffic)

>
> 2. �in shoal virtual synchronous multicast of sendMessage(attempt to be more
> robust), �we get the list of members that we know about and
> � �iterate over the members sending a synchronous TCP message to each
> member.
> � �(see GroupCommunicationProviderImpl.sendMessage(final String
> targetMemberIdentityToken,
> � � � � � � � � � � � � � �final Serializable message,
> � � � � � � � � � � � � � �final boolean synchronous) when
> targetMemberIdentifierToken is null.
> � This technique was implemented to have a less lossy broadcast, �but the
> current implementation is insufficient for initial discovery of a group w/o
> UDP.
> � (implementation takes for granted one already knows all group members due
> to discovery implemnted with UDP broadcast.)
>
> � �Points on how this virtual broadcast differs from UDP.
> � �1. first writing of this method failed on first send to a member whose
> state changed from
> � � � when we get the list of current core members. �Code used to throw an
> exception AND
> � � � not send to rest of members. �To make the code more predictable, �we
> changed that
> � � � to atttempt to send to ALL members regardless of difficulties sending
> to any particular member.
> � � � Otherwise, depending on first failure, �REST of members did not get
> message. �Too unpredictable.

Agreed, as you point out in your next section, this job can be parallelized

> � 2. The virtual broadcast should be done with threads so the virtual TCP
> broadcast to speed it up some.

Yes, this should be perfectly acceptable for smaller clusters. I'm
going to assume that bigger clusters and subrouting is out of scope
;-)

> � 3. When you attempt to create a TCP connection to a machine that has had
> its network plug pulled or machine turned off
> � � � results in a hang. �default TCP timeout is OS specific and is
> typically 2 minutes. �If you look at shoal code for pinging
> � � �a potentially failed instance, we are careful to only wait a
> configurable amount of time. �(see
> HealthMonitor.checkConnectionToPeerMachine.call())
> � � So virtual multicast using TCP should also be careful to not HANG for
> these situations. �(More a feature that is important to get correct at some
> point,
> � � not necessarily first attempt.) �But your design should account for fact
> that a send over virutal multicast is not same as a synchronous send to only
> one instance.

Understood, the threaded design will also help prevent locking up the
entire stack.
Sounds to me that each "broadcaster-thread" should have a simple
message queue; We should probably add logic that declares a node dead
if it the messages can't be sent before the timeout (this can be done
asynchronously without exceptions, the node would just be removed from
the list when this happens)

> � � 4. Obviously, �virtual multicast is not going to scale as well as UDP as
> number of members are added to group.

No, hence retaining both strategies is preferable, with multicast
being preferred and virtual multicast as a bandaid option for those
not lucky enough to have the right hardware.

> � �5. �Jxta had a rendezous service that assisted with group discovery w/o
> UDP. �The INITIAL_BOOTSTRAP node is probably quite similar to
> WellKnownAddresses that
> � � � �Shreedhar mentioned. �We are completely missing that type of service
> in shoal over grizzly. �I do not know that much about the jxta rendezvous
> service just that
> � � � it assisted in initial discovery.

Alright, discovery was one of the things I've been focusing on.
My KIS approach would be to assign a system property within Shoal that
triggers the "virtual multicast" functionality.
For example, defining the following property for each node:
-Dshoal.bootstrapnode=IP_ADDRESS1,IP_ADDRESS2
Setting this property is the least invasive to existing applications
and requires no API changes.

This places a very small administrative burden on those who do not run
the default (multicast) setup, but with simple documentation it should
not be hard to setup. Adding system properties to Glassfish is trivial
and using the DAS setup to propagate these settings would be easy. The
nodes would basically connect to the first system in the list (and try
another if the master does not reply within the timeout)

This would leave the question on how to handle failure of the
"Master", but maybe it's a good idea to get started on working out a
single master approach. I still have a lot to learn about the Shoal
lifecycles etc.

>
> *****
>
> Here is how initial group discovery and groupleader (Master of group) works
> in shoal gms module. �Hope this description
> assists you on initial understanding how groupleader is chosen.

It does, thanks

>
> Each shoal instances send out first message of MemberQueryRequest in
> "discoveryMode" when joining a shoal group. �(See
> MasterNode.startMasterNodeDiscovery()).
> This message is a UDP broadcast to all members of group. �The Master is
> suppose to respond to this query with a MasterNodeResponse
> within a discovery timeout time. �(time is defaulted to 5 seconds.) � If no
> member responds in discovery timeout, that instance makes itself
> the MASTER. �Tracking GroupLeadershipNotifications is go way to make sure
> you know that is working correctly. �All instances
> make themselves MASTER when they first join a group but with the special
> state that they are in discovery mode. �Once discovery
> mode timesout, �that is when a self-appointed master is promoted to the true
> master of cluster. �In all shoal dev tests, �we just start an
> admin instance in SPECTATOR shoal role and wait 5 seconds before starting
> CORE member of cluster. �This ensures there is not
> a battle over who is master and assists in automating our automatic analysis
> of log file results of test runs. (by knowing which instance
> is MASTER)
>
> -Joe
>
With my suggestion on using a property there would be no need for a
clash on who is master.
The non multicast version could (and this is just a layman's
suggestion) do the following:

Each shoal instances send out first message of MemberQueryRequest in
"discoveryMode" when joining a shoal group.
This message is a UDP/TCP unicast to the first listed IP address for
the "shoal.bootstrapnode" property . �The Master is suppose to respond
to this query with a MasterNodeResponse within a discovery timeout
time. �(time is defaulted to 5 seconds.) �if the master does not
respond, the node tries IP addresses further down the list. (and when
it reaches the bottom it will start at the top again, retrying every 5
seconds)

I'm not sure if this change in functionality is acceptable, but it
would definitely simplify joining the cluster.

>
>
>
>
>
>
>
>
>
>
>

-- 
Barry van Someren
---------------------------------------
LinkedIn: http://www.linkedin.com/in/barryvansomeren
Skype: BvsomerenSprout
Blog: http://blog.bvansomeren.com
KvK: 27317624
M: +31 (0)6-81464338
T: +31 (0)70-2500450