thoughts on virtual multicast

From: Joseph Fialli <joe.fialli_at_ORACLE.COM>
Date: Thu, 08 Apr 2010 09:12:37 -0400

Barry,

Some pointers for your task of implementing virtual multicast for shoal
over grizzly. (includes discovery)

1. when one broadcasts, only the members that have joined see the message.
    so boundary cases where members are joining or leaving the multicast
group, one can not
   be certain if an instance misses a UDP message just because it has
not joined group yet
   or due UDP message drop. There are OS level commands that capture
UDP message drops
   that can verify such things.

   With UDP, simply joining the multicast socket is all that is needed
to receive broadcast.
   With virtual broadcast in current shoal, the window for receiving
broadcast on boundary conditions
    is larger than UDP. It takes time for a member of cluster to
receive a view from Shoal master including
    a newly joined member. During that window, broadcast messages from
an instance to entire group
    will not reach a member that is up.

2. in shoal virtual synchronous multicast of sendMessage(attempt to be
more robust), we get the list of members that we know about and
     iterate over the members sending a synchronous TCP message to each
member.
     (see GroupCommunicationProviderImpl.sendMessage(final String
targetMemberIdentityToken,
                             final Serializable message,
                             final boolean synchronous) when
targetMemberIdentifierToken is null.
    This technique was implemented to have a less lossy broadcast, but
the current implementation is insufficient for initial discovery of a
group w/o UDP.
    (implementation takes for granted one already knows all group
members due to discovery implemnted with UDP broadcast.)

     Points on how this virtual broadcast differs from UDP.
     1. first writing of this method failed on first send to a member
whose state changed from
        when we get the list of current core members. Code used to
throw an exception AND
        not send to rest of members. To make the code more
predictable, we changed that
        to atttempt to send to ALL members regardless of difficulties
sending to any particular member.
        Otherwise, depending on first failure, REST of members did not
get message. Too unpredictable.
    2. The virtual broadcast should be done with threads so the virtual
TCP broadcast to speed it up some.
    3. When you attempt to create a TCP connection to a machine that has
had its network plug pulled or machine turned off
        results in a hang. default TCP timeout is OS specific and is
typically 2 minutes. If you look at shoal code for pinging
       a potentially failed instance, we are careful to only wait a
configurable amount of time. (see
HealthMonitor.checkConnectionToPeerMachine.call())
      So virtual multicast using TCP should also be careful to not HANG
for these situations. (More a feature that is important to get correct
at some point,
      not necessarily first attempt.) But your design should account
for fact that a send over virutal multicast is not same as a synchronous
send to only one instance.
      4. Obviously, virtual multicast is not going to scale as well as
UDP as number of members are added to group.
     5. Jxta had a rendezous service that assisted with group discovery
w/o UDP. The INITIAL_BOOTSTRAP node is probably quite similar to
WellKnownAddresses that
         Shreedhar mentioned. We are completely missing that type of
service in shoal over grizzly. I do not know that much about the jxta
rendezvous service just that
        it assisted in initial discovery.

*****

Here is how initial group discovery and groupleader (Master of group)
works in shoal gms module. Hope this description
assists you on initial understanding how groupleader is chosen.

Each shoal instances send out first message of MemberQueryRequest in
"discoveryMode" when joining a shoal group. (See
MasterNode.startMasterNodeDiscovery()).
This message is a UDP broadcast to all members of group. The Master is
suppose to respond to this query with a MasterNodeResponse
within a discovery timeout time. (time is defaulted to 5 seconds.) If
no member responds in discovery timeout, that instance makes itself
the MASTER. Tracking GroupLeadershipNotifications is go way to make
sure you know that is working correctly. All instances
make themselves MASTER when they first join a group but with the special
state that they are in discovery mode. Once discovery
mode timesout, that is when a self-appointed master is promoted to the
true master of cluster. In all shoal dev tests, we just start an
admin instance in SPECTATOR shoal role and wait 5 seconds before
starting CORE member of cluster. This ensures there is not
a battle over who is master and assists in automating our automatic
analysis of log file results of test runs. (by knowing which instance
is MASTER)

-Joe