Re: [Shoal-Dev] Re: CVS update: /shoal/gms/src/java/com/sun/enterprise/ee/cms/impl/jxta/GroupCommunicationProviderImpl.java

From: Joseph Fialli <Joseph.Fialli_at_Sun.COM>
Date: Wed, 20 Aug 2008 12:13:20 -0400

Bongjae Chang,

Thanks for your feedback.

My comments are inline below.
I hope that it help explains why the change was made more.

Bongjae Chang wrote:

Hi.
I have a question about this change.
GroupCommunicationProvider.sendMessage() can be used in DistributedStateCache.addToCache API

Assume that A, B and C are members in group.
When A adds data to DSC, first A's local cache is updated then A tries to send data to B and C.
B receives the data correctly but C doesn't receive the data because the connection between A and C has some problems.
At this case, DSC's data is not synchronized because C doesn't have A's data.
Before the following change, I thought if DistributedStateCache.addToCache() threw an exception, caller should do rollback about the data for data's synchronization among members.

The sending of messages is not transacted so I am uncertain how a rollback could be done.
Given that it takes 10 seconds to identify that an instance has failed, it is quite likely in scenarios that multiple instances
fail at same time that IOException occurs in PTP transmission between an instance and a newly failed instance.

In other words, I understood caller had synchronization role because API threw an exception. But API doesn't throw an exception like the following change, I think it is better that SHOAL should have rollback's logic.
cf) Before codes, only MemberNotInViewException was ignored. But though MemberNotInViewException was ignored, SHOAL could synchronize DSC's data with syncCache(). In other words, when the member that was not in view joins the group later, the master will be aware of the member in view, so the master will call DSC.synCache() method. But IOException is different from MemberNotInViewException because the member is already in view, so DSC.synCache() can't be called. Of course, if the member will be failed, there is no problem. But IOException can be occurred without member failure. ex) in the firewall

There is an outstanding shoal issue to handle IOException better in future when trying to send a message.
(See https://shoal.dev.java.net/issues/show_bug.cgi?id=72) However, it is best when one is implementing
udp broadcast using ptp messaging to all instances in the cluster, that the implementation logic should closely
simulate the functionality it is trying to implement via an alternative mechanism. That is why this change was made.

There is no transaction semantics associated with sending of the message to all instances in the cluster.
The use of point-to-point communication with all instances was to improve upon the reliability of group-wide (udp) based message.
But that reliability should not compromise attempting to send the message at least once to ALL instances in the cluster.
The functionality should closely approximate the udp broadcast. Allowing the first failed send to abort iterating over all instances in the
cluster was incorrect. If the exception did get thrown, how can the handler of the exception recover from the aborted broadcast to
all members of the cluster. There is no record of which instances received the message and which ones did not.

If it truely is interesting to know when not all instances of the cluster (based on the current local view of instances in the cluster which is out of
date in the scenario I made this change for since the scenario killed 4 instances at one time.) receive a message, one could collect the names
of the instances that did not receive the message while sending the message to all the other instances of the cluster.
After completing attempting to send message to all instances of the cluster, then one could throw a GMSException that it was unable to
send messages to some instances and those instances names are the following.

This fix was made for a scenario where 4 instances of an 8 instance cluster where killed at same time.
Thus, there was many cases where a sendMessage failed to one of the killed instances with an
IOException: unable to create messenger. All instances in the iteration after the failure never were sent the
message. This was an incorrect approximation of implementing udp broadcast with ptp to each instance in the cluster.
Thus, the following change was made.

public void sendMessage(final String targetMemberIdentityToken,
                            final Serializable message,
                            final boolean synchronous) throws GMSException, MemberNotInViewException {
        try {
            if (targetMemberIdentityToken == null) { // indicates send "message" to ALL instances in the cluster.
                if (synchronous) {
                    /*
                    Use point-to-point communication with all instances instead of the group-wide (udp) based message.
                    Since we don't have reliable multicast yet, this approach will ensure reliability.
                    Ideally, when a message is sent to the group via point-to-point,
                    the message to each member should be on a separate thread to get concurrency.
                     */
                    List<SystemAdvertisement> currentMemberAdvs = clusterManager.getClusterViewManager().
                            getLocalView().getView();

                    for (SystemAdvertisement currentMemberAdv : currentMemberAdvs) {
                        final ID id = currentMemberAdv.getID();
                        // deleted some code
                        try {
                            // I do agree with your analysis in another email that the return value of sendMessage should be checked and it false, that the send should be retried.
                            clusterManager.send(id, message);
                        } catch (MemberNotInViewException e) {
                            if (logger.isLoggable(Level.FINE)) {
                                logger.fine("MemberNotInViewException : " + e.toString());
                            }
                        } catch (IOException ioe) {
                            // don't allow an exception sending to one instance of the cluster to prevent ptp multicast to all other instances of
                            // of the cluster. Catch this exception, record it and continue sending to rest of instances in the cluster.
                            if (logger.isLoggable(Level.FINE)) {
                                logger.log(Level.FINE,
                                        "IOException in reliable synchronous ptp multicast sending to instance " + currentMemberAdv.getName() +
                                        ". Perhaps this instance has failed but that has not been detected yet. Peer id=" +
                                        id.toString(),
                                        ioe);
                            }
                        } catch (Throwable t) {
                           // don't allow an exception sending to one instance of the cluster prevent ptp broadcast to all other instances of
                           // of the cluster. Catch this exception, record it and continue sending to rest of instances in the cluster.
                           if (logger.isLoggable(Level.FINE)) {
                                logger.log(Level.FINE,
                                        "Exception in reliable synchronous ptp multicast sending to instance " + currentMemberAdv.getName() +
                                        ", peer id=" + id.toString(),
                                        t);
                            }
                        }
                    }
                } else {
                    // send "message" to all members of the group that are listening.
                    // NOTE: Failed instances DO NOT receive the message.   There is no record of which instances receive message and which do not.
                    clusterManager.send(null, message);//sends to whole group
                }

To address your concern of notifying caller that send to all instances in cluster failed,
a future revision of this method could throw a GMSException if unable to send a message to one of the instances of the cluster could be thrown.
I am still uncertain on how a caller would handle the proposed exception other than to print it out. Amounting to same as logging individually when the sends failed.
Thus, there would need to be a use case identified to justify making the following type of change. (New code highlighted in italics)

public void sendMessage(final String targetMemberIdentityToken,
                            final Serializable message,
                            final boolean synchronous) throws GMSException, MemberNotInViewException {
        try {
            if (targetMemberIdentityToken == null) { // indicates send "message" to ALL instances in the cluster.
                if (synchronous) {
                    /*
                    Use point-to-point communication with all instances instead of the group-wide (udp) based message.
                    Since we don't have reliable multicast yet, this approach will ensure reliability.
                    Ideally, when a message is sent to the group via point-to-point,
                    the message to each member should be on a separate thread to get concurrency.
                     */
                    List<SystemAdvertisement> currentMemberAdvs = clusterManager.getClusterViewManager().
                            getLocalView().getView();
                    boolean oneOrMoreSendsFailed = false;
                   Throwable firstException = null;

                    for (SystemAdvertisement currentMemberAdv : currentMemberAdvs) {
                        final ID id = currentMemberAdv.getID();
                        // deleted some code
                        try {
                            clusterManager.send(id, message);
                        } catch (MemberNotInViewException e) {
                            if (logger.isLoggable(Level.FINE)) {
                                logger.fine("MemberNotInViewException : " + e.toString());
                            }
                        } catch (IOException ioe) {

                            oneOrMoreSendsFailed = true;
                            if (firstException != null) {
                                firstException = ioe;
                            }

                            // don't allow an exception sending to one instance of the cluster to prevent ptp multicast to all other instances of
                            // of the cluster. Catch this exception, record it and continue sending to rest of instances in the cluster.
                            if (logger.isLoggable(Level.FINE)) {
                                logger.log(Level.FINE,
                                        "IOException in reliable synchronous ptp multicast sending to instance " + currentMemberAdv.getName() +
                                        ". Perhaps this instance has failed but that has not been detected yet. Peer id=" +
                                        id.toString(),
                                        ioe);
                            }
                        } catch (Throwable t) {
                            oneOrMoreSendsFailed = true;
                            if (firstException != null) {
                                firstException = ioe;
                            }

                           // don't allow an exception sending to one instance of the cluster prevent ptp broadcast to all other instances of
                           // of the cluster. Catch this exception, record it and continue sending to rest of instances in the cluster.
                           if (logger.isLoggable(Level.FINE)) {
                                logger.log(Level.FINE,
                                        "Exception in reliable synchronous ptp multicast sending to instance " + currentMemberAdv.getName() +
                                        ", peer id=" + id.toString(),
                                        t);
                            }
                        }
                    }
                    if (oneOrMoreSendsFailed) {
                            GMSException gmsException = new GMSException("one or more sends failed.");
                            gmsException.noteFailure(firstException);
                            throw gmsException;
                   }
                } else {
                    // send "message" to all members of the group that are listening.
                    // NOTE: Failed instances DO NOT receive the message.   There is no record of which instances receive message and which do not.
                    clusterManager.send(null, message);//sends to whole group
                }

-Joe Fialli, Sun Microsystems

Thanks.
--
Bongjae Chang


----- Original Message ----- 
From: <jfialli@dev.java.net>
To: <cvs@shoal.dev.java.net>
Sent: Wednesday, August 20, 2008 6:37 AM
Subject: CVS update: /shoal/gms/src/java/com/sun/enterprise/ee/cms/impl/jxta/GroupCommunicationProviderImpl.java

User: jfialli 
Date: 2008-08-19 21:37:07+0000
Modified:
  shoal/gms/src/java/com/sun/enterprise/ee/cms/impl/jxta/GroupCommunicationProviderImpl.java

Log:
Fix for sendMessage when it is using reliable multicast mode.  This mode is selected when sendMessage parameter
 target is null (indicating to broadcast the msg parameter to instances in the cluster) and synchronous parameter is
 true.
 Benefit of this fix is that when one a send to one of the instances in the cluster fails, the code will now log that
 failure but it will proceed to send messages to rest of instances in the cluster. Prior to this fix, failing to send to
 one instance in the cluster meant that ptp multicast would abruptly stop due to the exception, and potentially
 one or more instances in the cluster would not receive the synchronous ptp multicast message.


Reviewed by:   Sheetal Vartak

File Changes:

Directory: /shoal/gms/src/java/com/sun/enterprise/ee/cms/impl/jxta/
===================================================================

File [changed]: GroupCommunicationProviderImpl.java
Url: https://shoal.dev.java.net/source/browse/shoal/gms/src/java/com/sun/enterprise/ee/cms/impl/jxta/GroupCommunicationProviderImpl.java?r1=1.26&r2=1.27
Delta lines:  +24 -3
--------------------
--- GroupCommunicationProviderImpl.java 2008-08-13 20:36:25+0000 1.26
+++ GroupCommunicationProviderImpl.java 2008-08-19 21:37:05+0000 1.27
@@ -66,7 +66,7 @@
 *
 * @author Shreedhar Ganapathy
 *         Date: Jun 26, 2006
- * @version $Revision: 1.26 $
+ * @version $Revision: 1.27 $
 */
public class GroupCommunicationProviderImpl implements
        GroupCommunicationProvider,
@@ -212,11 +212,32 @@
                        task.setMessage(message);
                        msgSendPool.submit(task);
                        */
-                        logger.log(Level.FINE, "sending message to member: " + currentMemberAdv.getName());
+                        logger.log(Level.FINER, "sending message to member: " + currentMemberAdv.getName());
                        try {
                            clusterManager.send(id, message);
                        } catch (MemberNotInViewException e) {
-                            logger.warning("MemberNotInViewException : " + e.toString());
+                            if (logger.isLoggable(Level.FINE)) {
+                                logger.fine("MemberNotInViewException : " + e.toString());
+                            }
+                        } catch (IOException ioe) {
+                            // don't allow an exception sending to one instance of the cluster to prevent ptp multicast to all other instances of
+                            // of the cluster.  Catch this exception, record it and continue sending to rest of instances in the cluster.
+                            if (logger.isLoggable(Level.FINE)) {
+                                logger.log(Level.FINE, 
+                                        "IOException in reliable synchronous ptp multicast sending to instance " + currentMemberAdv.getName() + 
+                                        ". Perhaps this instance has failed but that has not been detected yet. Peer id=" +
+                                        id.toString(), 
+                                        ioe);
+                            }
+                        } catch (Throwable t) {
+                           // don't allow an exception sending to one instance of the cluster prevent ptp broadcast to all other instances of
+                           // of the cluster.  Catch this exception, record it and continue sending to rest of instances in the cluster.
+                           if (logger.isLoggable(Level.FINE)) {
+                                logger.log(Level.FINE, 
+                                        "Exception in reliable synchronous ptp multicast sending to instance " + currentMemberAdv.getName() + 
+                                        ", peer id=" + id.toString(), 
+                                        t);
+                            }
                        }
                    }
                } else {




---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@shoal.dev.java.net
For additional commands, e-mail: cvs-help@shoal.dev.java.net