users@shoal.java.net

Re: [Shoal-Users] Still not sure it's working

From: Shreedhar Ganapathy <Shreedhar.Ganapathy_at_Sun.COM>
Date: Wed, 02 Jul 2008 09:07:03 -0700

Mike,
Thanks for the information. I have at times seen better results when the
antivirus and/firewall software is turned off as they seem to hold state
wrt to ports. We can, of course, not rule out bugs in our code on this
behavior you have reported. An exercise in isolating the issue at best.

It might be worthwhile running this outside the IDE to rule out any
possibilities on that account.

What I meant by asking about concurrently starting each process was
whether these three processes were started by a script or other
automated tool that would start them at the same time as different
processes. Not the case as you pointed out. The behavior of Shoal should
be consistent whether started concurrently or started one after the other.

Joe Fialli, our team member, yesterday isolated a possible bug in health
monitor that is being tested but that may not be related to this
unexpected failure suspicion. We need some time to evaluate this and I
would appreciate your patience with this.

Thanks
Shreedhar

Mike Wannamaker wrote:
>
> Sorry for the late reply, yesterday was Canada Day and thus was a holiday.
>
>
>
> Yes all instances are running on a single machine.
>
> Machine is Windows XP 64 bit.
>
> JVM is Sun JVM 1.5.0_15
>
> I am running this through IntelliJ IDEA.
>
> Windows Firewall is turned off.
>
> I do have Symantec Antivirus. Do you mean disable the Auto-Protect?
> Or do you mean shutdown the whole antivirus?
>
>
>
> 10.6.2.89 is my network card.
>
> 192.168.111.1 is VMware VMNet8
>
> 192.168.138.1 is VMware VMNet1
>
>
>
> I am NOT running these instances inside a VMware instance; they are
> running on my main machine.
>
> I'm not sure what you mean are all started concurrently? I run these
> as applications from within IntelliJ which does this
>
>
>
> D:\JDKS\jdk1.5.0_15\bin\java -Didea.launcher.port=7546
> "-Didea.launcher.bin.path=C:\Program Files (x86)\JetBrains\IntelliJ
> IDEA 7.0.3\bin" -Dfile.encoding=windows-1252 -classpath
> "D:\JDKS\jdk1.5.0_15\jre\lib\charsets.jar;D:\JDKS\jdk1.5.0_15\jre\lib\deploy.jar;D:\JDKS\jdk1.5.0_15\jre\lib\javaws.jar;D:\JDKS\jdk1.5.0_15\jre\lib\jce.jar;D:\JDKS\jdk1.5.0_15\jre\lib\jsse.jar;D:\JDKS\jdk1.5.0_15\jre\lib\plugin.jar;D:\JDKS\jdk1.5.0_15\jre\lib\rt.jar;D:\JDKS\jdk1.5.0_15\jre\lib\ext\dnsns.jar;D:\JDKS\jdk1.5.0_15\jre\lib\ext\localedata.jar;D:\JDKS\jdk1.5.0_15\jre\lib\ext\sunjce_provider.jar;D:\JDKS\jdk1.5.0_15\jre\lib\ext\sunpkcs11.jar;D:\Development\shoaltest\out\production\SMessage;D:\Development\shoaltest\libs\appia\appia-3.2.4.jar;D:\Development\shoaltest\libs\jgroups\jgroups-all.jar;D:\Development\shoaltest\libs\log4j\log4j.jar;D:\Development\shoaltest\libs\shoal\shoal-gms.jar;D:\Development\shoaltest\libs\shoal\jxta.jar;C:\Program
> Files (x86)\JetBrains\IntelliJ IDEA 7.0.3\lib\idea_rt.jar"
> com.intellij.rt.execution.application.AppMain
> com.opentext.shoal.SendMessageSample SERVER-3
>
>
>
> I DO see the failure suspect for SERVER-3 in the log snippet?
>
>
>
> 30-Jun-2008 2:16:57 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
> newViewObserved
>
> INFO: Analyzing new membership snapshot received as part of event :
> IN_DOUBT_EVENT
>
> 30-Jun-2008 2:16:57 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
> addInDoubtMemberSignals
>
> INFO: gms.failureSuspectedEventReceived
>
> 30-Jun-2008 2:16:57 PM com.sun.enterprise.ee.cms.impl.common.Router
> notifyFailureSuspectedAction
>
> INFO: Sending FailureSuspectedSignals to registered Actions.
> Member:SERVER-3...
>
> 30-Jun-2008 02:16:57 PM DEBUG [pool-1-thread-4]
> com.opentext.ecm.services.smessage.impl.shoal.SignalLogger - -
> SERVER-3 >> FailureSuspectedSignalImpl @ 30/06/08 2:00 PM -
> [RCS_CLUSTER-false]:
> (Hashtable:[(String:server.name)<-->(String:SERVER-3),
> (String:local.host)<-->(Inet4Address:mwana0061/10.6.2.89)])
>
> MEMBERS: (ArrayList:[mwana0061/10.6.2.89, mwana0061/10.6.2.89,
> mwana0061/10.6.2.89])
>
>
>
> What is weird is the one for SERVER-1 which was not shutdown and is
> still running?
>
>
>
>
>
> ------------------------------------------------------------------------
>
> *From:* Shreedhar.Ganapathy_at_Sun.COM [mailto:Shreedhar.Ganapathy_at_Sun.COM]
> *Sent:* June 30, 2008 3:00 PM
> *To:* users_at_shoal.dev.java.net
> *Subject:* Re: [Shoal-Users] Still not sure it's working
>
>
>
> Hi Mike
> Yes this is indeed a new problem. I hope this is not different
> snippets but a continuous log snippet. What seems strange in this
> pasted output is that there is no failure suspected signal (in doubt
> event) for Server-3 ? Is this what you see? There is the suspect event
> for server-1.
>
> Some questions: Are all instances on the same machine? The interface
> addresses dont seem to be all in the same subnet and/or it appears to
> be different networks in a multihome machine environment (I see
> 10.6.2.89 and 192.168.111.1 and 192.168.138.1).
> Are all instances started concurrently?
>
> Do you have any antivirus or firewalls running in your machine(s) ? If
> yes, can you disable them and see if communications and events happen
> correctly?
>
> Thanks
> Shreedhar
>
>
>
> Mike Wannamaker wrote:
>
> Okay tested when shutting down a non groupleader. I do see suspect
> and failure notifications.
>
>
>
> However, you might not like this; I also see something that is very
> strange and disturbing.
>
>
>
> I start SERVER-1 (GROUPLEADER), SERVER-2, and SERVER-3.
>
>
>
> Shutdown SERVER-3, get correct messages in SERVER-1 and mostly in
> SERVER-2, but I also get a FailureSuspect for SERVER-1 in SERVER-2 window.
>
> This might be okay if I got a notification that the node was back, but
> I don't and it is still running. Started SERVER-3 and see SERVER-1 in
> the list and it gets notifications as well.
>
>
>
> I tried again shutdown the newly running SERVER-3 and I get the same
> results so it seems fully reproducible.
>
>
>
>
>
>
>
> Here is the output for SERVER-2
>
>
>
> 30-Jun-2008 2:16:57 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
> getMemberTokens
>
> INFO: GMS View Change Received for group RCS_CLUSTER : Members in view
> for (before change analysis) are :
>
> 1: MemberId: SERVER-2, MemberType: CORE, Address:
> urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FD0D4B867250FF460C9B539A161779845B03
>
> 2: MemberId: SERVER-3, MemberType: CORE, Address:
> urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FD54C54AB0D7A640E493A5C6CE427A3CE203
>
> 3: MemberId: SERVER-1, MemberType: CORE, Address:
> urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FDB946A28335F0413BBF73B77CCC8BFEC603
>
>
>
> 30-Jun-2008 2:16:57 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
> newViewObserved
>
> INFO: Analyzing new membership snapshot received as part of event :
> IN_DOUBT_EVENT
>
> 30-Jun-2008 2:16:57 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
> addInDoubtMemberSignals
>
> INFO: gms.failureSuspectedEventReceived
>
> 30-Jun-2008 2:16:57 PM com.sun.enterprise.ee.cms.impl.common.Router
> notifyFailureSuspectedAction
>
> INFO: Sending FailureSuspectedSignals to registered Actions.
> Member:SERVER-3...
>
> 30-Jun-2008 02:16:57 PM DEBUG [pool-1-thread-4]
> com.opentext.ecm.services.smessage.impl.shoal.SignalLogger - -
> SERVER-3 >> FailureSuspectedSignalImpl @ 30/06/08 2:00 PM -
> [RCS_CLUSTER-false]:
> (Hashtable:[(String:server.name)<-->(String:SERVER-3),
> (String:local.host)<-->(Inet4Address:mwana0061/10.6.2.89)])
>
> MEMBERS: (ArrayList:[mwana0061/10.6.2.89, mwana0061/10.6.2.89,
> mwana0061/10.6.2.89])
>
> 30-Jun-2008 2:16:57 PM com.sun.enterprise.jxtamgmt.HealthMonitor
> isConnected
>
> INFO: Checking for machine status for network interface :
> tcp://10.6.2.89:9701
>
> 30-Jun-2008 2:16:57 PM com.sun.enterprise.jxtamgmt.HealthMonitor
> isConnected
>
> INFO: Checking for machine status for network interface :
> tcp://192.168.111.1:9701
>
> 30-Jun-2008 2:16:57 PM com.sun.enterprise.jxtamgmt.HealthMonitor
> isConnected
>
> INFO: Checking for machine status for network interface :
> tcp://192.168.138.1:9701
>
> 30-Jun-2008 2:17:27 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
> getMemberTokens
>
> INFO: GMS View Change Received for group RCS_CLUSTER : Members in view
> for (before change analysis) are :
>
> 1: MemberId: SERVER-2, MemberType: CORE, Address:
> urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FD0D4B867250FF460C9B539A161779845B03
>
> 2: MemberId: SERVER-3, MemberType: CORE, Address:
> urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FD54C54AB0D7A640E493A5C6CE427A3CE203
>
> 3: MemberId: SERVER-1, MemberType: CORE, Address:
> urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FDB946A28335F0413BBF73B77CCC8BFEC603
>
>
>
> 30-Jun-2008 2:17:27 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
> newViewObserved
>
> INFO: Analyzing new membership snapshot received as part of event :
> IN_DOUBT_EVENT
>
> 30-Jun-2008 2:17:27 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
> addInDoubtMemberSignals
>
> INFO: gms.failureSuspectedEventReceived
>
> 30-Jun-2008 2:17:27 PM com.sun.enterprise.ee.cms.impl.common.Router
> notifyFailureSuspectedAction
>
> INFO: Sending FailureSuspectedSignals to registered Actions.
> Member:SERVER-1...
>
> 30-Jun-2008 02:17:27 PM DEBUG [pool-1-thread-4]
> com.opentext.ecm.services.smessage.impl.shoal.SignalLogger - -
> SERVER-1 >> FailureSuspectedSignalImpl @ 30/06/08 1:59 PM -
> [RCS_CLUSTER-false]:
> (Hashtable:[(String:server.name)<-->(String:SERVER-1),
> (String:local.host)<-->(Inet4Address:mwana0061/10.6.2.89)])
>
> MEMBERS: (ArrayList:[mwana0061/10.6.2.89, mwana0061/10.6.2.89,
> mwana0061/10.6.2.89])
>
> 30-Jun-2008 2:17:30 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
> getMemberTokens
>
> INFO: GMS View Change Received for group RCS_CLUSTER : Members in view
> for (before change analysis) are :
>
> 1: MemberId: SERVER-2, MemberType: CORE, Address:
> urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FD0D4B867250FF460C9B539A161779845B03
>
> 2: MemberId: SERVER-1, MemberType: CORE, Address:
> urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FDB946A28335F0413BBF73B77CCC8BFEC603
>
>
>
> 30-Jun-2008 2:17:30 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
> newViewObserved
>
> INFO: Analyzing new membership snapshot received as part of event :
> FAILURE_EVENT
>
> 30-Jun-2008 2:17:30 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
> addFailureSignals
>
> INFO: The following member has failed: SERVER-3
>
> 30-Jun-2008 2:17:30 PM com.sun.enterprise.ee.cms.impl.common.Router
> notifyFailureNotificationAction
>
> INFO: Sending FailureNotificationSignals to registered Actions.
> Member: SERVER-3...
>
> 30-Jun-2008 02:17:30 PM DEBUG [pool-1-thread-4]
> com.opentext.ecm.services.smessage.impl.shoal.SignalLogger - -
> SERVER-3 >> FailureNotificationSignalImpl @ 30/06/08 2:00 PM -
> [RCS_CLUSTER-false]:
> (Hashtable:[(String:server.name)<-->(String:SERVER-3),
> (String:local.host)<-->(Inet4Address:mwana0061/10.6.2.89)])SERVER-3
>
> MEMBERS: (ArrayList:[mwana0061/10.6.2.89, mwana0061/10.6.2.89])
>
>
>
> ------------------------------------------------------------------------
>
> *From:* Shreedhar.Ganapathy_at_Sun.COM
> <mailto:Shreedhar.Ganapathy_at_Sun.COM> [mailto:Shreedhar.Ganapathy_at_Sun.COM]
> *Sent:* June 30, 2008 2:07 PM
> *To:* users_at_shoal.dev.java.net <mailto:users_at_shoal.dev.java.net>
> *Subject:* Re: [Shoal-Users] Still not sure it's working
>
>
>
> Thats correct. Yes I should not mix up the provider terminology versus
> GMS terminology.
> Thanks
> Shreedhar
>
> Mike Wannamaker wrote:
>
> When you say a non-master do you mean when a server is shutdown that
> is not the groupleader?
>
>
>
> ------------------------------------------------------------------------
>
> *From:* Shreedhar.Ganapathy_at_Sun.COM
> <mailto:Shreedhar.Ganapathy_at_Sun.COM> [mailto:Shreedhar.Ganapathy_at_Sun.COM]
> *Sent:* June 30, 2008 1:47 PM
> *To:* users_at_shoal.dev.java.net <mailto:users_at_shoal.dev.java.net>
> *Subject:* Re: [Shoal-Users] Still not sure it's working
>
>
>
> Hi Mike
> This is a recent known issue occuring when master failure occurs. I
> don't see a Shoal issue on this yet but our QE has filed an internal
> issue on this behavior. I will post an issue in the Shoal tracker
> later today with your details.
>
> Can you confirm if behavior is okay when a non-master member fails?
>
> Thanks
> Shreedhar
>
>
>
>
>