users@shoal.java.net

RE: [Shoal-Users] Still not sure it's working

From: Mike Wannamaker <mwannama_at_opentext.com>
Date: Wed, 2 Jul 2008 14:28:12 -0400

Yes no problem, I'm currently just in investigation phase.

 

________________________________

From: Shreedhar.Ganapathy_at_Sun.COM [mailto:Shreedhar.Ganapathy_at_Sun.COM]
Sent: July 2, 2008 12:07 PM
To: users_at_shoal.dev.java.net
Subject: Re: [Shoal-Users] Still not sure it's working

 

Mike,
Thanks for the information. I have at times seen better results when the
antivirus and/firewall software is turned off as they seem to hold state
wrt to ports. We can, of course, not rule out bugs in our code on this
behavior you have reported. An exercise in isolating the issue at best.

It might be worthwhile running this outside the IDE to rule out any
possibilities on that account.

What I meant by asking about concurrently starting each process was
whether these three processes were started by a script or other
automated tool that would start them at the same time as different
processes. Not the case as you pointed out. The behavior of Shoal should
be consistent whether started concurrently or started one after the
other.

Joe Fialli, our team member, yesterday isolated a possible bug in health
monitor that is being tested but that may not be related to this
unexpected failure suspicion. We need some time to evaluate this and I
would appreciate your patience with this.

Thanks
Shreedhar

Mike Wannamaker wrote:

Sorry for the late reply, yesterday was Canada Day and thus was a
holiday.

 

Yes all instances are running on a single machine.

Machine is Windows XP 64 bit.

JVM is Sun JVM 1.5.0_15

I am running this through IntelliJ IDEA.

Windows Firewall is turned off.

I do have Symantec Antivirus. Do you mean disable the Auto-Protect? Or
do you mean shutdown the whole antivirus?

 

10.6.2.89 is my network card.

192.168.111.1 is VMware VMNet8

192.168.138.1 is VMware VMNet1

 

I am NOT running these instances inside a VMware instance; they are
running on my main machine.

I'm not sure what you mean are all started concurrently? I run these as
applications from within IntelliJ which does this

 

D:\JDKS\jdk1.5.0_15\bin\java -Didea.launcher.port=7546
"-Didea.launcher.bin.path=C:\Program Files (x86)\JetBrains\IntelliJ IDEA
7.0.3\bin" -Dfile.encoding=windows-1252 -classpath
"D:\JDKS\jdk1.5.0_15\jre\lib\charsets.jar;D:\JDKS\jdk1.5.0_15\jre\lib\de
ploy.jar;D:\JDKS\jdk1.5.0_15\jre\lib\javaws.jar;D:\JDKS\jdk1.5.0_15\jre\
lib\jce.jar;D:\JDKS\jdk1.5.0_15\jre\lib\jsse.jar;D:\JDKS\jdk1.5.0_15\jre
\lib\plugin.jar;D:\JDKS\jdk1.5.0_15\jre\lib\rt.jar;D:\JDKS\jdk1.5.0_15\j
re\lib\ext\dnsns.jar;D:\JDKS\jdk1.5.0_15\jre\lib\ext\localedata.jar;D:\J
DKS\jdk1.5.0_15\jre\lib\ext\sunjce_provider.jar;D:\JDKS\jdk1.5.0_15\jre\
lib\ext\sunpkcs11.jar;D:\Development\shoaltest\out\production\SMessage;D
:\Development\shoaltest\libs\appia\appia-3.2.4.jar;D:\Development\shoalt
est\libs\jgroups\jgroups-all.jar;D:\Development\shoaltest\libs\log4j\log
4j.jar;D:\Development\shoaltest\libs\shoal\shoal-gms.jar;D:\Development\
shoaltest\libs\shoal\jxta.jar;C:\Program Files (x86)\JetBrains\IntelliJ
IDEA 7.0.3\lib\idea_rt.jar"
com.intellij.rt.execution.application.AppMain
com.opentext.shoal.SendMessageSample SERVER-3

 

I DO see the failure suspect for SERVER-3 in the log snippet?

 

30-Jun-2008 2:16:57 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
newViewObserved

INFO: Analyzing new membership snapshot received as part of event :
IN_DOUBT_EVENT

30-Jun-2008 2:16:57 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
addInDoubtMemberSignals

INFO: gms.failureSuspectedEventReceived

30-Jun-2008 2:16:57 PM com.sun.enterprise.ee.cms.impl.common.Router
notifyFailureSuspectedAction

INFO: Sending FailureSuspectedSignals to registered Actions.
Member:SERVER-3...

30-Jun-2008 02:16:57 PM DEBUG [pool-1-thread-4]
com.opentext.ecm.services.smessage.impl.shoal.SignalLogger - - SERVER-3
>> FailureSuspectedSignalImpl @ 30/06/08 2:00 PM - [RCS_CLUSTER-false]:
(Hashtable:[(String:server.name)<-->(String:SERVER-3),
(String:local.host)<-->(Inet4Address:mwana0061/10.6.2.89)])

MEMBERS: (ArrayList:[mwana0061/10.6.2.89, mwana0061/10.6.2.89,
mwana0061/10.6.2.89])

 

What is weird is the one for SERVER-1 which was not shutdown and is
still running?

 

 

________________________________

From: Shreedhar.Ganapathy_at_Sun.COM [mailto:Shreedhar.Ganapathy_at_Sun.COM]
Sent: June 30, 2008 3:00 PM
To: users_at_shoal.dev.java.net
Subject: Re: [Shoal-Users] Still not sure it's working

 

Hi Mike
Yes this is indeed a new problem. I hope this is not different
snippets but a continuous log snippet. What seems strange in this pasted
output is that there is no failure suspected signal (in doubt event) for
Server-3 ? Is this what you see? There is the suspect event for
server-1.

Some questions: Are all instances on the same machine? The interface
addresses dont seem to be all in the same subnet and/or it appears to be
different networks in a multihome machine environment (I see 10.6.2.89
and 192.168.111.1 and 192.168.138.1).
Are all instances started concurrently?

Do you have any antivirus or firewalls running in your machine(s) ? If
yes, can you disable them and see if communications and events happen
correctly?

Thanks
Shreedhar



Mike Wannamaker wrote:

Okay tested when shutting down a non groupleader. I do see suspect and
failure notifications.

 

However, you might not like this; I also see something that is very
strange and disturbing.

 

I start SERVER-1 (GROUPLEADER), SERVER-2, and SERVER-3.

 

Shutdown SERVER-3, get correct messages in SERVER-1 and mostly in
SERVER-2, but I also get a FailureSuspect for SERVER-1 in SERVER-2
window.

This might be okay if I got a notification that the node was back, but I
don't and it is still running. Started SERVER-3 and see SERVER-1 in the
list and it gets notifications as well.

 

I tried again shutdown the newly running SERVER-3 and I get the same
results so it seems fully reproducible.

 

 

 

Here is the output for SERVER-2

 

30-Jun-2008 2:16:57 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
getMemberTokens

INFO: GMS View Change Received for group RCS_CLUSTER : Members in view
for (before change analysis) are :

1: MemberId: SERVER-2, MemberType: CORE, Address:
urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FD0D4B867250FF460C9B539A1617
79845B03

2: MemberId: SERVER-3, MemberType: CORE, Address:
urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FD54C54AB0D7A640E493A5C6CE42
7A3CE203

3: MemberId: SERVER-1, MemberType: CORE, Address:
urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FDB946A28335F0413BBF73B77CCC
8BFEC603

 

30-Jun-2008 2:16:57 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
newViewObserved

INFO: Analyzing new membership snapshot received as part of event :
IN_DOUBT_EVENT

30-Jun-2008 2:16:57 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
addInDoubtMemberSignals

INFO: gms.failureSuspectedEventReceived

30-Jun-2008 2:16:57 PM com.sun.enterprise.ee.cms.impl.common.Router
notifyFailureSuspectedAction

INFO: Sending FailureSuspectedSignals to registered Actions.
Member:SERVER-3...

30-Jun-2008 02:16:57 PM DEBUG [pool-1-thread-4]
com.opentext.ecm.services.smessage.impl.shoal.SignalLogger - - SERVER-3
>> FailureSuspectedSignalImpl @ 30/06/08 2:00 PM - [RCS_CLUSTER-false]:
(Hashtable:[(String:server.name)<-->(String:SERVER-3),
(String:local.host)<-->(Inet4Address:mwana0061/10.6.2.89)])

MEMBERS: (ArrayList:[mwana0061/10.6.2.89, mwana0061/10.6.2.89,
mwana0061/10.6.2.89])

30-Jun-2008 2:16:57 PM com.sun.enterprise.jxtamgmt.HealthMonitor
isConnected

INFO: Checking for machine status for network interface :
tcp://10.6.2.89:9701

30-Jun-2008 2:16:57 PM com.sun.enterprise.jxtamgmt.HealthMonitor
isConnected

INFO: Checking for machine status for network interface :
tcp://192.168.111.1:9701

30-Jun-2008 2:16:57 PM com.sun.enterprise.jxtamgmt.HealthMonitor
isConnected

INFO: Checking for machine status for network interface :
tcp://192.168.138.1:9701

30-Jun-2008 2:17:27 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
getMemberTokens

INFO: GMS View Change Received for group RCS_CLUSTER : Members in view
for (before change analysis) are :

1: MemberId: SERVER-2, MemberType: CORE, Address:
urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FD0D4B867250FF460C9B539A1617
79845B03

2: MemberId: SERVER-3, MemberType: CORE, Address:
urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FD54C54AB0D7A640E493A5C6CE42
7A3CE203

3: MemberId: SERVER-1, MemberType: CORE, Address:
urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FDB946A28335F0413BBF73B77CCC
8BFEC603

 

30-Jun-2008 2:17:27 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
newViewObserved

INFO: Analyzing new membership snapshot received as part of event :
IN_DOUBT_EVENT

30-Jun-2008 2:17:27 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
addInDoubtMemberSignals

INFO: gms.failureSuspectedEventReceived

30-Jun-2008 2:17:27 PM com.sun.enterprise.ee.cms.impl.common.Router
notifyFailureSuspectedAction

INFO: Sending FailureSuspectedSignals to registered Actions.
Member:SERVER-1...

30-Jun-2008 02:17:27 PM DEBUG [pool-1-thread-4]
com.opentext.ecm.services.smessage.impl.shoal.SignalLogger - - SERVER-1
>> FailureSuspectedSignalImpl @ 30/06/08 1:59 PM - [RCS_CLUSTER-false]:
(Hashtable:[(String:server.name)<-->(String:SERVER-1),
(String:local.host)<-->(Inet4Address:mwana0061/10.6.2.89)])

MEMBERS: (ArrayList:[mwana0061/10.6.2.89, mwana0061/10.6.2.89,
mwana0061/10.6.2.89])

30-Jun-2008 2:17:30 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
getMemberTokens

INFO: GMS View Change Received for group RCS_CLUSTER : Members in view
for (before change analysis) are :

1: MemberId: SERVER-2, MemberType: CORE, Address:
urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FD0D4B867250FF460C9B539A1617
79845B03

2: MemberId: SERVER-1, MemberType: CORE, Address:
urn:jxta:uuid-2F39FF376B6A43E3905DAFC81B7D02FDB946A28335F0413BBF73B77CCC
8BFEC603

 

30-Jun-2008 2:17:30 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
newViewObserved

INFO: Analyzing new membership snapshot received as part of event :
FAILURE_EVENT

30-Jun-2008 2:17:30 PM com.sun.enterprise.ee.cms.impl.jxta.ViewWindow
addFailureSignals

INFO: The following member has failed: SERVER-3

30-Jun-2008 2:17:30 PM com.sun.enterprise.ee.cms.impl.common.Router
notifyFailureNotificationAction

INFO: Sending FailureNotificationSignals to registered Actions. Member:
SERVER-3...

30-Jun-2008 02:17:30 PM DEBUG [pool-1-thread-4]
com.opentext.ecm.services.smessage.impl.shoal.SignalLogger - - SERVER-3
>> FailureNotificationSignalImpl @ 30/06/08 2:00 PM -
[RCS_CLUSTER-false]:
(Hashtable:[(String:server.name)<-->(String:SERVER-3),
(String:local.host)<-->(Inet4Address:mwana0061/10.6.2.89)])SERVER-3

MEMBERS: (ArrayList:[mwana0061/10.6.2.89, mwana0061/10.6.2.89])

 

________________________________

From: Shreedhar.Ganapathy_at_Sun.COM [mailto:Shreedhar.Ganapathy_at_Sun.COM]
Sent: June 30, 2008 2:07 PM
To: users_at_shoal.dev.java.net
Subject: Re: [Shoal-Users] Still not sure it's working

 

Thats correct. Yes I should not mix up the provider terminology versus
GMS terminology.
Thanks
Shreedhar

Mike Wannamaker wrote:

When you say a non-master do you mean when a server is shutdown that is
not the groupleader?

 

________________________________

From: Shreedhar.Ganapathy_at_Sun.COM [mailto:Shreedhar.Ganapathy_at_Sun.COM]
Sent: June 30, 2008 1:47 PM
To: users_at_shoal.dev.java.net
Subject: Re: [Shoal-Users] Still not sure it's working

 

Hi Mike
This is a recent known issue occuring when master failure occurs. I
don't see a Shoal issue on this yet but our QE has filed an internal
issue on this behavior. I will post an issue in the Shoal tracker later
today with your details.

Can you confirm if behavior is okay when a non-master member fails?

Thanks
Shreedhar