users@glassfish.java.net

Re: Problem with cluster 3.1.1

From: Bobby Bissett <bobby.bissett_at_oracle.com>
Date: Mon, 27 Jun 2011 10:25:30 -0400

On 6/27/11 5:33 AM, forums_at_java.net wrote:
> Thank you for your suggestion in Jira, but it seems to me, that
> problem is
> not in network configuration. On my test environment sessions
> replication and
> gms is working (i'm able to check this by get-health and list-instances
> commands), but validate-multicast it not, both with cluster and das
> working
> simultaneously and without them.

If it works in one environment and not another, then it's a problem in
the environment. Since you say that list-instances and get-health show
different results, then there is most likely a problem in the network
that prevents UDP multicast traffic from getting to/from all the servers.

The info you're giving me here can't be from the same time. For
instance, you have:

>
> portal-instance2 started since Fri Jun 24 21:34:10 MSD 2011

...which shows that instance 2 joined the group, but instance 2 isn't in
you member list here:


> Members in view for JOINED_AND_READY_EVENT(before change analysis) are :
>
>
>
> 1: MemberId: portal-instance1, MemberType: CORE, Address:
> 192.168.101.31:9115:228.9.96.158:20796:portal-cluster:portal-instance1
>
>
>
> 2: MemberId: portal-instance12, MemberType: CORE, Address:
> 192.168.101.31:9136:228.9.96.158:20796:portal-cluster:portal-instance12
>
>
>
> 3: MemberId: portal-instance4, MemberType: CORE, Address:
> 192.168.101.34:9190:228.9.96.158:20796:portal-cluster:portal-instance4
>
>
>
> 4: MemberId: portal-instance42, MemberType: CORE, Address:
> 192.168.101.34:9119:228.9.96.158:20796:portal-cluster:portal-instance42
>
>
>
> 5: MemberId: portal-instance5, MemberType: CORE, Address:
> 192.168.101.35:9096:228.9.96.158:20796:portal-cluster:portal-instance5
>
>
>
> 6: MemberId: portal-instance52, MemberType: CORE, Address:
> 192.168.101.35:9170:228.9.96.158:20796:portal-cluster:portal-instance52
>

I would recommend that you stop all instances and the DAS, and use the
validate-multicast tool to verify that all of your nodes can communicate
with each other. Follow this blog, and make sure your nodes are all on
the same sub-net and that the clocks are mostly in sync (if one is an
hour off from another, that will mess with the timing of GMS messages):

http://blogs.oracle.com/bobby/entry/validating_multicast_transport_where_d

Once you're sure the network is set up properly, we can go from there.
You say you have 5 nodes with 2 instances each. Is the DAS on a 6th machine?


> You can see, that gms is working, but not all instances have been
> joined to
> this multicast group. And absolutely the same output i have on all my
> instances. GMS group can not unite more than *6 instances*. On "failed"
> instances, there are no SPECTACOR member (DAS machine).

I can create a GMS group with more than 6 instances. Something else is
going on here.

Cheers,
Bobby