users@glassfish.java.net

EJB failover unexpected behaviour or the way it should be?

From: <glassfish_at_javadesktop.org>
Date: Tue, 03 Mar 2009 11:09:23 PST

I'm working on some 'proof of concept' application which has some server side EJBs (stateless) and WebStart client (all in one ear). I deployed it on cluster (2 nodes: pc-simbad and pc-grasshopper) in order to test how failover works.
It does not work as I expected ( I expected this:http://docs.sun.com/app/docs/doc/820-4341/fxxqs?a=view).

First test was (I removed all logging and try - catches to simplyfy):

String nodeName; //ip address of node
InitialContext ic = new InitialContext();
BeanRemote mbr = (BeanRemote) ic.lookup("TheBean");
nodeName = mbr.publish("message1"); //nodeName = pc-simbad
nodeName = mbr.publish("message2");
Thread.sleep(30000); //at this moment i've unplugged pc-simbad from network
nodeName = mbr.publish("message3"); //EXCEPTION!!!! : java.rmi.MarshalException: CORBA COMM_FAILURE 1398079696
nodeName = mbr.publish("message4"); //ok, nodeName = pc-grasshopper
nodeName = mbr.publish("message5"); //ok, nodeName = pc-grasshopper

In this scenario one call fails.

Second test was:

String nodeName;
InitialContext ic = new InitialContext();
nodeName = ((BeanRemote) ic.lookup("TheBean")).publish("message1"); //pc-simbad
nodeName = ((BeanRemote) ic.lookup("TheBean")).publish("message2"); //pc-simbad
Thread.sleep(30000);
nodeName = ((BeanRemote) ic.lookup("TheBean")).publish("message3"); //Exception
nodeName = ((BeanRemote) ic.lookup("TheBean")).publish("message4"); //pc-grasshopper
nodeName = ((BeanRemote) ic.lookup("TheBean")).publish("message5"); //pc-grasshopper

Works quite similar but the exception during 'message3' publication is:
javax.naming.CommunicationException: Can't find SerialContextProvider [Root exception is org.omg.CORBA.COMM_FAILURE: vmcid: SUN minor code: 208 completed: Maybe]
and lookups for message 4 & 5 take quite some time (timeouts when trying to contact pc-simbad)

And the final test:

        String nodeName;
        nodeName = ((BeanRemote) new InitialContext().lookup("TheBean")).publish("message1"); // pc-grasshopper
        nodeName = ((BeanRemote) new InitialContext().lookup("TheBean")).publish("message2"); //pc-simbad
        Thread.sleep(30000); //unplugging
        nodeName = ((BeanRemote) new InitialContext().lookup("TheBean")).publish("message3"); //Exception
        nodeName = ((BeanRemote) new InitialContext().lookup("TheBean")).publish("message4"); // pc-grasshopper
        nodeName = ((BeanRemote) new InitialContext().lookup("TheBean")).publish("message5"); // pc-grasshopper

Exception like in test2.

Is this a bug? or it's the way it should be. With statefull session bean it also occurs - first call after the node is down fails, but next work fine and the bean migrates to another instance.
I'm also not convinced that test2 should behave like this. If context finds primary endpoint unavailable shouldn't it use alternate for next lookups? It does but why application has to suffer from timeouts on primary endpoint EVERY time lookup is performed.
I'm attaching logs from Java console.
Glassfish 2.1, all machines runs on Windows, java 1.6.0_12.
[Message sent by forum member 'sasol' (sasol)]

http://forums.java.net/jive/thread.jspa?messageID=334882