Re: reason for destroying POA's when starting a server instance

From: Ken Cavanaugh <Ken.Cavanaugh_at_Sun.COM>
Date: Mon, 09 Feb 2009 10:50:40 -0800

Dies Koper wrote:

Hi Ken,

I've ran into the following issue. I'm looking at how to fix it, but I
just like to make sure why are POA's destroyed when starting a server
instance? Is it really necessary?

Absolutely. The issue here is how to update the object references when the cluster membership
changes. This means that the IOR's template (which corresponds to a POA) must
be rewritten, and that requires re-creating the POA. But this must be seamless: that is, I
can't just destroy/create the POA directly, because that would create a window during
which a request would get an OBJECT_NOT_EXIST error.

So, the solution is to use a POA AdapterActivator, which runs whenever the first request
comes into the ORB after the POA has been destroyed. This mechanism is carefully synchronized
so that the AdapterActivator runs exactly once, and other concurrent requests for the same
POA are held off until the AdapterActivator runs to completion (which is an upcall and may
perform arbitrary computation, including calling other object references).

For more details, see "IIOP Failover in Dynamic Clusters" at:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.83.643

(citeseer has a download link that seems to be publicly accessible)

I have a cluster with multiple instances. When another instance is
started IiopFolbGmsClient#addMember is invoked, which leads to a call to
ReferenceFactoryManagerImpl#suspend. In this method, the state is
changed to RFMState.SUSPENDED.

After that, when the EJB application is being started,
ReferenceFactoryManagerImpl#create fails because of the SUSPENDED state.
(see following exception)

Caused by: org.omg.CORBA.TRANSIENT:   vmcid: SUN  minor code: 1003
completed: No
	at
com.sun.corba.ee.impl.logging.POASystemException.rfmMightDeadlock(POASystemException.java:2341)
	at
com.sun.corba.ee.impl.logging.POASystemException.rfmMightDeadlock(POASystemException.java:2363)
	at
com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryManagerImpl.create(ReferenceFactoryManagerImpl.java:244)
	at
com.sun.enterprise.iiop.POARemoteReferenceFactory.createReferenceFactory(POARemoteReferenceFactory.java:345)
	at
com.sun.enterprise.iiop.POARemoteReferenceFactory.setRepositoryIds(POARemoteReferenceFactory.java:229)
	... 18 more


ReferenceFactoryManagerImpl#suspend was called from
ServerGroupManager#restartFactories with the purpose of destroying the
POA's.

This problem is GlassFish issue 4560, which came in too late to do anything about in GFv2 (according to the
schedule at the time in any case).

What happens here is that whenever a membership change is detected (by Shoal/GMS), the ORB will
quiesce requests, update the failover information, then destroy the POAs and resume processing.
I think the quiesce mechanism is useful in general, but thinking further about this, it's probably not
really needed in this case due to the synchronization inherent in the POA. We could PROBABLY
make this work simple by update the RFM poatable, then destroying all of the POAs. The POA will
correctly handle the synchronization of request processing/adapter activation/POA destruction,
and this will avoid a need to lock and risk possible deadlocks. It's also simpler than my original thought
for 4560, which was to block RFM create/find until suspend/resume completes. The main affected code
to fix this is the RFM and also the ServerGroupManager, which actually calls the suspend/resume code in the
RFM.

Biggest problem with making changes here is testing: I need to find someone in SQE to run whatever
full cluster tests they have (yes, there should be development tests at this level, but it's harder to test
this sort of thing when you need at least 4-5 AS instances deployed in a cluster to test: calls for
Hudson and xVM/VMWare integration).

The simplest possible fix here (which MIGHT work: there may be other problems) is to do the following on
a cluster membership change:

update the membership label in the ServerGroupManager (necessary so we update the clients that are holding onto cached object references)
JUST destroy all the POAs in the RFM (this would need a new public method in the RFM that would be called from ServerGroupManager.restartFactories(), which then does NOT to be in a spawned thread)

I do plan to use the RFM at some point for general dynamic reconfiguration, but I think it is likely that
we could take a simpler approach for IIOP FOLB, because the ONLY thing that needs to change is the POA
(e.g. in general we could be destroying and re-creating transport buffer pools, and I REALLY don't want to
do that while handling requests).

Why do the POA's need to be destroyed? Would it be okay here to not
invoke ServerGroupManager#restartFactories?

If you do that, the ORB will never see cluster membership changes.

Ken.