Dies Koper wrote:
Hi Ken,
I've ran into the following issue. I'm looking at how to fix it, but I
just like to make sure why are POA's destroyed when starting a server
instance? Is it really necessary?
Absolutely. The issue here is how to update the object references when
the cluster membership
changes. This means that the IOR's template (which corresponds to a
POA) must
be rewritten, and that requires re-creating the POA. But this must be
seamless: that is, I
can't just destroy/create the POA directly, because that would create a
window during
which a request would get an OBJECT_NOT_EXIST error.
So, the solution is to use a POA AdapterActivator, which runs whenever
the first request
comes into the ORB after the POA has been destroyed. This mechanism is
carefully synchronized
so that the AdapterActivator runs exactly once, and other concurrent
requests for the same
POA are held off until the AdapterActivator runs to completion (which
is an upcall and may
perform arbitrary computation, including calling other object
references).
For more details, see "IIOP Failover in Dynamic Clusters" at:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.83.643
(citeseer has a download link that seems to be publicly accessible)
I have a cluster with multiple instances. When another instance is
started IiopFolbGmsClient#addMember is invoked, which leads to a call to
ReferenceFactoryManagerImpl#suspend. In this method, the state is
changed to RFMState.SUSPENDED.
After that, when the EJB application is being started,
ReferenceFactoryManagerImpl#create fails because of the SUSPENDED state.
(see following exception)
Caused by: org.omg.CORBA.TRANSIENT: vmcid: SUN minor code: 1003
completed: No
at
com.sun.corba.ee.impl.logging.POASystemException.rfmMightDeadlock(POASystemException.java:2341)
at
com.sun.corba.ee.impl.logging.POASystemException.rfmMightDeadlock(POASystemException.java:2363)
at
com.sun.corba.ee.impl.oa.rfm.ReferenceFactoryManagerImpl.create(ReferenceFactoryManagerImpl.java:244)
at
com.sun.enterprise.iiop.POARemoteReferenceFactory.createReferenceFactory(POARemoteReferenceFactory.java:345)
at
com.sun.enterprise.iiop.POARemoteReferenceFactory.setRepositoryIds(POARemoteReferenceFactory.java:229)
... 18 more
ReferenceFactoryManagerImpl#suspend was called from
ServerGroupManager#restartFactories with the purpose of destroying the
POA's.
This problem is GlassFish issue 4560, which came in too late to do
anything about in GFv2 (according to the
schedule at the time in any case).
What happens here is that whenever a membership change is detected (by
Shoal/GMS), the ORB will
quiesce requests, update the failover information, then destroy the
POAs and resume processing.
I think the quiesce mechanism is useful in general, but thinking
further about this, it's probably not
really needed in this case due to the synchronization inherent in the
POA. We could PROBABLY
make this work simple by update the RFM poatable, then destroying all
of the POAs. The POA will
correctly handle the synchronization of request processing/adapter
activation/POA destruction,
and this will avoid a need to lock and risk possible deadlocks. It's
also simpler than my original thought
for 4560, which was to block RFM create/find until suspend/resume
completes. The main affected code
to fix this is the RFM and also the ServerGroupManager, which actually
calls the suspend/resume code in the
RFM.
Biggest problem with making changes here is testing: I need to find
someone in SQE to run whatever
full cluster tests they have (yes, there should be development tests at
this level, but it's harder to test
this sort of thing when you need at least 4-5 AS instances deployed in
a cluster to test: calls for
Hudson and xVM/VMWare integration).
The simplest possible fix here (which MIGHT work: there may be other
problems) is to do the following on
a cluster membership change:
- update the membership label in the ServerGroupManager (necessary
so we update the clients that are holding onto cached object references)
- JUST destroy all the POAs in the RFM (this would need a new
public method in the RFM that would be called from
ServerGroupManager.restartFactories(), which then does NOT to be in a
spawned thread)
I do plan to use the RFM at some point for general dynamic
reconfiguration, but I think it is likely that
we could take a simpler approach for IIOP FOLB, because the ONLY thing
that needs to change is the POA
(e.g. in general we could be destroying and re-creating transport
buffer pools, and I REALLY don't want to
do that while handling requests).
Why do the POA's need to be destroyed? Would it be okay here to not
invoke ServerGroupManager#restartFactories?
If you do that, the ORB will never see cluster membership changes.
Ken.