dev@shoal.java.net

Re: [Shoal-Dev] Shutdown time is delayed for 30sec at "ABSTRACTING_TRANSPORT" branch

From: Joseph Fialli <Joseph.Fialli_at_Sun.COM>
Date: Thu, 14 Jan 2010 11:55:11 -0500

Bongjae,

See comments inline below.

-Joe

Bongjae Chang wrote:
> Hi,
> I could also see another problem.
> When I tried to shut down the shoal based on "ABSTRACTING_TRANSPORT"
> branch and using Grizzly transport, I could find the shutdown time is
> always delayed for 30sec.
> When I saw the thread dump, BlockingIOMulticastSender was waiting for
> 30sec. Please see the following thread dump.(The code line number can
> be different from current branch)
> ---
> "ServerControl [cts-76]" daemon prio=10 tid=0x09888400 nid=0x170d
> waiting on condition [0x88f1f000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x94e2b6e0> (a
> java.util.concurrent.CountDownLatch$Sync)
> at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:947)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1239)
> * at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:253)
> at
> com.sun.enterprise.mgmt.transport.BlockingIOMulticastSender.stop(BlockingIOMulticastSender.java:180)
> * - locked <0x94e2b148> (a
> com.sun.enterprise.mgmt.transport.BlockingIOMulticastSender)
> at
> com.sun.enterprise.mgmt.transport.jeus.JeusNetworkManager.stop(JeusNetworkManager.java:292)
> - locked <0x94e09de0> (a
> com.sun.enterprise.mgmt.transport.jeus.JeusNetworkManager)
> at com.sun.enterprise.mgmt.ClusterManager.stop(ClusterManager.java:343)
> - locked <0x94e036c0> (a com.sun.enterprise.mgmt.ClusterManager)
> at
> com.sun.enterprise.ee.cms.impl.base.GroupCommunicationProviderImpl.leave(GroupCommunicationProviderImpl.java:209)
> at
> com.sun.enterprise.ee.cms.impl.base.GMSContextImpl.leave(GMSContextImpl.java:159)
> at
> com.sun.enterprise.ee.cms.impl.common.GroupManagementServiceImpl.leave(GroupManagementServiceImpl.java:389)
> at
> com.sun.enterprise.ee.cms.impl.common.GroupManagementServiceImpl.shutdown(GroupManagementServiceImpl.java:286)
> ...
> ---
> Maybe, the problem is the regression of the following changes.
> ---
> File [changed]: BlockingIOMulticastSender.java
> Url:
> https://shoal.dev.java.net/source/browse/shoal/gms/src/java/com/sun/enterprise/mgmt/transport/BlockingIOMulticastSender.java?r1=1.1.2.5&r2=1.1.2.6
> <https://shoal.dev.java.net/source/browse/shoal/gms/src/java/com/sun/enterprise/mgmt/transport/BlockingIOMulticastSender.java?r1=1.1.2.5&r2=1.1.2.6>
> Delta lines: +6 -3
> -------------------
> --- BlockingIOMulticastSender.java 2009-07-21 05:50:58+0000 1.1.2.5
> +++ BlockingIOMulticastSender.java 2009-12-10 17:04:59+0000 1.1.2.6
> ...
> @@ -175,17 +176,19 @@
> multicastSocket.leaveGroup( multicastAddress );
> } catch( IOException e ) {
> }
> *- multicastSocket.close();
> - multicastSocket = null;
> * }
> boolean finished = false;
> try {
> *finished = endGate.await( shutdownTimeout, TimeUnit.MILLISECONDS );*
> } catch( InterruptedException e ) {
> }
> - if( !finished && multicastThread != null )
> + if( !finished && multicastThread != null ){
> multicastThread.interrupt();
> }
> *+ multicastSocket.close();
> + multicastSocket = null;
> *...
> + }
>
> public void run() {
> try {
> ---
> If the MulticastSocket is closed after
> endGate(CountDownLatch).await(), the endGate can almost never be waked
> because of I/O read block.
> See the following code:
> In BlockingIOMulticastSender.java
> ---
> ...
> private long shutdownTimeout = *30 * 1000*; // ms
> ...
> public void run() {
> try {
> while( running ) {
> ...
> try {
> * multicastSocket.receive( packet );*
> if( !running )
> return;
> ...
> }
> } catch(...) {
> ...
> } finally {
> multicastThread = null;
> * endGate.countDown();*
> }
> }
> ---
> So I think that the MulticastSocket should be closed before
> endGate.await().
> What do you think?
The motivation behind closing the multicastSocket AFTER interrupting the
MulticastSocket thread was I occasionally
was observing an uncaught NPE in multicastThread accessing the nulled
multicastsocket. (namely the line "multicastSocket.receive(packet)").

I do agree with your assessment that my change has the undesirable
effect of waiting 30 seconds.
I neglected to comprehend the endGate wait. I do understand it now. Wait
at least 30 seconds before interrupting the
thread to allow processing of the the currently being handled message.
Interrupt it after 30 seconds.

So I agree that the 12/10/2009 change to move the close should be reverted.

However, the concurrency issue on multicastSocket still needs to be
addressed between the thread calling stop() and the multicastSocket
thread both accessing
the variable at same time.

The following is close to addressing the concurrency issue but I am not
completely satisfied with it (so I did not do it this way originally.)

Declare multicastSocket as volatile and check if multicastSocket == null
before calling multicastSocket.receive() in the run method above.
The javadoc for the close states that another thread that is blocked
receiving from the closed socket will receive a SocketException,
so that sounds like it is okay.

Perhaps you have a better idea than above. Thanks for catching this.

-Joe
> Thanks.
>
> Regards,
> Bongjae Chang