admin@glassfish.java.net

Re: failure recovery

From: Bill Shannon <bill.shannon_at_oracle.com>
Date: Fri, 04 Jun 2010 16:13:28 -0700

Tom Mueller wrote on 06/ 4/10 03:05 PM:
> Presumably the records domain.xml mod time is going to be stored in a
> file other than domain.xml, because otherwise the domain.xml mod time
> would change when that value was written.

Right, it needs to be a separate "transaction log".

> So trying to answer the original questions:
>
> If the DAS is down when an instances starts up, we allow the instance
> to start and use its cached state. What if the instance was out of sync?
> When and how will we detect and correct that situation?
>
> Answer: The only way an instance can get out of sync is if the DAS is up
> and the instance is down while an admin command is executed, and then
> the DAS is down when the instance comes back up. In this case, the
> recorded mod time will be the same as the domain.xml mod time, so the
> query message isn't sent out. So the instance isn't told to restart :-(.

Yes, I think the instances need to keep track of the fact that they
started without knowing whether they were in sync, and when they see
the DAS come up they'll need to do something to find out if they're
in sync.

I worry about driving this check from both directions, because there
can be all sorts of race conditions.

I'd really like to avoid having the DAS contact all instances when it
comes up, just to find out if they're in sync, but maybe that's the
easiest and most reliable approach. The problem with that is that
when the DAS starts "cleanly" it will waste a lot of time contacting
instances that haven't been started yet, possibly delaying your ability
to start those instances.

Possibly the information needed to determine if an instance is in sync
(i.e., the domain.xml mod time) can be piggybacked on the GMS message
announcing that the DAS is ready? And possibly the domain.xml mod time
should be included with the "ready" message that GMS sends when an
instance comes online?

> It seems that "up" 5th paragraph might have to be "configured" instances.
>
> If that is the case, then the instance being out of sync will be
> detected when the DAS starts, and will be created by the broadcast to
> all instances that are out of sync to restart themselves.
>
> Question: Do we want to make sure that the DAS doesn't request that all
> instances restart themselves at the same time? Because that would bring
> the whole cluster down.

Yes, Jerome and I talked about that. If we triggered the restart from
the DAS itself, rather than having each instance decide to restart
itself, the DAS could coordinate the restart of the cluster so that
only one node is down at a time. I think that's an enhancement for a
future release.

> We also need to handle the case where an instance gets out of sync
> because the communication link between the DAS and the instance is down,
> without either of the servers being down. I don't see how this design
> handles that case.

If a command sent to an instance fails because of a communication problem,
the instance needs to be marked "offline", and somehow this needs to be
coordinated with GMS. If communication is restored, GMS should
announce that the instance is online/ready again, which would trigger
the DAS to do the "are we in sync" request. (Or, as suggested above,
this would be done using data that is piggybacked on the GMS messages.)

> If the DAS executes an admin command locally, then replicates it to
> server
> instances, what happens if it crashes somewhere in the middle of this
> process? How does the DAS know whether or not an instance received the
> replicated command and executed it successfully?
>
> Answer: the DAS updates the recorded mod time only when all "configured"
> (vs. "up") instances have successfully executed the command. Success is
> confirm by the DAS receiving a success code from the instance. If the
> instance crashes or fails to report success, the recorded mod time is
> not updated.

We need to support having configured instances that are never up.
The above wouldn't handle that.

> Note: this is another case where the DAS has to periodically check with
> the instances if the mod times do not match (rather than just on startup).

My hope was that, by depending on GMS, we would have events that could
trigger these checks, rather than having to do them periodically.