Re: failure recovery

From: Tom Mueller <tom.mueller_at_oracle.com>
Date: Fri, 04 Jun 2010 17:05:21 -0500

Presumably the records domain.xml mod time is going to be stored in a
file other than domain.xml, because otherwise the domain.xml mod time
would change when that value was written.

So trying to answer the original questions:

    If the DAS is down when an instances starts up, we allow the instance
    to start and use its cached state. What if the instance was out of sync?
    When and how will we detect and correct that situation?

Answer: The only way an instance can get out of sync is if the DAS is up
and the instance is down while an admin command is executed, and then
the DAS is down when the instance comes back up. In this case, the
recorded mod time will be the same as the domain.xml mod time, so the
query message isn't sent out. So the instance isn't told to restart :-(.

It seems that "up" 5th paragraph might have to be "configured" instances.

If that is the case, then the instance being out of sync will be
detected when the DAS starts, and will be created by the broadcast to
all instances that are out of sync to restart themselves.

Question: Do we want to make sure that the DAS doesn't request that all
instances restart themselves at the same time? Because that would bring
the whole cluster down.

We also need to handle the case where an instance gets out of sync
because the communication link between the DAS and the instance is down,
without either of the servers being down. I don't see how this design
handles that case.

    If the DAS executes an admin command locally, then replicates it to
    server
    instances, what happens if it crashes somewhere in the middle of this
    process? How does the DAS know whether or not an instance received the
    replicated command and executed it successfully?

Answer: the DAS updates the recorded mod time only when all "configured"
(vs. "up") instances have successfully executed the command. Success is
confirm by the DAS receiving a success code from the instance. If the
instance crashes or fails to report success, the recorded mod time is
not updated.

Note: this is another case where the DAS has to periodically check with
the instances if the mod times do not match (rather than just on startup).

Tom

On 6/4/2010 3:40 PM, Bill Shannon wrote:
> Jerome and I have been discussing some issues related to failure
> recovery.
> I wrote up the attached to describe the approach we're considering. I'd
> appreciate feedback.
>
> Thanks.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: admin-unsubscribe_at_glassfish.dev.java.net
> For additional commands, e-mail: admin-help_at_glassfish.dev.java.net