Re: failure recovery

From: Tom Mueller <tom.mueller_at_oracle.com>
Date: Mon, 07 Jun 2010 10:12:06 -0500

I'd consider the problem of having the whole cluster go down to resync a
serious enough problem that we need to fix it for this release. The
first time this happens in the field, this would be a real "egg on our
face" situation.

It seems that the design still handles this case incorrectly:

    * DAS and two instances are up, but DAS cannot communicate with
      instance 1 because the LAN is down. So DAS thinks instance 1 is
      down, but instance 1 thinks the DAS is down.
    * An admin command is executed, and the DAS replicates it to
      instance 2 successfully, and updates the domain.xml mod time.
    * The LAN is brought up. GMS notifies the DAS that instance 1 is now
      up and instance 1 is notified that the DAS is up.
    * The DAS sees that then recorded domain.xml mod time is the same as
      the current domain.xml mod time, so it thinks everything is
      synced. Instance 1 is not told to restart, so it continues to
      operate with out-of-date data.

It seems that a solution would be that whenever an instance
reestablishes communication with the DAS, it should check if a sync is
needed; not just when it is starting.

Tom

On 6/4/2010 6:13 PM, Bill Shannon wrote:
> Tom Mueller wrote on 06/ 4/10 03:05 PM:
>> Presumably the records domain.xml mod time is going to be stored in a
>> file other than domain.xml, because otherwise the domain.xml mod time
>> would change when that value was written.
>
> Right, it needs to be a separate "transaction log".
>
>> So trying to answer the original questions:
>>
>> If the DAS is down when an instances starts up, we allow the
>> instance
>> to start and use its cached state. What if the instance was out
>> of sync?
>> When and how will we detect and correct that situation?
>>
>> Answer: The only way an instance can get out of sync is if the DAS is up
>> and the instance is down while an admin command is executed, and then
>> the DAS is down when the instance comes back up. In this case, the
>> recorded mod time will be the same as the domain.xml mod time, so the
>> query message isn't sent out. So the instance isn't told to restart :-(.
>
> Yes, I think the instances need to keep track of the fact that they
> started without knowing whether they were in sync, and when they see
> the DAS come up they'll need to do something to find out if they're
> in sync.
>
> I worry about driving this check from both directions, because there
> can be all sorts of race conditions.
>
> I'd really like to avoid having the DAS contact all instances when it
> comes up, just to find out if they're in sync, but maybe that's the
> easiest and most reliable approach. The problem with that is that
> when the DAS starts "cleanly" it will waste a lot of time contacting
> instances that haven't been started yet, possibly delaying your ability
> to start those instances.
>
> Possibly the information needed to determine if an instance is in sync
> (i.e., the domain.xml mod time) can be piggybacked on the GMS message
> announcing that the DAS is ready? And possibly the domain.xml mod time
> should be included with the "ready" message that GMS sends when an
> instance comes online?
>
>> It seems that "up" 5th paragraph might have to be "configured"
>> instances.
>>
>> If that is the case, then the instance being out of sync will be
>> detected when the DAS starts, and will be created by the broadcast to
>> all instances that are out of sync to restart themselves.
>>
>> Question: Do we want to make sure that the DAS doesn't request that all
>> instances restart themselves at the same time? Because that would bring
>> the whole cluster down.
>
> Yes, Jerome and I talked about that. If we triggered the restart from
> the DAS itself, rather than having each instance decide to restart
> itself, the DAS could coordinate the restart of the cluster so that
> only one node is down at a time. I think that's an enhancement for a
> future release.
>
>> We also need to handle the case where an instance gets out of sync
>> because the communication link between the DAS and the instance is down,
>> without either of the servers being down. I don't see how this design
>> handles that case.
>
> If a command sent to an instance fails because of a communication
> problem,
> the instance needs to be marked "offline", and somehow this needs to be
> coordinated with GMS. If communication is restored, GMS should
> announce that the instance is online/ready again, which would trigger
> the DAS to do the "are we in sync" request. (Or, as suggested above,
> this would be done using data that is piggybacked on the GMS messages.)
>
>> If the DAS executes an admin command locally, then replicates it to
>> server
>> instances, what happens if it crashes somewhere in the middle of
>> this
>> process? How does the DAS know whether or not an instance
>> received the
>> replicated command and executed it successfully?
>>
>> Answer: the DAS updates the recorded mod time only when all "configured"
>> (vs. "up") instances have successfully executed the command. Success is
>> confirm by the DAS receiving a success code from the instance. If the
>> instance crashes or fails to report success, the recorded mod time is
>> not updated.
>
> We need to support having configured instances that are never up.
> The above wouldn't handle that.
>
>> Note: this is another case where the DAS has to periodically check with
>> the instances if the mod times do not match (rather than just on
>> startup).
>
> My hope was that, by depending on GMS, we would have events that could
> trigger these checks, rather than having to do them periodically.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: admin-unsubscribe_at_glassfish.dev.java.net
> For additional commands, e-mail: admin-help_at_glassfish.dev.java.net
>