Re: failure recovery

From: Bill Shannon <bill.shannon_at_oracle.com>
Date: Mon, 07 Jun 2010 11:31:30 -0700

Byron Nevins wrote on 06/ 5/10 12:44 PM:
>
>
> On 6/4/2010 3:05 PM, Tom Mueller wrote:
>> Presumably the records domain.xml mod time is going to be stored in a
>> file other than domain.xml, because otherwise the domain.xml mod time
>> would change when that value was written.
>>
>> So trying to answer the original questions:
>>
>> If the DAS is down when an instances starts up, we allow the instance
>> to start and use its cached state. What if the instance was out of
>> sync?
>> When and how will we detect and correct that situation?
>>
>> Answer: The only way an instance can get out of sync is if the DAS is
>> up and the instance is down while an admin command is executed, and
>> then the DAS is down when the instance comes back up. In this case,
>> the recorded mod time will be the same as the domain.xml mod time, so
>> the query message isn't sent out. So the instance isn't told to
>> restart :-(.
> I don't see a problem here. The DAS is down. The instance just starts up
> and runs. It has an out-of-date configuration but nothing can be done
> about that. When DAS starts up it simply sends off the commands to the
> instance from its journal file.

The DAS is *not* keeping track of every admin command in a journal file
to replay when the instance comes up.

>> It seems that "up" 5th paragraph might have to be "configured" instances.
>>
>> If that is the case, then the instance being out of sync will be
>> detected when the DAS starts, and will be created by the broadcast to
>> all instances that are out of sync to restart themselves.
>>
>> Question: Do we want to make sure that the DAS doesn't request that
>> all instances restart themselves at the same time? Because that would
>> bring the whole cluster down.
>>
>> We also need to handle the case where an instance gets out of sync
>> because the communication link between the DAS and the instance is
>> down, without either of the servers being down. I don't see how this
>> design handles that case.
>>
>> If the DAS executes an admin command locally, then replicates it
>> to server
>> instances, what happens if it crashes somewhere in the middle of this
>> process? How does the DAS know whether or not an instance received the
>> replicated command and executed it successfully?
>>
>> Answer: the DAS updates the recorded mod time only when all
>> "configured" (vs. "up") instances have successfully executed the
>> command. Success is confirm by the DAS receiving a success code from
>> the instance. If the instance crashes or fails to report success, the
>> recorded mod time is not updated.
>>
>> Note: this is another case where the DAS has to periodically check
>> with the instances if the mod times do not match (rather than just on
>> startup).
> Why should DAS ever check -- other than when it is processing an admin
> command?

If they're out of sync, how long do you want that condition to persist
before detecting and fixing it? If you wait until an admin command
that effects the instance is issued, it could be a very long time.