admin@glassfish.java.net

Re: failure recovery

From: Byron Nevins <byron.nevins_at_oracle.com>
Date: Sat, 05 Jun 2010 12:44:26 -0700


On 6/4/2010 3:05 PM, Tom Mueller wrote:
Presumably the records domain.xml mod time is going to be stored in a file other than domain.xml, because otherwise the domain.xml mod time would change when that value was written.

So trying to answer the original questions:
If the DAS is down when an instances starts up, we allow the instance
to start and use its cached state. What if the instance was out of sync?
When and how will we detect and correct that situation?
Answer: The only way an instance can get out of sync is if the DAS is up and the instance is down while an admin command is executed, and then the DAS is down when the instance comes back up. In this case, the recorded mod time will be the same as the domain.xml mod time, so the query message isn't sent out. So the instance isn't told to restart :-(.
I don't see a problem here.  The DAS is down. The instance just starts up and runs.  It has an out-of-date configuration but nothing can be done about that.  When DAS starts up it simply sends off the commands to the instance from its journal file.

It seems that "up" 5th paragraph might have to be "configured" instances.

If that is the case, then the instance being out of sync will be detected when the DAS starts, and will be created by the broadcast to all instances that are out of sync to restart themselves.

Question: Do we want to make sure that the DAS doesn't request that all instances restart themselves at the same time?  Because that would bring the whole cluster down.

We also need to handle the case where an instance gets out of sync because the communication link between the DAS and the instance is down, without either of the servers being down. I don't see how this design handles that case.
If the DAS executes an admin command locally, then replicates it to server
instances, what happens if it crashes somewhere in the middle of this
process? How does the DAS know whether or not an instance received the
replicated command and executed it successfully?
Answer: the DAS updates the recorded mod time only when all "configured" (vs. "up") instances have successfully executed the command. Success is confirm by the DAS receiving a success code from the instance. If the instance crashes or fails to report success, the recorded mod time is not updated.

Note: this is another case where the DAS has to periodically check with the instances if the mod times do not match (rather than just on startup).
Why should DAS ever check -- other than when it is processing an admin command?


Tom

On 6/4/2010 3:40 PM, Bill Shannon wrote:
Jerome and I have been discussing some issues related to failure recovery.
I wrote up the attached to describe the approach we're considering.  I'd
appreciate feedback.

Thanks.
--------------------------------------------------------------------- To unsubscribe, e-mail: admin-unsubscribe@glassfish.dev.java.net For additional commands, e-mail: admin-help@glassfish.dev.java.net

-- 
Byron Nevins  -  Oracle Corporation
Home: 650-359-1290
Cell: 650-784-4123
Sierra: 209-295-2188