On 6/4/2010 3:05 PM, Tom Mueller wrote:
Presumably the records domain.xml mod time is going to be stored in a
file other than domain.xml, because otherwise the domain.xml mod time
would change when that value was written.
So trying to answer the original questions:
If the DAS is down when an instances starts up, we allow
the instance
to start and use its cached state. What if the instance was out of sync?
When and how will we detect and correct that situation?
Answer: The only way an instance can get out of sync is if the DAS is
up and the instance is down while an admin command is executed, and
then the DAS is down when the instance comes back up. In this case, the
recorded mod time will be the same as the domain.xml mod time, so the
query message isn't sent out. So the instance isn't told to restart
:-(.
I don't see a problem here. The DAS is down. The instance just starts
up and runs. It has an out-of-date configuration but nothing can be
done about that. When DAS starts up it simply sends off the commands
to the instance from its journal file.
It seems that "up" 5th paragraph might have to be "configured"
instances.
If that is the case, then the instance being out of sync will be
detected when the DAS starts, and will be created by the broadcast to
all instances that are out of sync to restart themselves.
Question: Do we want to make sure that the DAS doesn't request that all
instances restart themselves at the same time? Because that would
bring the whole cluster down.
We also need to handle the case where an instance gets out of sync
because the communication link between the DAS and the instance is
down, without either of the servers being down. I don't see how this
design handles that case.
If the DAS executes an admin command locally, then
replicates it to server
instances, what happens if it crashes somewhere in the middle of this
process? How does the DAS know whether or not an instance received the
replicated command and executed it successfully?
Answer: the DAS updates the recorded mod time only when all
"configured" (vs. "up") instances have successfully executed the
command. Success is confirm by the DAS receiving a success code from
the instance. If the instance crashes or fails to report success, the
recorded mod time is not updated.
Note: this is another case where the DAS has to periodically check with
the instances if the mod times do not match (rather than just on
startup).
Why should DAS ever check -- other than when it is processing an admin
command?
Tom
On 6/4/2010 3:40 PM, Bill Shannon wrote:
Jerome
and I have been discussing some issues related to failure recovery.
I wrote up the attached to describe the approach we're considering.
I'd
appreciate feedback.
Thanks.
---------------------------------------------------------------------
To unsubscribe, e-mail: admin-unsubscribe@glassfish.dev.java.net
For additional commands, e-mail: admin-help@glassfish.dev.java.net
--
Byron Nevins - Oracle Corporation
Home: 650-359-1290
Cell: 650-784-4123
Sierra: 209-295-2188