There are several failure cases we need to handle related to synchronization.

If the DAS is down when an instances starts up, we allow the instance
to start and use its cached state.  What if the instance was out of sync?
When and how will we detect and correct that situation?

If the DAS executes an admin command locally, then replicates it to server
instances, what happens if it crashes somewhere in the middle of this
process?  How does the DAS know whether or not an instance received the
replicated command and executed it successfully?


Here's an approach to handling these issues...

After successfully executing a replicated command on all "up" instances,
the DAS records the mod time of the domain.xml.

When the DAS starts, it checks the actual mod time of domain.xml with
the recorded mod time from above.  If they're the same, all instances
that were up when the most recent admin command was executed have
successfully executed the command, and are in sync.

If the times are different, at least one instance may be out of sync.
(Note that the mod time could be recorded per-instance as each instance
reports success.)  The DAS then sends a command to all instances (or
all instances that aren't known to be in sync).  The command says "this
is my domain.xml mod time, are we in sync?"  Each instance reports
yes or no.

If the instance reports "no", the DAS marks the instance as "offline",
and the instance restarts itself.  The instance will then resynchronize
at startup.

If an instance is marked offline, commands are not replicated to that
instance.  The user is warned that the instance is down.

When the DAS starts up, all instances are considered offline until
GMS reports that the instances are ready.

XXX - Does DAS become the GMS master when it restarts?

XXX - Do we need to wait for instances to report that they're ready
before checking whether we think they're out of sync (as described
above)?  Or do we do the check "on demand" as the instance comes online?


As an instance is restarting, it may take some time for the instance
to fully synchronize before it's ready to accept admin commands.
Between the time when an instance starts synchronizing, and the time
it finishes, we don't want admin commands to fail because the instance
is offline.  Neither do we want the admin commands to wait for the
instance to fully synchronize.  During this period, the instance is
marked "starting", and admin commands issued to starting instances
are queued until the instance is fully online.

This implies that the DAS is notified of two events - the instance has
started synchronizing and thus will likely be online soon, and the
instance is fully online and ready to accept admin commands.  The first
event can be triggered by the first sync command that the DAS receives
for the instance.  The second event can be triggered by GMS detecting
that the instance is "joined and ready".

(This assumes that GMS will be used for *all* instances, not just instances
in a cluster.  Currently this seems like the best approach, rather than
reinventing a similar mechanism just for this purpose.)

When an instance is fully online, any admin commands queued for that
instance are sent.  Note that there's no user to notify of failures.