There are several failure cases we need to handle related to synchronization. If the DAS is down when an instances starts up, we allow the instance to start and use its cached state. What if the instance was out of sync? When and how will we detect and correct that situation? If the DAS executes an admin command locally, then replicates it to server instances, what happens if it crashes somewhere in the middle of this process? How does the DAS know whether or not an instance received the replicated command and executed it successfully? Here's an approach to handling these issues... After successfully executing a replicated command on all "up" instances, the DAS records the mod time of the domain.xml. When the DAS starts, it checks the actual mod time of domain.xml with the recorded mod time from above. If they're the same, all instances that were up when the most recent admin command was executed have successfully executed the command, and are in sync. If the times are different, at least one instance may be out of sync. (Note that the mod time could be recorded per-instance as each instance reports success.) The DAS then sends a command to all instances (or all instances that aren't known to be in sync). The command says "this is my domain.xml mod time, are we in sync?" Each instance reports yes or no. If the instance reports "no", the DAS marks the instance as "offline", and the instance restarts itself. The instance will then resynchronize at startup. If an instance is marked offline, commands are not replicated to that instance. The user is warned that the instance is down. When the DAS starts up, all instances are considered offline until GMS reports that the instances are ready. XXX - Does DAS become the GMS master when it restarts? XXX - Do we need to wait for instances to report that they're ready before checking whether we think they're out of sync (as described above)? Or do we do the check "on demand" as the instance comes online? As an instance is restarting, it may take some time for the instance to fully synchronize before it's ready to accept admin commands. Between the time when an instance starts synchronizing, and the time it finishes, we don't want admin commands to fail because the instance is offline. Neither do we want the admin commands to wait for the instance to fully synchronize. During this period, the instance is marked "starting", and admin commands issued to starting instances are queued until the instance is fully online. This implies that the DAS is notified of two events - the instance has started synchronizing and thus will likely be online soon, and the instance is fully online and ready to accept admin commands. The first event can be triggered by the first sync command that the DAS receives for the instance. The second event can be triggered by GMS detecting that the instance is "joined and ready". (This assumes that GMS will be used for *all* instances, not just instances in a cluster. Currently this seems like the best approach, rather than reinventing a similar mechanism just for this purpose.) When an instance is fully online, any admin commands queued for that instance are sent. Note that there's no user to notify of failures.