Re: failure recovery

From: Bill Shannon <bill.shannon_at_oracle.com>
Date: Mon, 07 Jun 2010 11:59:38 -0700

Tom Mueller wrote on 06/ 7/10 08:12 AM:
> I'd consider the problem of having the whole cluster go down to resync a
> serious enough problem that we need to fix it for this release. The
> first time this happens in the field, this would be a real "egg on our
> face" situation.

I think the question for this release is how likely this is to happen.
The cluster has to be down when the DAS is up, changes have to be made,
then the DAS has to go down, then the cluster has to start up, then the
DAS has to start up. That's the situation that would cause the entire
cluster to *restart*.

I agree, it would be good if we could fix it for this release, but it
might be a sufficiently unlikely event that we can defer a full fix to
the next release.

> It seems that the design still handles this case incorrectly:
>
> * DAS and two instances are up, but DAS cannot communicate with
> instance 1 because the LAN is down. So DAS thinks instance 1 is
> down, but instance 1 thinks the DAS is down.
> * An admin command is executed, and the DAS replicates it to
> instance 2 successfully, and updates the domain.xml mod time.
> * The LAN is brought up. GMS notifies the DAS that instance 1 is now
> up and instance 1 is notified that the DAS is up.
> * The DAS sees that then recorded domain.xml mod time is the same as
> the current domain.xml mod time, so it thinks everything is
> synced. Instance 1 is not told to restart, so it continues to
> operate with out-of-date data.
>
> It seems that a solution would be that whenever an instance
> reestablishes communication with the DAS, it should check if a sync is
> needed; not just when it is starting.

Your last step above doesn't happen.

What I proposed is that when the DAS sees that instance 1 has come online
(because the network was restored), the DAS would check with the instance
to see if they're in sync, and if not tell instance 1 to restart.

So yes, the check has to happen not just when starting, but also when
coming online.

There is one other aspect of the current design that works against us in
some of the above scenarios. Consider this case...

DAS with two standalone instances - I1 and I2. (No cluster.)

An admin command is issued against I1. The DAS version of domain.xml is
updated and thus gets a new mod time, the command is replicated to I1
*but not I2* (because it didn't effect I2).

Is I2 in sync?

Well, if you compare the domain.xml mod times, you'll conclude that it is
*not* in sync. If I2 were to restart, it would get a new domain.xml and go
through a full resync, just to find that nothing else changed. Not terrible,
but not optimal either.

The more worrisome case is if there's a network outage between the DAS and
I2. When I2 comes back online, the DAS will believe it's out of sync, and
will require it to restart, even though it's not "really" out of sync.

Fixing this is more complicated, and almost certainly will wait for 3.2.