Failover Configuration Guide for Cisco Digital Media Suite 5.2.x
Recovering from a Failover

Table Of Contents

Recovering from a Failover

Minor Failure Event Recovery

Major Failure Event Recovery

Secondary Appliance Failure Recovery

Primary Appliance Failure Recovery

Split Brain Recovery


Recovering from a Failover


Revised: May 31, 2011

Minor Failure Event Recovery

A minor failure event is an event that caused a failover and can be cleared without replacing hardware or reimaging the appliance. Some examples include:

A monitored service failing more than 5 times on the active unit.

A service failed to start or stopped.

An external event, such as a network failure.

A single disk failure is a minor failure. Replace the disk and reboot the appliance. If more than one disk fails, you have to perform a major failure event recovery.

When a failover occurs, clear the cause of the failover and reboot the failed appliance. It will boot to standby and receive data from the active unit. Rebooting the appliance also clears the monitored service fail counters.

If you cannot clear the condition that caused failover, you may have to perform a major event recovery.

Major Failure Event Recovery

Major failure events are events that require the appliance to be reimaged or replaced in order to bring it back into service.

If you need to replace hardware, obtain the replacement hardware before starting the recovery process. If you need to replace an appliance, you will need to obtain and install a new license for the appliance.


Note A single disk failure is a minor failure event. Multiple disk failures are a major failure event.



Caution You cannot revert a secondary appliance to standalone mode and then bring it back online as a primary appliance. When you convert a cluster to standalone mode, you must reimage the secondary appliances.

There are two major recovery procedures, depending upon which appliance failed:

If a secondary appliance failed, see Secondary Appliance Failure Recovery.

If a primary appliance failed, see Primary Appliance Failure Recovery.

Prerequisites

This procedure must be performed from the appliance console. You cannot perform this procedure through an SSH session.

Secondary Appliance Failure Recovery

To recover from a major failure event, you must:


Step 1 On the pair of appliances that did not fail, make the primary appliance the active appliance.

Step 2 Back up the active appliances in your failover cluster.

Step 3 Revert the active appliances to Standalone mode:

a. Log into AAI.

b. Choose FAIL_OVER > REVERT.

Step 4 Revert the standby appliances to Standalone mode:

a. Log into AAI.

b. Choose FAIL_OVER > REVERT.

Step 5 Apply the virtual FQDN and IP address to the primary appliances. This reverts them to the pre-failover configuration.

Step 6 Pair the primary appliances.

The appliances operate as a standard, standalone configuration.

Step 7 Reimage the secondary appliances.

Step 8 Re-configure failover. Depending upon your configuration, see Cisco Digital Signs Failover Configuration, page 2-1 or Cisco Show and Share Failover Configuration, page 3-1 for the failover configuration process.


Primary Appliance Failure Recovery

Recovering a failed primary requires some additional steps because you cannot use a secondary appliance as a primary appliance. You must reimage the secondary appliances after converting the failover cluster to standalone mode.

Procedure


Step 1 On the pair of appliances that did not fail, make the primary appliance the active appliance.

Step 2 Back up the active appliances in your failover cluster.

Step 3 Revert the standby appliances to Standalone mode:

a. Log into AAI.

b. Choose FAIL_OVER > REVERT.

Step 4 Revert the active appliances to Standalone mode:

a. Log into AAI.

b. Choose FAIL_OVER > REVERT.

Step 5 Reimage the failed primary appliance and the two standby appliances.

Step 6 Apply the virtual FQDN and IP address to the primary appliances. This reverts them to the pre-failover configuration.

Step 7 Pair the primary appliances.

Step 8 Restore the cluster backup on the appliances.

Step 9 Re-configure failover. Depending upon your configuration, see Cisco Digital Signs Failover Configuration, page 2-1 or Cisco Show and Share Failover Configuration, page 3-1 for the failover configuration process.


Split Brain Recovery

Split brain occurs when both nodes become active or when the data on each node become out of sync with the other node. To recover, you need to determine which set of data you are going to keep. The recovery process overwrites the other set of data.

Procedure


Step 1 Determine which device you want to use as the data source. This is the appliance whose data will be used to populate the cluster.

Step 2 On the appliance you want to receive the data, do the following:

a. Log into AAI.

b. Choose FAIL_OVER > RECOVER.

If split brain is not occurring, you will receive a message that split brain was not detected. Cancel out of the split brain recovery process.

If split brain is occurring, the data selection page appears.

c. Choose OVERWRITE_DATA.

d. Choose Yes if prompted to continue.

Step 3 On the appliance you are going to use as the data source, do the following:

a. Log into AAI.

b. Choose FAIL_OVER > RECOVER.

If split brain is not occurring, you will receive a message that split brain was not detected. Cancel out of the split brain recovery process.

If split brain is occurring, the data selection page appears.

c. Choose USE_DATA.

d. Choose Yes if prompted to continue.