Cisco Active Network Abstraction Administrator Guide, 3.6.5
Using High Availability

Table Of Contents

Using High Availability

High Availability Overview

Watchdog Protocol

Unit N+m High Availability

Estimating Down Time in Case of Failure

Catastrophic Process Failure

Timeout Process Failure

Measuring Ticket-Processing Down Time for AVMs

Timeout Machine Failure

Measuring Ticket-Processing Down Time for Units

Configuring Cisco ANA Units

Customizing Protection Groups

Configuring Units for High Availability and a Protection Group

Configuring Standby Units

Checking the Assignment of Units to Protection Groups

Changing the Protection Group of a Unit

Viewing and Editing Protection Group Properties

Manually Switching to a Standby Unit

Automatically Switching to a Standby Unit

Managing the Watchdog Protocol

Configuring AVMs for High Availability

Viewing and Editing Watchdog Protocol Settings

High Availability Events


Using High Availability


This appendix describes the high availability (redundancy) and protection options available for units and gateways. The topics include:

High Availability Overview

Configuring Cisco ANA Units

Managing the Watchdog Protocol

High Availability Events


Note High availability is an optional feature that can be used with Cisco ANA. For more information about obtaining the high availability feature for use with Cisco ANA, contact your Cisco Account Manager.


High Availability Overview

The high availability architecture ensures continuous availability of Cisco ANA functionality by detecting and recovering from a wide range of hardware and software failures. The distributed design of the system enables the impact radius caused by a single fault to be confined. This prevents all types of faults from setting into motion the "domino" effect, which can lead to a crash of all the management services.

High availability of the server backbone is achieved at several complementary levels. For example:

NEBS-3 compliant carrier-class server hardware.

Internal watchdog within each unit, responsible for monitoring and, if necessary, automatically reloading failed processes.

N+m warm standby protection for unit groups.

See the following sections for more information:

Watchdog Protocol

Unit N+m High Availability

Estimating Down Time in Case of Failure


Note Cisco ANA does not provide a solution for the configuration of high availability for a Cisco ANA gateway. For information on configuring high availability for a Cisco ANA gateway using Veritas, contact the Cisco Project Manager or Cisco Account Team.


Watchdog Protocol

The watchdog protocol monitors the AVM processes to make sure any AVMs that have failed are restarted. The watchdog protocol is normally denoted in the GUIs as AVM Protection. Each unit executes several processes: one control process and several AVM processes that execute VNEs. Each process within the unit is completely independent. The isolation concept is tailored throughout the design so that a failure of a single process does not affect other processes on the same machine. The exact number of processes on each unit depends on the capacity and computational power of the unit.

The control process executes a watchdog protocol, which continuously monitors all other processes on the unit. This watchdog protocol requires each AVM process to continuously handshake with the control process. A process that fails to handshake with the control process after a number of times is automatically cancelled and reloaded.

The dynamic design of the control process implements runtime adaptation and escalation. The escalation procedure moves the AVM to suspended mode; that is, the process is suspended. An example of an escalation procedure is to stop reloading a process that has crashed more than n times within a given period, because it is suspected of having a recurring software problem.

The reload process is local to the unit, and thus very rapid, with a minimal amount of downtime. Because the process can use its previous cache information (temporary persistency used to improve performance), once the stuck process is detected, reloading the process takes only a few seconds with no data loss.

All watchdog activity is logged and an alarm is generated and sent when the watchdog reloads a process.


Note An alarm persistency mechanism enables the system to clear alarms that relate to events that occurred while a VNE, an AVM, a unit, or the whole system was down, thus preserving system integrity. For more information about alarm persistency, see Appendix G, "VNE Persistency Mechanism."


All watchdog protocol parameters, such as pulse interval and retry times, are configurable in the registry. The higher these parameter values are, the longer the AVM or unit failure lasts, but this increases the certainty that a failure has actually occurred. Configuring these parameters with lower values may shorten the AVM or unit recovery, but might result in a "false positive" which could unnecessarily restart an AVM or revert to a standby unit when the AVM is just busy or the unit is processing a heavy load of data.

Unit N+m High Availability

The clustered N+m high availability mechanism uses the Cisco ANA fabric is designed to handle the failure of a unit. Such failures include hardware failures, operating system failures, power failures, and network failures, which disconnect a unit from the Cisco ANA fabric.

Unit availability is established in the gateway, running a protection manager process, which continuously monitors all the units in the network. Once the protection manager detects a unit that is malfunctioning, it automatically signals one of the standby servers in its cluster to load the configuration of the faulty unit (from the system registry), taking over all of its managed network elements. This design provides many possibilities for trading off protection and resources. These possibilities range from segmenting the network into clusters without any extra machines, to having a warm-swappable empty unit for each unit in the setup. We recommend that you cluster units according to geography and add an additional empty unit to heavily loaded clusters.

The switchover of the redundant standby unit does not result in any loss of information in the system because all information is autodiscovered from the network, and no persistent storage synchronization is required. As a result, the redundant standby unit relearns all the information from the network elements with no danger of persistent information corruption. Furthermore, when there is cluster saturation (that is, when more than one unit in a cluster fails at the same time and there are no extra machines), the remaining units continue to operate and manage their network scope normally.

When a unit is configured, it can be designated as being an active or standby unit. The active units (excluding the standby unit) that are connected to the gateway are known as a protection group. The standby unit that is configured for the gateway is linked to that protection group. You can define more than a single protection group. Each protection group defined has a set of protected units and a protecting standby unit.

Figure F-1 shows a protection group (cluster) of units controlled by a gateway with one unit configured as the standby for the protection group.

Figure F-1 Cisco ANA Architecture

In the example configuration, when the gateway determines that one of the units in the protection group has failed, it notifies the standby unit of the protection group to immediately load the configuration of the failed unit. The standby unit loads the configuration of the failed unit, including all AVMs and VNEs, and functions as the failed unit.

All events are recorded in the EventVision system log, which enables you to take the necessary action to bring the failed unit up again. When the failed unit becomes operational, you can decide whether to configure it as the new standby unit or to reinstate it to the protection group and configure another unit as the standby unit.

Estimating Down Time in Case of Failure

When a failure occurs in a unit or AVM, the length of time that the system is down depends on the type of failure, how long it takes to detect that the component is not working, and the length of the recovery period (during which the unit or AVM reloads and the system begins to function normally again).

Three types of failure can occur, as described in the following sections:

Catastrophic Process Failure

Timeout Process Failure

Timeout Machine Failure

Catastrophic Process Failure

Each AVM has a log file which is constantly monitored by a Perl process for log messages about catastrophic failures, such as AVM processes running out of memory. When such a failure occurs, the Perl process restarts the AVM almost immediately, so the mean time to repair (MTTR) is based on the AVM loading life cycle.

Table F-1 describes the impact on different AVMs when experiencing such a failure.

Table F-1 Catastrophic Process Failure Impact on AVMs

Process
Impact
MTTR
Probability of Failure

AVM 0 (switch AVM)

Loss of messages to and from the machine.

One minute to reach bootstrap.

Messages are constantly being sent and received in the system, so the probability of failure is high.

AVM 99 (management AVM)

Loss of registry notifications on changes made to the Golden Source.

One minute to reach bootstrap.

Registry modifications are made only when the VNE is first loaded into the system, so the probability of failure is low.

Modifications are rarely made while the system is up and running.

AVM 100 (trap management AVM)

Loss of traps and syslogs from devices.

One minute to reach bootstrap, plus time for all the VNEs to register again for traps and syslogs.

Traps and syslogs are constantly received in a live, scaled system, so there is a high probability of losing traps and syslogs during the reloading period.

AVM 11 (gateway)

Loss of persistency of any kind.

Six to ten minutes to reach bootstrap on a scale.

Since AVM 11 handles Oracle communication and various gateway functions such as alarm processing, there is a high probability of loss of event persistency during this period.

AVM 101-999

Loss of management to a section of devices managed by the AVM.

One minute to reach bootstrap, plus time to load the VNEs depending on the number and type of VNEs.

No alarm processing occurs when the AVM is down, so traps and syslogs sent to the VNEs are lost.

The probability of the loss of traps and syslogs for a period of one minute is high.


Timeout Process Failure

Each AVM is constantly monitored by the management AVM (AVM 99) using a watchdog protocol pulse message sent to the AVM at preconfigured intervals. When the AVM fails to respond to the pulse message after a preconfigured number of attempts, the management AVM restarts the process.

The management process also keeps a history of the number of times it has restarted the AVM. When it reaches the maximum number of preconfigured restart times, the management AVM stops restarting the AVM because this indicates a serious problem with the AVM. Each restart is logged as a system event (except when AVM 11 is restarted, because this AVM handles all persistency).

Failures on AVMs in the system are measured in a way similar to that used for catastrophic process failures (see Table F-1), with the addition of the watchdog protocol overhead. This is measured by the pulse interval multiplied by the number of restart attempts.


NoteThe maximum number of preconfigured restart times is five, after which the management process does not try to reload the AVM.

It takes approximately one minute for the system to detect that an AVM (including AVM 100) is not working.

The recovery period during which an AVM (including AVM 100) reloads and the system starts to function normally again is approximately five minutes, depending on the number of VNEs per AVM and the complexity of each.


Figure F-2 provides a typical example of how high availability timer parameters work while monitoring AVMs.

Figure F-2 HA Parameter Timers and AVM Monitoring Example

Measuring Ticket-Processing Down Time for AVMs

When a failure occurs on an AVM, the time during which ticket processing is down is measured as the sum of the following factors:

The time it takes to determine that the AVM has failed.

The time it takes for the AVM to reload, depending on the number of VNEs.

The time it takes to pass syslogs or traps to the VNEs (in the case of AVM 100), or to pass events to the gateway (in the case of AVM 101-999).


Note For the first 30 minutes after AVM 99 (the management AVM) has started, there is no monitoring of the system to find high availability issues. This allows the system enough time to get up and running.


Timeout Machine Failure

The Cisco ANA gateway constantly monitors units by sending a watchdog protocol pulse message to the unit management AVM at preconfigured intervals. If the unit management AVM fails to respond to the pulse message after a preconfigured number of retries, the gateway loads the standby unit to replace it.

The impact of such a failure on the system is that the unresponsive unit does not manage the devices for a period of time. This unmanaged period of time is measured by the pulse interval multiplied by the number of retry times, plus the unit load time.


Note Unit load time depends on the AVMs and the load time required for the VNEs to complete their modeling, as described in Table F-1.


Figure F-3 illustrates how a unit handles events during the loading time.

Figure F-3 Stages in Event Handling Through System Restart

Measuring Ticket-Processing Down Time for Units

When a failure occurs on a unit, the time during which ticket processing is down is measured as the sum of the following factors:

The time it takes to determine that the unit has failed (depending on the ping interval).

The time it takes for the unit to reload, depending on the number of AVMs and VNEs in the unit.

The time it takes to pass correlated events to the gateway (a minimum of five minutes to obtain device history, plus a variable time depending on the number of VNEs per AVM).

Configuring Cisco ANA Units

The following sections describe customizing protection groups, configuring units for high availability, and configuring standby units:

Customizing Protection Groups

Configuring Units for High Availability and a Protection Group

Configuring Standby Units

Checking the Assignment of Units to Protection Groups

Changing the Protection Group of a Unit

Viewing and Editing Protection Group Properties

Manually Switching to a Standby Unit

Automatically Switching to a Standby Unit

Customizing Protection Groups

By default, all units in the Cisco ANA fabric belong to one cluster. You can change the default setup of the units by customizing protection groups and then assigning units to these groups.

To customize a protection group:


Step 1 Select the Global Settings branch in the ANA Manage window. The Global Settings branch is displayed.

Step 2 Expand the Global Settings branch and select the Protection Groups sub-branch.

Step 3 Open the New Protection Group dialog box in one of the following ways:

Right-click the Protection Groups sub-branch, then choose New Protection Group.

Click New Protection Group in the toolbar.

Choose File > New Protection Group.

Step 4 In the Name field, enter the name of the protection group.

Step 5 (Optional) In the Description field, enter a description for the protection group.

Step 6 Click OK. The new protection group is displayed in the workspace of the ANA Manage window with the other currently defined protection groups.


Note The default-pg protection group displayed in the workspace is the protection group to which all units in the Cisco ANA fabric belong by default.



Configuring Units for High Availability and a Protection Group

You can change the default settings of a unit and assign it to a customized protection group. For more information about customizing protection groups, see Customizing Protection Groups.

In addition, you can enable or disable high availability for a unit. In other words, these settings enable you to define to which protection group a unit is assigned and whether it is enabled for high availability.

For information about how long a unit or AVM is down when switching to a standby unit, see Estimating Down Time in Case of Failure, page 2-3.


Note By default, all units in the Cisco ANA fabric belong to the default-pg protection group and high availability is enabled.


Advanced configurations can be found in the registry to:

Enable or disable the watchdog protocol for each process, including timeouts for discovery when the process is down.

Control the timeouts for detecting when a unit is down.

For more information, contact your Cisco representative.

To configure a unit for high availability and assign it to a protection group:


Step 1 Select the ANA Servers branch in the ANA Manage window. The ANA Servers branch is displayed.

Step 2 Open the New ANA Unit dialog box in one of the following ways:

Right-click the ANA Servers branch, then choose New ANA Unit.

Click New ANA Unit in the toolbar.

Choose File > ANA Unit.

Step 3 In the IP Address field, enter the IP address of the new unit.


Note For a detailed description of configuring units, see Chapter 5, "Managing Cisco ANA Units."


Step 4 Confirm that the Enable Unit Protection check box is checked. When this option is checked, high availability is enabled on the unit.


Note The Enable Unit Protection check box is checked by default. We strongly recommended that you do not disable this option.


Step 5 In the Protection Group drop-down list, select the required protection group.

For more information about protection groups, see:

Customizing Protection Groups

Changing the Protection Group of a Unit

Step 6 In the Gateway IP field, confirm that the IP address of the gateway appears.

Step 7 Click OK. The new unit is displayed in the ANA Manage window.

If the new unit is installed and reachable, the following events occur:

It starts automatically.

It is registered with the gateway.

A configuration registry for the new unit is created in the Golden Source. For more information about the Golden Source registry, see Appendix C, "Golden Source Registry."


Note To make an active unit a standby unit:

1. Shut down all VNEs on the active unit.
2. Remove all configurable AVMs on the active unit; AVMs 0-100 cannot be deleted.
3. Delete (remove) the active unit from the setup.
4. Configure the new standby unit.

For more information, see Configuring Standby Units.



Configuring Standby Units

ANA Manage enables you to configure standby units and assign standby units to protection groups.

For information about how long a unit or AVM is down when switching to a standby unit, see Estimating Down Time in Case of Failure, page 2-3.

To configure a standby unit:


Step 1 Select the ANA Servers branch in the ANA Manage window. The ANA Servers branch is displayed.

Step 2 Open the New ANA Unit dialog box in one of the following ways:

Right-click the ANA Servers branch, then choose New ANA Unit.

Click New ANA Unit in the toolbar.

Select File > New ANA Unit.


Note For a detailed description of configuring units, see Chapter 5, "Managing Cisco ANA Units."


Step 3 In the IP Address field, enter the IP address of the new unit.

Step 4 Confirm that the Enable Unit Protection check box is checked. When this option is checked, high availability is enabled on the unit.


Note The Enable Unit Protection check box is selected by default. We strongly recommended that you do not disable this option.


Step 5 Check the Standby Unit check box to define the unit as a standby unit.

Step 6 In the Protection Group drop-down list, select the protection group for which the newly created unit will act as a standby unit.

For more information about protection groups, see:

Customizing Protection Groups

Changing the Protection Group of a Unit

Step 7 In the Gateway IP field, confirm that the IP address of the gateway appears.

Step 8 Click OK.


Note Standby units are not displayed in the ANA Servers branch in the tree pane.


For information about changing the protection group to which a unit is assigned, see Changing the Protection Group of a Unit.


Checking the Assignment of Units to Protection Groups

You can view the protection groups to which units are currently assigned, allowing you to confirm at a glance that the configuration or assignment matches the initial deployment plan.

To view unit and protection group assignments, select the ANA Servers branch in the Cisco ANA Manage tree pane. The properties of the ANA Servers branch are displayed in the workspace, including the details of the protection group to which each unit and standby unit currently belongs.

Changing the Protection Group of a Unit

You can easily and quickly change the protection group to which a unit has been assigned.

To change the protection group of a unit:


Step 1 Select the ANA Servers branch in the ANA Manage window. The ANA Servers branch is displayed.

Step 2 Expand the ANA Servers branch and select the required ANA Unit sub-branch.

Step 3 Open the ANA Unit Properties dialog box in one of the following ways:

Right-click the ANA Servers branch, then choose Properties.

Click Properties in the toolbar.

Choose File > Properties.


Note For a detailed description on configuring units, see Chapter 5, "Managing Cisco ANA Units."


Step 4 In the Protection Group drop-down list, select the protection group to which you want to assign the unit.

Step 5 Click OK to save the updated protection group setting for the selected unit. The ANA Manage window is displayed.


Viewing and Editing Protection Group Properties

You can view or edit the properties of a protection group, such as the description.

To view and edit protection group properties:


Step 1 Select the Global Settings branch in the ANA Manage window. The Global Settings branch is displayed.

Step 2 Expand the Global Settings branch and select the Protection Groups sub-branch.

Step 3 Select the required protection group in the workspace.

Step 4 Open the Properties dialog box in one of the following ways:

Right-click the protection group, then choose Properties.

Click Properties in the toolbar.

Choose File > Properties.

Step 5 View the properties of the protection group or edit the description.

Step 6 Click OK. The ANA Manage window is displayed.


Manually Switching to a Standby Unit

ANA Manage enables you to manually switch to a standby unit. This is useful when, for example, a unit needs to be temporarily shut down for maintenance.

To manually switch to a standby unit:


Step 1 Select the ANA Servers branch in the ANA Manage window. The ANA Servers branch is displayed.

Step 2 Expand the ANA Servers branch and select the required ANA Unit sub-branch.

Step 3 Right-click the required unit, then choose Switch.

A confirmation message is displayed.

Step 4 Click Yes. The standby unit becomes the active unit and is displayed in the ANA Servers branch. The original unit is removed from the setup and can be safely shut down. It is no longer displayed in the ANA Servers branch of the ANA Manage window.


Note In the event of unit failover, the Cisco ANA gateway randomly selects a redundant unit when more than one standby unit is available.



Automatically Switching to a Standby Unit

When the gateway discovers that one of the active units has, for example, timed out (see High Availability Events, page A-1 for more information), Cisco ANA automatically transfers all data from the failed unit to a standby unit in the same protection group. The original unit is removed from the standby setup and is no longer displayed in the ANA Servers branch of the ANA Manage window.

Managing the Watchdog Protocol

The following sections describe how to define AVMs for units and enable or disable the watchdog protocol on the AVM:

Configuring AVMs for High Availability

Viewing and Editing Watchdog Protocol Settings

Configuring AVMs for High Availability

Every AVM in the Cisco ANA fabric is, by default, managed by the watchdog protocol. ANA Manage enables you to define AVMs for units and enable or disable the watchdog protocol on each AVM.

To define an AVM:

The unit must be installed.

The unit must be connected to the transport network.

The following default AVMs must be running:

AVM 0—The switch AVM.

AVM 99—The management AVM.

AVM 100—The trap management AVM.

The new AVM must have a unique identifier within the unit.


Note For detailed information on defining AVMs, see Chapter 6, "Managing AVMs and VNEs."


To define an AVM for high availability:


Step 1 Select the ANA Servers branch in the ANA Manage window. The ANA Servers branch is displayed.

Step 2 Expand the ANA Servers branch and select the required ANA Servers Entity sub-branch.

Step 3 Open the New AVM dialog box in one of the following ways:

Right-click the required unit, then choose New AVM.

Click New AVM in the toolbar.

Choose File > New AVM.

Step 4 Define the properties of the AVM. For more information, see Chapter 6, "Managing AVMs and VNEs."

Step 5 Check the Enable AVM Protection check box to enable the watchdog protocol on the AVM.


Note We strongly recommended that you do not uncheck the Enable AVM Protection check box.


Step 6 Click OK. The new AVM, with the watchdog protocol enabled, is added to the selected unit and is displayed in the workspace.

Adding the new AVM creates the registry information for the new AVM in the specified unit. The AVM can now host VNEs.


Viewing and Editing Watchdog Protocol Settings

You can view the properties of an AVM, such as its status and location. In addition, you can edit some of the properties of the AVM, including enabling or disabling the watchdog protocol.


Note For detailed information on defining and editing AVMs, see Chapter 6, "Managing AVMs and VNEs."


To view and edit AVM settings:


Step 1 Select the ANA Servers branch in the ANA Manage window. The ANA Servers branch is displayed.

Step 2 Expand the ANA Servers branch and select the required AVM sub-branch in the tree pane.

Step 3 Open the AVM Properties dialog box in one of the following ways:

Right-click the required AVM, then choose Properties.

Click Properties in the toolbar.

Choose File > Properties.

Step 4 Edit the details of the AVM, as required.


Note We strongly recommended that you do not uncheck the Enable AVM Protection check box.


Step 5 Click OK. The new properties for the AVM are displayed in the workspace.


High Availability Events

Table F-2 lists the high availability events displayed in EventVision and the preconfigured defaults for Cisco ANA failover parameters. For more information, see the Cisco Active Network Abstraction 3.6.5 User Guide.

Table F-2 Failover Defaults 

#
Description
Default Value
Entry Name in Registry

1

Grace period (time from system startup in which events are not raised)1

1800000 milliseconds (30 minutes)

Delay

2

Timeout for AVMs

300000 milliseconds (5 minutes)

Timeout

3

Timeout for units

300000 milliseconds (5 minutes)

Note This is the initial recovery period, defined in minutes. This period includes device polling and inventory buildup. End-to-end services, such as RCA and topology, can take longer before they become available.

Timeout

4

AVMs repeatedly not responding

Tries a maximum of 5 times to restart the AVM within 10800000 milliseconds (180 minutes). If the number of restarts exceeds the specified value, the AVM is suspended.

maxTimeoutReloadTime

maxTimeoutReloadTries

1 The grace period defines the amount of time that the system does not perform high availability operations of any kind on the configured target (either the AVM or the unit). There is one exception: When the configured target responds for the first time with ping, the grace period is over.