Table Of Contents
Introduction to High Availability
High Availability Overview
Watchdog Protocol
Unit N+m High Availability
Estimating Down Time in Case of Failure
Catastrophic Process Failure
Timeout Process Failure
Timeout Machine Failure
Related Documentation
Introduction to High Availability
This chapter describes the high availability (redundancy) and protection options available for units and gateways:
•
High Availability Overview—Provides an overview of high availability in the Cisco ANA fabric.
•
Watchdog Protocol—Describes the Watchdog protocol that monitors the processes on the units.
•
Unit N+m High Availability—Describes the clustered N+m high availability mechanism within the Cisco ANA fabric designed to handle the failure of units.
•
Estimating Down Time in Case of Failure—Describes how to estimate how long a unit or AVM is down in the event of failure, and the recovery period.
High Availability Overview
High availability is the provision of multiple interchangeable components to perform a single function to cope with failures and errors.
The high availability architecture is designed to ensure continuous availability of assurance and fulfillment functionality, by detecting, and recovering from a wide range of hardware and software failures, such as failures in the server machines, connectivity, software breakdowns and so on.
The distributed design of the system enables the "impact radius" caused by a single fault to be confined. This prevents all types of fault from setting into motion the "domino" effect, which can lead to the meltdown of all the management services.
The high availability of the server backbone is achieved at several complementing levels, namely:
•
NEBS-3 compliant carrier-class server hardware.
•
Internal watchdog within each unit, in charge of monitoring (and if necessary automatically reloading) failed processes. For more information see Watchdog Protocol.
•
N+m warm standby protection for units clusters. For more information see Unit N+m High Availability.
Note
Cisco ANA does not provide a solution for the configuration of high availability for a Cisco ANA gateway. For information on configuring high availability for a Cisco ANA gateway using Veritas, please contact the Cisco Project Manager or Cisco Account Team.
Watchdog Protocol
Each unit executes several processes: one control process and several Agent Virtual Machine (AVM) processes that execute Virtual Network Elements (VNEs). Each process within the unit is completely independent. The isolation concept is tailored throughout the design: a failure of a single process does not affect other processes on the same machine. The exact number of processes on each unit depends on the capacity and computation power of the unit.
The control process executes a Watchdog protocol, which continuously monitors all other processes on the unit. This Watchdog protocol requires each AVM process to continuously handshake with the Control process. A process that fails to handshake with the control process after a number of times (namely, is "stuck") will be automatically killed and reloaded.
The dynamic design of the control process implements runtime adaptation and escalation. The escalation procedure moves the AVM to suspended mode, namely, the process is suspended. An example of an escalation procedure is to stop reloading a process that has crashed more than N times within a given period, as it is suspected of having a recurring software problem.
The reload process is local to the unit, and thus very rapid, with a minimal amount of downtime. Since the process can use its previous cache information (temporary persistency used to improve performance), once the stuck process is detected, reloading the process takes only a few seconds with no data loss.
All Watchdog activity is logged, and an alarm is generated and sent when the watchdog reloads a process.
Note
An alarm persistency mechanism enables the system to clear alarms which relate to events that occurred while a VNE, AVM, unit, or the whole system was down, thus preserving system integrity. For more information about alarm persistency, see the Cisco Active Network Abstraction Administrator Guide.
All the Watchdog protocol parameters, such as "pulse interval" and "retry times" are configurable in the registry by the operator. The higher these parameter values are, the longer the AVM or unit failure lasts, but this increases the certainty that a failure has actually occurred. Configuring these parameters with lower values may shorten the AVM or unit recovery, but might result in a "false positive" which could unnecessarily restart an AVM or revert to a standby unit, when the AVM is just busy or the unit currently processing a heavy load of data.
Unit N+m High Availability
The clustered N+m high availability mechanism within the Cisco ANA fabric is designed to handle the failure of a unit. Such failures include hardware failures, operating system failures, power failures, or network failures, which disconnect a unit from the Cisco ANA fabric.
Unit availability is established in the gateway, running a Protection Manager process, which continuously monitors all the units in the network. Once the Protection Manager detects a unit that is malfunctioning, it automatically signals one of the m servers in its cluster to load the configuration of the faulty unit (from the system registry), taking over all its managed network elements. This design provides many possibilities for trading off protection and resources. These possibilities range from just segmenting the network into clusters without any extra machines, up to having a warm-swappable empty unit for each and every unit in the setup. It is recommended that units are clustered according to geography and that an additional empty unit is added to heavily loaded clusters.
The switchover of the redundant standby unit does not result in any loss of information in the system, as all the information is auto-discovered from the network, and no persistent storage synchronization is required. Hence, the redundant standby unit relearns all the information from the network elements, with no danger of persistent information corruption. Furthermore, where there is cluster saturation (namely, more than one unit in a cluster fails at the same time and there are no extra machines), the remaining units will continue to operate and manage their network scope normally.
When a unit is configured it can be designated as being an active or standby unit. The active units (excluding the standby unit) that are connected to the gateway are known as a protection group. The standby unit that is configured for the gateway is linked to that protection group. The administrator can define more than a single protection group. Each protection group defined has a set of protected units and a protecting standby unit.
The following example shows a protection group (cluster) of units, controlled by a gateway with one unit configured as the standby for the protection group.
Figure 2-1 Cisco ANA Architecture
In the above configuration, when the gateway determines that one of the units in the protection group has failed, it notifies the protection group's standby unit to immediately load the configuration of the failed unit. The standby unit loads the configuration of the failed unit, including all its AVMs and VNEs, and functions as the failed unit.
These events are all recorded in the EventVision system log, which enables the user to take the necessary action to bring the failed unit up again. When the failed unit becomes operational, the user can decide whether to configure it as the new standby unit or to reinstate it to the protection group and configure another unit as the standby unit.
Estimating Down Time in Case of Failure
When a failure occurs in a unit or AVM, the length of time that the system is down depends on the type of failure, how long it takes to detect that the component is not working, and how long is the recovery period during which the unit or AVM reloads and the system functions normally again.
Three types of failure can occur, as described in the following sections:
•
Catastrophic Process Failure
•
Timeout Process Failure
•
Timeout Machine Failure
Catastrophic Process Failure
Each AVM has a log file which is constantly monitored by a Perl process for catastrophic log messages, such as AVM processes running out of memory. When such a failure occurs, the Perl process restarts the AVM almost immediately, so the MTTR (Mean Time To Repair) is based on the AVM loading life cycle.
Table 2-1 describes the impact on different AVMs when hit by such a failure:
Table 2-1 Catastrophic Process Failure Impact on AVMs
Process
|
Impact
|
MTTR
|
Probability of Failure
|
AVM 0 (switch AVM)
|
Loss of messages to and from the machine.
|
1 minute to reach bootstrap.
|
Messages are constantly being sent and received in the system, so the probability of failure is high.
|
AVM 99 (management AVM)
|
Loss of registry notifications on changes done in the Golden Source.
|
1 minute to reach bootstrap.
|
Registry modifications are only ever done at the first system load up by the VNE, so the probability of failure is low.
While the system is up and running, modifications are rarely done.
|
AVM 100 (trap management AVM)
|
Loss of traps and syslogs from devices
|
1 minute to reach bootstrap, plus time for all the VNEs to re-register for traps & syslogs.
|
Traps and syslogs are constantly received in a live, scaled system, so there is a high probability of losing traps and syslogs during the reloading period.
|
AVM 11 (gateway)
|
Loss of persistency of any kind.
|
6-10 minutes to reach bootstrap on a scale.
|
Since AVM 11 handles Oracle communication and various gateway functions, such as alarm processing, there is a high probability of loss of events persistency during this period.
|
AVM101-999
|
Loss of management to a section of devices managed by the AVM.
|
1 minute to reach bootstrap, plus time to load the VNEs depending on the number and type of VNEs.
|
When the AVM is down, no alarm processing is done, so traps and syslogs sent to the VNEs are lost.
The loss of traps and syslogs for a period of 1 minute is high.
|
Timeout Process Failure
Each AVM is constantly monitored by the management AVM (AVM99) using a Watchdog protocol pulse message sent to the AVM every pulse interval (preconfigured). When the AVM fails to respond to the pulse message after a preconfigured number of attempts, the management AVM restarts the process.
The management process also keeps a history of the number of times it has restarted the AVM. When it reaches the maximum number of preconfigured restart times, the management AVM stops restarting the AVM as this would indicate a serious problem with the AVM. Each restart is logged in EventVision as a system event (except when AVM11 is restarted as this AVM handles all the persistency).
Failures on AVMs in the system are measured in a similar way to a catastrophic process failure (see Table 2-1), with the addition of the watchdog protocol overhead. This is measured by the pulse interval multiplied by the number of restart attempts.
Note
•
The maximum number of preconfigured restart times is five, after which the management process will not try to reload the AVM.
•
It takes approximately 1 minute for the system to detect that an AVM (including AVM100) is not working.
•
The recovery period during which an AVM (including AVM100) reloads and the system starts to function normally again takes approximately 5 minutes, depending on the number of VNEs per AVM, and the complexity of each.
Figure 2-2 provides a typical example of how the High Availability timer parameters work while monitoring the AVMs.
Figure 2-2 HA Parameter Timers and AVM Monitoring Example
Measuring Ticket Processing Down Time
When a failure occurs on an AVM, the time during which ticket processing is down is measured as the sum of the following factors:
1.
The time it takes to determine that the AVM has failed.
2.
The time it takes for the AVM to reload, depending on its number of VNEs.
3.
The time it takes to pass syslogs or traps to the VNEs (in the case of an AVM100), or to pass events to the gateway (in the case of an AVM101-999).
Note
For the first 30 minutes after an AVM99 (the management AVM) has started, there is no monitoring of the system to find high availability issues. This is to allow the system enough time to get up and running.
Timeout Machine Failure
The ANA gateway constantly monitors the units by sending a pulse message every (preconfigured) pulse interval to the units' management AVM. If the units' management AVM fails to respond to the pulse message after a preconfigured number of retries, the gateway loads the standby unit to replace it.
The impact of such a failure on the system is a loss of management for the devices this unit manages for a period of time. It is measured by the pulse interval multiplied by the number of retry times, plus the unit load time.
Note
The unit load time depends on the AVMs and the load time taken for the VNEs to reach a status of modeling complete as described in Table 2-1.
Figure 2-3 illustrates how the unit handles events during the loading time.
Figure 2-3 Stages in Event Handling through System Restart
Measuring Ticket Processing Down Time
When a failure occurs on a unit, the time during which ticket processing is down is measured as the sum of the following factors:
1.
The time it takes to determine that the unit has failed (depending on the ping interval).
2.
The timeit takes for the unit to reload, depending on the number of AVMs and VNEs in the unit.
3.
The time it takes to pass correlated events to the gateway (a minimum of 5 minutes to get some device history, plus a variable time depending on the number of VNEs per AVM).
Related Documentation
For more detailed information see the following publications:
•
Cisco Active Network Abstraction Administrator Guide
•
Cisco Active Network Abstraction NetworkVision User Guide
•
Cisco Active Network Abstraction EventVision User Guide
Note
Changes to the registry should be performed only with the support of Cisco, for details, please contact the Cisco Project Manager or Cisco Account Team.