Cisco ASR 9000 Series Aggregation Services Router Overview and Reference Guide
High Availability and Redundant Operation
Downloads: This chapterpdf (PDF - 710.0KB) The complete bookPDF (PDF - 19.52MB) | Feedback

Table of Contents

High Availability and Redundant Operation

Features Overview

High Availability Router Operation s

Stateful Switchover

Fabric Switchover

Active/Standby Status Interpretation

Non-Stop Forwarding

Nonstop Routing

Graceful Restart

Process Restartability

Fault Detection and Management

Power Supply Redundancy

AC Power Redundancy

DC Power Redundancy

Detection and Reporting of Power Problems

Cooling System Redundancy

Cooling Failure Alarm

High Availability and Redundant Operation

This chapter describes the high availability and redundancy features of the Cisco ASR 9000 Series Routers.

Features Overview

The Cisco ASR 9000 Series Routers are designed to have high Mean Time Between Failures (MTBF) and low Mean Time To Resolve (MTTR) rates, thus providing a reliable platform that minimizes outages or downtime and maximizes availability.

In addition, the Cisco ASR 9000 Series Routers offer the following high availability (HA) features to enhance network level resiliency and enable network-wide protection:

Stateful Switchover

Fabric Switchover

Non-Stop Forwarding

Process Restartability

Fault Detection and Management

High Availability Router Operations

The Cisco ASR 9000 Series Routers offer a variety of hardware and software high availability features.

Stateful Switchover

The RSP/RP cards are deployed in “active/standby” configurations. Stateful switchover (SSO) preserves state and configuration information if a switchover to the standby RSP/RP card occurs. The standby RSP/RP card has a mirror image of the state of protocols, users configuration, interface state, subscriber state, system state and other parameters. Should a hardware or software failure occur in the active RSP/RP card, the standby RSP/RP card changes state to become the active RSP/RP card. This stateful switchover has no impact in forwarding traffic.

Fabric Switchover

In the Cisco ASR 9010 Router, Cisco ASR 9006 Router, and Cisco ASR 9904 Router, the RSP card makes up most of the fabric. The fabric is configured in an “active/active” configuration model, which allows the traffic load to be distributed across both RSP cards. In the case of a failure, the single “active” switch fabric continues to forward traffic in the systems.

In the Cisco ASR 9922 Router and Cisco ASR 9912 Router, fabric switching across the RP and line cards is provided by a separate set of seven OIR FC cards operating in 6+1 redundancy mode. Any FC card can be removed from the chassis, power-cycled, or provisioned to remain unpowered without impacting system traffic. All FC cards remain active unless disabled or faulty. Traffic from the line cards is distributed across all FC cards.

Active/Standby Status Interpretation

Status signals from each RSP/RP card are monitored to determine active/standby status and if a failure has occurred that requires a switchover from one RSP/RP card to the other.

Non-Stop Forwarding

Cisco IOS XR Software supports non-stop forwarding (NSF) to enable the forwarding of packets without traffic loss during a brief outage of the control plane. NSF is implemented through signaling and routing protocol implementations for graceful restart extensions as standardized by the Internet Engineering Task Force (IETF).

For example, a soft reboot of certain software modules does not hinder network processors, the switch fabric, or the physical interface operation of forwarding packets. Similarly, a soft reset of a non-data path device (such as a Ethernet Out-of-Band Channel Gigabit Ethernet switch) does not impact the forwarding of packets.

Nonstop Routing

Nonstop routing (NSR) allows forwarding of data packets to continue along known routes while the routing protocol information is being refreshed following a processor switchover. NSR maintains protocol sessions and state information across SSO functions for services such as MPLS VPN. TCP connections and the routing protocol sessions are migrated from the active RSP/RP card to the standby RSP/RP card after the RSP/RP switchover without letting peers know about the switchover. The sessions terminate and the protocols running on the standby RSP/RP card reestablish the sessions after the standby RSP/RP goes active. NSR can also be used with graceful restart to protect the routing control plane during switchovers. The NSR functionality is available only for Open Shortest Path First Protocol (OSPF) and Label Distribution Protocol (LDP) routing technologies.

Graceful Restart

Graceful restart (GR) provides a control plane mechanism to ensure high availability by allowing detection and recovery from failure conditions while preserving Nonstop Forwarding (NSF) services. Graceful restart is a way to recover from signaling and control plane failures without impacting the forwarding plane. Cisco IOS XR Software uses graceful restart and a combination of check pointing, mirroring, route switch processor redundancy, and other system resiliency features to recover before a timeout and avoid service downtime as a result of network reconvergence.

Process Restartability

The Cisco IOS XR distributed and modular microkernel operating system enables process independence, restartability, and maintenance of memory and operational states. Each process runs in a protected address space. Checkpointing facilities, reliable transports, and retransmission features enable processes to be restarted without impacting other components and with minimal or no disruption of traffic. Usually any time a process fails, crashes or incurs any faults, the process restarts itself. For example, if a Border Gateway Protocol (BGP) or Quality of Service (QoS) process incurs a fault, it restarts to resume its normal routine without impacting other processes.

Fault Detection and Management

To minimize service outage, the Cisco ASR 9000 Series Routers provide rapid and efficient response to single or multiple system component or network failures When local fault handling cannot recover from critical faults, the system offers robust fault detection, correction, failover, and event management capabilities.

  • Fault detection and correction—In hardware, the Cisco ASR 9000 Series Routers offer error correcting code (ECC)-protected memory. If a memory corruption occurs, the system automatically restarts the impacted processes to fix the problem with minimum impact. If the problem is persistent, the Cisco ASR 9000 supports switchover and online insertion and removal (OIR) capabilities to allow replacement of defective hardware without impacting services on other hardware components in the system.
  • Resource management—Cisco ASR 9000 Series Routers support resource threshold monitoring for CPU and memory utilization to improve out of resource (OOR) management. When threshold conditions are met or exceeded, the system generates an OOR alarm to notify operators of OOR conditions. The system then automatically attempts recovery, and allows the operator to configure flexible policies using the embedded event manager.
  • Online diagnostics—Cisco ASR 9000 Series Routers provide built-in online diagnostics to monitor functions such as network path failure detection, packet diversion failures, faulty fabric link detections, etc. The tests are configurable through the CLI.
  • Event management—Cisco ASR 9000 Series Routers offer mechanisms such as fault-injection testing to detect hardware faults during lab testing, a system watchdog mechanism to recover failed processes, and tools such as the Route Consistency Checker to diagnose inconsistencies between the routing and forwarding tables.

Power Supply Redundancy

The Cisco ASR 9000 Series Routers are configured such that a power module failure or its subsequent replacement does not cause a significant outage.

A power supply failure or over/under voltage at the output of a power module is detected, and an alarm raised.

AC Power Redundancy

The AC power modules are a modular design allowing replacement without any outage. Figure 3-1 shows the minimum and maximum module configurations for version 1 power modules. Figure 3-2 shows that version 2 is similar, with a minimum of one module per tray.

At least one fully loaded AC tray is required to power a fully loaded system. Each module outputs 3000 W.

For Cisco ASR 9010 Routers, the slot location of a module in the power trays is irrelevant as long as the two power trays have an equal number of modules installed (in case one tray should fail) (see Figure 3-1).

For Cisco ASR 9006 Routers and Cisco ASR 9904 Routers the slot location of a module in the tray is irrelevant as long as there are N+1 number of modules (see Figure 3-3 and Figure 3-4).

Figure 3-1 AC System Power Redundancy for the Cisco ASR 9010 Router—Version 1

 

Figure 3-2 AC System Power Redundancy for the Cisco ASR 9010 Router—Version 2

 

Figure 3-3 AC System Power Redundancy for the Cisco ASR 9006 Router—Version 2

 

Figure 3-4 AC System Power Redundancy for the Cisco ASR 9904 Router—Version 2

 

Figure 3-5 AC System Power Redundancy for the Cisco ASR 9922 Router—Version 2

 

Figure 3-6 AC System Power Redundancy for the Cisco ASR 9912 Router—Version 2

 


Note The Cisco ASR 9010 Router, Cisco ASR 9922 Router, and Cisco ASR 9912 Router are capable of operating with power modules installed in only one of their power trays. However, such a configuration does not provide any redundancy.



Note AC power redundancy for the Cisco ASR 9010 Router, Cisco ASR 9922 Router, and Cisco ASR 9912 Router requires that power modules be installed in multiple power trays.


DC Power Redundancy

The DC power modules are a modular design allowing replacement without any outage. Each tray houses up to three version 1 power modules or four version 2 power modules. Figure 3-7 shows the minimum and maximum module configurations for the version 1 power modules. Figure 3-8 shows that version 2 is similar, with a minimum of one module per tray.

The Cisco ASR 9000 Series Routers have two available DC power modules, a 2100 W module and a 1500 W module. Both types of power modules can be used in a single chassis. See Appendix A, “Technical Specifications,” for power module specifications.

The slot location of a module in a tray is irrelevant as long as there are N+1 number of modules.

Figure 3-7 DC System Power Redundancy for the Cisco ASR 9010 Router—Version 1

 

 

Figure 3-8 DC System Power Redundancy for the Cisco ASR 9010 Router—Version 2

 

Figure 3-9 DC System Power Redundancy for the Cisco ASR 9006 Router Version—2

 

Figure 3-10 DC System Power Redundancy for the Cisco ASR 9904 Router—Version 2

 

Figure 3-11 DC System Power Redundancy for the Cisco ASR 9922 Router—Version 2

 

Figure 3-12 DC System Power Redundancy for the Cisco ASR 9912 Router—Version 2

 


Note The Cisco ASR 9000 Series Routers are capable of operating with one power module. However, such a configuration does not provide any redundancy.


Redundant –48 VDC power feeds are separately routed to each power tray. For maximum diversity, the power entry point to each tray is spatially separated to the left and right edges of the tray. Each feed can support the power consumed by the entire tray. There is load sharing between the feeds. Each power module in the tray uses either feed for power, enabling maintenance or replacement of a power feed without causing interruption.

Detection and Reporting of Power Problems

All –48 VDC feed and return lines have fuses and are monitored. Any fuse blown can be detected and reported. The input voltages are monitored against an over and under voltage alarm threshold. The controller area network (CAN) monitors the power output voltage levels.

Cooling System Redundancy

The Cisco ASR 9000 Series Routers are configured in such a way that a fan failure or its subsequent replacement does not cause a significant outage. During either a fan replacement or a fan failure, the airflow is maintained and no outage occurs. Also, the fan trays are hot swappable so that no outage occurs during replacement.

The Cisco ASR 9010 Router has two fan trays at the bottom of the card tray. Each fan tray has 12 fans arranged in three groups of four fans each. Two fans of each group share a fan controller. The power supplied to the fan controller is 1:3 protected. A single fan failure has no impact on air flow because the other 11 fans will compensate for it. If the fan controller fails, there is a possibility of up to two fans failing; however, the design always has two fans operating in a row (three rows of fans) to compensate for the air speed.

The Cisco ASR 9006 Router has two fan trays at the top left of the chassis. Each fan tray has six fans arranged in three groups of two fans each. The two fans in a group share a fan controller. The power supplied to the fan controller is 1:3 protected. A single fan failure has no impact on air flow because the other five fans will compensate for it. If the fan controller fails, there is a possibility of up to two fans failing; however, the design always has two fans operating to compensate for the air speed.

The Cisco ASR 9904 Router has a single fan tray located at the left side of the chassis and is accessible from the rear. The fan tray has 12 fans arranged in three groups of four fans each. Two fans in each group share a fan controller. The power supplied to the fan controller is 1:3 protected. A single fan failure has no impact on air flow because the other eleven fans will compensate for it. If a fan controller fails, there is a possibility of up to two fans failing; however, the design always has two fans operating to compensate for the air speed.

In the Cisco ASR 9922 Router, the two top fan trays are located between the top and middle card cages, while the two bottom fan trays are located between the middle and bottom card cages. In the Cisco ASR 9912 Router, the two fan trays are located above the line card cage. Each fan tray has 12 fans arranged in three groups of four fans each. Two fans of each group share a fan controller. The power supplied to the fan controller is 1:3 protected. A single fan failure has no impact on air flow because the other 11 fans will compensate for it. If the fan controller fails, there is a possibility of up to two fans failing; however, the design always has two fans operating in a row (three rows of fans) to compensate for the air speed.


Caution If only one fan tray is installed in the system, one single point of failure does not cause all fans to stop. However, the system cannot operate without a fan tray. The system shuts itself off if all fan trays are removed and the system crosses the Shutdown Temperature Threshold (STT).

Cooling Failure Alarm

Temperature sensors are installed in all cards and fan trays. These sensors detect and report any fan failure or high temperature condition, and raise an alarm. Fan failure can be a fan stopping, fan controller failure, power failure, or a failure of a communication link to the RSP/RP card.

Every card has temperature measurement points in the hottest expected area to clearly indicate a cooling failure. The line cards have two sensors, one at the inlet and one near the hottest devices on the card. The RSP/RP card also has two sensors.