Redundancy and Availability Features

Redundancy and Availability Features
 
 
Every minute of downtime and every dropped session represents lost revenue to the wireless operator resulting in potential customer loss and reduced profitability. With this understanding, we have developed a system that exceeds the availability features found in the majority of today's wireless and wireline access devices.
 
Service Availability Features
In its recommended redundant configuration, the system provides the highest level of service assurance. Following is detailed information describing the service availability features found in the system.
 
Hardware Redundancy Features
In addition to providing the highest transaction rates and session capacity, the system is designed to provide robust hardware reliability and service assurance features.
Features of the hardware design include:
 
ASR 5000
 
Important: 1:1 redundancy is supported for these cards however some subscriber sessions and accounting information may be lost in the event of a hardware or software failure even though the system remains operational.
 
Hardware Redundancy Configuration
The maximum redundant configuration for a fully loaded system supporting data services consists of the following:
 
This configuration allows for the highest session capacity while still providing redundancy. The following figures depict this recommended maximum redundant configuration.
Recommended Redundant Configuration for Data Services - Front View
Recommended Redundant Configuration for Data Services - Rear View
 
Maintenance and Failure Scenarios
The following table shows various maintenance and failure scenarios involving the SMC and SPIO cards; and explains how each situation is resolved.
Service Assurance Features for the SMC and SPIO
Important: When an SMC or SPIO failover occurs, the standby SMC or SPIO automatically becomes active. However, should the failed card's error condition be corrected (by replacement or configuration change), the state of the repaired SMC or SPIO does not automatically return to the active state. This migration must occur through manual intervention by a system administrative user.
With the ability of performing on-line process migration, supporting 1:1 SMC and SPIO redundancy, and utilizing the fully redundant switching fabric and control bus, single points of failure are eliminated from the switch fabric and system management capabilities.
The following table shows various maintenance and failure situations involving the processing cards (PSC, PSC2, PPC), Line Cards (LCs), and RCC cards; and explains how each situation is resolved. Note that LCs are not needed behind the standby processing cards that provide redundancy.
Service Assurance Features for Processing and Line Cards
< 5 sec. interrupt until new A11/GTP-C manager is available (new sessions only)NOTE: Applies only when A11/GTP-C manager is on failed card
Important: If the session recovery feature is enabled, then a processing card hardware failure will not cause any loss of fully established HA subscriber sessions. This feature does, however, require a minimum processing card configuration per chassis of three active cards and two standby to prevent all data loss and session recovery.
Important: When a processing or line card failover occurs, the redundant component (when installed) automatically begins providing service. However, once the failed card's error condition is corrected (by replacement or configuration change), there is no automatic return of control to the repaired processing or line card. This migration must occur through manual intervention by a system administrative user.
 
Software Assurance Features
Numerous features are built into the system software to ensure the continuation of service in the case of software process failures. SMC software controls the management contexts and overall system control, while processing card software controls the PPP sessions, AAA, and VPN processes.
The following table shows various software process failure situations involving the SMC and SPIO cards, provides impact analysis (if any), and explains how each situation is resolved using rapid failure detection techniques found in the system.
Service Assurance Features for the SPC/SMC Software
The following table shows various software process failure situations involving the processing cards, provides impact analysis (if any), and explains how each situation is resolved using rapid failure detection techniques found in the system.
Service Assurance Features for the Processing Cards Software
 
Session Recovery Feature
This licensed software feature performs an automatic recovery of all fully established subscriber sessions should a session manager task failure occur. This functionality is available for the following call types:
 
With this feature enabled, there is no loss of session information as described in table above. Session recovery consists of the migration and recreation of control and data packet state information, subscriber session statistics, or session time parameters such as idle timer and others.
Typical recovery time for a single session manager failure is not expected to exceed 10 seconds. Should a processing card hardware failure occur during a migration, then the time to recover all tasks and subscriber sessions should not exceed 60 seconds.
This feature is enabled/disabled on a chassis-wide basis and requires additional processing card hardware to ensure that enough reserve resources (memory, processing, etc.) are available to fully recover session in the event of a software or hardware failure.
 
Interchassis Session Recovery
The Interchassis Session Recovery feature provides the highest possible availability for continuous call processing without interrupting subscriber services. This is accomplished through the use of redundant chassis. The chassis are configured as primary and backup with one being active and one inactive. Both chassis are connected to the AAA server. When calls pass the checkpoint duration timer, checkpoint data is sent from the active chassis to the inactive chassis. If the active chassis handling the call traffic goes out of service, the inactive chassis transitions to the active state and continues processing the call traffic without interrupting the subscriber session.
The chassis determine which is active through a propriety TCP-based connection called a redundancy link. This link is used to exchange status messages between the primary and backup chassis and must be maintained for proper system operation. In the event the redundancy link goes out of service, interchassis session recovery is maintained through the use of authentication probes and BGP peer monitoring. BGP routing must be enabled.
Interchassis Session Redundancy is currently supported on chassis configured for GGSN service or HA services in support of Mobile IP and Proxy Mobile IP session types.
 
Mean Time Between Failure and System Availability
Mean Time Between Failure (MTBF) data is used to provide statistical information as to the length of time that should expire before a particular card or system fails. This information is calculated using the following method:
Calculated MTBF - Expected elapsed time before failure occurs using the method defined in Telcordia TR-NWT-000332-CORE. This is based on reliability of components and design factors.
Failure per million hours (Fpmh) identifies the predicted failure rate per one million hours (for every 1,000,000 hours of operation, “FITS number” of failures would be expected to occur) for a component of the system.
 
MTBF Table
The following table shows the MTBF characteristics of each major component of the system.
Mean Time Between Failure Statistics
 
System Availability
System-level Mean Time To Failure (MTTF), is the average interval of time that a component will operate before failing. Reliability information is based on the number of overall anticipated failures of the individual components, in conjunction with any redundancy schemes employed to minimize the impact of such failures.
The following table provides service availability calculations (based on reliability modeling) for the ASR 5000 platform.
Platform Service Availability Calculations
One suggestion to help improve overall system availability is to institute an on-site spares program, wherein key components are housed locally with the deployed equipment. The following section defines a recommended spares program and quantities for the system.
Mean Time To Repair (MTTR) is the amount of time needed to repair a component, recover the system, or otherwise restore service after a failure. System availability calculations are based on the industry standard of four hours.
 
Spare Component Recommendations
This section provides a recommended quantity of spare parts to be used as part of a spare components program for the system. The information contained is for informational purposes only, and should only be used as a guideline for designing a spares program that meets your company's design, deployment, and availability goals.
It is recommended that your company either has fully-trained personnel available to effect the exchange of Field Replaceable Units (FRUs) within your network, or requests on-site or field engineering resources to perform such duties.
Based on industry-leading redundancy and failover features found in the system, the following minimum spare parts levels for any planned deployment are recommended.
Recommended FRU Parts Sparing Quantities

Cisco Systems Inc.
Tel: 408-526-4000
Fax: 408-527-0883