Guest

Cisco 7500 Series Routers

High-Availability Initiative - Beat the Downtime

Overview

Cisco 7500 Series Router
High-Availability Initiative
Beat the Downtime

Summary

As the number of Internet users and Internet-based mission-critical applications increase daily at an unprecedented pace, service-provider and enterprise customers are demanding greater reliability and availability. When every minute of downtime can mean millions of dollars in lost revenue and embarrassing headlines, companies are eagerly looking for solutions to make their systems highly available. The Cisco High-Availability (HA) feature of networking products helps customers increase uptime and protect financial performance, reputation, and customer loyalty.

Cisco Systems has initiated a series of programs to enhance HA performance of its products. This overview covers the High-Availability initiative specific to the Cisco 7500 Series Router.

Availability Basics

Availability isn't just a concept, but a science, which can be expressed mathematically. A highly available system is one that is usable when the customer needs it. A system can be highly available, operating from 8 a.m. to 5 p.m., if that is all the business demands. The remaining time can be used for scheduled maintenance and repair. Availability is defined as actual service divided by required service. The challenge for many of today's systems is to operate 24 hours a day, 365 days a year (also referred to as 7 x 24, or 365 x 24).

Availability is often expressed in percentages. A 365 x 24 system with 99.9-percent availability has an average downtime of 8.76 hours per year (525 minutes). A system with only 3 minutes of service outage must have 99.999-percent availability.

Availability is calculated using statistical models for all the system components, the simplest model for a component being binary. The component is either in or out of service. Availability can be calculated from failure rates, measured in mean time between failures (MTBF), and repair times, measured in mean time to repair (MTTR).

The average downtime contribution by any component is calculated by amortizing the MTTR time over the MTBF period. For example, if a component critical to the operation of the platform has an MTBF of 250,000 hours and a MTTR of 1 hour, it contributes 2.1 minutes (60 min/250,000 hr/8760 hr/yr) of unavailability to the system per year.

Availability in the two-nines or three-nines range (99 to 99.9 percent) can be achieved by maximizing the reliability of components and minimizing repair times. To achieve higher reliability or to compensate for less-reliable components, redundancy is used. Having a backup for a component that fails keeps the system operating. Availability of redundant configurations is calculated based on the time taken to detect and switch over to the redundant component.

Cisco Systems has been working diligently to move the Cisco 7500 Series Router into the "five-nines" (99.999-percent) availability area, the highest performance achieved in the industry.

Cisco 7500 High-Availability Initiatives

The Cisco 7500 Series Router will focus on increased availability during planned and unplanned network outages. Because of its prominent position in large-scale networks as an edge router, high availability is an important feature that customers are demanding. Edge routers do not benefit from redundant network architecture topologies that core routers typically benefit from, and, therefore, are likely to be a single point of failure in a network. Customers see the downtime as a major obstacle to their business goals and customer relations. However, it is not always possible to build equipment and circuit redundancy throughout the entire network. Therefore the availability initiatives of an edge router must concentrate on features that will:

  • Isolate any errors on one part of the router from affecting the rest of the system.

  • Allow a faulty processor to switch over to any redundant processors in the event of a failure.

  • Minimize the switchover time between processors.

Cisco 7500 Series Router enhanced high-availability features include:

  • Cisco 7500 Route Processor Redundancy (RPR)

  • Cisco 7500 Fast Software Upgrade (FSU)

  • Cisco 7500 Route Processor Redundancy+ (RPR+)

  • Cisco 7500 Single Line Card Reload (SLCR)

  • Cisco 7500 Stateful SwitchOver (SSO)

  • Cisco 7500 Non-Stop Forwarding (NSF)

High System Availability Review

Redundancy is one of the key methodologies to increase the system availability. When the active Route Switch Processor (RSP) is faulty, the standby Route Switch Processor will take over so that the system will continue processing and forwarding. Cisco High System Availability (HSA) employs such methodology to deal with route processor failure and increase the system availability. However, there are still areas in this process that can be optimized. In the HSA process, the time from initial failure to first packet transmission can be broken down as follows:

1. Time to identify failure

2. Time to load and boot software on standby RSP

3. Time to load new configuration on standby RSP

4. Time to reset and reload line cards

5. Time to load new configuration on line cards

6. Time to learn routes, pass keepalive message, and forward traffic

7. Route convergence

This process is called Cold Standby, which implies that the entire system will lose function for the duration of the restoration. All traffic flowing through the router is lost during this time. The benefit of using Cold Standby is that the device will restart without manual intervention by rebooting with the standby RSP taking control of the router.

Route Processor Redundancy

The Cisco 7500 implements the RPR feature to eliminate Steps 2 and 3 in the HSA switchover process, thereby reducing the failure recovery time. The recovery time is now reduced because the standby RSP has already started the boot up process before taking control of the router. This is called warm standby mode.

In warm standby mode, when the router powers up, both the active and standby RSPs will boot and initialize themselves. The standby RSP goes through most of the Cisco IOS® Software bootup process, but does not complete some final steps. Initialization will take place as if all the line cards (LCs) have been removed by online-insertion-and-removal (OIR).

If the active RSP fails, the standby RSP takes over. The standby RSP needs only to complete the final steps of the bootup process when becoming the active RSP, thus reducing the recovery time. The LCs will then be OIR inserted by the standby RSP (now active RSP) during the switchover. This new switchover strategy reduces the switchover time by 50 percent compared to the Cold Standby scenario (from 8-10 minutes down to 4-5 minutes).

Route Processor Redundancy+

Building on the RPR feature, the RPR+ feature further eliminates Steps 4 and 5 in the HSA switchover process. RPR+ on the Cisco 7500 keeps LCs up during the switchover. The LCs will not be reloaded or reinitialized. This feature reduces the route processor switchover time by 90 percent (down to 30-40 seconds) compared to RPR.

Fast Software Upgrade

RPR and RPR+ are used to deal with unanticipated RSP failures, Fast Software Upgrade (FSU) is used to increase the availability during the planned downtimes, such as software upgrade and maintenance.

Using the same process as RPR, the planned downtime is dramatically reduced. Instead of using the same Cisco IOS Software image in both the active RSP and standby RSP as in RPR, an upgraded Cisco IOS Software image will be loaded on the standby RSP before switchover. This will save the same amount of time as in the RPR scenario, including downloading time, decompression time, and initialization time when upgrading the IOS image.

Single Line Card Reload

Previous to this feature's availability, when a LC is faulty, the entire backplane was rendered inactive and all LCs were reloaded, during this time no packets were forwarded. To increase the HA on the Cisco 7500 Series Router, SLCR is used as a new recovery process. Upon the occurrence of an error on a single LC, only the faulty LC is reloaded—instead of all the LCs. This new process dramatically reduces the time to recover from a single LC failure by 85 percent.

Stateful Switchover (SSO)

This feature, which is based on RPR+, will reduce the time in Steps 6 and 7. The stateful switchover feature allows the active RSP to pass the necessary state information of key routing and interface protocols to the standby RSP upon switchover, thereby reducing the time for the standby RSP to learn and converge routes. This feature is planned to be available in 12.0(22)S.

Non-Stop Forwarding (NSF)

Also based on RPR+, Non-Stop Forwarding allows routers with redundant RSPs to continue forwarding data to the standby RSP during a switchover. This feature uses the Forwarding Information Base (FIB) that was current at the time of the switchover. Once the routing protocols have converged, the FIB table is updated and stale route entries are deleted. This feature eliminates downtime during the switchover. Planned availability of this feature is in 12.0(22)S.

Conclusion

Cisco systems is committed to continuously improve the High Availability performance of the Cisco 7500 Series Router. For service providers, this means millions of dollars operating savings from system outage as well as highly reliable image and reputation, which are the two critical success factors for service providers. For enterprise customers, this means continuous process for critical business communications and applications, which will dramatically increase an enterprise's productivity as well as competitiveness. In summary, the enhanced High Availability performance will be critical for both service-provider and enterprise customers to be successful in today's competitive marketplace.