Cisco on Cisco
IP Telephony Case Study: How Cisco Built Survivable Remote Site Telephony into IT Operations Command Center
SRST ensures highly reliable backup for critical operations command center.
If there’s one group at Cisco that is critical to the company’s logical existence, it is the IT Operations Command Center (OCC). With fewer than half a dozen people working around the clock, OCC shoulders a huge responsibility: facilitating problem resolution for significant failures that could affect business continuity. Of the 24,000 monitored resources in the Cisco network, the OCC group continuously monitors the 10,000 resources considered priority 1 and priority 2 (P1 and P2), including hosts, network devices, applications, and databases. Should a failure occur, the OCC staff provides communication, escalation, coordination, and documentation, which is the framework that enables technical resources to resolve the incident, and then performs root cause analysis later.
“We’re process people,” says Ian Reddy, IT project manager for the IT Operations Command Center. “The phone and paging system are our most critical communications tools during an outage. Even if there’s a problem with the regular phone system, it’s imperative that the phones in the OCC remain available.”
In early 2000 the Cisco San Jose campus was still supported by a pair of TDM-based private branch exchange (PBX) switches. They were very reliable, which was a good thing, because backing them up with a second system would have been prohibitively expensive and difficult. During that time the IT OCC used the same phone system as did every Cisco employee in San Jose, without any form of backup.
When Cisco converted to an IP telephony network and exchanged its TDM-based PBX switches for Cisco Unified Communications Manager (formerly Cisco CallManager) clusters, the OCC was the last to make the transition. There was concern in 2000 about the new technology change; the IP solution needed to provide the same or a higher level of reliability than any TDM solution would provide. While IP telephony proved itself over time to be a remarkably reliable architecture for voice, for awhile there was still no additional backup. Survivable Remote Site Telephony (SRST), a feature of the Cisco IOS® Software which appeared in 2002 and can run on almost any Cisco router, provided that backup capability. “Our legacy PBX had redundancy,” says Fran McBrearty, IT project manager, “so we didn’t have to concern ourselves with a backup system. SRST provides a greater level of backup. Before the OCC would make the transition, we provided complete confidence that the IP telephony system would deliver continuous availability.”
IT Project Manager, Data Center Operations
The basic IT telephony solution used on the Cisco campus was already highly reliable in its own right. The Cisco Unified Communications Manager “supercluster” at headquarters manages 50,000 phone numbers, and incorporates a redundant design featuring a publisher, two TFTP servers (one primary and one secondary), and 16 publishers (eight primary and eight secondary). Primary servers are kept in one data center while the secondary servers are kept in another data center on the same campus. In addition, all campus LAN connections and all WAN connections are duplexed, and each path is kept physically separate as often as possible, which reduces the chances of any IP phone being cut off from its Cisco Unified Communications Manager cluster. This level of redundancy delivers the high availability that major enterprises demand for their telephony systems. The San Jose campus IP telephony system has provided 99.998 percent dial-tone availability over the past 12 months.
This level of availability was still not enough to convince the OCC to migrate to IP telephony in 2000, and the OCC saw an opportunity for even greater availability. “The telephony system is so critical to the OCC, and to Cisco business continuity, that the OCC wanted yet another backup in the event that not only the Unified Communications Manager supercluster failed, but also the PSTN gateways,” says McBrearty.
With the availability of SRST features in Cisco routers, Cisco IT began planning the OCC transition to IP telephony in mid-2002. “Over the past two years we evaluated every alternative we could find,” says McBrearty. “That included other vendors’ solutions, and even having two telephones on each desk. Until SRST was available, we didn’t see anything that really met our needs.”
Through demonstrations and statistical evidence, McBrearty and Reddy determined that the best solution for ensuring continuous availability of the IP telephony system was a redundant architecture based on Cisco SRS Telephony, a feature of the Cisco IOS Software. An SRS Telephony router automatically detects when it (and the IP phones it supports) can no longer connect to the Cisco Unified Communications Manager cluster. Then the router uses Cisco Simple Network Automated Provisioning (SNAP) capability to initiate a process to intelligently autoconfigure itself to provide limited call-processing backup redundancy for IP phones.
In its usual role, SRS Telephony is used in branch offices so that employees can continue to place and receive calls in the event of a WAN outage. Cisco branch offices are equipped with both a primary and a backup WAN link which provides them with connectivity to other sites, including a centralized Cisco Unified Communications Manager cluster, located in the closest Cisco data center. There are currently 13 Cisco Unified Communications Manager clusters worldwide, supporting more than 70,000 IP phones (and 30,000 Cisco IP Communicators) in more than 300 locations worldwide. If WAN connectivity from the branch office to its call-processing cluster is lost, the branch office router (which supports the IOS-based SRST features) provides temporary and limited call handling until access to the Cisco Unified Communications Manager cluster is regained. The SRST router interconnects calls between all IP phones within that branch office. The SRST router then connects all other calls; that is, all calls to or from the outside world, by sending them to the PSTN gateway in that branch office. All branch offices have a PSTN gateway, because only some calls to or from an office are “on-net”, or calls to other Cisco locations on the WAN. Most calls to and from the branch office are “off-net”; that is, calls to or from the rest of the world, and PSTN gateways are needed to connect those calls to the regular phone network.
The OCC, in contrast, uses SRS Telephony in the same location as the Unified Communications Manager cluster, to give IP phones an added level of service to survive unplanned technical or security incidents. “We use SRS Telephony in a different way than originally intended—that is, not for survival at the edge, but for business continuity at the core,” says Reddy. “SRS Telephony is typically used to help a remote site that’s cut off from headquarters to continue to manage its own telephony needs. But in our case, it’s not the remote site that needs to survive; rather, it’s the core facility that needs to survive, even if everything surrounding it should fail.”
This innovative application of SRS Telephony has broader appeal. Numerous enterprise customers have visited the Cisco OCC facility to investigate how the solution might deliver continuous availability for their data operations centers or mission-critical call centers.
The SRS Telephony deployment at the OCC ensures business continuity through no fewer than four levels of backup (Figure 1). McBrearty likens the deployment to a “belt and suspenders” approach. “The redundant design features of the Unified Communications Manager cluster are the belt, holding up the operation,” he says. “Should Unified Communications Manager fail entirely, a Cisco router enabled with SRS Telephony takes over as the suspenders. Now, most data centers would be content with that level of availability, but the Cisco OCC wanted yet more assurance, so we set up a completely redundant system in another building.” Says Reddy, “The SRS Telephony architecture transparently supports the physical redundancy of our operation.”
Figure 1. Four Levels of Backup for the OCC IP Telephony System
All calls to or from the OCC, whether they travel internally over the Cisco LAN or WAN, or externally over the gateway-connected PSTN, come through one of two separate, duplexed gateway routers. These gateway routers are connected to two separate pairs of SRST gateway routers, and in each pair one SRST router is designated as primary, while the other is designated as secondary. Ordinarily, these SRST router do nothing more than pass the IP telephony traffic to the appropriate IP phone endpoints, based on the call control information from the Cisco Unified Communications Manager supercluster located across the LAN or WAN. However, each SRST router is in continual contact with the Cisco Unified Communications Manager cluster, and if the SRST router can no longer reach the cluster (because of a failure in the LAN or WAN link to the cluster, or even a rare failure in the cluster itself), it turns on SRST features that enable it to provide limited call-processing support sufficient for the OCC. Calls within the OCC are then routed to the IP phones in the OCC by the primary SRST router itself, and calls into or out of the OCC are sent to the PSTN gateway to be handled (and carried) by the chosen telephony vendor. If the primary SRST router fails, all calls are routed through the secondary SRST router in the same building, which picks up the responsibility for handling all OCC calls.
Another level of redundancy is activated in the unlikely event that access to the Cisco Unified Communications Manager cluster, and both the primary and backup routers are unavailable all at the same time; for example, in the event of a physical disaster in the headquarters campus. In that situation, the call automatically flows to the backup system in the redundant OCC location half-way around the world. One OCC is in San Jose, USA, while the other OCC is in Bangalore, India. If manual switchover were necessary (in the case of a bomb threat, for instance) manual switchover would take just minutes.
This solution provides four levels of redundancy, since:
- If one set of gateway routers fails, calls are routed through the second set of gateway routers to the Cisco Unified Communications Manager cluster.
- If all connectivity to the Cisco Unified Communications Manager cluster fails, the primary SRST router takes over.
- If access to the primary SRST router fails, the secondary SRST router takes over.
- If access to the secondary SRST router fails, calls are directed to the backup OCC.
These four levels of redundancy for the OCC are in addition to the LAN and Cisco Unified Communications Manager cluster redundancy already in place for all other Cisco IP Phones.
Before deploying the solution, McBrearty’s group tested SRS Telephony in the OCC environment with 10 phone numbers. “Because this application of SRS Telephony is unique, it behaved somewhat differently than would be expected in a typical SRS Telephony environment used with central call processing,” he says. “Our engineer, Jeff McDowell, tested the solution rigorously and directed us in enhancements. We went through months of iterations to develop just the right set of features.” Some of these features have since become standard for all Cisco customers using SRS Telephony.
The OCC transitioned to the Cisco Unified Communications Manager and SRS Telephony architecture on July 10, 2003. Due to the criticality of the OCC phone numbers, Cisco service providers SBC/Pacific Bell and Sprint were present during the cutover and participated actively. Engineers from both carriers diverted two telephone numbers from their previous PSTN trunks, which led to the Unified Communications Manager supercluster, to separate trunks that feed the SRS Telephony-enabled routers. They then worked with McDowell’s team to extensively test the call flow (Figure 2).
Figure 2. Call Failover
On the night of the cutover, preparation for the transition was accomplished in about an hour. This included final routing changes in the Cisco Unified Communications Manager supercluster and last-minute changes in Dial Peer Statements. Cisco then asked the carriers’ engineers to move the phone numbers from the regular trunk groups to the new ones, and testing was done next. When testing was complete, trial failovers began. “Our service providers partnered with us to put the right people on the job—those with the skill sets to make certain calls were going where we expected,” says Reddy. “Though they had little to do during cutover, they stayed for more than seven hours to test the solution and be sure it worked. Great partnering.”
The OCC now has the confidence that its IP telephony network will continue to operate even if the general Cisco network should experience problems. “When all else fails, our telephony system must work,” says Reddy. “It’s a very positive statement about IP telephony in general and SRS Telephony in particular that our group made the transition.”
The OCC conducts monthly tests of this backup solution, by “manually” failing the gateway connectivity to the Cisco Unified Communications Manager cluster. After an hour the gateway connection is re-established, and the SRST routers hand back all call processing to the cluster. The SRST solution continues to work, month after month, without anyone in the OCC noticing when the failover, and the fail-back, takes place.
Numerous large enterprise customers have already expressed interest in this specialized application of SRS Telephony, primarily for their operations command centers and critical call centers. “For the companies that have approached us, telephony plays a major or supporting role in ensuring business continuity,” says Reddy.
“Fundamentally, IP phones are very stable,” Reddy continues. “Our deployment of SRS Telephony is designed to ensure availability even in a very extreme scenario. What’s remarkable is that we didn’t have to give up any features or capabilities of IP telephony to achieve extraordinary levels of availability. Rather, availability is a bonus.”
For companies considering the SRS Telephony solution for their data centers or call centers, McBrearty emphasizes the importance of teamwork among IT, the client, and third-party service providers. “When we executed the cutover, SBC/Pacific Bell and Sprint participated in every planning meeting. Our service providers understood what we were doing and its criticality to Cisco. Know your business needs, know your network, know your vendors, and have a relationship,” he says.
McBrearty also states that testing is essential. “Build out your solution first in a lab environment, on a small scale, and resist the temptation to test solutions in production,” he says. Cisco, for example, tested its implementation with just 10 phone numbers.
Based on the success of SRS Telephony for operations requiring highly redundant IP telephony systems, the OCC will become a Cisco Showcase site for SRS Telephony. Long term, Reddy and McBrearty are investigating the possibility of extending the fully-redundant OCC solution to support wireless phones and Cisco IP SoftPhone software access for OCC personnel. “I can envision a more mobile OCC function and more flexible disaster recovery,” says Reddy. “For example, we could potentially scale our operation to ‘follow the sun,’ with staff around the world monitoring the phones that answer the OCC phone number.”
“By combining the rich feature set of IP telephony with the redundancy of our SRS Telephony implementation, we’ve gained capabilities far beyond those possible with our old PBX switch.”