Cisco on Cisco
Cisco Unified Operations Manager Case Study: How Cisco Actively Manages Voice Availability and Quality
Cisco IT automates daily testing of all voice elements in the global network
(PDF - 353 KB)
Take a quick survey
Overall, voice quality on the Cisco global network is excellent, with 99.99 percent of calls meeting the company’s high quality standards. But even one call that is dropped or is difficult to hear is one too many.
Cisco IT strives to identify and remediate problems that affect voice quality before they affect productivity or the customer experience. However, the size and diversity of the global Cisco network make it difficult to identify the source of voice issues. At the end of 2009, Cisco IT managed more than 200,000 Cisco Unified IP phones, including 50,000 soft phones for employees, contractors, and extranet partners. The endpoints reside in approximately 500 buildings in 90 countries. A voice conversation or call to the Cisco Unity® Unified Messaging system might be processed by any of 15 Cisco Unified Communications Manager clusters around the world. Along the way, the call will travel through multiple Cisco Catalyst® switches and Cisco Integrated Services Routers that provide the Quality of Service (QoS), security, and multicast services needed for unified communications. Any of these devices might be the source of a problem.
Traditionally, Cisco IT had used custom scripts to test the network, and then reviewed call detail information to look for anomalies. But the process was so labor-intensive that Cisco IT only used the scripts reactively, which delayed detection of voice issues and identification of the causes. “To continue providing excellent voice quality without adding staff, we needed a better way to identify and solve voice issues,” says Jon Heaton, unified communications architect, Cisco. “We wanted to be proactive rather than waiting to react to user complaints.”
In 2007, Cisco IT began proactively monitoring and troubleshooting the voice environment using the Cisco Unified Communications Management Suite. Cisco currently uses two software tools in the suite, Cisco Unified Operations Manager and Cisco Unified Service Monitor.
Deployed in six global Cisco IT data centers, Cisco Unified Operations Manager provides service-level monitoring of all system elements (Figure 1). Cisco IT uses the software to generate synthetic calls from nearly 200 virtual Cisco Unified IP Phones distributed across all geographic theaters (Figure 2). These synthetic calls confirm that the endpoint responds appropriately.
Unified Communications Design Engineer
Figure 1. Synthetic Calls Accurately Measure End-User Experience
Each virtual phone initiates a synthetic call every 1-3 minutes, for a total of approximately 200,000 test calls each day. The phones call physical IP phones, voicemail ports, interactive voice response (IVR) ports, Session Border Controllers (SBC) used for Cisco WebEx™ sessions, and Cisco Unified MeetingPlace® gateways. The virtual phones make some calls within the same Cisco Unified Communications Manager cluster, reflecting the experience when employees make local calls and make other calls to phones in different geographic theaters. Cisco Unified Operations Manager also uses synthetic calls to verify that phone registration and dial tone are available from each Cisco Unified Communications Manager subscriber. “The load that the test calls place on the system is almost undetectable,” says Heaton.
Figure 2. Cisco IT Uses Cisco Unified Operations Manager Monitoring Dashboard to Set Alert Thresholds
Cisco IT uses Cisco Unified Operations Manager to set alert thresholds (Figure 2). “We are not concerned with occasional individual call failures, which can result from a software malfunction or momentary outage on a network segment,” says David Neustedter, unified communications design engineer, Cisco. “Therefore, we configure the software to alert us to a certain number of sequential failures, or to a larger number of sporadic failures within a specified time. We would not receive that information by monitoring user complaints.”
The synthetic calls measure whether a connection is made, not whether the parties can hear and understand each other. To measure voice quality, Cisco IT uses the jitter and packet loss metrics that Cisco Unified IP Phones record at the end of each phone call. The phones use the Cisco Voice Transmission Quality (VTQ) algorithm to calculate a single Mean Opinion Score (MOS), which becomes part of the call management record (CMR). The CMR is embedded in the call detail record (CDR) logs that Cisco Unified Communications Manager transmits to Cisco Unified Service Monitor (Figure 3).
Figure 3. Call Management Records with Quality Information are Sent to Cisco Unified Service Monitor
Unlike Cisco Unified IP Phones, Cisco Unity Unified Messaging and Cisco Unified MeetingPlace do not report Cisco VTQ quality information from their side of the connection. Therefore, Cisco IT deployed Cisco 1040 sensors in front of these servers to collect jitter and packet loss information (Figure 4). The sensors transmit the information to Cisco Unified Service Monitor for conversion into a MOS value that appears in reports.
Figure 4. Cisco 1040 Sensor Estimates Voice Quality Score for Unity and MeetingPlace Servers
Cisco employees initiate three million calls daily. If only one or two percent of calls have quality issues, identifying the causes can be difficult or impossible using CDRs alone. Cisco Unified Service Monitor simplifies issue detection by comparing voice quality across clusters. “Most of our clusters process more than 99 percent of calls with excellent voice quality, so Cisco Unified Service Monitor flags clusters with quality of 96 or 97 percent and lower,” says Neustedter.
A small global operations team meets twice weekly to review the data from Cisco Unified Communications Management Suite, discuss findings and trends, and decide on action. "In the current economic climate, we have had to find ways to do more with less," says Neustedter. "The Cisco Unified Communications Management Suite makes it possible for just two people to manage reporting for the global Cisco enterprise."
Some of the ways that Cisco Unified Communications Management Suite has improved voice service availability at Cisco are enabling intermittent outage detection, identifying a misconfigured dial plan, and revealing a media translation issue.
Solved: Intermittent Outage Affecting Voicemail
Detecting intermittent outages is a traditional challenge for IT organizations, because recreating the problem can be difficult, and users tend not to complain if they get through on the next attempt. Now Cisco IT can often discover problems even before users report them by conducting synthetic tests with Cisco Unified Operations Manager.
As an example, one report showed that a portion of calls attempting to connect to Cisco Unity Voicemail were failing every night at about the same time, for 30 minutes to an hour. Then the problem went away all by itself, until the next night. Cisco IT never heard about the outages, because they occurred late at night, and the few users who were affected probably just hung up and tried again.
Based on the time range of the failed calls, Cisco IT realized that the failures might be related to the nightly synchronization of the Message Waiting Indicator on the Cisco Unity server. With this information, Cisco IT deduced that the problem resulted from the way the dial string was implemented. A simple error had caused heap errors on the Cisco Unity memory stack that took several hours to clear out. “Synthetic testing enabled us to fix an availability issue before we ever heard about it from users,” says Heaton.
Solved: Another Intermittent Outage Affecting Voicemail
For more than a year, the Cisco Global Technical Response Center (GTRC) had received occasional calls from Cisco users reporting that the voicemail system would not recognize their key entries or just hung up. GTRC could not verify the problem because it was intermittent.
Cisco IT diagnosed the problem by using Cisco Unified Service Monitor to measure voice bearer quality and call disconnects. “When we looked at voice quality on the lines going into the affected Cisco Unity servers, we discovered that outbound quality was perfect, but inbound quality was poor enough to make speech unintelligible,” Neustedter says. “After investigating, we discovered that the devices on the subnet had not been properly configured for QoS.” Under normal network loads, voice quality was fine. But during data center backups, when load was high, the network devices were dropping packets. As a result, for about five percent of calls, the Cisco Unity server received enough data to keep the port open, but not enough to recognize dual-tone multifrequency (DTMF) tones and voice.
Without Cisco Unified Service Manager, Cisco IT might have never diagnosed the problem, because 95 percent of calls to the voicemail system were fine. The diagnosis enabled Cisco IT to eliminate an inconvenience that was costing time for a small portion of users every day.
Solved: Misconfigured Dial Plan
When more than the usual percent of calls were dropping, Cisco IT used synthetic testing to identify the source of the error. Calls from point A to point B completed at the usual rate, but calls from point A to point C did not. “In this situation, the dial plan is usually the culprit,” says Heaton. “After looking at call records to see where the problem calls were handed off, we found the suspect entry in the call routing table and corrected it.”
Solved: Media Translation Issue
A small portion of employees who use Cisco IP Communicator software on their laptops, including employees in sites without broadband access, use a low-bandwidth codec. Cisco IT discovered that calls using this codec in one region were not working properly. Using synthetic testing, Cisco IT discovered that the Cisco Unified Communications Manager cluster had been misconfigured, preventing calls from receiving transcoding resources. “Fewer than five percent of employees use the low-bandwidth codec, so GTRC had been advising callers to use a different codec,” Heaton says. “By identifying the root cause and fixing it, Cisco IT gave employees more flexibility to work the way they want.”
When Cisco Technical Assistance Center (TAC) reported that callers were complaining that they would hear a couple of clicks and then dial tone, Cisco IT suspected that the issue was the equipment of the vendor that TAC used to manage call routing. The vendor, however, was convinced that the carrier was responsible. Cisco IT used Cisco Unified Service Monitor to track call quality and stream quality and pull CDR records in real time. “Using Cisco Unified Operations Manager and Cisco Unified Service Monitor, we proved empirically that calls were completing end to end,” says Neustedter. The report persuaded the vendor to examine its equipment, which led to the discovery of a misconfigured dial stream on the T1 blade that received inbound calls.
Cisco IT accelerates troubleshooting by using the service quality thresholding and reporting features in Cisco Unified Service Monitor, and the fault correlation and diagnostic capabilities in Cisco Unified Operations Manager. “We now have a holistic and systemic view of the unified communications environment, which enables us to detect and fix minor problems before they become major,” says Neustedter. “For example, we now become aware of problems with Cisco Unity Voicemail in 5 minutes, instead of the 30 minutes it typically takes for someone to report the issue to GTRC and the GTRC to contact Cisco IT.”
Typically, more than 99.9 percent of calls across the Cisco network complete with high voice quality. When Cisco Unified Service Monitor reports showed that just 96 percent of calls had acceptable quality In Research Triangle Park, North Carolina, Cisco IT used Cisco Unified Service Monitor to identify the places in the network where the problem was occurring. “We identified 10 subnets that had a problem, and after examining device configuration, we determined that QoS had either been misapplied or not applied at all to certain devices,” says Neustedter. “The problem was limited to one floor, which meant that only a few percent of calls were affected. Users wouldn’t necessarily notice the problem unless they were listening to voicemail.”
Neustedter continues, “This underscores the challenge of managing 600,000 ports and 40,000 switches. Some 2000 switches were under management, and only 10 had a problem. We would never have noticed the problem without doing comparisons.”
As the Cisco voice operations teams become comfortable using the Cisco Unified Communications Manager Suite instead of their custom scripts, Cisco IT plans to use additional features and components:
- Alerting features: Currently, Cisco Remote Operations Service monitors the network equipment and notifies Cisco IT if a device or line card becomes unavailable.
- Cisco Unified Service Statistics Manager: This software compiles data from Cisco Unified Service Monitor and Cisco Unified Operations Manager to reveal long-term trends. The team will use this tool to plan capacity upgrades for gateways and routers. “Until now, our infrastructure management center used internally developed tools that took longer to use, which meant they had to focus on the most critical service events instead of all of them,” says Heaton.
- Cisco Unified Provisioning Manager 2.0: Cisco IT plans to integrate this software with global IT systems for unified communications provisioning and service activation, replacing and augmenting internally developed tools. The expected benefit is reduced resource requirements.
Cisco IT advises other large enterprises that adopt the Cisco Unified Communications Management Suite to realize that it represents a new operations model. Data collection is one step toward improving processes, not an end in itself. “Some people regard the metrics as something they have to collect, as opposed to something that helps them do their jobs more effectively,” says Heaton. “Realize that it will take time for teams to accept that they now have the tools to solve all voice quality problems, not just the problems that affect a large percentage of calls. Every call is important.”