Cisco on Cisco
Cisco Remote Operations Services Case Study: How Cisco IT Outsourced Network Management Operations
Outsourcing contextual activities to trusted provider allows IT staff to focus on strategic projects.
In 2004, CiscoÂ® acquired a services organization now known as Cisco ROS. Today, the organization provides global clients with the following remote management services:
- Foundation technology: routing and switching infrastructure
- Core routing: Cisco Carrier Routing Systems (CRS), optical, and others
- Cisco Unified Communications
- Network security
- Cisco TelePresence
The services are delivered by a global team of engineers with CCIEÂ® certification and other experienced engineers who use accepted industry practices, including IT Infrastructure Library (ITIL)-based processes. Cisco IT became a customer of Cisco ROS in 2005. By outsourcing daily management tasks to Cisco ROS, Cisco IT receives industry-leading service, maintains control of its network, and frees its IT staff to focus on strategic activities.
Monitoring and managing Cisco LANs and the Cisco global WAN backbone-which comprises more than 10,000 devices-is an unrelenting and resource-intensive discipline. To gain the most value from IT resources, Cisco IT distinguishes between core and contextual activities. "Core activities involve strategic IT programs and new technology, while contextual activates are repeatable, consistent day-to-day tasks," says Kevin Buchan, Cisco IT program manager. Examples of core activities for Cisco IT include the architecture and design for newer programs such as next-generation wireless technologies, network admission control, and security. Examples of contextual activities are responding to outages on voice circuits that connect field offices, installing operating system patches, and supporting the company's more than 3000 legacy access points. (Newer Cisco AironetÂ® Wireless Access Points are centrally managed.)
Until 2003, Cisco IT managed contextual as well as core activities. In Asia Pacific, for example, IT staff took turns being on call 24 hours a day for a full week, once every six to eight weeks. When it was their turn, staff members would be paged in the middle of the night to respond to events such as voice circuit outages, which can occur more than 100 times nightly in certain countries. Each time, the engineer would have to get up and log a ticket with the carrier-whose support personnel might not speak the same language as the local support person. "The next morning, the engineers were expected to be productive at their 'real' job-which was to deploy new technology intended to increase operational efficiency or create a competitive advantage," says Arthur Rosling, senior IT engineer for Asia Pacific voice operations.
Not only did nighttime pager duty diminish job satisfaction and productivity, it complicated event tracking. When engineers fixed a problem such as a failed power supply in the middle of the night, they did not always thoroughly document changes or check them for quality after execution. "Changes that occur outside of normal business hours sometimes do not manifest as problems until the following morning," says Mike Flannagan, senior manager for Cisco ROS. "Our customer, Cisco IT, wanted better change management."
Cisco IT decided to outsource network monitoring and management activities for the Cisco global LANs and global WAN. The service organization would need to offer the following:
Senior Engineer for Network Operations, Cisco
- Service-level objectives (SLOs) for availability, which is the top priority for Cisco IT, as well as long-term fix resolution. In addition to tracking what went wrong, Cisco IT wanted to know why it went wrong and what actions were needed to make sure that it did not happen again.
- An SLO that defined the operational agreement between Cisco IT and the service organization.
- 24-hour monitoring and response.
- SLOs for mean time to notify, mean time to isolate, and mean time to resolve incidents.
- Ability and willingness to collaborate with the Cisco IT Tier 3 organization, which would provide knowledge transfer about new network technologies after managing them internally-usually for six months or longer.
- Support for all Cisco technologies. Rather than trying to manage relationships with different outsourced partners for the WAN, routers, firewalls, and other technologies, Cisco wanted the simplicity of a single, trusted relationship.
- A commitment to customer satisfaction.
Cisco IT decided to contract with Cisco ROS, a service organization that Cisco acquired in 2004. Within the Cisco lifecycle strategy for delivering services-Prepare, Plan, Design, Implement, Operate, and Optimize (PPDIOO)-Cisco ROS takes responsibility for selected aspects of the operate phase (Figure 1). "The operate phase is the core business of Cisco ROS," says Carlos Castano, manager of Cisco ROS. "By taking responsibility for day-to-day operations activities, Cisco ROS frees Cisco IT to focus on more strategic activities, such as planning, designing, and implementing new technologies."
Figure 1. Cisco ROS Assumes Responsibility for the Operate Phase of the Cisco Lifecycle Approach
Cisco IT and Cisco ROS complement each other, each applying its core competencies to the Cisco network. "The core competency for Cisco ROS is running a 24-hour network operations center on a Cisco infrastructure," says Rich West, senior network engineer for network operations. "This frees up Cisco IT staff to focus on their core competency, which is developing and supporting new technologies for Cisco."
Customer Support Engineer, Cisco Remote Operations Services
In addition to managing day-to-day activities for the Cisco global network, Cisco ROS also resolves outages for all equipment that it monitors. For most of its clients, Cisco ROS is completely responsible for handling all outages. For Cisco, it owns responsibility for all outages of priority two or lower, and assists with priority-one outages, although Cisco IT owns the case. "We are an extension of the Cisco IT group," says James Jones, customer support engineer for Cisco ROS. "We take recommendations and give recommendations. By taking full responsibility for the activities that are our core competencies-identifying malfunctioning devices, fixing circuits, optimizing voice performance, and more-we free up Cisco IT to focus on their strategic enhancements that will create business benefits for the company."
Duncan Mennie, EMEA network operations lead for Cisco, helped to develop the processes for transitional support. The topic of the first planning meeting, he says, was identifying the global IT tasks to outsource. "We considered which tasks caused us problems, which would save us the most time, and which we were comfortable outsourcing," says Mennie. The team wanted to make a decision that would be applied globally. Following are the factors that the team considered when deciding which activities to outsource to Cisco ROS:
At the outset of the relationship, Cisco retained core activities while outsourcing contextual activities. "If an activity is measurable, repeatable, and consistent, and we can come up with a test plan to verify the deployment, then that activity is a good candidate to outsource," says Rosling. An example is patching the Microsoft Windows servers used for Cisco Unified CallManager-almost 50 servers in Asia Pacific alone. If Cisco determines that a new patch is necessary, Cisco ROS can install the patches on all global servers quickly, and as often as necessary.
When does a core activity become contextual? Each time Cisco adopts a new network technology, Cisco IT manages that technology until it can establish best practices-ordinarily within 6 to 12 months. "During that time, we engage Cisco ROS for limited activities, such as monitoring, so that they can begin becoming familiar with the technology in our environment," says Craig Williams, director for Cisco Global Network Operations. "Later we outsource more responsibility, such as configuration, and eventually we outsource all monitoring and management responsibility."
Currently Cisco IT is transferring knowledge to Cisco ROS about the newest version of Cisco Unified CallManager and the Cisco TelePresence solution. "For most of our customers, Cisco ROS manages new technologies so that customers can adopt them even if they do not have the internal expertise," says Castano. "Cisco IT is a highly sophisticated customer that does have the internal expertise needed to adopt new technologies. Unlike many Cisco ROS customers, they prefer to manage new technologies for several months so that their engineers can keep their knowledge and skills current. We tailored an agreement with Cisco IT, as we do with all our customers, to meet their unique business needs."
Cisco outsources noncritical (Tier 2) activities and some mission-critical (Tier 3) activities. The Tier-3 component that Cisco ROS manages is the Cisco global WAN backbone, which links more than 34,000 employees in more than 25 global sites. Availability is nearly 99.999 percent. Cisco IT has chosen to manage noncritical activities that occur in the Cisco data center environment. "All tasks that occur in our data center are considered core, and although Cisco ROS has the capability to manage them-and does for its other customers-we made the decision to retain control," says Mennie.
When Cisco IT adopts a new technology, management of the technology is always considered core. After Cisco IT engineers have learned about the technology in depth and documented best practices, the technology becomes contextual. At that point, Cisco IT considers outsourcing it to Cisco ROS. "When I joined Cisco, the backbone was considered core, and now it is contextual," says Williams. Cisco IT has outsourced some aspects of IP telephony to Cisco ROS in Asia Pacific, Europe, and emerging markets and is now considering expansion into North and South America.
Cisco ROS today manages all or some of Cisco's three infrastructure components:
- Foundation technologies, including routers, switches, and wireless access points. Currently, Cisco ROS provides most services for foundation technologies, which Cisco IT considers mature.
- Security, including firewalls, intrusion detection systems, and intrusion prevention systems.
- Unified Communications management, such as patching Cisco Unified CallManager servers and monitoring circuits.
Cisco IT transferred responsibility to Cisco ROS in phases:
- North and South America: WAN/LAN management, regional offices
- Asia and Pacific Rim: WAN/LAN management, regional offices, and high-priority sites, including India
- Europe and emerging markets: WAN/LAN management, regional offices, and high-priority sites
- Cisco global backbone, which ties together all regional WANs
Before transferring responsibilities for network monitoring, Cisco IT performed a thorough assessment of the Cisco ROS organization. Members of the Cisco IT team met with the Cisco ROS staff, reviewed their management tools, talked to the tool developers to determine how the tools compared to those that Cisco IT uses internally, and observed operational practices. Before the meeting, Cisco built a virtual office in a Cisco location and added it to the Cisco ROS management network. "For six hours we deliberately caused network problems and observed the techniques that Cisco ROS used for troubleshooting and resolution," says West. "After we satisfied ourselves that they had the necessary tools and expertise, we had the confidence to begin outsourcing. We now consider Cisco ROS to be a partner rather than an outsourcing company."
The Americas geographic region had not previously outsourced any IT activities. Therefore, to become comfortable, Cisco IT decided to first transition operational responsibility for lower-priority IT environments, such as regional sales offices, and then medium-priority sites. As Cisco ROS successfully handled each new activity, the momentum of the cutover increased. "At the beginning, our attitude was 'Prove yourself,'" says Williams. "Later it became, 'Can you take on this other task, as well?'"
In the Asia Pacific geographic region, unlike the Americas, Cisco was already accustomed to outsourcing network management. Cisco IT simply pointed its devices to Cisco ROS and transitioned one regional center at a time, over six months. "The major issues that we had to resolve were who Cisco ROS should contact in our organization for each type of event, and how they should provide the information," says Rosling.
Cisco ROS provides consistent global support while accommodating regional differences. "In India, for example, it is part of the culture to call local support instead of global support, so Cisco ROS is establishing a local help desk," says Wai Min Higa, director of global service delivery. "Their flexibility is very important for a global enterprise like Cisco."
Table 1 shows the activities that Cisco outsources to Cisco ROS.
|Table 1. Cisco ROS Responsibilities for Cisco|
|Scope||Americas||Europe and Emerging Markets||Asia Pacific and Japan|
|WAN management, including global backbone||x||x||x|
|Wireless LAN management||x||x||x|
|Uninterruptible power supply (UPS) monitoring||x||x||x|
|IP telephony management (voice circuits and change management for Cisco Unified CallManager software)||Planned||x||x|
|IP telephony full operational management||Planned||Planned||Under way in India for partners; planned elsewhere|
|Security management||Pilot under way||Under discussion||Under discussion|
|Cisco TelePresence management||x||x||x|
Cisco ROS follows the best practices defined the IT Infrastructure Library (ITIL), an industry-standard approach to running an IT organization. ITIL includes processes for service desk, incident management, problem management, change management, release management, configuration management, service level management, service readiness and continuous improvement, and transition management. Cisco takes advantage of all these services in varying degrees. For example, if a network device fails, the following processes might apply, in order:
- Incident management to determine if the cause and solution are known.
- Problem management to discover the root cause of the problem, if it is not known
- Change management to get required approvals-for example, if the problem is an incorrect configuration that must be changed
- Release management to make the change
- Configuration management to update the internal database with the change
At the outset of the relationship, Cisco IT became familiar with the ITIL processes that Cisco ROS uses and got to know the team. "An IT partnership like ours requires trust," says Williams. "We began using Cisco ROS for Tier 2 network monitoring, initially just outsourcing a few tasks. As trust grew, we gave them more responsibilities."
Today Cisco IT regards Cisco ROS as a trusted partner rather than as simply an outsourcer. "We still own the function that Cisco ROS performs, which is different than a typical relationship with an outsourcer," says West. "Cisco ROS performs operational work, such as responding to outages, which frees our personnel to focus on strategic activities such as optimizing the network to make it more resilient."
A Cisco IT representative and Cisco ROS meet frequently to review key performance indicators, including metrics and SLOs. "Cisco ROS has become a strategic partner because of their credentials, people, expertise, and management metrics," says Williams. "When I approach Cisco ROS with a new IT environment that we would like for them to manage, our account manager tells me the cost, and then I can determine whether it makes sense to outsource that function. It does not get much easier."
As Cisco introduces new technologies, Cisco IT is generally the first customer to deploy them, which gives Cisco ROS early experience working with the technology. For example, Cisco ROS is working closely with Cisco IT and the Cisco TelePresence business unit on the internal global deployment of Cisco TelePresence. From the first day of the deployment, Cisco ROS assumed selected responsibilities for monitoring and management, following best practices that Cisco IT and Cisco ROS developed jointly. Early experience with Cisco TelePresence has prepared Cisco ROS to provide skilled monitoring and management for other customers that adopt the technology.
Cisco and Cisco ROS agreed on escalation practices based on the event priority. Cisco ROS notifies Cisco IT about low-priority events through e-mail. When medium or high-severity events occur, Cisco ROS works to resolve them for a defined number of hours and then, if the problem is not resolved, pages Cisco IT. "We have clearly defined triggers based on event type, event priority, and how long Cisco ROS will work on the problem without resolution before paging us," says West. The escalation agreement is global and can be modified based on what Cisco IT needs.
"By outsourcing contextual activities to a trusted provider, we gain more time for core activities that will create operational efficiencies or a competitive advantage," says Buchan. Williams says, "The Cisco IT organization functions most effectively if we assign our talented resources to our core activities. For Cisco ROS, in contrast, Tier 2 network monitoring is a core activity. It is what they do best."
Relieved of 24-hour pager duty, Cisco engineers have more time and energy to devote to strategic new applications, such as Network Access Control (NAC) and other advanced security technologies. "Our IT resources regard strategic outsourcing of contextual activities as a boost to their careers," says Higa. "This avoids the frustration that comes from doing contextual work for many years."
"People whose core competency is to respond to network problems tend to do a better, more consistent job," says Rosling. "Cisco ROS has a 24-hour NOC as well as a translation service so that their engineers can talk to service providers that do not have English-language support."
Cisco IT emphasizes that it did not eliminate jobs as a result of outsourcing. In 2007, Cisco IT is managing 20 strategic IT programs, such as availability metrics and strategy for the IP Next-Generation Network (IP NGN). Each requires an enabling and operating function. "By outsourcing contextual functions, Cisco IT has more resources to devote to these programs," says Williams. Rosling says, "Our IT professionals would prefer to work on new technologies rather than tasks like patching, and we have enough work to keep them busy indefinitely."
Global monitoring of the Cisco network enables Cisco ROS to identify potential network problems more quickly than when regional Cisco IT organizations monitored their own portions of the network. "Our global monitoring personnel talk weekly," says Quincy Hopkins, customer support engineer in Cisco ROS. "Therefore, if we notice a bug that occurs with a particular router hardware and software combination in Europe, we can take action in other geographic regions before the problem even surfaces." An example is a recurring problem that Cisco experienced with its wireless access points. Cisco ROS determined that one user's faulty network interface card was generating enough bad traffic to make the wireless access card reset, which caused the network interface card to associate with another access point and reset that one, as well. The remedy is to identify the user and request that the network interface card be turned off. "Global monitoring enables us to see rare problems more frequently than we would if we monitored each region separately, which accelerates problem identification and resolution," says Hopkins.
Cisco IT is also outsourcing tasks that previously were not getting done. For example, Cisco ROS raises cases when the public switched telephone network (PSTN) circuits go down and when server hard drives fail. In the past, Cisco IT was not necessarily aware of these events if they did not interrupt the business. "Outsourcing enables us to follow best practices for management and maintenance without hiring new staff," says Rosling.
Cisco ROS handles hundreds of alerts each night in Asia Pacific alone, which means that Cisco IT staff can sleep better. "The improvement in quality of life for me and my team is invaluable," says Rosling. "With a good night's sleep, we perform better during the day." West says that nighttime events that require his attention have decreased by 75 percent.
Previously, each of Cisco's three geographic regions followed different processes for incident response. Cisco ROS employs consistent global processes, which facilitates scalability of the network support operation. If an operating system vendor issues a patch, for example, Cisco ROS installs it quickly, working outside of Cisco's business hours in each geographic region. Cisco ROS also performs the patching for Cisco Unified CallManager, with an SLO for completion time. As Cisco grows, Cisco IT does not need to increase staff to handle operational tasks such as patching.
"Cisco ROS designed its organization for one purpose: to provide remote network monitoring and management," says Mennie. "They can scale far better than we can on a global basis. For Cisco IT to respond to network outages as quickly as Cisco ROS does-day or night-would require very large support teams and high costs."
"The relationship has been mutually beneficial for Cisco ROS and Cisco IT," says Jones. "When we first began working with Cisco IT, we managed a smattering of services-different ones in each geographic region. As we gained trust and Cisco IT became confident in our abilities, we were asked to watch circuits, as well. Our responsibilities continue to grow." For example, Cisco ROS has been asked to establish a baseline for the 100 Cisco TelePresence systems used within Cisco so that it can advise Cisco customers on thresholds and how to set up the systems. "By monitoring and managing new Cisco technologies that Cisco uses internally, such as Cisco TelePresence, we stay on the forefront," Jones says. "By the time that Cisco releases new technologies to its customers, we already know what to watch and how to configure devices for optimum performance. The same Cisco division that supports Cisco IT-an extremely demanding customer-can support Cisco customers as well."
The Cisco IT team emphasizes that its objectives for network monitoring and management change constantly, and that some problems are unavoidable. Following are lessons learned from Cisco IT for establishing a partnership with Cisco ROS:
- Develop guidelines for which tasks to turn over to Cisco ROS and which to keep within the organization, and then work with the partner to develop a plan and roadmap.
- Take the time to set up a clear SLO so that everyone understands the service. "Both parties need to understand and sign off on the processes," says Castano. "This is a critical success factor."
- Charter Cisco ROS to achieve the SLOs and then give them flexibility within that charter. "We had to learn to give Cisco ROS the latitude to solve problems in their own way, provided that they were delivering the agreed-on results," says Buchan.
- Define processes for communication and escalation as well as support. "It is not realistic to expect that every process will be fully defined at the outset," says Mennie. "When our partnership with Cisco ROS began, gaps would periodically appear. That is when communication is most essential." Rosling says, "Outsourcing is not smooth 100 percent of the time, so open, frequent, and direct communication is needed to resolve issues. We provide Cisco ROS with feedback for change in a timely and constructive manner and listen to their point of view, as well."
- Focus on service management rather than project management. "Previously, Cisco IT thought in terms of project management," says Higa. "Our people were accustomed to defined timelines, with a beginning and end. In contrast, service management is ongoing. The difference is that we look for continuous improvement rather than getting results on time and on budget."
- Consider the internal processes of the affected organizations when defining processes. At Cisco, numerous support organizations are involved in problem reporting and resolution, and each had its own processes. For example, the Cisco Operations Command Center (OCC) opens low- and medium-priority tickets and sends pages to the appropriate staff. In the original process that Cisco IT and Cisco ROS defined, Cisco ROS was to contact the OCC for low- and medium-priority tickets. However, the OCC does not page outside of normal local business hours, which stalled trouble tickets. "We amended to process so that Cisco ROS sends the page when a medium-priority event occurs outside the normal business hours," says Mennie.
- Be sure to provide Cisco ROS with the documentation that they need to do their job well. Among the documents that Cisco IT provided to Cisco ROS were descriptions of the addressing scheme, carrier contacts, circuit databases, and escalation procedures.
Until recently, most of the tasks that Cisco IT outsourced to Cisco ROS have pertained to foundation technologies: WAN, LAN, and wireless. "Now that we have reached agreement on SLOs, processes, and procedures for foundation technologies, Cisco ROS is positioned to take on additional responsibilities when needed," says Flannagan. "We are a trusted partner, and our global monitoring and management operations enable us to scale to provide whatever types of support that our customer needs."