Multiprotocol Label Switching (MPLS) VPNs can deliver an impressive range of next-generation telecommunications services. Their sophistication sometimes results in complex networking challenges to the network operations center (NOC) staff in fault finding and troubleshooting. In the growing and competitive MPLS VPN market, many providers and enterprises are under pressure to improve the operational efficiency of their NOC to reduce costs, constrain headcount growth, and provide better customer support. Troubleshooting an MPLS VPN is often a manual task, requiring complex procedures not supported today in traditional fault management tools. Thus the traditional way to meet increased demand for MPLS network maintenance support would be to hire or train appropriately skilled specialists; however, this approach can be costly.
To help address these challenges, Cisco Systems® has developed a next-generation approach to MPLS VPN troubleshooting. Cisco® MPLS Diagnostics Expert is an automated, workflow-based network management application that troubleshoots and diagnoses problems in MPLS VPN deployments. It optionally co-exists and integrates with the Cisco IP Solution Center L3 VPN Management product. This innovative product is suitable for MPLS VPNs in both service provider and enterprise settings.
Cisco MPLS Diagnostics Expert not only can help significantly reduce the direct costs of NOC troubleshooting, but it can also present notable value-add savings and benefits. For example, fewer escalations, from the NOC troubleshooting group into the network engineering group, enables the design engineers to deliver next year's new services earlier. By providing a detailed log of all troubleshooting steps performed automatically, the comprehensive diagnostics can also reduce time lost in organizational handovers, such as transferring a problem from level 1 to level 2 to level 3 support, leading to faster resolution of support requests. Faster service response translates to more satisfied customers and a competitive advantage for the provider. And the diagnostics capability can also be used during provisioning to reduce time troubleshooting start-up issues.
Cisco MPLS Diagnostics Expert is designed around a knowledge base of MPLS VPN failure scenarios, based on Cisco experience in worldwide MPLS VPN deployment. It is the only MPLS VPN diagnostic tool in today's market that includes built-in domain knowledge from Cisco IOS MPLS development experts and Technical Assistance Center (TAC). This enables Cisco MPLS Diagnostics Expert to deliver the experience of a team of MPLS experts to help troubleshoot and diagnose MPLS VPN operational problems. The Cisco MPLS Diagnostics Expert will become an essential competitive requirement for providers delivering MPLS VPN services.
This paper describes how Cisco MPLS Diagnostics Expert can, when deployed strategically in an NOC help desk, give entry-level staff a range of troubleshooting capabilities that once were the exclusive tools of staff with advanced network qualifications and experience. This product can help reduce troubleshooting time with complex MPLS VPN network problems, in some cases, some common, some not-so-common, from hours to minutes.
This paper starts with a review of MPLS VPN network management and the fault management procedures in NOC organizations. It then describes how Cisco MPLS Diagnostics Expert has been designed to work in that setting, both in its user interface design and in its use of the underlying functions of MPLS operations, administration, and maintenance (OAM) in Cisco IOS® MPLS Embedded Management. And finally, it considers how Cisco MPLS Diagnostics Expert can reduce operating expenses over traditional methods and give a strategic competitive advantage to MPLS providers.
Challenges of MPLS VPN Management
MPLS VPNs have become a ubiquitous part of networks in service providers and in many enterprises. MPLS is a cost-efficient way of integrating services on a single network in a highly secure manner. With its advantage comes increased complexity for OAM support. A lot of this complexity can be managed with the help of an operation support system (OSS) and appropriate processes, but sooner or later equipment failure, sometimes even firmware issues, or, more frequently, human error will cause problems in the MPLS provider network.
Because of the complexity of MPLS networks running multiple MPLS VPNs and the volume of label-switched paths (LSPs) in the network, diagnosing the underlying cause of a fault is tedious and time consuming and requires a great deal of expertise. Here are some examples of complex MPLS VPN problems that can challenge network operations groups:
• Network outages caused by operator configuration "finger trouble"
• Failure of Label Distribution Protocol (LDP) but not of Open Shortest Path First (OSPF) Protocol
• Problems related to route installation into virtual route forwarding (VRF) tables
• LSP "black holes," in which packets somehow vanish
• Failure on the LSP return path
• Failure to start LDP sessions
• Packets not being labeled, label allocation problems
• Route target mismatch
• Chip failure on a router line card
• Visualizing MPLS packet flow paths through a complex network
Anyone operating an MPLS VPN network in today's economic environment requires that the isolation and diagnosis of problems within the network be as simple and efficient as possible. While faults can sometimes be detected using traditional fault management tools, the important task often remains: determining the reason for that fault. Troubleshooting requires particular expertise and it can take a lot of time to isolate the problem. If the problem is not apparent from the outset, the engineer is likely to narrow down the problem space by a process of elimination. This can be the most time-consuming part of the process, requiring the engineer to take a networkwide view of the problem and perform diagnostics on multiple devices. Automating this process could greatly reduce the time spent during this elimination phase and allow the engineer to quickly focus on resolving the problem.
The Problem Detection and Resolution Lifecycle
Traditional fault management products focus on problem detection, alarm management, and trouble ticket management. Problem detection is usually based upon listening to messages from the network devices, using Simple Network Management Protocol (SNMP) traps and syslog messages, or polling the devices at regular intervals for certain fault symptoms. Cisco routers also provide the option of monitoring the network by injecting test traffic. The problems detected are typically represented in a fault management system such as Cisco Info Center or Cisco Network Connectivity Center in the form of an alarm. While traditional fault management systems can help in the detection, logging, and management of network problems, until now there has been little tool support for the actual troubleshooting.
Whether a problem is detected automatically or reported by a customer, it is usually managed using some kind of workflow management system, such as a trouble ticketing system. An NOC engineer is:
• Assigned to confirm and (where possible) reproduce the problem
• Locate the source, diagnosing the cause of the problem
• Carry out the required repair
• Verify that the problem is no longer present
• Report back to the customer that service has been re-established
Even with expert personnel, the troubleshooting activities after problem detection, reporting, and logging can be the most time-consuming part of the end-to-end fault management cycle. And again, traditional fault management products provide limited or no automated support for these activities.
Automated MPLS VPN Diagnostics and Troubleshooting: The Requirements
What should automation tools for troubleshooting look like? The ideal diagnostic tool has a simple user interface requiring the input of the minimum operational information to describe the problem. And the ideal product checks for problems in the customer's traffic path and does not force unnecessary manual effort such as separate GUIs for optional parts of the workflow. If there is a problem, then the ideal automatic troubleshooting tool can call upon the heuristic experience of experts to accurately diagnose the underlying fault.
The results should be understandable by novices and exploitable by experts, presenting a design challenge for the user interface. For novices, the user interface should include a concise summary of the problem, including the possible cause and recommended actions. For experts, it must explain all steps arriving at a conclusion, and if there has been no conclusion, it should clearly show what troubleshooting steps have already been performed. Finally, such a product must make appropriate use of advanced MPLS instrumentation, such as the IETF standards for MPLS OAM, discussed later.
After the problem has been remedied, the operator should be able to easily rerun the test without entering any new data (or reentering any of the existing data), and the application should either verify that the problem is solved or point to any remaining problems or new ones that have appeared.
The Cisco Solution
Cisco MPLS Diagnostics Expert is designed to be that ideal troubleshooting product. By helping automate the troubleshooting and diagnosis process, it helps reduce service downtime and helps ensure that business-impacting services are available.
The Cisco MPLS Diagnostics Expert product is designed for NOC fault and assurance operators, providing automated troubleshooting and diagnostics for access circuits, edge, and core in Layer 3 MPLS VPN deployments. Cisco MPLS Diagnostics Expert significantly reduces the time to isolate and diagnose failures in these networks by employing an automated, workflow-based troubleshooting approach.
There are five steps in an effective network fault assurance process:
1. Detection - Customer reports problem
2. Isolation and localization - Reproduce and localize problem
3. Diagnosis - Diagnose problem cause and recommend action
4. Repair - Fix problem based on diagnosis
5. Verification - Verify that fix is implemented
This release of Cisco MPLS Diagnostics Expert is designed to support reactive situations, after step 1. Cisco MPLS Diagnostics Expert focuses on automating steps 2, 3, and 5 - isolation, diagnosis, and verification. It has functions for isolating and diagnosing failures in the network, determining the devices at fault, and automatically checking appropriate device status and configuration to determine the likely reason for the failure. The objective of this product is to shorten the lifecycle of complex MPLS outages from hours to minutes. The simple user interface makes this product well suited for providing the NOC help desk with advanced MPLS diagnosis skills.
The diagnostics performed by Cisco MPLS Diagnostics Expert are based upon analysis of network failure scenarios across MPLS access, edge, and core components. The product achieves this through its MPLS VPN Failure Scenario Knowledge Base: more than 100 troubleshooting workflows, each with 100 to more than 150 decision steps and dozens of lines of Cisco IOS Command Line Interface (CLI) being executed and automatically analyzed. The failure scenarios have been identified based on problems that have been reported to Cisco MPLS VPN experts, including the Cisco Technical Assistance Center (TAC) and Cisco Advanced Services. The knowledge base will be updated in future product releases as more failure scenarios are discovered and their detection automated.
When the application does not determine an issue, it lists out every decision step and every Cisco IOS Software operation performed. This complete transcript of the automated troubleshooting session gives NOC technicians a head start on subsequent diagnosis.
Whether it is checking for NOC operator configuration errors or detecting symptoms of software bugs and hardware issues, Cisco MPLS Diagnostics Expert applies an "80/20" philosophy, checking the 20 percent of possible causes that account for 80 percent of all problems. With this approach, it is possible to reduce troubleshooting time to around 4 minutes or less instead of - in some cases - hours.
In addition to troubleshooting, Cisco MPLS Diagnostics Expert can be used for postprovisioning connectivity checks on VPN connections. This has the potential to significantly reduce the cost of provisioning, in which as much as 75 percent of the cost can be in verification that the VPN connections function correctly.
Localization and Diagnosis Procedures
Cisco MPLS Diagnostics Expert encapsulates a rich set of connectivity testing, troubleshooting, and diagnostics steps and procedures. The exact steps performed for each test depend upon the nature of the failure symptoms found and the location of the failure within the network. This section provides a high-level overview of the connectivity testing, troubleshooting, and diagnostics logic.
The test scope is determined by the test configuration entered. For example, for each site, testing could be performed to a customer device within the site, the customer edge (CE) access circuit interface, or the provider edge (PE) access circuit interface. For simplicity, let us assume that testing will be to the CE access circuit interface.
The first step tests VPN connectivity between the two sites using the VRF ping functionality (Figure 1). Ideally, this test should be initiated from a device in the local site subnet to a destination IP address in the remote site subnet. In Figure 1, the first stage tests connectivity from the remote site PE to the local site CE and the second tests connectivity from the local site PE to the remote site CE.
Connectivity Testing Process in Cisco MPLS Diagnostics Expert
By testing connectivity in two stages, the troubleshooting and diagnostic functions simulate an end-to-end test from the local site CE to the remote site CE, and thus can identify any VRF connectivity problems between the sites. This connectivity test exercises VPN, MPLS, and IP connectivity between the two sites.
If the VPN connectivity is confirmed to be functional, the MPLS provider can conclude that the problem is in the customer's network. If a VRF connectivity problem is detected, then further tests are performed to isolate the problem. These tests are initiated on the PE device and performed in the direction in which a VPN failure was detected. They include checks on the PE and core devices, using the OAM tools in Cisco IOS MPLS Embedded Management.
Connectivity testing stops when the fault is isolated. When isolated, a sequence of automated troubleshooting and diagnostics steps is performed to diagnose the cause of the fault. The specific steps depend on the nature and location of the fault; on a provider router they may include checking basic per-node configuration, validating that LDP operation is correct, and checking route installation into the VRF. On the access circuit, configuration mismatch errors are automatically checked for among many other in-depth checks.
Ease of Use
The GUI of Cisco MPLS Diagnostics Expert (Figure 2) is designed to be easily usable by staff at all levels of the NOC support organization.
Cisco MPLS Diagnostics Expert Data Input GUI
The figure shows a snapshot of the input screen. The user specifies two endpoints in the network. There are six mandatory fields to fill in, three for each of the endpoints on either side of the network, that uniquely identify each of the endpoints. There is also an optional fourth field where a destination within the customer network can be specified to test connectivity into the customer network and eliminate it as the source of the VPN connectivity problem.
Note that there was no need to specify any device information in the input screen except for the PE devices and the customer IP addresses. Cisco MPLS Diagnostics Expert can automatically derive all of the other network information required to perform an end-to-end diagnosis, such as loopback addresses for the LSP ping if required, router interfaces, and device status.
The GUI for diagnostic results is shown in Figure 3.
The Diagnostics Results Summary Screen
The screenshot illustrates an example of a failure condition that has been detected in the core network. Cisco MPLS Diagnostics Expert provides three easy-to-understand pieces of information on the result screen:
• Summary - the type of problem encountered
• Possible Cause - the precise nature of the problem
• Recommended Action - what an operator should do to rectify the problem
This screen is suitable for interpretation by an entry-level NOC operator. The details of the diagnostics are in the test log, which is mainly for use by experienced NOC operators and Cisco TAC. An example is shown in Figure 4. All Cisco IOS software steps - lines of Command Line Interface (CLI) - used in the diagnosis can also be examined, so that a more experienced user can determine why Cisco MPLS Diagnostics Expert has reach a particular conclusion. An example is illustrated in Figure 5.
Test Log Example - all diagnostics steps are shown to explain the diagnosis
Cisco IOS Command example, as executed as part of a diagnostics workflow
The IP addresses shown are derived automatically by the Cisco MPLS Diagnostics Expert from the network.
This test log is particularly useful when problems are escalated through the support chain, as it not only helps to find the problem, but also records what has already been tested. This feature helps avoid miscommunication issues that arise as a problem report is passed from one part of the organization to the next.
Cisco MPLS Diagnostics Expert and MPLS OAM Standards
The recent standardization and implementation of support for MPLS OAM in Cisco network elements has made problem localization significantly easier. Cisco IOS Software now supports mechanisms such as VRF ping and traceroute, and LSP ping and traceroute. These standard mechanisms are discussed in detail in the following references:
Cisco MPLS Diagnostics Expert automatically makes use of these mechanisms to the extent that end users do not need to know their detailed syntax. It also makes use of advanced MPLS OAM options that can be used to diagnose conditions such as inconsistent route processor and line card MPLS forwarding tables on routers with distributed forwarding architectures, such as the Cisco 12000 Series Router.
MPLS OAM support has been adopted by a variety of other vendors beside Cisco. Note that Cisco MPLS Diagnostics Expert can function appropriately in a multivendor environment, whether the third-party devices conform to the MPLS OAM standards or not.
Cisco MPLS Diagnostics Expert and the Cisco IP Solution Center Product Family
Cisco MPLS Diagnostics Expert can be used with or without the applications of the Cisco IP Solution Center product family. Use of Cisco IP Solution Center MPLS VPN provisioning is not required for use of Cisco MPLS Diagnostics Expert. Cisco MPLS Diagnostics Expert is the latest addition to the product family.
Old and New Troubleshooting Systems Compared
Consider the following simplified hypothetical comparison of an NOC with and without Cisco MPLS Diagnostics Expert. A customer phones NOC support, complaining that there is loss of VPN connectivity between two customer sites. Consider the potential timeline comparisons (Table 1) between a traditional MPLS provider and a next-generation MPLS provider using Cisco MPLS Diagnostics Expert on the front-line support help desk.
Table 1. Troubleshooting Comparison
Traditional NOC Process
NOC Using Cisco MPLS Diagnostics Expert
Customer calls to report problem.
Customer calls to report problem.
Operator collects customer and problem details, creates trouble ticket, and gives reference number to customer.
Support runs tests while customer is on the phone.
Problem placed in queue, awaiting level 2 support attention.
Cisco MPLS Diagnostics Expert finds the problem in the core and verbal feedback is given to customer.
Still in queue.
Trouble ticket is created; detailed test log is attached. Test log clearly indicates that level 3 escalation is required.
Level 2 support engineer starts looking at the problem.
Trouble ticket, including Cisco MPLS Diagnostics Expert test log, is inspected by level 3 support engineer. Clear diagnosis of problem is provided by test log.
Problem is fixed and validated using test log.
Customer is called back; case closed.
Level 2 support engineer escalates to level 3 support.
Level 3 support picks up the case and starts manual troubleshooting.
Level 3 engineer finds fault in core.
Problem is fixed and manually validated.
Customer is called back; case closed.
Use of Cisco MPLS Diagnostics Expert can significantly speed up troubleshooting. Note that in this example, significant time was the result of delays in organizational handover. Faster troubleshooting not only results in decreased operational cost, but also enables redeployment of the saved resources to other productive activities. Equally as important, the customer satisfaction improves with better service and network availability, creating a competitive advantage for the MPLS provider.
Cisco MPLS Diagnostics Expert delivers radical improvements to MPLS VPN customer support organizations through an automated troubleshooting and diagnostics approach. The MPLS provider gains not only by reducing operating expenses, but also by increasing organizational efficiency and competitive advantage by serving the customer faster. Key features that deliver these improvements are the simple user interface suitable for all levels of help personnel, together with the MPLS VPN Failure Scenario Knowledge Base, which is built from Cisco extensive experience as an MPLS VPN infrastructure provider.
For a wide range of common and not-so-common MPLS VPN problems, Cisco MPLS Diagnostics Expert can reduce connectivity troubleshooting time from hours down to a few minutes. It reduces the human error in the troubleshooting process through an automated process that is both methodical and deterministic. The product enables MPLS providers to quickly identify whether the problem is in their network or their customers' network. It encapsulates Cisco experience in troubleshooting MPLS networks. Finally, it enables first-level NOC technicians to carry out troubleshooting and diagnostics that would normally require the attention of a highly trained MPLS expert.
Cisco MPLS Diagnostics Expert can reduce service downtime and, hence, increase MPLS provider profitability. It helps deliver MPLS VPN operational excellence to customers and helps in meeting service-level agreements. These operational improvements can have a direct impact in enhancing customer satisfaction, reducing customer turnover, and raising a MPLS provider's MPLS VPN operations to a new level of excellence.