Have an account?

  •   Personalized content
  •   Your products and support

Need an account?

Create an account

Service-Centric Approach to AIOps

Augment Cisco Network Services Orchestrator (NSO) deployment with Cisco Crosswork Situation Manager to boost operational efficiency

 

Drivers for AIOps

Operations teams are undergoing a paradigm shift and embracing big data, modern machine learning, and other advanced analytics technologies to boost operations efficiency with proactive, personal, and dynamic insight. Gartner has coined the term AIOps (artificial intelligence for IT operations) to capture the spirit of these changes.

Current methodologies, techniques, and best practices are shackled by traditional siloed Operations Support System (OSS) stacks, rigid rule-based systems, and monolithic architectures. AIOps helps to quickly extract actionable insights from the operational data to help automate tasks and processes that have traditionally required human intervention.

According to Gartner, AIOps is going to drive a major change in operations over the next few years. Both Communications Service Providers (CSPs) and enterprises will substantially benefit from this shift as they undergo digital transformation. The change is manifesting as adoption and integration of different digital technologies to fundamentally change the way services are delivered to end customers. However, it is also posing new challenges for operations teams in terms of scale (Figure 1) and pace of change.

Figure 1. Digital transformation poses a scalability challenge

Digital transformation encompasses trends that stakeholders need to understand in order to effectively embrace digitization as part of service delivery:

Increasing infrastructure complexity and scale

Digitization demands that service delivery infrastructure be scalable and extensible to support growing numbers of applications, services, and subscribers. Such infrastructure would leverage multiple technology layers such as IP, Multiprotocol Label Switching (MPLS), and optical, and would connect multiple domains such as RF, backhaul, access, and core. In addition, a proliferation of diverse use cases is demanding different architectural requirements. For example, upcoming use cases such as the Internet of Things (IoT), autonomous vehicles, and edge analytics demand a combination of cloud and edge computing.

Managing such mass-scale infrastructure can be complex. From an operational perspective, the systems need to effectively deal with technology layers, domains, a large number and variety of devices, data volumes, and disparate data formats while delivering a simplified experience to improve operational efficiency. Virtualization introduces an additional dimension of complexity by decoupling the software from the underlying hardware and dynamic workload-based scaling.

Dynamic infrastructure and software-defined networking

According to the Open Networking Foundation (ONF), Software-Defined Networking (SDN) is a new approach to networking in which network control is decoupled from the data forwarding function and is directly programmable. The net result is an extremely dynamic, manageable, cost-effective, and adaptable architecture that gives administrators unprecedented programmability, automation, and control. Implementing SDN as an open standard enables rapid service development and deployment, reduced operational costs, and flexibility for network administrators to integrate best-in-class technology.

The rise of data- and context-driven decision making

Traditional management systems can easily be overloaded with operational data, impeding operator productivity. What is required are dynamic insights, such as detection of anomalous behaviors or identification of causal patterns. Further, such insights need to be translated to precise actions that can be automated within a relevant context to improve service quality and ultimately the end-user experience.

AIOps as a critical business enabler

Customer experience and delivery of committed service levels is paramount. As part of digitization, services are increasingly offered as Software as a Service (SaaS) or Platform as a Service (PaaS) and are associated with stringent Service-Level Agreements (SLAs) around service availability. Service disruption or degradation can have a significant impact in terms of revenue or customer churn. The success of operations will be measured not just in terms of its effectiveness in managing the complex infrastructure but, more importantly, in how quickly and proactively it can respond to service issues.

Such transitions pose significant challenges for the current BSS/OSS system. In the next section, we will discuss what is needed to effectively embrace the digital transformation.

As the operational focus expands beyond infrastructure to services spanning multilayer, multivendor, and multidomain environments, the benefits of tying service context to statistical results to improve AIOps accuracy become more evident, as highlighted in this white paper. In this context Cisco® Crosswork Situation Manager offers comprehensive AIOps capabilities to address such operational challenges and delivers substantial improvement in operational efficiency. It can be deployed in conjunction with Cisco Network Services Orchestrator (NSO), which helps to enrich AIOps by implementing a service-centric approach to AIOps.

Embracing digital transformation

To realize the full potential of the digital transformation, a fundamental shift in operational best practices and adoption of AIOps principles is required. Realistically, the change can be implemented in phases, driven by specific business outcomes and use cases.

Today, restrained by the current toolset, IT and telco operations are still reactive, relying on rules-based anomaly detection and manual remediation procedures. The goal is to adopt AIOps to help transition from a reactive approach to a proactive and predictive one, and to use analytics for anomaly detection and automation of closed-loop operational workflows. A service-centric approach to AIOps advocates the principles in the table below to boost operational efficiency.

Table 1. AIOps principles

Principle

Description

Service awareness

Service delivery is a central theme across digital transformation trends. Enrichment of operational data with service attributes such as service name, identifier, and topology helps to provide necessary context for AIOps to improve accuracy in data processing and consequently service support quality.

Unified cross-domain visibility

Today’s siloed operational model is a big impediment to operational efficiency. End-to-end cross-domain awareness is essential to track service health, contextually monitor the performance of underlying infrastructure components, and characterize traffic flows and the interplay among different technologies.

Scalable data processing and effective use of machine learning techniques

From an operational perspective, the systems need to effectively deal with technology layers, domains, and a large number and variety of physical and virtual devices, monitoring data volumes and disparate data formats. Data analysis is becoming too complex and resource intensive to be efficiently addressed by human resources. 
AIOps transformative algorithmic approaches are needed to effectively handle massive data volume and extract dynamic insights to help improve operational efficiency.

Knowledge sharing and collaboration

Digitization simplifies knowledge capture and sharing and facilitates cross-silo operational workflows. Collaboration results in significant productivity gains and reduction of operational metrics such as Mean Time To Know (MTTK) and Mean Time To Restore (MTTR).

Automation focus

Dynamic insights help to drive automation and to implement large-scale change closed-loop control. This minimizes error-prone manual methods or procedures and substantially improves operational efficiency.

Sorry, no results matched your search criteria(s). Please try again.

 

The power of a service-centric approach to AIOps

Dynamic and programmable infrastructure can generate a lot of operational data. This data can be related to infrastructure health, application performance, activity logs, event notifications, traffic flows, social media interaction, IT Service Management (ITSM) system integration, etc. The data volumes and variety are increasing daily, and it is impossible for humans to process the data and deliver the stringent SLAs. Applying AIOps helps to address the challenge. The software algorithms reduce large volumes of data into actionable bits of information, and assist in the learning and codification of the knowledge. Termed “proactive insight” by Gartner, it speeds up the analysis and operational decisions. For example, it would take numerous hours for a human to manually analyze and correlate every event in your production environment; however, the algorithms can accomplish the task in a matter of seconds (Figure 2). Combining machine learning with human skills and knowledge offers an optimal approach to delivering the target SLA and boost operational efficiency in a large-scale and growing IT or telco environment.

Figure 2. Applying artificial intelligence to network operations

The more AIOps learns the context associated with the data, the greater the accuracy with which the alert can be processed. Enriching the event streams with contextual attributes such as service name, tenant name, and service topology from Cisco NSO enables more precise correlations and reinforces functions such as service impact analysis and probable root cause analysis.

 

Orchestrating AIOps service context with Cisco NSO

Cisco NSO, enabled by Tail-f®, is an industry-leading orchestration platform for hybrid networks. It provides comprehensive lifecycle service automation to enable the design and delivery of high-quality services much faster and more easily, reinforcing the digital transformation. In addition, it simplifies the automation of configuration tasks in a rapidly growing, complex network environment.

Figure 3. Cisco NSO interplays with AIOps as part of self-healing closed-loop automation

The top priority for operations team is to improve the service level, that is, to proactively detect and remediate any service- and customer-impacting issues. Deploying AIOps with Cisco Crosswork Situation Manager in conjunction with Cisco NSO (Figure 3) helps enrich the service context, delivering the following benefits:

  • Bridge the disconnect between service and infrastructure views. Enrich AIOps data with service and device attributes such as service name, service components, and topology to help correlate them across the service and infrastructure layers.
  • Keep up with dynamic infrastructure. Subscribe to Cisco NSO for any changes in the service lifecycle status, underlying service components, and topology to help ensure accuracy in the analytics outcomes using machine learning techniques.
  • Prioritize based on business and service impact. Enrich AIOps with customer attributes, which, when correlated with the service and infrastructure information, helps to characterize the impact of the issues that require attention from the operator.
  • Automate service assurance through a model-driven approach. Cisco NSO enables “orchestrated assurance” to validate service status at the time of service provisioning and monitor service health throughout the service lifecycle. Based on the assurance intent expressed in the YANG model in conjunction with the definition of the service, NSO can configure the device instrumentation, such as Simple Network Management Protocol (SNMP) and model-driven telemetry pertaining to a specific service. It can provision active probes to monitor end-to-end service status. It can also configure the AIOps system to contextualize and analyze streamed data during the entire service lifecycle, immediately after the service is provisioned until it is retired.
  • Simplify multivendor device configuration. Cisco NSO offers a single interface to configure all devices as part of closed-loop automation.

Figure 4. Service enrichment by Cisco NSO

Let’s consider a scenario of a service provider deploying a Layer 3 VPN service (named volvo) to connect two customer sites, headquarters and a branch.

  1. The service provider operator uses Cisco NSO to define the Layer 3 VPN service and express the service and resource attributes (such as tenant name, site name, Provider Edge [PE] and Customer Edge [CE] device details, and Quality-of-Service [QoS] policy) as a YANG-based service model. The model is used by NSO to provision the service.
  2. The service model is extended to include service assurance enablement attributes such as provisioning of active probes at designated sites, configuring test operations, setting up thresholds to generate Threshold Crossing Alerts (TCA), and enabling syslogs to be forwarded to AIOps implemented with Cisco Crosswork Situation Manager.
  3. As a result of integration with Cisco NSO, Cisco Crosswork Situation Manager gathers the service and device attributes (Figure 4) in order to enrich a live stream of alerts and identified situations.
  4. Applying machine learning techniques, Cisco Crosswork Situation Manager expedites detection of device and service issues and highlights situations that require operator attention (Figure 5). In addition, it enriches alerts and situations with contextual information such as the service impacted and the tenant name learned from NSO. The context, along with the knowledge base, helps in facilitating collaborative problem resolution in the situation room.
  5. As a part of the remediation, NSO takes care of required device configuration changes, which can be accomplished in a vendor-agnostic manner.

Figure 5. Situations identified by Cisco Crosswork Situation Manager

This summarizes the service-centric approach to AIOps and how it can drive the closed-loop automation. Coupling of service monitoring with orchestration and configuration changes helps to reduce the typical delays between monitoring and control. The reduction of latency between orchestration, provisioning, and change with monitoring to near zero helps ensure continuous service quality from the moment the service is activated to its retirement

 

Implementing AIOps with Cisco Crosswork Situation Manager

Cisco Crosswork Situation Manager uses patented artificial intelligence technology to automatically reduce your alert volume, provide proactive insights into the health of your technology stack, and facilitate collaboration across multiple stakeholders to quickly resolve any incident that might arise.

With Cisco Crosswork Situation Manager, you can automatically reduce alert volumes by up to 99 percent and proactively detect problems through smart correlation of alerts (situations). Once an issue has been identified, the Cisco Crosswork Situation Manager streamlines collaboration and automates workflows across teams and tools using built-in features such as the “situation room.”

Cisco Crosswork Situation Manager is constantly learning, capturing knowledge gained from an operator’s experience and then automatically storing and sharing it for future reuse through features such as probable root cause analysis and algorithmic knowledge (Figure 6).

Figure 6. Implementing AIOPs with Cisco Crosswork Situation Manager

The result is a drastic reduction in the overall number of incidents affecting production systems and in the time to detect and resolve incidents that do occur (Mean Time To Detect [MTTD] and MTTR), improving your quality of service and reducing the overall cost of addressing downtimes and outages. The table below gives a list of features provided by Cisco Crosswork Situation Manager and the related benefits.

Table 2. Cisco Crosswork Situation Manager features and benefits

AI-driven features

Benefit

Automatic noise reduction

Ranks monitoring events by significance, enabling IT and network operations teams to focus only on relevant alerts. Up to 99 percent noise reduction without rules or filters, avoiding both false positives (useless alerts that distract operators) and false negatives (useful alerts that are filtered out before operators see them), as well as reducing operator fatigue from excessive alert volumes.

Avoids having issues be lost in the noise and helps operators focus on relevant events.

Algorithmic correlation

A variety of patented algorithmic and machine learning techniques build clusters of related alerts automatically, identifying unique situations without the need for laborious development and time-intensive maintenance of rules, filters, or inventory-based service maps. These techniques reduce incidents being created in ticketing systems (existing customers have seen a 62 percent reduction in tickets), avoiding the corresponding direct and indirect costs.

Delivers situational context around an issue, enabling correct prioritization without requiring laborious creation and maintenance of rules, filters, or service maps.

Cross-silo collaboration

Automatically assembles cross-disciplinary teams to work on identified situations while avoiding the distraction of involving unnecessarily large teams in issues where they are not required. New participants can be invited automatically (based on skills required) or manually and receive notification by email or SMS or through integration with existing notification systems.

Streamlines both diagnosis and remediation by limiting the size of incident teams and providing powerful collaboration tools (situation room). Existing customers have seen a 30 percent reduction in customer-identified incidents and a 25 percent reduction in MTTR, leading to greater quality of service for less effort.

Integrated ChatOps functionality further accelerates diagnosis and remediation by integrating Cisco or external automation tools.

Probable root cause

Machine learning functionality to identify events that might be the root cause of a situation and rank them according to the confidence of that identification.

Enables IT and network operations teams to start their investigation from events that have the greatest chance of being the cause of the incident, without requiring manual dependency mapping.

Algorithmic knowledge capture

Suggests resolving actions captured automatically from past situations (Figure 7) in order to speed resolution times and identify recurring issues. Assembles knowledge based on proven solutions to past situations, with inline ratings by operators to help flag useful suggestions.

Provides IT and network operations teams with proven remediation actions from past situations, without requiring manual documentation to be compiled or consulted.

Sorry, no results matched your search criteria(s). Please try again.

Figure 7. Algorithmic knowledge capture identifies similar situations from the past

 

Groundbreaking improvement in operational efficiency

Deploying Cisco NSO in conjunction with Cisco Crosswork Situation Manager allows you to implement a service-centric approach to AIOps and realize groundbreaking improvement in operational efficiency. It is alarming to know that Mean Time To Identify (MTTI) and MTTK make up to 80% of the total MTTR, as found by Gartner. Applying AI and machine learning techniques, as discussed earlier, can substantially reduce MTTI and MTTK, as show in Figure 8, hence reducing the overall MTTR.

Figure 8. Reduction in MTTR

Deploying Cisco NSO in conjunction with Cisco Crosswork Situation Manager allows:

  • Multivendor, multilayer, and multidomain correlation and service impact detection and localization
  • Fully automated service onboarding and incident ticketing automation
  • Automated service deployment, verification, and assurance onboarding through YANG definition
  • Resource and service contextualization to expedite impact isolation
  • Assessment of impact of infrastructure vs. overlay service via machine learning in a multilayer, multivendor, and multidomain environment
  • Expedited troubleshooting and closed-loop remediation via NSO services exposure

In a customer deployment scenario, it was observed that it takes an average of 150 minutes to resolve a P1 issue (Figure 9). A substantial proportion of the time was spent in the initial identification and triaging effort. Due to the siloed operational model, it was cumbersome to isolate the issue. Lengthy audio bridges were set up, and significant time was taken to get the right stakeholders and subject matter experts on the bridge. Individuals not associated with the issue were also pulled in, and the entire process involved a lot of wasted effort.

Figure 9. Boost in operational efficiency (example scenario)

MTTR can be significantly reduced, and incident and problem management processes can be streamlined with Cisco Crosswork Situation Manager, combining the strengths of algorithms and human intellect. The benefits include:

  • Drastically reduced alert volumes by eliminating operational noise, deduplication of data, and clustering of related alerts
  • Proactive detection of service-impacting issues in a matter of seconds
  • Automated ticket creation and stakeholder notifications
  • Streamlined collaboration and an automated workflow across teams and tools using the situation room
  • Identification of a possible root cause using smart correlation
  • Capture of knowledge and automated sharing to make operators more knowledgeable over time

In this scenario, MTTR was reduced to less than 60 minutes, which was a substantial improvement.

 

Conclusion

Cisco Crosswork Situation Manager is a next-generation approach to event management driven by real-time machine learning to detect anomalies across your production stack of applications, infrastructure, and monitoring tools. This gives development and operations teams a single pane of glass and unique operational insight, so they can detect issues in seconds, troubleshoot in minutes, and give business users a superior level of service.

Deploying Cisco NSO with Crosswork Situation Manager helps to enrich the alerts with service context so that the service impact can be accurately characterized. Moreover, the use of the YANG model decouples the definition of the services from the actual implementation. As a result the solution can support heterogeneous, multivendor deployments and the coexistence of SDN/NFV and legacy technologies, and is future ready for any new definition of service.

The benefits of this service-centric approach to AIOps include:

  • Discovering and acting on service issues proactively before they affect end users
  • Empowering IT and network operations to embrace new agility-enhancing technologies such as cloud and SDN with greater confidence
  • Facilitating more effective collaboration among IT and network operators across technology domains, locations, and organizations
  • Enabling delivery of new services within existing staff constraints
  • Increasing customer satisfaction, IT credibility, and business performance

Take advantage of Cisco Crosswork Situation Manager to implement innovative artificial intelligence for IT and network operations, accelerating detection, diagnosis, triage, and remediation of incidents while smoothing collaboration and automating workflow across technological and organizational boundaries.

 

 

For more information, review the Crosswork Situation Manager data sheet. Below are additional resources for more details: