Programmable Closed-Loop Remediation

This section explains the following topics:

Overview

Objective

Detect anomalies and generate alerts that can be used to notify an operator or trigger automation workflows.

Challenge

Discovering and repairing problems in the network usually involves manual network operator intervention and is time-consuming and error-prone.

Solution

Incorporating Change Automation and Health Insights into Crosswork Network Controller allows service providers to automate the process of discovering and remediating problems in the network by enabling an operator to match an alarm to pre-defined remediation tasks. These tasks will be performed after a defined Key Performance Indicator (KPI) threshold has been breached. Remediation can be implemented with or without the network operator’s approval, depending on the operator's setting and preferences.

Using such closed-loop remediation reduces the time to discover and repair a problem while minimizing the risk of making a mistake and creating an additional error through high-stakes manual network operator intervention.

How does it work?

Smart monitoring

  • The Smart Monitoring feature helps operators collect, filter, and present the data in useable formats, such as graphs and tables. Operators can concentrate on their business goals while Crosswork Network Controller, Change Automation, and Health Insights handle the configuration required for the data collection using the zero-touch telemetry feature.

  • By utilizing a common collector to gather network device data over SNMP, CLI, and model-driven telemetry and making it available as modeled data described in YANG, we can prevent duplicate data collection. This optimizes the load on both the devices and the network.

  • The Recommendation Engine analyzes network device hardware and software configuration and employs a pre-trained model built from data mining. It produces KPI-relevant recommendations, facilitating per-use-case monitoring.

  • KPIs cover a wide range of statistics, from CPU, memory, disk, and layer 1/2/3 network counters to per protocol, LPTS, and ASIC statistics.

Smart filtering

  • Health Insights builds dynamic detection and analytics modules that allow operators to monitor and see alerts on network events based on user-defined logic (KPI).

  • Key Performance Indicators (KPIs) Alerting Logic can be:

    • Simple static thresholds (TCA). For example, CPU load above 90 percent.

    • Moving average, standard deviation, percentile-based, etc., For example, CPU load above mean and staying there for five minutes.

    • Streaming jobs that provide real-time alerts or batch jobs that run periodically.

    • Customized for threshold values and visualization dashboards.

    • Customized operator-created KPIs based on business logic.

    • TCAs that can be exported or integrated with other systems via HTTP, Slack, and socket interfaces.

  • KPIs are associated with dashboards that provide real-time and historical views of the data and corresponding TCAs.

  • KPIs also provide purpose-built dashboards that go beyond raw data and provide valuable information in various infographic style charts and graphs useful for triaging and root-causing complex issues.

Smart remediation

  • Health Insights KPIs can be associated with Change Automation playbooks, which can be executed manually or via auto-remediation. Remediation workflow could be used to fix the issue or collect more data from the network devices. Operators can save time and money by proactively remediating the situation instead of resorting to ad hoc debugging and unscheduled downtime, providing better QOE to their customers.

  • Health Insights correlates alerts or anomalies on the network's topology, allowing easy visualization of the impact of events.

Scenario: Achieve predictive traffic load balancing using segment routing affinity

Scenario context

To maintain smooth and optimal traffic flow, operators need to be able to monitor traffic on the interfaces, identify errors such as CRC, watchdog, and overrun, and then reroute the traffic so that the SLA is maintained. This process can be automated using the Crosswork Network Controller with the installed Health Insights and Change Automation applications.

Assumptions and prerequisites

Health Insights and Change Automation must be installed and running.

Workflow

The following is a high-level workflow for executing this scenario:

Procedure


Step 1

Deploy Day0 ODN templates on edge nodes with dynamic path calculation delegated to SR-PCE and the ODN template configured to exclude links that are tagged with a specific affinity, such as RED affinity. ODN allows a service head-end router to automatically instantiate an SR-TE policy to a BGP next-hop when required (on-demand). The ODN template defines the required SLA using a specific color.

For information on creating an ODN template, refer to Step 1 in Scenario: Implement and maintain SLA for an L3VPN service for SR-MPLS (using ODN).

Step 2

Create an L3VPN route policy to specify the prefixes to which the SLA applies and mark them with the same color used in the ODN template. When traffic from the specified network with a matching color is received, paths are computed based on the SLA defined in the ODN template.

For information on creating a route policy, refer to Step 1 in Scenario: Implement and maintain SLA for an L3VPN service for SR-MPLS (using ODN).

Step 3

Provision an L3VPN across the required endpoints and create an association between the VPN and the route policy. This will connect the VPN to the ODN template that defines the SLA.

For information on provisioning an L3VPN, refer to Step 3 in Step 3 Create and provision the L3VPN service.

Step 4

Define and enable the KPIs on the devices. This will continuously monitor the uplink interfaces on the L3VPN endpoints.

For information on defining KPIs, see Crosswork Network Controller 7.2 Closed-Loop Network Automation.

Step 5

When there is an error on monitored interfaces, mark the dirty link with RED affinity so that it will be excluded based on the specifications of the ODN template. This is achieved by creating a custom playbook. Crosswork Network Controller identifies the interface where an error has occurred and generates an alert. This information is then used in a custom playbook to push the affinity configuration to the relevant router, creating a closed-loop automation process. This helps ensure that the customer does not experience any outages.

For information on defining playbooks, see the Crosswork Network Controller 7.2 Closed-Loop Network Automation.

Step 6

Crosswork Network Controller will continue monitoring the link, and once there are no longer any alerts, the RED affinity tag can be removed. Another playbook should be defined for this purpose.