The Real Work of Automating NetOps White Paper

Available Languages

Download Options

  • PDF
    (633.5 KB)
    View with Adobe Reader on a variety of devices
Updated:April 27, 2022

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Available Languages

Download Options

  • PDF
    (633.5 KB)
    View with Adobe Reader on a variety of devices
Updated:April 27, 2022


Depending on who you ask, it is either the best or the worst of times to be an information technology professional.

“Glass half empty” types may argue that IT teams are being pulled in too many directions at once. These teams are asked to address competing priorities within their organizations—security, network maintenance, user support, big data, network upkeep, application oversight, and management of the technology needs for hybrid workforces. Despite this expanding roster of duties, many IT groups are woefully understaffed. One study suggests that 87 percent of organizations are presently experiencing an IT talent shortage or expect to face one within the next few years.[1] That’s bad news for technology leaders who are already asking their people to do more with fewer resources.

Other IT pros, however, see a silver lining. While there is an acknowledged disconnect (e.g., budget, staffing, and the fact that resourcing dedicated to IT’s mission does not always match expectations), some IT leaders are seeking creative alternatives to combat labor shortages and unrealistic workloads. Many forward-thinking technology professionals are turning to automation, artifical intelligence, and machine learning for configuration management and network testing.

Well-maintained network operations can lead to virtually limitless cost savings and revenue generation opportunities across all business areas. When mean-time-to-resolution metrics look good, downtime diminishes, and this equates to real financial benefits for companies. In the largest enterprise organizations, IT downtime can cost up to $5,600 per minute (when accounting for IT hours plus lost employee productivity across thousands of devices).[2] In the race to digital transformation, automation can augment understaffed teams, minimize risk, and reduce human errors. Automation also permits organizations to focus on the most strategic technology initiatives, because it takes on the tasks that can be most tedious, time-consuming, and error-prone (such as testing, technical support, deployments and configurations, network exception checks, and security monitoring). This grants leadership the flexibility to reallocate IT resources to high-value efforts, where they’re needed most. Standardizing the IT and network operations experiences can sidestep a shrinking talent pool while still driving growth and innovation.

IT automation defined

Since the earliest remote equipment alarms for centralized fault management, automation has taken many forms. Throughout, the goal of deploying automation—to drive simplification and precision of network operations—has changed little. There are a variety of approaches and outcomes associated with automation. Generally, IT automation can be grouped into five categories:

1.     Task automation: Automating manual tasks that otherwise would be executed by people using command line interfaces is a core automation benefit. Task automation processes—employing standardized scripts and Robotic Process Automation (RPA) software—are relatively simple with few decision points and limited error condition handling.

2.     Device automation: Device automation runs a multi-step process for an entire device by consolidating sets of scripts. The automation software or RPA can run more sophisticated process flows with branches, loops, and error conditions.

3.     Domain automation: Groups of devices are considered part of a larger whole comprising a domain with a definition based on the Content Security Policy (CSP). The domain’s definition can relate to the technology (core routing, mobile core, or SD-WAN), to geography (region or work function), or to the manufacturer. The domains follow the boundaries of the CSP’s network organization structure. It is common to have domain controllers automate all the work within a domain.

4.     Cross-domain automation: Traditionally, work across multiple domains has been conducted manually through interrelated work orders tracked by project management software. New cross-domain automation techniques (sometimes called orchestration) are allowing cross-domain work to be standardized and automated.

5.     Process automation: With cross-domain automation established, overall process automation can be applied. This requires the configuration of the network and alignment across Operations Support Systems (OSS) and Business Support Systems (BSS).

The growing need for network automation

Today, network engineers and IT professionals face an expanding set of challenges. Services are growing increasingly complicated and varied, while expectations for security and relatability have ballooned. 5G technology will support a range of new services with low latency, high bandwidth, and a vast number of endpoints. Network operations will also become more complex. Thus, automation will be crucial to maintain network visibility and control.

NetOps teams that have not yet adopted automation report they struggle to keep pace with rapid release cycles, dedicating more and more time to manual, unpredictable deployments and the resulting affects. The typical network engineer spends more than 400 hours annually (or about 10 weeks of the calendar year) just dedicated to network troubleshooting.

The NetOps Time Sink

Every week, the typical network engineer spends more than 20 hours on troubleshooting alone.[3]

Related image, diagram or screenshot

Network operations are never in a steady state for long. As business needs change, network configurations change, too—from the simple addition of clients into the network to complicated changes in core network routers. With more security, cloud, and controller elements, it becomes more complex to implement required changes without excessive operational rigor. Even with the best and well-staffed teams of technicians, the sheer volume of work will lead to human error in the world of “finger-defined networks”—it is just unavoidable. Mistakes cascade down, affecting other related systems, leading to costly firefighting and a continuous drain on resources having to react to the new issues. Indeed, technology leaders are realizing that automation represents the only reasonable (and affordable) solution to manage complex network operations. Without it, adoption and testing can slow to a crawl as engineers must verify security configurations (often on a device-by-device basis), conduct pre-implementation validation, track down malfunctioning devices, install software updates, and more.

Perhaps the greatest stressor on NetOps is the need for visibility into network status in a manner that can be easily absorbed and acted upon. As networks grow more complex, the ability for any individual or team to collect incoming information, then analyze it for meaningful outcomes, wanes. In fact, 56 percent of NetOps professionals say they are held back by cumbersome information gathering and analysis.[4] This case of information overload can be overcome by machine learning systems that sift through the noise, efficiently highlighting relevant information and displaying it a digestible format.

Transforming NetOps infrastructure

When the complexities of a system become a drag on operations, IT’s mission is in jeopardy. Adopting automation offsets the most common impediments to efficiency, delivering a variety of benefits:

NetOps at scale

Automation is a wise investment for the future. In the coming years, more business processes (both internal and external) will become dependent on network systems. As a result, NetOps will become more difficult to maintain. Organic growth will be impossible to manage without some degree of staff augmentation through artificial intelligence and machine learning. The thoughtful and planned deployment of automated systems will allow controlled scaling of network operations that will continue to meet the needs of the business.

Testing and validation

As networks have become more critical, testing has shifted from a “nice to have” to an accepted and valued part of the deployment process. Automated testing helps IT teams accelerate deployments and minimize the risks associated with making changes that can adversely impact the network. Every application relies on the stability of the underlying network, yet manual confirmation that deployments will not harm the wider system can be elusive. Each vendor’s focus is on feature and functionality testing; to rely on such specific validation efforts in the world of complex multivendor system testing can lead to less-than-optimal findings at the change window. A more in-depth understanding of network behavior, based on specific requirements, is critical. Companies that can access the dedicated software platforms, test cases, and guidance to verify network changes will prevent outages and boost their network productivity.

Manageable data output

The amount of raw data available from network operations is vast. Every network element can deliver information via various methods such as syslog and streaming telemetry. However, having so many disparate data pipelines can lead to information overload, making it difficult for administrators to separate valuable, actionable data from random noise. Integrated machine learning algorithms can be trained to automatically monitor and collect only the information most pertinent to NetOps.

Syslog Patterns: The What and When Matter

Syslog Patterns: The What and When Matter

Figure 1.               

In the event of a network issue, syslog patterns reveal information to restore services and limit future impacts.

Visualization and data use

With the ability to curate inbound data, it becomes imperative to process and display this information in a way that can be used to benefit the organization. Research shows that 95 percent of businesses need to manage unstructured data and 40 percent need to do so regularly.[5] Formatted presentation of key metrics, determined by the needs of the company, gives leaders visibility into the entire IT infrastructure, revealing blind spots and potential security vulnerabilities. With visualization in place, IT leaders are well equipped to rapidly identify and remediate issues that otherwise would be lurking in the background.

Securing the enterprise

With 61 percent of CIOs saying their IT spend is being driven by increased security concerns, frontline IT and SecOps professionals are under pressure to ensure these dollars aren’t going to waste.[6] Automation promotes easier management of network security. Test validation of new software releases and policy changes leads to increased network reliability. Policy algorithms and improved analysis can reveal security gaps that are leaving networks open to cyber intrusions. Predicting vulnerabilities and uncovering weak infrastructure help enable speedy remediations back to a secure state, thwarting hackers. In an era where data breaches are becoming all too frequent—with the average cost of a breach topping $3.83 million—added network insight minimizes risk and boosts confidence during technology transitions.[7]

Cisco Automated Fault Management

Network automation is a logical solution for riding the wave of digital disruption, giving businesses the tools to keep pace with a rapidly evolving technology landscape. Intelligent automation tools, such as Cisco® Automated Fault Management, can automatically analyze situations faster and more accurately than manual means.

Powered by an industry-leading database of millions of devices worldwide, Cisco collects vast amounts of data to identify conditions indicating urgent network problems. Machine learning is applied to find patterns and trends that feed an ever-growing global database of signatures. In turn, the data are used to identify conditions indicating urgent network problems that require attention. The signatures are programmed as “triggers” within the Automated Fault Management service, so that when a trigger appears during routine daily reports, it automatically activates a sequence of predefined responses:

     Generating an email identifying the issue and directing it to the appropriate source

     Opening a case with the Cisco Technical Assistance Center (TAC)

     Performing system self-diagnosis (with no human interaction)

     Surfacing an automatic list of remediation steps to solve the problem


Figure 2.               

Cisco Automated Fault Management collects vast amounts of data, then triggers a series of predefined responses to identify and mitigate issues.

Customized for your IT environment

Every network is different, with distinct priorities and vulnerabilities. Cisco Automated Fault Management is customized to align with your business requirements. You define the devices critical to your operations and determine whether you want TAC case notifications to remain internal. Cisco sets up the service to match your preferences.

To start, Cisco engineers collect up to a year’s worth of data from your network (literally billions of device syslog messages). We identify all faults and pinpoint the sequences that preceded them. We correlate that data with our intellectual capital to create rules (or signatures) specifically for your IT environment. Moving forward, when a pattern appears or an event occurs, the Automated Fault Management service recognizes it and responds.

Furthermore, most network signatures have a shelf life. When you upgrade network software, configurations, and hardware, the initial signatures may no longer be valid. Machine learning keeps discovering the patterns specific to your environment, so Automated Fault Management’s ability to identify warning patterns improves over time.

Manual vs. Automated Fault Management


Automated Fault Management

When a fault occurs, your IT team must follow a time-consuming process:

  Expend significant time and effort to manually pinpoint what occurred
  Open a ticket with Cisco TAC
  Wait for a response to requested information
  Locate information, input it, and reply to the TAC
  Wait for the TAC to identify appropriate steps
  Perform recommended remediation activities prescribed by the TAC

Automated Fault Management runs behind the scenes in the client environment. The service leverages a library of global and customer signatures and performs real-time syslog monitoring. When an event is detected, the Automated Fault Management service:

  Connects to the device to collect event data
  Matches data against our global database
  Opens a Cisco TAC case via an API
  Sends a notification to the customer’s network operations center
  Surfaces a remediation plan for the TAC engineer


Related image, diagram or screenshot

Data formatting and visualization in Automated Fault Management

As noted earlier, capturing the data flowing through your network is of little use unless it can be presented in a way that is meaningful to network engineers. The convenient Cisco Automated Fault Management dashboard displays the most relevant information needed to view and respond to network issues.

Users can drill into intuitive tabs to get deeper insights:

     Alarms Tab: If an event matches a signature on a device monitored by Automated Fault Management, an alarm is created on this tab.

     Signatures Tab: On this tab, events are created, modified, and mapped to a product series.

     Devices Tab: Here, you can access details on all devices monitored by Automated Fault Management.

Related image, diagram or screenshot

Figure 3.               

The Automated Fault Management Dashboard

From this interface, you can generate custom reports and open service requests.

Related image, diagram or screenshot

Figure 4.               

Tabs provide easy access to relevant NetOps information

Continuous Automation and Integration Testing

Network operators and architects will naturally ask how changes will affect the network. Even updates that are less critical can carry some risk. For instance, consider a change to all the trunk interfaces for closet switches. Solely relying on a vendor’s system functional testing can lead to operability interactions that are hard to predict. Even when expecting two routers from different vendors (yet talking the same protocol) should properly work, variances can lead to catastrophic results. Testing automation can eliminate deployment and complex migration risks while minimizing the cost of new changes and capability deployments.

Integrating network DevOps models requires a combination of programming and network expertise—challenging skillsets for CIOs to obtain. Leveraging Cisco’s top experts in validation testing, Continuous Automation and Integration Testing (CAIT) allows a network replica (or the production network directly) to run through specific test requirements whenever needed.

Benefits of Continuous Automation and Integration Testing

Related image, diagram or screenshot

Leveraging Cisco tooling and libraries, our engineers can build automated test cases that are easy to understand and execute.

Cisco automation delivers results

Cisco customers and the industry at large have realized automation’s beneficial outcomes:

Banking on automation

A major bank with a large network was spending too much time and effort resolving network outages and chasing down potential problems. The strain on network engineering and the productivity loss for bank employees were beginning to impact the bank’s bottom line. Every day, engineers were having to identify failures from hundreds of syslogs, open TAC cases, collect data, manage the TAC cases, analyze the problems, and determine remediation steps, all of which are manual, time-consuming processes. The bank implemented Cisco Automated Fault Management, enrolling 10,000 devices. The new installation automated this NetOps process through remediation (and those resolution steps were surfaced by the system’s machine learning), reducing mean time to resolution (MTTR) from hours to minutes. Conservatively, the bank estimates the resulting savings in labor and productivity at $5 million annually.

Service Provider improves MTTR

Another company, a telecom service provider, was struggling with deep work queues among their network engineers. Unable to respond to outages in a timely manner, customer satisfaction scores were plummeting. By applying Cisco Automated Fault Management, the automatic collection of failure data and case creation removed the need for engineers to filter through the extensive syslog noise. As a result, outage detection dropped from over an hour to less than 10 minutes, and the service provider estimated an average of 40 to 100 minutes were saved per remediation incident. Over a 12-month period, the organization estimated an $8 million cost savings and an historic 50 percent drop in MTTR.

Technology & Services Industry Association 2021 STAR Award

The global IT community acknowledges the benefits that Automated Fault Management brings to NetOps. The application won the Technology & Services Industry Association’s (TSIA) 2021 STAR award for featured applications. TSIA is the world’s leading research organization dedicated to helping technology companies achieve profitable growth and solve their top business challenges. Services, sales, product, and channel organizations at technology companies large and small look to TSIA for world-class business frameworks, best practices based on real-world results, detailed performance benchmarking, and exceptional peer networking opportunities.

Realizing similar results in your own organization may seem complex, requiring deep changes to many network management processes, including code development, testing, quality assurance, provisioning, upgrades, patching, and more. Cisco can streamline the process, delivering end-to-end lifecycle services and consultative guidance. Our engineers work closely with yours to ensure your technical vision aligns to business priorities. Cisco has the technical depth and breadth to successfully implement your automation strategy. We’ll help you transition your NetOps orchestration while training your engineers, architects, and software developers to get the most from your automation solutions.

Cisco CX Business Critical Services: Resilient, adaptive, transformative IT

Optimize performance and de-risk IT with guidance throughout the technology lifecycle to get the most from your investments. Leveraging Cisco best practices at every step—from evaluation through transformation—Cisco Business Critical Services serve as your strategic advisor. With experience spanning more than 35 years, Cisco maintains a network of 1.7 million certified professionals and 62,000 trusted partner ecosystems. Our expertise—combined with automation, innovation, and insight from horizontal and vertical industry professionals—is a core component in Cisco’s drive to help organizations improve performance and accelerate their transformation into secure digital enterprises.

Resilient IT

     Predict and resolve incidents more quickly by combining human intelligence and expertise with artificial intelligence, machine learning, telemetry, automation, and insights.

     Identify availability, capacity, and operational process improvements and optimize collaboration through design, configuration, and monitoring workshops.

     Strengthen your security posture with a mix of proactive and reactive services to identify risks and problems, and proactively protect and defend against attacks.

Adaptive IT

     Identify capabilities and deliverables to address changing priorities, speed adoption, support cross-architecture projects, and accelerate transformation.

     Augment your workforce with industry-leading expertise in strategy, design, implementation, operations, and optimization.

     Accelerate complex problem resolution through 1:1 expert coaching based on specific use cases and insight reviews.

Transformative IT

     Architect the right strategy, roadmap, and vision with high-touch consultative guidance and an adaptive workforce to speed transformation.

     Create adaptive and innovative designs for improved performance by applying AI and data-driven insights.

     Deploy new services and applications faster to revolutionize customer experiences and fortify security in real time.

Your strategic advisor: Cisco Scrum Services

Automation is only one focus area within Cisco’s range of Scrum Services, through which you gain access to a top talent pool and cutting-edge intellectual capital, tools, and best practices. Our expertise—powered by Cisco analytics, insights, and automation—helps you address your top-priority projects throughout the technology lifecycle. Scrum Services can be sized up-front based on your primary and secondary architecture focus areas. Within these parameters, consulting capabilities can be prioritized based on your business needs, making it easier to adjust to your most strategic or urgent initiatives. Consulting capabilities include (but are not limited to):

     Planning and Architecture

     Design Engineering

     Implementation Planning

     Assessments and Analysis


     Security and Threat Management

     Operations and Enablement

     Automation and DevOps

     Cloud Transformation

     Matrix Analytics

     Automated Fault Management (AFM)

     Compliance and Remediation

     Continuous Automation and Integration Testing (CAIT)

     Intelligent Insights Service

Cisco Business Critical Services offer Automated Fault Management, Continuous Automation and Integration Testing, and other foundational automations, including:

     Data collectors to automate the login, collection, and storage of data from network elements.

     Automated analysis applied to data to uncover possible areas of concern:

    Security Advisory Analysis (PSIRTs).

    Field Notices (FN).

    Devices reaching end of support/useful life (EOX).

    Identification of concerns in digital exhaust, such as the log messages mentioned previously (Log Analysis).

    Configuration Best Practice configuration exceptions (CBP).

    ML-backed Device criticality and importance based on Place In Network (PIN).

    ML-backed Policy Variation Analysis (PVA).

    ML-backed Device Risk of Crash probability (Fingerprinting).

    ML-backed Outlier Analysis for configurations and compliance (Fingerprinting).

     Automated discovery of additional best practices and outliers based on deployment patterns.

     Automated recommendations ranked to identify the most impactful places to spend your time.

These Business Critical Services foundational automations and outputs can be compared to the indicator lights on your car’s dashboard. The tools use automation and advanced machine learning to continuously check your network, only raising areas of concern when they are important and need attention. To take the analogy a step further, as your mechanic does with your automobile, you can use the data for deeper and more advanced analysis to drive NetOps success. In your organization, this can mean more uptime, reduced costs, and strengthened security. With our Scrum Services, Cisco experts provide support for automated fault learning and recognition, troubleshooting acceleration and automation, trouble ticket integration with TAC, and custom workflow automation in incident handling.

Your Cisco CX Business Critical Services team provides guidance at every step of your journey as you embrace automation and re-imagine IT for agility, growth, and innovation.

Contact your local Cisco account representative or authorized Cisco partner today to start the conversation.

Learn more




[1] McKinsey, Beyond Hiring: How Companies are Reskilling to Address Talent Gaps, Feb 2020
[2] Gartner, How Much Does an Hour of Downtime Cost?, Jul 2021
[3] NetOptics, How Much Time Do Network Engineers Spend Troubleshooting the Network?, Nov 2020
[4] Gartner, How Much Does an Hour of Downtime Cost?, Jul 2021
[5] Forbes, Big Data Goes Big, Feb 2019
[6] Garter, CIO Survey, May 2021
[7] Ponemon Institute, Cost of a Data Breach Report, Jul 2021

Learn more