Networking has been rapidly evolving over the past two decades. This evolution has led to a broad range of available networking technologies (numerous PHY/MACs, routing protocols, QoS, high availability, security, etc.) in order to serve a plethora of applications with highly differentiated requirements (from best effort to deterministic) and at unprecedented scales.
Networks today are the lifeblood of many organizations worldwide. Whether in healthcare, manufacturing, education, retail, or other industries, the proper, ongoing operation of the network is vital to the operation of today’s organizations. As such, analytics for the network has become of the utmost importance, as this allows an organization to gauge how well the network is working to support the goals of its business and to constantly adapt to new requirements serving a broad range of IT and IoT applications.
Are applications being accessed efficiently by users?
Are devices able to communicate with their servers and services seamlessly?
Are the users of the network getting the service levels they need to be productive?
The answers to these and many other questions allow the organization to determine how well the network is operating, to quickly and efficiently troubleshoot problems when they arise, and to keep adapting the network’s topology, configurations, and dimensioning.
Beyond this, there is the issue of scale. Networks today are so large, so widespread, and encompass so many various interacting technologies, that the sheer volume of data provided by the network quickly becomes overwhelming for the network operator. Moreover, how can the network move from a purely reactive stance to a more proactive one, spotting key trends and issues and making the necessary network changes before problems occur?
Let’s Take the Example of Troubleshooting
What network operators need is a platform capable of extracting astute and relevant insights. Insights into the services provided to their applications and end users. Insights into network operations via multiple correlated key performance indicators (KPIs). Insights into network trending which reveals key aspects of network and application operation. Insights that help to rapidly distill the many thousands of otherwise-disparate data points down to a few key conclusions that drive actionable outcomes for the business.
Predictive Analytics Is Yet Another Example
Can the network predict issues before they happen, or even trigger automatic remediation (for example, through a closed loop control system)?
This is where the power of machine learning (ML) comes into play. Cisco has built a breakthrough learning platform, making use of an extremely rich data platform used to train the most advanced machine learning technologies.
This white paper explores how such a platform is used in the context of Cisco’s Digital Network Architecture (Cisco DNA). ML capability is now being introduced with Cisco DNA Center, bringing powerful ML capabilities – and more importantly, the key business insights that ML provides – within reach of networks and organizations worldwide.
This introductory white paper provides an overview of such a learning platform in the context of Cisco AI Network Analytics, a solution which is initially focused on the complex technologies and challenges inherent to wireless networks. Future revisions of this document will discuss other areas, such as switching, SD-WAN, and cross-domain, to name a few, as Cisco’s machine learning technologies become applied to these areas.
As we begin to explore the uses and technologies provided by the Cisco AI Network Analytics solution, let’s first start with a few common wireless challenges that will be immediately familiar to any network operator, manager, or architect: wireless joining/roaming times and per-user wireless application throughput.
Joining/roaming refers to the set of complex tasks triggered when a wireless client attempts to join a wireless network or roams from one access point (AP) to another within the wireless system; without a doubt, the time for a client to successfully join a network, or to roam between APs contributes significantly to the quality of experience (QoE) for the end-user. Being able to monitor such complex, multidimensional KPIs so as to detect abnormal joining/roaming times, along with determining potential root causes should an issue be observed, is a fundamental task for IT teams.
A traditional approach in networking to tackle such issues has consisted in measuring the KPI of interest (for example, joining/roaming times, or a simple metric such as memory / CPU utilization/packet loss) and then set a static threshold which, when crossed, triggers an alarm (for example, “If joining/roaming times > 2s then set an alarm”). Although such a simple approach nicely applies to several network metrics (such as the CPU on a router or switch), experience has shown that a metric such as join or roam times may greatly differ with a number of variables, such as the network characteristics (for example, signal strength, number of clients, type of clients, the performance of network servers, and services such as Dynamic Host Configuration Protocol [DHCP] and/or Authentication, Authorization and Accounting [AAA]), the network topology involved, and many other environmental factors – thus making the use of a single static threshold that would apply across a range of network deployments impractical.
As an example, Figure 1 below shows how the joining/roaming performance varies over time for several networks. On examining this data, one can easily see that finding a unique threshold for all networks, and even all locations in a given network, is simply not possible, due to the wide variances of what is “normal” observed across multiple varying network deployments.
This highlights the need for Learning (see the next section for the definition of "Learning" as used here) Cisco AI Network Analytics is a learning platform capable of dynamically predicting what is a normal onboarding time for any given network, taking into account hundreds of dimensions, without any prior knowledge and without the use of static thresholds. This is possible thanks to the use of advanced machine learning algorithms.
Another interesting network metric that is highly relevant to the overall user quality of experience is the average throughput per user (across layers). As shown in Figure 2, such a metric once again tends to vary over time and as conditions change within the unique network involved (number of users, network characteristics, etc.); once again highlighting the need for learning. One can observe that the per-user throughput is always high for some APs, and always low for other APs of the same network, whereas in other cases the per-user throughput varies over time.
In both of these cases – involving onboarding times and per-user wireless throughput – machine learning can help to cut through the thicket of dozens or hundreds of KPIs, and allow network operators, managers, and architects to rapidly identify and resolve issues – allowing them to be more responsive to the business and provide better services, more rapidly, to end users. Now, let’s explore the technology behind how this works.
Why Machine Learning (ML)?
What does "Learning" mean, in this context? "Learning" refers to the ability to train a machine learning (ML)(mathematical) model capable of understanding or modeling the behavior of a complex system or given KPI or variable.
“Why machine learning (ML)” is a fair question. ML algorithms are certainly very powerful, but they also have a reputation of being difficult to design, tune, and adapt to a variety of situations (such an ability is known as robustness) and sometimes have been known to produce results that may be difficult to interpret (a complex topic, and one that is outside the scope of this introductory white paper). ML algorithms outperform all existing (traditional) approaches in some areas (for example, image recognition), whereas, in other circumstances, the overall benefits may not be worth the extra cost and complexity.
Thanks to years of experience in building some of the world’s largest and most advanced networks, along with deep ML algorithm expertise developed over the past decade and focused on networking problems, Cisco is uniquely able to provide the Cisco AI Network Analytics solution, a solution that has been designed with the most advanced and highly developed ML technologies, for solving issues where ML will provide an indisputable benefit over existing technologies and approaches.
For example, the ML model(s) may be used to predict what should be the lower-upper bounds for a given KPI (joining/roaming times) according to hundreds of variables (sometimes referred to as a regression). KPIs falling outside a prediction as provided by the ML model would be considered “abnormal” for that unique network involved, and thus would be candidates for raising an alarm (that is, an alarm based on a learned bound, not based on a static value).
For example, Figure 3 shows a predicted ”band” (shown in green) of normal values for the percentage of failed onboarding sessions. As we can see, at some point the percentage of failed onboarding sessions (blue line) became abnormal (falling outside the green band), considering a number of network variables involved, as analyzed by the ML algorithm in use. This departure from normal to abnormal behavior for this network is denoted by the red section of the timeline in the diagram shown.
Note that ML models may also be used for other tasks, such as classification (a classification algorithm may, for example, try to recognize malware or other undesired network content or use). In yet another example, an ML model may be used to predict a specific event, such as the failure of an item of networking gear.
A second set of machine learning algorithms are then used for root-cause issues that may be observed (note that when attempting root-cause determination, there is very often confusion between correlation and causation, which are different from each other but may often appear to be associated before a deeper analysis of the data is made). In some cases, an ML algorithm may be able to detect anomalies with associated root causing, whereas in other situations more than one ML algorithm may be used in conjunction with anomaly detection to provide root causing.
Referring to the previous example, the root cause for the abnormal percentage of failed onboarding sessions was, in fact, a DHCP timeout issue, as shown in Figure 4. In this case, the root cause was automatically and rapidly determined by the Cisco AI Network Analytics platform – a conclusion that may have taken a network manager or operator many hours to come to, after manual correlation of all of the possible variables involved.
Last but not least, the Cisco AI Network Analytics platform is capable of relevance learning (RL). Thanks to simple user feedback on the relevancy of anomalies observed and reported to the network manager, the Cisco AI Networks Analytics platform adjusts itself (“auto-adjusts”) to “raise anomalies” (that is, bring to the user’s attention) that are of most interest for the user.
Consider the use case where the system tries to predict a networking equipment failure: this is a situation where there is a ground truth (the equipment did fail (or did not)). Thus, the precision or accuracy of the prediction is easy to measure. In contrast, when raising an anomaly in a wireless network, an anomaly may be considered relevant by a given user and irrelevant by another user (there is no ground truth), making the notion of accuracy, in this case, subjective.
It is worth discussing the critical topic of relevancy. Anomaly detection (AD) refers to the capacity of a system to detect anomalies where an anomaly is an outlier. Said differently, an event may be (rightly, mathematically) considered a real anomaly even though the user may not find this fact relevant from a user or networking standpoint. Consequently, in contrast to problems with ground truth, the notion of the accuracy of the system makes little sense.
For example, in the case of cognitive analytics, determining that an AP is offering significantly less throughput for a given class of device at regular times of day in a complex network (without any explicit and manual configuration) is extremely difficult, using classic approaches. Finding a pattern to predict the actual user experience of a video call, taking into account the nature of the application, video codec parameters, the states of the network (data rate, RF, etc.), the current observed load on the network, and the destination being reached – to mention a few parameters – is simply impossible using existing so-called rules-based systems. By way of contrast, an ML, leveraging intelligent algorithms and leveraging the power of big data, can succeed at such tasks.
Yet another example where ML provides high value relates to the detection of “subtle” changes over time, which, after enough time has passed, may become a real anomaly. In this case, algorithms are used to detect that an access point (AP) provides a throughput that slowly but surely degrades over time compared to other APs in the network, alerting the network manager so that corrective action may be taken – possibly before any end user actually notices a problem!
This is illustrated in Figure 5, where such a parameter is tracked over several weeks and months and a deviation from the “normal” operation of the AP is highlighted, both by itself and compared with other, peer APs. In this diagram, each bubble represents an AP, the relative size of the bubbles indicates the number of clients served by that AP, and the line running across the various weeks tracks the AP’s progress and deviation over time, allowing the network manager to easily spot the issue and determine both impact and severity at a single glance.
Which Machine Learning Algorithm Are You Using?
This shows the power of ML in action – distilling a vast amount of data down to a clear set of insights from which actionable outcomes can be drawn, without dependence on static thresholds that can, and do, vary from one network deployment to another.
This is, without doubt, a very common question. Unfortunately, no single ML algorithm is capable of solving all (or even most) use cases (this fact is referred to as the “No Free Lunch” ML theorem). Efficient ML approaches rely on a set of ML algorithms working together to achieve a desired outcome. Cisco AI Network Analytics is an ML learning platform that uses a collection of ML algorithms focused on providing key insights and outcomes not readily possible by other means.
Cisco AI Network Analytics focuses on use cases where an ML approach offers the only viable answer and where existing techniques fall short; specifically, highly dimensional problem spaces where patterns must be understood and learned over time using vast data sets in which ML has proven to be extremely powerful. For example, in the case of cognitive analytics, determining that an AP is offering significantly less throughput for a given class of device, considering a specific set of parameters, in a complex network (without any explicit and manual configuration) is extremely difficult with classic threshold-based approaches. By leveraging the power of machine learning, such insights and outcomes can rapidly be provided.
Few Words About the Role of the Data Platform
In the world of ML, data is ultimately more important than the algorithm. Indeed, the volume of the data platform used to feed the ML algorithm is a strong factor influencing the overall efficiency of the solution. However, sheer volume of data is not sufficient; diversity is also critical. Traffic and network characteristics vary drastically between networks, and sometimes even between areas of a given network. One cannot expect an ML model to perform on predicting for data it has not been trained for. Thus, it is imperative to build a data platform with high diversity.
Last, but not least, is the quality of the data. The paradigm under which models (such as a deep neural network) may be fed with “infinite” input variables relying of some form of auto selection and filtering is far from being proven for most use cases. Feeding models with noise unavoidably leads to random output. The Cisco AI Network Analytics solution makes use of techniques to automatically ensure data quality.
Selecting a small number of relevant anomalies: Building an ML system capable of raising anomalies is not the toughest challenge. Many anomaly detection (AD) applications have been designed over the past two decades (statistical deviation, percentile regression, and auto-encoders, to name a few). Still, raising a large number of anomalies is likely to make the system unusable for the network operator. The number of anomalies raised should be limited and highly relevant. Cisco AI Analytics makes use of several advanced techniques to reach such an objective, combining an ML approach with Cisco’s 35 years of deep subject matter expertise in networking.
As shown in Figure 6, Cisco AI Network Analytics is a cloud-based ML/AI solution, within Cisco DNA Center, empowers IT administrators to accurately and effectively improve performance and issue resolution within their networks.
Figure 6 shows how the data flows across the main components of the platform:
- Data Collection
From various sources, data is collected and aggregated on-premises by the Network Data Platform (NDP) component within the Cisco DNA Center appliance.
- Data validation and anonymization
- Handled by the local ML agent, operating on premises.
- Sent to the cloud after being anonymized on premises
- Data processing
- Handled in the cloud, based on anonymized data.
- Data is stored and fed into the machine learning models
- Insights generated in the cloud and sent back to the on-premises agent
- Issues/insight visualization
- Data received from the cloud is first de-anonymized.
- Data is then displayed in the Cisco AI Network Analytics UI, which is fully and seamlessly integrated with Cisco DNA Center
The Cisco AI Network Analytics platform is architected to feed its analytics with a wide variety of data sourced from across the network - not only the wireless components, but also routers, switches, management stations, application servers, RADIUS/DHCP servers, and more. During the initial phases, however, the focus of the data input into Cisco AI Network Analytics solution will be data produced by Cisco Wireless LAN Controllers (WLCs).
The Cisco Wireless LAN Controller platform and version requirements are the same for Cisco DNA Center:
- Software version: AireOS 126.96.36.199 (8.5MR4+ recommended)
- WLC Models 3504, 5520, 8510, 8540
- The Application Visibility and Control (AVC) feature must be enabled on the WLC/WLAN configuration to enable throughput-related use-cases.
- APs need to be assigned to a building floor in Cisco DNA Center to allow per-site data aggregation.
The security of customer data is of paramount importance. In order to ensure the quality of telemetry while at the same time guaranteeing privacy, personal and confidential data is anonymized, namely when it pertains to:
- End-user identity (user name, device MAC address, etc.)
- Device location (hostname, AP location string, etc.)
- Network addresses (IPv4 / IPv6), including routing table information
Other nonsensitive data - such as the relative location of clients and APs, mobility of clients between APs and controllers, and similar variables - need to be kept intact in order to feed meaningful data to the Cisco AI Network Analytics engines.
The anonymization scheme used is based on strong encryption. Every bit of sensitive data is run through an AES-based encryption function using a strong key that is generated, managed, and archived by the customer on-premises (it is kept in the secure storage of Cisco DNA Center). The original data is replaced with the encrypted version before being sent to the Cisco AI Network Analytics cloud
Once the output data (results, alerts, visualization, etc.) comes back from the platform cloud to be displayed in the Cisco DNA Center UI in the user’s browser, it is run by a local UI proxy through a de-anonymization process that decrypts the anonymized values and restores the original data as needed. The user is then presented with relevant names, IDs, addresses, and other such information, even though the analytics were done using anonymized versions of those items.
Cisco AI Network Analytics is part of the Cisco DNA Advantage software license for Cisco DNA Center. It is provided as an additional component that seamlessly blends in with the Cisco DNA Assurance user interface. The solution provides advanced ML-generated insights and issues, along with the visualization tools required for analysing, troubleshooting, and reacting to the issues raised by the ML engines.
Deploying Cisco AI Analytics is very straightforward, thanks to Cisco DNA infrastructure, and simply requires a working instance of Cisco DNA Center (which runs in an appliance form factor) as well as HTTPS connectivity to the Cisco AI Network Analytics cloud. All data is mapped, processed, and anonymized before being sent to the cloud; results and insights are returned by the Cisco AI Network Analytics cloud services and are displayed after de-anonymization, directly in the Cisco DNA Assurance UI on Cisco DNA Center.
Communication and Authentication to the Cisco AI Network Analytics Cloud
All Communications to the Cisco AI Network Analytics cloud (hosted on AWS) are secured using Transport Layer Security (TLS) 1.2 with strong encryption. Mutual authentication between the Cisco AI Network Analytics agent on Cisco DNA Center and the associated cloud services is ensured through the use of certificates generated and managed by Cisco. The initial enrollment with the Cisco AI Network Analytics cloud certificate authority is authenticated by a customer ID and a secure onboarding key (acting as a one-time enrollment password).
All connections to the Cisco AI Network Analytics cloud are outbound on TCP port 443; no inbound connections are required (that is, the Cisco AI Network Analytics cloud will not be initiating TCP flows toward Cisco DNA Center). A list of cloud server fully qualified domain names (FQDNs) and IP addresses is provided in order to configure firewalls accordingly (please refer to the Installation Overview > Proxy Configuration section in the Cisco AI Network Analytics documentation for current information). Cisco DNA Center must also be able to perform DNS lookups for the cloud server addresses.
Connections to Cisco AI Network Analytics cloud may also go through a proxy (explicit or transparent), if required. The proxy server setting, if any, is inherited from Cisco DNA Center.
Scope of Data Collection
The following categories of information will be collected from the Cisco Wireless LAN Controllers involved:
- Client: MAC and IP address, VLAN, association / connection AP / state / time, CCX, device / OS type, RF rate / RSSI/SNR, and ACLs
- AP: MAC and IP addresses, WLC, state, SN / model/location, RF channel / stats / interference, neighbors, client count, and RF quality/power
- Application-level stats, such as those provided by Cisco AVC (Application Visibility and Control)
- System: CPU and memory usage, and client/AP count
- RF: interferer and rogue AP/RSSI/SNR / channels, and SSID
- DHCP: server IP address, counters, and statistics
- RADIUS authentication and authorization: server IP address, counters, and statistics
- Client events (state changes during association, roaming, etc.)
- AP/RRM events (Radio Resource Management events)
This paper provides a short overview of Cisco AI network Analytics – a breakthrough cloud-based ML/AI platform for the network, with a specific focus on the initial components of the ML platform dedicated to use with the wireless network.
The most advanced ML/AI techniques and algorithms have been developed by Cisco for this solution to support cognitive and predictive analytics. The system is capable of learning ranges of normality for a number of variables, detects relevant anomalies, and is able to adjust continuously to user feedback. Other functionalities of the solution allow for detecting long-term anomalies, such as subtle changes over a long period of time, and also make use of ML algorithms for comparing network element performance metrics, or even providing the capability to compare the organization’s network performance with comparable peers.
The ML models employed are trained with a vast amount of data, with high diversity and quality, a must-have for any ML-based platform, and the solution leverages strong data anonymization techniques with in-built security to ensure the appropriate level of handling for sensitive customer data.
In short, the Cisco AI Network Analytics solution provides the ability for network operators, managers, and architects to be provided with key network insights, driving actionable outcomes, leveraging the power of machine learning to tackle problems that were not previously tractable. Note that the output of such a system may also be used by a close loop control system for automation.
Initially, the Cisco AI Network Analytics solution (an instantiation of a broader learning platform) is focused on wireless use cases, as these are some of the most difficult-to-troubleshoot and difficult-to-analyze areas of the network – and yet, at the same time, some of the most critical areas for modern organizations as they transition to a wireless-first future.
Over time, further revisions of this document will extend to other use cases, such as switching, SD-WAN and cross-domain, to name a few, as the solution continues to evolve and grow.