Table Of Contents
Cisco ANA Event Correlation and Suppression
Event Suppression
Root-Cause Correlation Process
Root-Cause Alarms
Correlation Flows
Correlation by Key
Correlation by Flow
DC Model Correlation Cache
Using Weights
Correlating TCA
Cisco ANA Event Correlation and Suppression
This chapter describes how Cisco ANA performs correlation logic decisions:
•
Event Suppression—Describes enabling or disabling port-down, port-up, link-down and link-up alarms on a selected port.
•
Root-Cause Correlation Process—Describes the root-cause correlation concept.
•
Root-Cause Alarms—Describes the root-cause alarm and weights concepts.
•
Correlation Flows—Describes correlation by flow and correlation by key. In addition, it describes the DC model correlation cache.
Event Suppression
The user can enable or disable the port-down, port-up, link-down, and link-up alarms on a selected port. By default, alarms are enabled on all ports except for xDSL. When the alarms are disabled on a port, no alarms will be generated for the port, and they will not be displayed in the ticket pane. Using the Registry Editor advanced tool, it is possible to enable or disable service alarms on network entities other than ports, such as the MPBGP (for enabling or disabling BGP neighbor down events), or the MPLS TE Tunnel (for TE-Tunnel down service alarm). It is also possible to enable or disable alarm specific types without regard to a specific network entity.
By default, port-down alarms are suppressed on xDSL ports. Cisco ANA supports selectively enabling sending of port-down alarms on xDSL ports. This can be done by:
•
Using a command available in the GUI, right-click on the port in the inventory, select Enable Sending Alarms.
or
•
Setting a flag in the registry under the OID of the port. Changes to the registry should only be carried out with the support of Cisco Professional Services.
Refer to the Cisco Active Network Abstraction NetworkVision User Guide for information about disabling or enabling a port alarm.
Events can also be filtered according to their DC type source, for example, all the events that come from any ATM DC can be filtered by configuring the registry. The following alarm under DC types is filtered by default:
•
VRF—duplicate ip on vpn
Root-Cause Correlation Process
Root-cause correlation is implemented in various stages within the Cisco ANA VNEs. Initially, the system tries to find the root-cause alarm. When a VNE detects a fault and opens an alarm, it attempts to find another open alarm within the same device, which qualifies as the root-cause of the new alarm. For example, in the case of a "link-down syslog" alarm , the VNE will look for a root-cause alarm within the device, for example, "link down". When such a root cause is found and qualified, the correlation relationship is set in the alarm database. This process is correlation by key.
A more complex scenario is finding the root cause in a different device, which could be many network hops away. In the above example, the link-down alarm could cause multiple BGP Neighbor Down events throughout the network. In such cases, the BGP Neighbor Down is configured by default to actively go and search for a root cause in other VNEs, by initiating correlation by flow. In this example, the VNE that detected the BGP Neighbor Down uses the network topology model maintained in the Cisco ANA fabric to trace the path to its lost neighbor. During this trace it will encounter the faulty link, and qualify it as the BGP Neighbor Down root cause.
The following figure illustrates the local and active correlation processes.
Figure 3-1 Root-Cause Correlation Process
The correlation mechanisms are highly configurable (per alarm), as described in the following sections.
Root-Cause Alarms
Potential root-cause alarms have a determined weight according to the specific event customization. Refer to Chapter 6, "Event and Alarm Configuration Parameters" for additional information about setting the weights. For example, a link-down alarm is configured to allow other alarms to correlate to it, thus when a link-down event is recognized, other alarms that occur in the network may choose to correlate to it, hence identifying it as the cause for their occurrence. However an event that is configured to be the cause for other alarms can in its turn correlate to another alarm. The topmost alarm in the correlation tree is the root cause for all the alarms.
Correlation Flows
The VNEs utilize their internal device component model (DCM) in order to perform the actual correlation. This action is considered to be a correlation flow. There are two basic correlation mechanisms used by the VNE:
1.
Correlation by Key (correlation in the same VNE).
2.
Correlation by Flow (correlation across VNEs or in the same VNE).
Each event can be configured to:
•
Not correlate at all.
•
Perform correlation by key.
•
Perform correlation by flow.
For more information about these parameters, see Chapter 6, "Event and Alarm Configuration Parameters".
In addition, the DC model cache enables the system to issue correlation flows over an historical network snapshot that existed in the network before a failure occurred. For more information see DC Model Correlation Cache.
Correlation by Key
When the root cause problem is at the box level, attempts to correlate to other events are restricted to the specific VNE. This means that the correlation flow does not cross the DCM models of more than one VNE. An example is a port-down syslog event correlating to a port-down event.
An exception for this behavior is the link-down alarm. Since a link entity connects two endpoints in the DCM model, it involves the DCM of two different VNEs, but on each VNE the events are correlated to their own copy of the link-down event.
Correlation by Flow
Network problems and their effects are not always restricted to one network element. This means that a certain event could have the capability of correlating to an alarm several hops away. To do this the correlation mechanism within the VNE uses an active correlation flow that runs on the internal VNE's DCM model and tries to correlate along a specified network path to an alarm. This is similar to the Cisco ANA PathTracer operation when it traces a path on the DCM model from point A to point Z, except that it is trying to correlate to a root-cause alarm along the way, rather than just tracing a path. This method is usually applicable for problems in the network layer and above (OSI network model) that might be caused due to a problem upstream or downstream. An example is an OSPF Neighbor Down event caused by a link-down problem in an upstream router. Another important distinction between Cisco ANA PathTracer and the correlation flow is that the correlation flow may run on an historical snapshot of the network.
DC Model Correlation Cache
The DC model correlation cache represents the network as it was before an event occurred or during a specific time frame by enabling the DC cache to be stored.
A flow of packets occurs on the virtual network, as part of correlation of all DCs, from one VNE to a destination VNE while simulating the virtual network state of a past moment in time, and these packets are forwarded via the message processing mechanism from one DC to another DC according to the rules of the flow. If there are active DCs, and if there is a change in the DC's property value or if a DC was removed, all the DC properties that are marked as cache-based will be stored in the DC model cache for a configurable period of time as defined in the registry and these property values can be restored.
The DC model cache implements this so that the VNE holds cache information for each flow related to a DC (for example, routing entries or bridge entries) and for forwarding tables, so when a VNE needs to reflect its DC model, as it was at some point of time in the past, the VNE will be able to do so based on the cached information it keeps. The DC Property mechanism stores the related data of each property (when cache management is enabled) for a configurable period of time. The default is 10 minutes. The cache can be enabled or disabled in the registry (by default it is enabled).
The cached data (the data that is old according to the configured value in the registry) is periodically cleaned up, in order to maintain the latest valid VNE cache information. This includes old property values and also previously removed DCs, so that removed DCs are kept in a cache only for the defined amount of time. The Cache Manager Component of the DA repeatedly (the period of time is defined in the registry) sends itself a cleanup message in order to initiate a cleanup of the old property values, and all of the DCs that were removed outside of the defined period. So after 10 minutes all the DC properties with a timeout are automatically cleared.
Using Weights
In cases where there are multiple potential root causes along the same service path, Cisco ANA enables the user to define a priority scheme (weight) which can determine the actual root cause.
The correlation system will use the following information to identify more precisely the root-cause alarm:
•
weight: >=0 The correlation flow will collect the alarm, but will not stop.
The correlation mechanism will choose the alarm with the highest weight as the root cause for the alarm that triggered the correlation by flow.
Correlating TCA
TCAs participate in the correlation mechanism, and can correlate or be correlated to other alarms.