Fault Management in Cisco ANA
A common problem in the management of large networks is that a single fault manifests itself as multiple alarms in the Network Management System (NMS). This makes manual analysis of faults costly and also diverts attention from major problems. This is a major motivation for applying analysis and correlation in an NMS.
Many types of problems threaten service delivery; for example, hardware failures, software failures, and so on. An effective root cause analysis technique must be capable of identifying all these problems automatically. This technique must work accurately for any environment and for any topology, including interrelated logical and physical topologies, with or without redundancy.
Cisco ANA analyzes and manages faults by implementing world-class event collection, identification, and correlation functionality. After identifying the event, Cisco ANA groups related events and uses the automatically discovered virtual network model to perform fault inspection and advanced correlation to determine the root cause of the fault and to create a ticket.
Cisco ANA 3.7 introduces a fault management architecture that:
•Ensures that events are never incidentally dropped.
•Supports a greater volume of events.
•Improves fault management performance.
•Enables you to define and generate reports on demand in a variety of formats.
•Is predictable and consistent so that the same scenarios, one played after the other, produce the same results.
•Clearly defines terminology that is used correctly and implemented consistently.
The following topics describe the key concepts of Cisco ANA fault management:
•Correlation and Root Cause Analysis
•Event Flow Through Cisco ANA
•Northbound Integration Services
An event is a distinct incident that occurs at a specific point in time. Examples of events include:
•Port status change
•Connectivity loss (for example, BGP Neighbor Loss) between routing protocol processes on peer routers
•Device becoming unreachable by the management station
An event is a possible symptom of a fault, which is an error, failure, or exceptional condition in the network.
Events and Alarms
In Cisco ANA NetworkVision and Cisco ANA EventVision, an event is represented by a small icon in the form of a bell (see Figure 2-1).
Figure 2-1 Example Events
Events have an associated severity. Events with a severity of Critical (red), Major (orange), Minor (yellow), and Warning (sky blue) are said to be flagging events; events with a severity of Cleared (green) are called clearing events. Events that are informational in nature are marked in dark blue.
The lifecycle of a fault scenario is called an alarm. An alarm is characterized by a sequence of related events, such as port-down and port-up. In this guide, an alarm is denoted by an oval (see Figure 2-2).
Figure 2-2 Example of an Alarm
If event A is followed by event B as part of an event sequence, then event A is the predecessor of B and B is A's successor.
The last event in the sequence determines the severity and state of the alarm. An alarm that ends with an event that has a severity of cleared is called a cleared alarm.
Cisco ANA discovers events in the following ways:
•By receiving notification events and analyzing them.
•By automatically polling devices and discovering changes that cause service alarms.
Events can pertain to network faults or network elements:
•Incoming network event notifications—SNMP traps and syslog event messages sent asynchronously by network elements are captured by Cisco ANA and processed by the appropriate VNE. Cisco ANA supports SNMP v1, v2, and v3 traps.
•Generated event notifications—The VNE generates an event message when it detects a state change in the network element, typically after polling it via Telnet or SNMP. Some event notifications, such as the VPN Leak event, can also be generated by the gateway.
Upon receipt of an event notification or service alarm, Cisco ANA frequently expedites (immediately triggers) the polling of specific data from the device. Thus, the incoming notifications often yield additional generated event notifications or service alarms that are key to the subsequent correlation process.
The list of traps and syslog notifications that cause polling expedition is device specific, and varies from VNE to VNE.
In addition to network faults, the gateway and units can generate system events that indicate Cisco ANA internal faults, such as disk full or unit unreachable.
The event manager parses the event notification message to obtain more information about the event and creates an internal event object to represent the underlying event. During event identification, Cisco ANA determines the following information:
•Event Functionality Type—Trap event, syslog event, service alarm, and so on.
•Event Type—The event type is an identifier, describing the nature of the fault, such as Link Down.
•Event Subtype—The event subtype is a further clarification of the event type, such as Link Down Due to Admin Down.
•Event description strings—Includes the content of the notification message (for incoming event notifications) and a short description; this text is used to display the event to the operator.
•Event Severity—For more information, see Event Severity.
Based on the event type and event subtype, an event has numerous additional correlation and metadata attributes that determine how the event is processed by Cisco ANA. These attributes are discussed in Chapter 3, "Correlation Scenarios." These attributes are documented in the Cisco Active Network Abstraction 3.7 Reference Guide.
Incoming Event Identification
Incoming event notifications (traps and syslogs) are identified by matching the event data to predefined patterns. A trap or syslog is considered supported by Cisco ANA if it has matching patterns and can be properly identified.
If the incoming event notification cannot be identified, Cisco ANA creates an event-object of the nonactionable type. Nonactionable events are not used in subsequent correlation activity. Instead, Cisco ANA preserves the trap OID and forwards it through the event notification service for both nonactionable and known event types.
The payload structure of a particular trap might differ between SNMP versions, and a trap might be supported by Cisco ANA only for a specific SNMP version.
In some cases, the identification is based on the internal values of the event. For example, the Line Down trap is mapped to event type and subtype as described in Table 2-1.
Table 2-1 Line Down Trap
Line Down trap
Line Down trap
Line Down trap
Line Up trap
Traps are mapped to event type and subtype regardless of the SNMP version in use.
The event type determines whether the event continues to be processed or is dropped. Events that are dropped after event identification are not stored in the fault database, and do not participate in correlation. Dropping events at this stage is important because it prevents Cisco ANA from being overwhelmed by large numbers of insignificant event notifications.
Each event has an assigned severity. Events broadly fall into the following three severity categories, each with their associated color in Cisco ANA EventVision and Cisco ANA NetworkVision:
•Flagging—Indicative of a fault: Critical (red), Major (orange), Minor (yellow), or Warning (sky blue).
•Clearing—Indicative of a fault that has been resolved: Cleared (green).
•Informational—Info (dark blue).
For example, a Link Down event might be assigned a critical severity, while its corresponding Link Up event will have a cleared severity.
The last event in the sequence determines the severity of an alarm (an event sequence). Exceptions to this rule include bookkeeping events (see Bookkeeping Events), which do not change the severity of the sequence (the alarm).
Event Source Association
Event identification is followed by source association. Cisco ANA examines and parses the event notification message to pinpoint the precise entity that is the location, or source, of the event. Rather than simply relate the event to the managed element as a whole, the association code determines the precise source of the event. The source corresponds to an object in the VNE model. The event is populated with the unique IMO identifier of that object (the OID).
For instance, the source of an Ethernet flow point (EFP) Down event would be the corresponding EFP model object. Correctly associating an event to its closest source is an important step for the subsequent correlation actions.
Source Association Fallback
In some cases, the event source might not be in the internal VNE model at the time of the event notification. For instance, when a new module is inserted, it takes some time for Cisco ANA to poll all its interfaces and build up (populate) the model. If the new event notification is handled before the model is fully populated, the association logic might fail to find and retrieve the entity that is the correct source of the trap. A retry mechanism minimizes the occurrence of such a rare condition, but if it persists, the association logic falls back to the managed element entity (the network element) that is the source of the new event. An additional identifier (the alarm differentiator), representing the intended source, is later used in the correlation logic. For more information about alarm source OIDs, see the Cisco Active Network Abstraction 3.7 Reference Guide.
Correlation and Root Cause Analysis
Causality correlation is the process of relating an event to an existing alarm in a causality relationship. The root of the resulting causality hierarchy is called the root cause, and the correlation process that determines the root cause of a fault is referred to as root cause analysis (RCA).
RCA is performed on historic snapshots of the VNE model and forwarding information. These snapshots, maintained for ten minutes, enable the correlation logic to traverse the VNE network model at a time in the past, before the occurrence of the fault that is being analyzed. This is critical for root cause analysis in the presence of network faults.
The unique correlation capability of Cisco ANA is derived from this ability to learn the impact of the suspected fault by examining snapshots of the network before the occurrence of the fault.
Network and Local Correlation
Cisco ANA supports both network and local correlation and RCA of events:
•Network correlation—Network correlation associates events according to network logic. This logic correlates an event that occurred on a specific VNE component to an event that occurred on another component on the same VNE, or a component on a different VNE. The correlation is based on a flow that runs across the distributed model of the network elements and their topology. Network correlation is most successful if the event holds forwarding information, such as the IP address of a Border Gateway Protocol (BGP) neighbor, or a Frame Relay virtual connection. A root cause is identified among the group of correlated alarms.
•Local correlation—Local correlation is used for events that occur on the same VNE. In this model, events are correlated according to event weight. Each parent event can have multiple child events, and each event can have a different weight. Local correlation is tried first with the event having the highest associated weight.
•Local special correlation—Local special correlation is used for events that occur on the same VNE, such as link down to card out, or link down to device unreachable.
After correlation and RCA, the event is considered an enriched event.
For more information about correlation and root cause analysis, see Chapter 3, "Correlation Scenarios."
Event Correlation and Alarms
Event correlation is the process of relating one event to other events. Cisco ANA distinguishes two types of relations between events:
•A sequence of events. Events that have the same type and the same source are considered part of an event sequence, or an alarm. An alarm represents the complete lifecycle of a fault (see Figure 2-3).
Figure 2-3 Event Sequence
For more information, see New Event Associations.
•A hierarchy of event sequences (alarms), representing causality.
Causality correlation is the process of relating an event to an existing alarm in a causality relationship (see Figure 2-4).
Figure 2-4 Causality Correlation
Causality correlation creates a hierarchy, and the top-most cause is called the root cause.
In Figure 2-5, the Link Down alarm is the cause for OSPF Neighbor Loss alarm, and Card Out is the cause for Link Down and the root cause for all the other alarms as well.
Figure 2-5 Root Cause Analysis
For more information about event correlation, see Chapter 3, "Correlation Scenarios."
New Event Associations
Cisco ANA associates a new event to an existing event if it identifies an existing event with the following criteria:
•The existing event has the same event type and source as the new event.
•The existing event does not have a successor.
•The existing event is not archived.
•One of the following two conditions:
–The existing event's severity is not cleared. This is the normal case of an open alarm being updated, and is illustrated in Figure 2-6.
Figure 2-6 Updating an Alarm
–The existing event has cleared severity, and the new event arrives within a short time interval after the clearing event. The interval default is 20 minutes and can be configured. Cisco ANA considers the new event to be an extension of the existing fault despite the fact that is was already cleared. This is illustrated in Figure 2-7.
Figure 2-7 Extending a Cleared Alarm
If the new event arrives later, cannot be associated, and is ticketable, a new alarm is created, as illustrated in Figure 2-8.
Figure 2-8 Creating a New Alarm
Flapping is the occurrence of a flood of consecutive event notifications (often severity toggling) related to the same alarm. Flapping can happen when a fault is unstable and causes repeated event notifications; for example, the use of a cable with a loosely-fitting, rattling connector. Cisco ANA recognizes this flapping phenomenon, and represents the new event notifications with a single generated event with a flapping subtype. The alarm is said to be flapping. When the fault stabilizes and the new event notification frequency returns to normal, the fault management logic terminates the alarm's flapping mode by generating a final event notification (either Flapping Stopped Cleared or Flapping Stopped Uncleared subtype), based on the state of the fault (the last received new event notification) at that time.
During flapping, the fault management logic generates periodic event notifications with a Flapping Update subtype that also becomes part of the alarm's event sequence.
A flapping situation is illustrated in Figure 2-9.
Figure 2-9 Flapping Event
A sequence of events is identified as flapping if:
•All events share the same event type and are associated with the same source.
•The time interval between consecutive events is less than one minute (default value).
•There are more than five events (default value) with a severity different from Cleared.
Flapping detection is enabled for certain events and disabled for others.
For information about customizing fault behavior, see the Cisco Active Network Abstraction 3.7 Customization User Guide.
Events that are not dropped after the identification phase and that are correlated, are stored in the fault database. The relationships between the events are also retained. The content of the database can be reviewed using the Cisco ANA EventVision application.
Note Events are stored in the form of the Cisco ANA event object. The original notification structure of incoming event notifications (trap or syslog) is not maintained.
The stored events might be marked as archived. Archived events are persisted in the fault database and can be viewed by using Cisco ANA EventVision.
An event is archived if either of the following is true:
•The whole correlation hierarchy that it was part of is marked as archived (see Ticket Management Operations).
•The event does not relate to any other event and it is not ticketable.
Attributes Used by the Event Correlation Algorithm
Table 2-2 describes the attributes that control the algorithm Cisco ANA uses to perform event correlation.
Table 2-2 Event Correlation Attributes
Determines whether the event should attempt to correlate and find a causing event.
Determines if the new event should initiate a network correlation process or a local correlation process.
Event-specific flow information, such as the IP address of a BGP neighbor, used in network correlation. This information includes the following:
•The component to start the correlation flow from. This might be the VNE component that identified the event (for example, an IP interface component in an Interface Status Down event) or any other component in the VNE.
•Additional forwarding data used by the flow logic, including the flow direction and one of the following:
–The IP address used as the destination IP address of the flow.
–The virtual connection used as the entry point of the flow.
–The MPLS label used as the entry point of the label switched path (LSP).
–The type of VNE component to stop at.
One or more unique identifiers that are typically derived from the event type, source, and parameters from the event message content, used in local and flow correlation.
Whether this event can be the possible cause for other events. If so, other alarms can be correlated to it. An event that is configured to allow correlation should also be ticketable.
The relative weight of an event as a cause candidate in relation to other causing events. A new event can only correlate to an event that has a higher weight. The heavier the event, the more likely it is to be chosen as the cause.
Figure 2-10 illustrates the Cisco ANA correlation process.
Figure 2-10 Correlation Process in Cisco ANA
The correlation process starts after a VNE receives a new event notification (see Event Identification). The correlation steps, shown in Figure 2-10, include:
The first step in the process is event identification. After the trap has been identified, and if it is not dropped by the VNE, its attributes are determined.
Note If the VNE is in maintenance state, the VNE does not expedite events but does respond to flows.
Cisco ANA examines and parses the event notification message to pinpoint the precise entity that is the location, or source, of the event. For more information on event source association, see Event Source Association.
Cisco ANA then examines the new event for flapping. If this event is part of a flapping sequence, it is suppressed as described in Flapping Events.
Cisco ANA then further examines the new event. If the event is configured for correlation, the new event is subject to correlation. Most flagging events are subject to correlation; clearing events are not. If this event was configured to expedite polling, the specific polling command is generated as well, which might result in additional generated events.
If the event is not configured for correlation, Cisco ANA tries to relate the event to an event sequence. See Step J to Step M for additional details.
The correlation process is suspended for two minutes. This gives other events, possibly related in terms of cause and effect, a chance to be detected.
No further processing is performed on the event while it is suspended. As a result, the update to the fault database and Cisco ANA NetworkVision for this event is delayed by two minutes.
After the two-minute suspension, other events are examined for possible causality. Two different processes can be followed: local correlation (Step G) or network correlation (Step H), depending on the event configuration.
Local correlation examines and finds other events related to the same (local) NE from which the new event originated; no network flow is initiated.
Local correlation is well suited for the following scenarios:
•A new event (such as a Module Out event) is in the device scope.
•The new event has a corresponding generated event message in Cisco ANA. It is desirable for the event to correlate to its generated event message. For example, a Link Down syslog event tries to correlate to a local Link Down or Port Down generated event message.
•The new event does not contain information that can be used to perform correlation beyond the scope of the local device. Such information is used to initiate the correlation flow.
Most trap and syslog events use the local correlation process. The correlation logic looks for causing events in the VNE that:
•Are configured to allow correlation.
•Arrived within the last nine minutes (including the two-minute suspension time).
•Have a correlation key that matches at least one of the correlation keys of the new event.
Flow or network correlation broadens the search for events to those coming from other NEs, including those that are several hops away.
Flow correlation is well suited for the following scenarios:
•The event represents a failure in a connection or service that spans multiple devices. For example, an MPLS traffic engineering (TE) Tunnel Down event tries to correlate to faults on the path that the tunnel traverses.
•Logically, the new event can result from events that occurred in other devices. For example, Cisco ANA tries to find the root cause for a Device Unreachable event in other devices by performing a flow to the management IP address.
Flow correlation uses historic snapshots of the VNE model to search the local VNE and other VNEs for causing events that meet the following criteria:
•Are configured to allow correlation.
•Arrived within the last seven minutes (including the two-minute suspension time).
•Exist on VNE components that appear on a flow path traversed according to the forwarding information of the new event.
Correlation processes often yield more than one candidate-causing event. Additional, specific, rule-based filtering logic eliminates unlikely causing events, such as Cloud Problem, BGP Process Down, or LDP Neighbor Down. Finally, the causing event with the highest weight (or the closest in time, when there is more than one) is selected as the causing event.
The new event becomes the starting point of a new, correlated alarm.
If the previously described correlation process does not yield any causing events, or the new event was not subjected to correlation, it is possible that the new event simply relates to an existing event sequence (alarm). Cisco ANA searches for such an event sequence with the same source and event type as the new event.
If a matching event sequence is found, the new event is appended to it, and the alarm severity is updated accordingly.
If no matching alarm is detected, the new event is potentially a root cause in its own right. If the new event is ticketable, it is the basis for a new alarm that, in turn, causes the creation of a new ticket.
If the new event is not ticketable, it is archived and is no longer involved in correlations.
Ticket management refers to managing the life cycle of tickets, including:
•Associating events and alarms to tickets
Ticket creation is distributed among the units. When a ticket needs to be created, it is created in the unit that generated it by the fault agent running in that unit.
Events and alarms are associated with tickets by polling the fault database and associating enriched events with the tickets and alarms.
Ticket management includes both automatic and manual operations that are involved in managing the life cycle of a ticket and alarm. Automatic operations include associating an event with a ticket and alarm, updating the aggregated ticket state (such as severity and event counter), and automatic archiving. Manual operations include acknowledging a ticket, clearing a ticket, and so on. The automatic operations are orchestrated by the ticket agent in the gateway.
For information about automatic operations, see Automatic Operations. For information about manual operations, see Ticket Management Operations.
An alarm represents a scenario that involves a fault in the network, a managed element, or the management system. A ticket represents an attention-worthy root alarm whose type is marked as ticketable. A ticket has the same type as the root alarm it represents, and it has a status, which represents the entire correlation tree.
A ticketable event is an event that becomes a root cause for a new ticket if the new ticket is not correlated to any other event. Events are configured to be ticketable or not ticketable. For more information, see Ticket-Related Configuration Options.
Each ticket assumes the propagated severity of the alarm with the top-most severity, within all the alarms in the correlation hierarchy at any level.
A ticket is considered open as long as its severity is not cleared.
Both Cisco ANA NetworkVision and Cisco ANA EventVision display tickets and allow drilling down to view the consequent alarm hierarchy.
From an operator's point of view, a fault is always represented by a complete ticket. Operations such as Acknowledge or Remove are always applied to the whole ticket.
A ticket points to the root cause alarm that is the top-most alarm in the correlation hierarchy. The attributes of the ticket (such as short description) are derived from the root cause alarm.
In ticket management, automatic operations include finding the ticket to which an event belongs, finding the alarm to which the event belongs, and updating both. Alarms define a sequence of events in a ticket. Once the ticket is found, updating the alarm is immediate. Each event originating in the network is associated with a ticket. Only those events that are marked attention worthy, or ticketable, are visible to Cisco ANA NetworkVision. All alarms and tickets can be viewed in Cisco ANA EventVision.
Associating Events, Tickets, and Alarms
The main component of ticket management queries the fault database and retrieves events that can be connected to the ticket. Events that can be connected to a ticket are those events whose ticket ID is not known yet, but the ticket ID of the preceding event is known. The component continually executes this query to retrieve events that are not yet connected to a ticket or alarm. The following actions are taken for each event that is selected by the query:
1. The alarm whose source and name match those of the event (within the ticket) is updated for severity, state, and counter.
2. The ticket is updated for severity, last modification time, and counter.
3. The event is associated with the alarm and the ticket.
No events are lost in this process. As long as events are successfully persisted with their correlation information and the preceding event ID, they are associated with a ticket.
Every minute, Cisco ANA reviews all the tickets and looks for tickets to clear. Cisco ANA clears a ticket if all of its events either are cleared or are configured for automatic clearing. In addition, Cisco ANA checks for tickets to archive. The archive timeout of a ticket is determined by the archive timeout setting of the root cause event.
Auto-archiving is a ticket management component that periodically checks for tickets that are candidates for archiving. A ticket is a candidate for archiving if:
•All of its events are clear or auto-clear.
•The configured length of time has elapsed after the last event joined it.
Auto-archiving interacts only with the fault database. Archiving a ticket marks its events and alarms as archived. They are not permanently deleted from the fault database. Events are kept in the fault database for three months. Alarms can be deleted sooner than three months because all required information exists in the event tables.
Ticket Management Operations
The following management operations might be applied to a ticket either manually or through the system (northbound) API:
•Acknowledge—Mark a ticket as acknowledged. Acknowledge is used to distinguish between new faults and faults that are known or handled by the operation team.
•Remove—Set the ticket and all the events in the hierarchy as archived. An archived ticket is removed from the display in Cisco ANA NetworkVision.
•Clear—Set all uncleared alarms in the hierarchy to cleared severity.
Remove and Clear operations might be done automatically by the system. The mechanism used for these automatic processes is described in the following topics:
For more information about integrating with northbound applications, see the Cisco Active Network Abstraction Integration Developer Guide.
Cisco ANA also generates bookkeeping events. When a ticket is archived or acknowledged, a bookkeeping event is generated for all alarms that are correlated to the ticket.
Ticket-Related Configuration Options
You can configure the following ticket management options:
•Whether or not the alarm is to be treated as attention-worthy. If so, is treated as a ticket. If not, it is treated as an alarm and is not promoted to a ticket.
•The duration between a flagging event and the preceding clearing event of the same alarm so that the events are considered a single occurrence.
For example, in a link down > link up > link down scenario, you can specify how long to wait between the Link Up event and the second Link Down event before a new ticket is opened. This configuration option is relevant for tickets only. Once a ticket is cleared or auto-cleared and this timeout is exceeded, the ticket can be archived.
•The amount of time that must elapse from the time the ticket is cleared until it can be archived.
•Automatic clearing of alarms if they are not cleared by the network.
For more information about customizing fault behavior, see the Cisco Active Network Abstraction 3.7 Customization User Guide.
Cisco ANA uses the auto-remove process to remove tickets automatically. The process launches periodically and scans all unarchived tickets. It removes a ticket automatically if all of the following conditions are met:
•The ticket is cleared, or its type is Info.
•The time that has passed from the clearing of the ticket is greater than that configured for correlation.
•The subtype of the first event in the sequence of the root cause alarm is configured for automatic removal.
•The time that has passed from the last update to the ticket is greater than the time for automatically removing the ticket. Any change to correlation hierarchy or to the sequence of one of the alarms in the correlation hierarchy is considered as an update to the ticket.
The default value for the time interval to trigger the auto-remove process is one minute.
Note After the ticket is archived, the events in its correlation hierarchy are no longer presented in Cisco ANA NetworkVision.
Cisco ANA implements an additional automatic process to maintain the number of concurrent open (noncleared) tickets below a predefined threshold. If the number of open tickets found is above the threshold, the oldest tickets are removed and archived.
For more information about customizing fault behavior, see the Cisco Active Network Abstraction 3.7 Customization User Guide.
Situations can occur in which the root cause alarm is cleared, but alarms that are not cleared remain in the ticket correlation tree.
Noncleared alarms might exist in the correlation tree for one of the following reasons:
•The network event that caused the alarm creation has not been fixed, or the network event that caused the alarm creation was fixed, but the VNE has not yet identified the change.
•The network event that caused the alarm creation was fixed. A clearing notification (trap or syslog) associated with this event was sent from the device but did not reach Cisco ANA, or was not identified correctly by Cisco ANA.
The situation described in the second scenario is undesirable, and Cisco ANA enables you to configure automatic clearing of alarms to address this type of situation.
Figure 2-11 shows an example ticket of a Link Down alarm.
Figure 2-11 Ticket of a Link Down Alarm
In this example, the alarm has been cleared (Link Up), but the severity of the ticket is still Major (Orange).
The reason for this situation is found in the correlation tree of the ticket (see Figure 2-12). The tree contains one Link Down syslog alarm that has not been cleared because its clearing syslog event did not reach Cisco ANA.
Figure 2-12 Correlation Tree of the Link Down Ticket
You can configure Cisco ANA so that it clears alarms automatically, or you can manually clear the ticket in Cisco ANA NetworkVision.
You can configure automatic clearing for each event type in Cisco ANA. It indicates whether or not this type of event can be automatically cleared by the auto-clear mechanism. This mechanism runs on Cisco ANA every minute and iterates on all the tickets that are not archived. For any open ticket, the auto-clear mechanism determines if all of its events either are cleared or are configured for automatic clearing. If the open ticket meets either of these criteria, the mechanism automatically clears the ticket.
Note The auto-clear mechanism does not clear a ticket if the root cause event has not been cleared.
When an event is cleared automatically, the clearing event description displayed in Cisco ANA NetworkVision indicates this (for example, Auto Cleared - Link Down due to Admin Down).
In Cisco ANA, all syslogs and traps are configured to clear automatically, except the following:
•Syslogs and traps that are ticketable.
•A few important syslogs and traps that do not have a corresponding service alarm.
Note A device that suddenly loses power does not send a Down event. Instead, it sends a coldstart trap when it subsequently recovers, and this trap does not clear automatically. In this situation, because there is no corresponding Down event, if the coldstart trap were automatically cleared, the device-recovery notification would be lost.
Syslog and trap events that are not cleared automatically must be cleared manually, using Cisco ANA NetworkVision.
The process that checks periodically for open tickets is also used to automatically remove tickets. As a result, both operations share the same time interval, which is one minute by default.
Duplicate Alarm Processing
By default, Cisco ANA avoids creating duplicate tickets by identifying and appropriately linking duplicate alarms to existing alarms within previously generated tickets.
Figure 2-13 shows a generic example. Alarm 1 is the top-most, or root, alarm for the ticket. Alarms 2 and 3 are both part of the correlation tree, derived from Alarm 1. Two duplicate alarms, Alarm 1.1 and Alarm 1.2, have arrived and Cisco ANA has assigned them as successors to Alarm 1.
Figure 2-13 Duplicate Alarms in a Ticket Correlation Tree
Cisco ANA uses the predecessor/successor concept to handle incoming duplicates properly, without either discarding them or creating new tickets for them. In this example, Alarm 1 is the predecessor of Alarm 1.1, and Alarm 1.1 is the successor of Alarm 1. Similarly, Alarm 1.1 is the predecessor of Alarm 1.2, and Alarm 1.2 is the successor of Alarm 1.1.
When an alarm arrives, Cisco ANA searches among its stored alarms for a possible predecessor. In this example, when Alarm 1.1 arrives, Cisco ANA identifies Alarms 1, 2, and 3 as possible predecessors, and then identifies the correct predecessor by matching it against the incoming alarm according to the following rules:
•The predecessor and successor both come from the same OID.
•The predecessor and successor have the same alarm type.
•The predecessor has not been archived. As explained in Ticket Auto-Clear, tickets that are configured for automatic clearing and a cleared alarm associated with the root cause are archived automatically. If any alarm within the ticket is not configured for automatic clearing, the ticket is not archived. Instead, it can receive duplicate alarms until the operator clears it manually.
•The predecessor is not an indeterminate, informational, or cleared alarm.
•The predecessor's correlation timeout period has not elapsed.
•The predecessor currently has no successor.
If all of these conditions are met, the incoming alarm is assigned to that predecessor, as shown in the example. The predecessor alarm also sets its successor to the new alarm. Later duplicates (like Alarm 1.2) become successors to the new alarm under the same processing rules.
Event Flow Through Cisco ANA
Figure 2-14 illustrates the flow of events through Cisco ANA.
Figure 2-14 Event Flow Through Cisco ANA
Events flow through Cisco ANA as follows:
1. The Cisco ANA Event Collector (AEC):
–Receives all incoming events.
–Performs initial parsing to obtain basic information about each event.
2. The AEC stores each incoming event in the event archive database.
3. The AEC distributes each incoming event to the corresponding VNE.
4. The VNE:
–Defines the type of application to use for the event type.
–Extracts additional information from the raw event.
–Passes the event to the fault agent.
–Correlates associated events.
–Performs root cause analysis (RCA) on the event.
–Sends updated correlation and RCA information to the fault agent.
5. The fault agent:
–Runs on a dedicated AVM in each unit.
–Sends events to the alarm database.
–Creates tickets, as appropriate.
6. Cisco ANA forwards all event notifications for actionable events, nonactionable items, tickets, system events, and ticket updates to the gateway.
7. The gateway forwards BQL, event, and ticket notifications to external OSS applications via northbound integration services. The notifications are formatted in accordance with Cisco EPM-NOTIFICATION-MIB.
8. Event information is available on demand to users in standard or user-defined formats via the Report Manager.
9. Events older than three months are removed from the event archive and the alarm database.
The actual event rates and optimal configurations of the flow processing path are highly dependent on many factors, including:
•Deployed networking technologies and configurations
•Number of network elements under management
•Frequency of fault incidents
Exhaustive testing is required before changing default values.
A key concept that is implemented in Cisco ANA fault management is that of event reduction, thereby providing you with correlated, actionable tickets. See Figure 2-15.
Figure 2-15 Event Reduction
As illustrated, the AEC collects a large number of events from the network and from managed devices. Some of events are actionable while others are not. After saving all events to the event archive, Cisco ANA drops the events that are nonactionable.
The remaining actionable events are identified and examined to determine whether they represent a sequence of related events or alarms. For example, a number of events that arrive from the same device within a short period of time might be grouped and identified as an alarm because it is likely that they are related to each other.
The groups are then examined further to determine a root cause. If multiple options exist for the root cause, the causing event with the highest weight or the closest in time is selected as the causing event. If the causing event is ticketable, a ticket is issued and the corresponding alarms are associated with the ticket.
Events, the AEC, and VNEs
Cisco ANA polls devices to discover changes in its network model, and generates internal events. External events (traps, syslogs) are treated as indications of faults that Cisco ANA confirms via expedited polling.
The AEC uses a high speed collector to obtain and receive traps and syslogs from network elements and distributes each event to the VNE that represents the device that sent it.
For each trap or syslog that it receives, the AEC creates a raw (unparsed) event. The AEC also stores each event from a managed element by sending the event to the event archive.
The AEC then sends the event to a shared buffer for forwarding to the appropriate VNE. If a high number of notifications arrive simultaneously, exceeding the per-VNE limit, the overflow notifications are directed to a burst buffer. The burst buffer operates on a first-come, first-served basis. If the burst buffer is full, any additional new events are dropped, and a system event is generated to inform the operator of this situation.
If required, you can configure the:
•Shared buffer size
•Burst buffer size
•Amount of time that elapses between each event that is sent
Upon receipt of the event, the VNE identifies the type of event and drops all nonactionable events.
Events that are dropped at this stage are not stored in the fault archive, and do not participate in correlation. Dropping events at this stage prevents Cisco ANA from being overwhelmed by large numbers of insignificant event notifications.
Pruning events older than three months in the event archive is handled by the Cisco ANA integrity process. For more information, see Integrity and Pruning Process.
When an event notification reaches the VNE, it is handled by the event manager. The event manager handles all network events, regardless of their detection method, such as polling, syslogs, traps, or threshold crossing alerts (TCAs).
If the VNE is in maintenance state, the VNE does not expedite events, but does respond to flows. Similarly, if a port is in disable alarm state, it does not expedite events but does respond to flows.
The event manager parses and persists the events as follows:
•Event parsing applications extract information from the raw event, such as its source, the problem it represents, and its perceived severity.
•The VNE saves the event to the fault database. As soon as the event is persisted, it can be viewed in Cisco ANA EventVision.
VNE and Fault Agent
The VNE sends a message to the fault agent with information about the correlation result, which is then updated in the event record.
As soon as the update message arrives, the fault agent determines if a new ticket should be created and, if so, creates one.
After the VNE completes the correlation for a flapping event or completes the handling for a clearing event, it sends an update to the fault agent, providing it with the following information:
•Event ID—The identifier of the updated event.
•Causing event ID—The identifier of the causing event if relevant.
•Causing event source OID—The source OID of the causing event.
•Causing event name—The name of the causing event.
•ConnectTo ID—The identifier of the preceding event. The preceding event can be the causing event, if it is found, or the preceding event in the alarm sequence.
This information is updated in the event record in the fault database. The concept of a preceding event boosts the association of the event with an alarm or ticket. If the event does not have a ConnectTo ID, the fault agent creates a ticket for it if it is defined as attention worthy.
The fault agent runs in each Cisco ANA unit and is responsible for:
•Updating correlation information for enriched events.
•Creating new tickets, as required.
The fault agent is available as soon as its hosting AVM is up and recovers quickly if failover occurs in a high availability environment.
The fault database receives events in the following modes:
•Batch mode—Events are inserted into the database in batches of configurable size. The configuration is dynamic and can be changed. For more information about customizing fault behavior, see the Cisco Active Network Abstraction 3.7 Customization User Guide.
•Poll mode—Polling is used so that events wait no longer than a configured amount of time if events are received at a rate that does not fill the buffer.
As soon as an event is persisted, it can be viewed in Cisco ANA EventVision.
All events, alarms, and tickets that are not dropped after the identification phase, and the relationship between these items, are stored in the alarm database. The content of the database can be reviewed using Cisco ANA EventVision.
Events are stored in the form of a Cisco ANA event object. The original notification structure of incoming event notifications (trap or syslog) is not maintained.
Cisco ANA stores all event types, whether or not it recognizes them.
Statistics, such as how many events were persisted per second, how many tickets were created, and more, can be shown through the Report Manager in Cisco ANA Manage, Cisco ANA NetworkVision, and Cisco ANA EventVision.
If the fault agent fails for any reason, events that were waiting in its queue are lost and cannot be recovered.
The stored events might be marked as archived. Archived events are persisted in the fault database and can be viewed by using Cisco ANA EventVision.
An event is archived either because:
•The whole correlation hierarchy that it was part of was marked as archived (see Ticket Management Operations).
•The event did not relate to any other event, or it was not ticketable.
Northbound Integration Services
Cisco ANA supports notification services to both external OSS applications and northbound clients. Notification services are available for all event notifications for actionable and nonactionable events, as well as tickets and ticket updates.
NBI services are available for both:
•SNMP, new with Cisco ANA 3.7
Notifications are available in EPM-NOTIFICATION-MIB format only.
Note If you deployed the Cisco ANA2Netcool (AVM80) solution to integrate with CIC or IBM Tivoli (Netcool), contact your Cisco representative for migration support.
See the Cisco Active Network Abstraction Integration Developer Guide for more information about:
•Cisco EPM-NOTIFICATION-MIB and its format
•Configuring notification services
Cisco ANA includes Report Manager, a tool that enables you to produce and customize a variety of reports about events, traps, tickets, and syslogs. The standard event reports included with Cisco ANA enable you to view the:
•Number of syslogs, tickets, and traps in the alarm database and the AEC
•Daily average and peak number of syslogs and traps
•Daily event count
•Devices with the most events by severity or type
•Devices with the most syslogs or traps
•Most common daily events
•Most common syslogs
•Syslog count by device
The reports allow you to specify the time period and devices to be included, and provide output in a variety of output formats, such as XML, XLS, HTML, and PDF.
Report Manager is available in Cisco ANA Manage, Cisco ANA NetworkVision, and Cisco ANA EventVision. For information on the types of fault management reports available and how to create them, see the Cisco Active Network Abstraction 3.7 User Guide.
Cisco ANA provides the following processes for managing the database:
•Integrity and Pruning Process
•Database Size Maintenance
Integrity and Pruning Process
The integrity and pruning process runs as a scheduled process in Cisco ANA and is responsible for:
•Guaranteeing the integrity of the alarm database by fixing any detected integrity violations. Integrity violations include correlation loops, missing root cause events, clearing events with no flagging event, and so on.
•Pruning old events and alarm and ticket records from the database.
•Pruning old raw event records from the event archive.
You can control how long events, alarms, and tickets are kept before being pruned from the database. For more information about customizing fault behavior, see the Cisco Active Network Abstraction 3.7 Customization User Guide.
Database Size Maintenance
To prevent overflow in the database, Cisco ANA implements an automatic process that deletes old data. You can configure the period of time for which events should be maintained. The oldest time for which events should be maintained is the current time minus the event history size. Any event before this time will be deleted. For more information about customizing fault behavior, see the Cisco Active Network Abstraction 3.7 Customization User Guide.