Cisco Active Network Abstraction Fault Management User Guide, 3.6.2
Fault Management Overview

Table Of Contents

Fault Management Overview

Introduction

Event Management

Events and Alarms

Event Discovery

Network Faults

Internal Faults

Event Identification

Incoming Event Identification

Event Dropping

Severity

Event Source Association

Source Association Fallback

Event Correlation and Alarms

Relating a New Event to an Event Sequence

Flapping Events

Events Persistency

Archived Events

Tickets

Ticketable Event

Ticket Severity

Ticket Management Operations

Bookkeeping Events

Ticket Auto-Remove

Ticket Auto Clear

Database Size Maintenance

Event Flow Through Cisco ANA


Fault Management Overview


This chapter describes the key concepts of Cisco ANA fault management.

Introduction

Event Management

Tickets

Database Size Maintenance

Event Flow Through Cisco ANA


Note For details relating to configuring parameters in the registry throughout this guide, see Appendix C, "Event and Alarm Configuration Parameters".


Introduction

A common problem in the management of large networks is that a single fault manifests itself as multiple alarms in the Network Management System (NMS). This makes manual analysis of faults costly and also diverts attention from major problems. This is a major motivation for applying analysis and correlation in a NMS.

There are many types of problems that threaten service delivery, namely hardware failures, software failures and so on. An effective root cause analysis technique must be capable of identifying all these problems automatically. This technique must work accurately for any environment and for any topology, including interrelated logical and physical topologies, with or without redundancy.

Cisco ANA is used for analyzing and managing faults using fault detection, identification and correlation. Once a fault is identified, the system uses the auto-discovered virtual network model to perform fault inspection and correlation in order to determine the root cause of the fault.

Event Management

An event is a representation of a distinct incident occurring at a specific point in time. An event is a possible symptom of a fault. Examples of events include:

Port status change

Connectivity loss between routing protocol processes on peer routers (for example, BGP neighbor loss)

Device reset

Device becoming unreachable by the management station

Events and Alarms

In Cisco ANA's NetworkVision, Cisco ANA's EventVision and this guide, an event is represented by a small icon in the form of a bell.

Figure 1-1 Example Events

Events have an associated severity. Events with a severity of Critical (red), Major (orange), Minor (yellow), and Warning (sky blue) are said to be flagging events; events with a severity of Cleared (green) is called a clearing event. Events that are informational in nature are marked in dark blue.

The lifecycle of a fault scenario is called an alarm. An alarm is characterized by a sequence of related events, such as port-down and port-up. In this guide, an alarm is denoted by an oval, as follows:

Figure 1-2 Example of an Alarm

If event A is followed by event B as part of an event sequence then event A is the predecessor of B, and B is A's successor.

The last event in the sequence determines the severity/state of the alarm. An alarm that ends with an event that has a severity of cleared is called a cleared alarm.

Event Discovery

Network Faults

Cisco ANA's fault management system learns about network faults via network event notifications:

1. Incoming Network Event Notifications—SNMP traps and syslog event messages sent asynchronously by network elements are captured by Cisco ANA and processed by the appropriate VNE. Cisco ANA supports SNMP v1, v2 and v3.

2. Generated Event Notifications—The VNE generates an event message when it detects a state change in the network element, typically after polling it via Telnet or SNMP. Some event notifications can also be generated by the gateway, for instance the "VPN Leak" event.

Cisco ANA can be set to generate a threshold crossing event (TCA) notification when a certain device attribute condition is violated. For instance, Cisco ANA can be set to monitor CPU temperature and yield an event notification when it exceeds a certain value. For more information about TCAs, see the Cisco Active Network Abstraction Customization User Guide.

Generated event notifications are also referred to as service alarms.

Expedition

The Cisco ANA fault management system frequently expedites (immediately triggers) the polling of specific data from the device, upon receipt of a trap or syslog event notification. Thus, these incoming notifications often yield additional generated event notifications (service alarms), which are key to the subsequent correlation process.

The list of traps and syslog notifications that cause polling expedition is device specific, and varies from VNE to VNE.

Internal Faults

In addition to network faults, the gateway and/or the units generate so-called system events, indicating Cisco ANA internal faults, like disk full, unit unreachable, and so on.

Event Identification

Once the fault management system becomes aware of a fault via an event notification, it processes the event notification message to identify the event, and creates an internal event object to represent the underlying event. At this stage, Cisco ANA determines the following information:

Event Functionality Type—Trap event, syslog event, service alarm, and so on.

Event Type—The event type is an identifier, describing the nature of the fault. For instance "Link down".

Event Subtype—The event subtype is a further clarification of the event type, for instance "Link down due to admin down".

The Event Type and Event Subtype are internal fields that Cisco ANA uses for event processing. Cisco ANA does not externalize these fields.

Event description strings, including the content of the notification message (for incoming event notifications), and a "short description"; this text is used to display the event to the operator.

Event Severity—For more information see Severity.

Based on the event type and event sub-type, an event has numerous additional correlation and metadata attributes that determine how the event will be processed by Cisco ANA. These attributes, further discussed in this chapter and in Chapter 2, "Causality Correlation and Root Cause Analysis", are defined in the Cisco ANA registry, and documented in Appendix A, "Supported Service Alarms" of this manual, and further detailed in the Cisco ANA VNE Reference Guide.

Incoming Event Identification

Incoming event notifications (traps and syslogs) are identified by matching the event data to predefined patterns. A trap or syslog is considered supported by Cisco ANA if they have matching patterns and can be properly identified.

If the incoming event notification cannot be identified, Cisco ANA creates an event-object of a so-called "generic" type. Generic events are not used in subsequent correlation activity. When a generic event is generated, the raw data is populated in the long description field of the generic event.

The payload structure of a particular trap may differ between SNMP versions and a trap may be supported by Cisco ANA only for a specific SNMP version.

Each supported trap or syslog has a corresponding configuration in the registry that maps it to a specific event type and sub type. The identification process described above builds the event object based on this configuration. In some cases the mapping is based on the internal values of the event. For example, the "Line down" trap is mapped to event type and sub type as follows:

Table 1-1 Line down Trap

Trap Internal Value
Event Type
Event Sub Type

Status=down

Line down trap

Line down trap

Status=up

Line down trap

Line up trap


Traps are mapped to event type and sub type regardless of the SNMP versions in use.

Event Dropping

Upon identification, the event type determines whether it will continue to be processed or dropped.

Events that are dropped at this stage are not stored in the Cisco ANA database, and do not participate in correlation. Dropping events at this stage is important in order to prevent Cisco ANA from being overwhelmed by large numbers of insignificant event notifications.

Severity

Each event has an assigned severity. Events broadly fall into three kinds of severity categories:

Flagging—Indicative of a fault: Critical, Major, Minor, or Warning

Clearing—Indicative of a fault that has been resolved: Cleared

Informational—Info

For example, a link-down event may be assigned critical severity, while its corresponding link-up event will have a cleared severity.

The last event in the sequence determines the severity of an alarm (an event sequence). An exception to this rule are bookkeeping events (see Bookkeeping Events), which do not change the severity of the sequence (the alarm).

Event Source Association

Event identification is followed by source association. Cisco ANA examines and parses the event notification message in order to pinpoint the precise entity that is the location, or source, of the event. Rather than simply relate the event to the managed element as a whole, the association code determines the precise source of the event. The source corresponds to an object in the VNE model. The event is populated with the unique IMO identifier of that object (OID).

See Appendix B, "Source OIDs of Alarms Generated by Cisco ANA" for more details.

For instance, the source of a "Neighbor Lost" event would be the relevant IP interface of the managed element. Correctly associating an event to its closest source is an important step for the subsequent correlation actions.

Source Association Fallback

In some cases the event source may not be in the internal VNE model at the time of the event notification. For instance, when a new module is inserted, it takes some time for Cisco ANA to poll all of its interfaces and build up (populate) the model. If the new event notification is handled before the model is fully populated, the association logic may fail to find and retrieve the entity that is the correct source of the trap. A retry mechanism minimizes the occurrence of such a race condition, but if it persists, the association logic will fallback to the managed element entity (Network Element) that is the source of the new event. An additional identifier (the alarm differentiator), representing the intended source, is later used in the correlation logic. See also alarm differentiator in Appendix B, "Source OIDs of Alarms Generated by Cisco ANA".

Event Correlation and Alarms

Event correlation is the term used to describe the process of relating an event to other events. Cisco ANA distinguishes two types of relations between events:

1. A sequence of events. Events that have the same type and the same source are considered part of an event sequence, or an alarm. An alarm represents the complete life-cycle of a fault.

Figure 1-3 Event Sequence

For more information see Relating a New Event to an Event Sequence.

2. A hierarchy of event-sequences (alarms), representing causality.

Causality correlation is the process of relating an event to an existing alarm in a causality relationship.

Figure 1-4 Causality Correlation

Causality correlation creates a hierarchy, and the top-most cause is called the root cause.

In the following picture, the "Link Down" alarm is the cause for "OSPF neighbor loss" alarm, and "Card Out" is the cause for "Link Down", and the root cause for all the other alarms as well.

Figure 1-5 Root Cause Analysis

For more information about event correlation see Chapter 2, "Causality Correlation and Root Cause Analysis".

Relating a New Event to an Event Sequence

Cisco ANA associates a new event to an existing event if it identifies an existing event with the following criteria:

1. The existing event has the same event type and source as the new event

2. The existing event doesn't have a successor

3. The existing event is not archived

4. One of the following two conditions:

a. The existing event's severity is not cleared. This is the normal case of an open alarm being updated, and is illustrated in Figure 1-6.

Figure 1-6 Updating an Alarm

b. The existing event has cleared severity, and the new event arrives within a short time interval after the clearing event. The interval is configurable per event type using the gw-correlation-timeout parameter (default 20 minutes). Cisco ANA considers the new event to be an extension of the existing fault despite the fact that is was already cleared. This is illustrated in Figure 1-7.

Figure 1-7 Extending a Cleared Alarm

If the new event arrives later, and cannot be associated, and the new event is ticketable (see Ticketable Event), a new alarm will be created, as illustrated in Figure 1-8.

Figure 1-8 Creating New Alarm

Flapping Events

Flapping is the occurrence of a flood of consecutive event notifications (often severity toggling) related to the same alarm. This can happen when a fault is unstable and causes repeated event notifications, for instance, the use of a cable with a loosely-fitting, rattling connector. Cisco ANA recognizes this flapping phenomenon, and represents the new event notifications with a single generated event with a "flapping" sub-type. The alarm is said to be flapping. Once the fault stabilizes and the new event notification frequency goes back to normal, the fault management logic terminates the alarm's flapping mode by generating a final event notification, with either "flapping stopped cleared" or "flapping stopped un-cleared" subtype, based on the state of the fault (the last received new event notification) at that time.

During flapping, the fault management logic will generate periodic event notifications with a "flapping update" sub-type, which also becomes part of the alarm's event sequence.

A flapping situation is illustrated by the following figure:

Figure 1-9 Flapping Event

The flapping detection code is configurable. The following parameters affect the behavior (default values in parentheses):

flapping-threshold—The number of consecutive events that must be received at intervals shorter than the flapping interval, to be considered a flapping sequence (five).

flapping-interval—The maximum time interval between consecutive event notifications that are part of a flapping sequence (one minute).

update-threshold—The number of events in an incoming flapping sequence that triggers the generation of a "flapping update" event notification (20).

update-interval—If no "flapping update" event notification was sent during this time, one will be generated (about three minutes).

clear-interval—The time that the alarm is not updated with new events, in order to exit the flapping mode (four minutes).

Flapping detection is enabled for certain events and disabled for others.

Events Persistency

All events that are not dropped after the identification phase, and the relationship between these events, are stored in the system database. The content of the database can be reviewed using the Cisco ANA EventVision application.


Note Events are stored in the form of the Cisco ANA event object. The original notification structure of incoming event notifications (trap or syslog) is not maintained.


Archived Events

The stored events may be marked as archived. Archived events are persisted in the system database but can be presented to the operator only by using the Cisco ANA EventVision application.

An event is archived either because:

The whole correlation hierarchy that it was part of was marked as archived (see Ticket Management Operations).

The event was found not to relate to any other event, nor was it ticketable.

Tickets

As mentioned above, an alarm represents a scenario which involves a fault in the network, the managed element or the management system. A ticket represents the complete hierarchy of correlated alarms representing a single specific fault scenario. Both Cisco ANA NetworkVision and Cisco ANA EventVision display tickets and allow drilling down to view the consequent alarm hierarchy.

From an operator's point of view, a fault is always represented by a complete ticket. Operations such as Acknowledge or Remove are always applied to the whole ticket.

A ticket points to the root cause alarm which is the top most alarm in the correlation hierarchy. The attributes of the ticket such as short description are derived from the root cause alarm.

Ticketable Event

A ticketable event is an event that becomes a root cause for a new ticket in case it was not correlated to any other event.

An event is configured to be ticketable through the registry if the is-ticketable parameter for the event sub-type is set to True.

Ticket Severity

Each ticket assumes the propagated severity of the alarm with the topmost severity, within all the alarms in the correlation hierarchy at any level.

A ticket is referred to as Open as long as it's severity is not cleared.

Ticket Management Operations

The following management operations may be applied to a ticket either manually or through the system (northbound) API:

Acknowledge—Mark a ticket as acknowledged. It is used to distinguish between new faults and faults that are known or handled by the operation team.

Remove—Set the ticket and all the events in the hierarchy as archive. An archived ticket is removed from the display in Cisco ANA NetworkVision.

Clear—Set all un-cleared alarms in the hierarchy to cleared severity.

The Remove and Clear may be done automatically by the system. Ticket Auto-Remove and Ticket Auto Clear describes the mechanism used for these automatic processes.

Bookkeeping Events

Cisco ANA also generates so-called bookkeeping events. When a ticket is archived or acknowledged a bookkeeping event is generated for all alarms that are correlated to the ticket.

Ticket Auto-Remove

Cisco ANA implements a process to remove tickets automatically. The process is launched periodically and scan through all none archived tickets. It will remove a ticket automatically if all of the following conditions are met:

The Ticket is Cleared

The time past from the clearing of the ticket is greater than gw-correlation-timeout parameter

The auto-remove parameter for the sub-type of the first event in the sequence of the root cause alarm is set to True

The time past from last update to the ticket is greater than auto-remove-timeout which is by default set to 88 minutes. Any change to correlation hierarchy or to the sequence of one of the alarms in the correlation hierarchy is considered as an update to the ticket.

The default value for the time interval to trigger the auto-remove process is one minute.


Note Once the ticket is archived, the events in its correlation hierarchy are no longer presented in Cisco ANA NetworkVision.


Cisco ANA implements an additional automatic process to maintain the number of concurrent open (non-cleared) tickets below a predefined threshold. If the number of open tickets found is above the threshold the oldest tickets will be removed and archived. This is automatic process is part of the integrity test, see Appendix D, "Cisco ANA Integrity Service". The default threshold number is 5000.

Ticket Auto Clear

There are situations when the root cause alarm is cleared but there are still non-cleared alarms in the correlation tree (hierarchy) of the ticket.

Non-cleared alarms may exist in the correlation tree for one of the following reasons:

The network event that caused the alarm creation has still not been fixed or the network event that caused the alarm creation was fixed, but the VNE has still not identified the change.

The network event that caused the alarm creation was fixed. A clearing notification (trap or syslog) associated with this event was sent from the device but did not reach Cisco ANA, or was not identified correctly by Cisco ANA.

The situation described in the second scenario is undesirable, which is why Cisco ANA supports a feature called auto clearing (the ticket is cleared automatically).

In the following example a ticket of a Link down alarm is shown.

Figure 1-10 Ticket of a Link Down Alarm

In this example, the alarm has already been cleared ("Link up") while the severity of the ticket is still Major (Orange).

The reason for this situation is hidden in the correlation tree of the ticket. The tree contains one Link down syslog alarm, which has not been cleared, because its clearing syslog event did not reach Cisco ANA.

Figure 1-11 Correlation Tree of the Link Down Ticket

The auto clear mechanism handles such possibilities, automatically.

It is of course also possible to manually clear the ticket in Cisco ANA NetworkVision.

The auto clear attribute is an attribute that is set per event, configured in Cisco ANA. It indicates whether or not, this type of event can be auto cleared by the auto clear mechanism.

This mechanism runs on the gateway every minute and iterates on all the tickets, which are not archived. For any open ticket, it checks if all its events are either cleared or have the auto clear attribute set to true. In this case the mechanism will auto clear the ticket.

In Cisco ANA, all syslogs and traps have the auto clear attribute set to true, except in the following instances:

Syslogs and traps which are ticketable.

A few important syslogs and traps, which don't have a corresponding service alarm.

The process that check periodically for open tickets is the same one used for auto-remove. Therefore, both operations share the same time interval which is one minute by default.

Database Size Maintenance

In order to prevent overflow in the database Cisco ANA implements and an automatic process, which deletes old data. There is a configurable setting of the period of time - event history size, for which events should be maintained. The oldest time for which events should be maintained is the current time minus the event history size and any event before this time will be deleted. The automatic process is part of the integrity test (see Appendix D, "Cisco ANA Integrity Service") and is activated periodically.

Event Flow Through Cisco ANA

Figure 1-12 Event Flow Through Cisco ANA

Incoming event notifications (traps, syslogs) are received by an internal event listener process, also known as an AVM100. The event listener stores the notifications in a VNE input buffer that corresponds to the Network Element from where the event notification came. The event notifications are then sent to the VNE at a fixed rate.

When a VNE input buffer in AVM 100 fills up, further events from this NE are dropped. This capability protects Cisco ANA from event-storms or DOS attacks.

As discussed earlier, the VNE parses, identifies and processes the received notifications, drops flapping and other non-interesting events, expedites certain polling operations, attempts to correlate (see Chapter 2, "Causality Correlation and Root Cause Analysis"), and sends the incoming and generated events to the gateway.

Event rate limiters are used to prioritize the behavior of Cisco ANA during major outages and under heavy peak and flooding conditions. Each VNE has its own outgoing rate limiter; another rate limiter acts on the aggregate flow of events from all the VNEs coming into the gateway. See Figure 1-12.

The event rate limiters are configured for multiple commit and burst rates and durations per event type, based on priority and/or overall aggregate rates. Events that exceed the burst rates are dropped, and a special generated event is sent to the gateway, to inform the operator of this situation.

The gateway stores the events it receives in the database, performs final alarm-association and ticket-management, and notifies the Northbound interface and Cisco ANA clients of the event.

Every minute, the gateway reviews all the tickets and looks for tickets to clear. The gateway clears a ticket if all its events are either cleared or have the auto-clear attribute. In addition, the gateway checks for tickets to archive (auto-remove). The archive timeout of a ticket is determined by its root cause event's archive timeout attribute.

The actual event rates and optimal configurations of the flow processing path are highly dependent on the network topology and deployed networking technologies and configurations, the number of network elements under management, the frequency of fault incidents, and many other factors. Exhaustive testing is required before changing default values.