Table Of Contents
Advanced Correlation Scenarios
Lab Setup for Scenarios
Device Unreachable Correlation Scenarios
Device Unreachable on Device Reload or Device Down Event
Description of Fault Scenario in the Network
Cisco ANA Failure Processing
Device Unreachable on Another Device Unreachable Event
Description of Fault Scenario in the Network
Cisco ANA Failure Processing
Device Unreachable on Link Down Event
Description of Fault Scenario in the Network
Cisco ANA Failure Processing
Multiroute Correlation Scenarios
Description of Fault Scenario in the Network
Cisco ANA Failure Processing
BGP Neighbor Loss Correlation Scenarios
BGP Neighbor Loss Due to Port Down
Description of Fault Scenario in the Network
Cisco ANA Failure Processing
BGP Link Down Scenarios
Description of Fault Scenario in the Network
Cisco ANA Failure Processing
HSRP Scenarios
HSRP Alarms
HSRP Example
IP Interface Failure Scenarios
IP Interface Status Down Alarm
Correlation of Syslogs and Traps
All IP Interfaces Down Alarm
IP Interface Failure Examples
Interface Example 1
Interface Example 2
Interface Example 3
Interface Example 4
Interface Example 5
ATM Examples
Ethernet, Fast Ethernet, Giga Ethernet Examples
Interface Example 6
Interface Example 7
Generic Routing Encapsulation (GRE) Tunnel Down/Up
GRE Tunnel Down/Up Alarm
GRE Tunnel Down Correlation Example 1
GRE Tunnel Down Correlation Example 2
Advanced Correlation Scenarios
This chapter describes the specific alarms which use advanced correlation logic on top of the root cause analysis flow:
•
Lab Setup for Scenarios—Describes the lab setup for the scenarios described in this chapter.
•
Device Unreachable Correlation Scenarios—Describes the Device unreachable alarm, its correlation and provides various examples.
•
Multiroute Correlation Scenarios—Describes support for multiroute scenarios and their correlation. In addition, it provides several examples.
•
BGP Neighbor Loss Correlation Scenarios—Describes the BGP alarms, and their correlation.
•
HSRP Scenarios—Describes the HSRP alarms and provides various examples.
•
IP Interface Failure Scenarios—Describes the ip interface status down alarm and its correlation. In addition, it describes the All ip interfaces down alarm, its correlation and provides several examples.
•
Generic Routing Encapsulation (GRE) Tunnel Down/Up—Provides an overview of GRE tunneling, describes the GRE tunnel alarm, and provides correlation examples.
Note
For a description of the conventions used when describing events, see Conventions Used in this Guide, page -xxi.
Lab Setup for Scenarios
This section describe the lab's functionality and the applied technologies. The diagram illustrates the lab setup for the scenarios described in this section.
Figure 6-1 Lab Setup
The lab simulates a network that is used by the Service Provider (SP). The core is based on MPLS technology and runs the OSPF routing protocol as IGP.
The P-network is topologically contiguous, whereas the C-network is delineated into a number of sites (contiguous parts of the customer network that are connected in some way other than through the VPN service). Note that a site does not need to be geographically contained.
The devices that link the customer sites to the P-network are called Customer Edge (CE) devices, whereas the service provider devices to which the CE routers connect are called Provider Edge (PE) devices. Where the provider manages an Ethernet access network, the CE routers will be connected to the PE which are usually LAN switches with L3 capabilities.
The access network can be any L2 technology.
In this lab there are two L2 technologies in the access network, namely:
•
Ethernet
•
Frame Relay
The access network in the lab is unmanaged, namely, a cloud.
In most cases, the P-network is made up of more than just the PE routers. These other devices are called P devices (or, if the P-network is implemented with Layer 3 technology, P routers). Similarly, the additional Layer 3 devices at the customer sites that have no direct connectivity to the P-network are called C routers. In this example, C routers are not part of the Lab setup and are not managed by Cisco ANA.
The CEs (Customer Edge) devices are locate at the customer site and can be managed by the SP. All other devices PEs (Provider Edge), Ps and RRs (Router Reflector) are located at the SPs site, these devices are maintained by the SP.
An end-to-end MPLS VPN solution is, like any other VPN solution, divided into the central P-network to which a large number of customer sites (sites in the C-network) are attached. The customer sites are attached to the PE devices (PE routers) through CE devices (CE routers). Each PE router contains several virtual routing and forwarding tables (VRFs), at least one per VPN customer. These tables are used together with multiprotocol BGP to run between the PE routers to exchange customer routes and to propagate customer datagrams across the MPLS network. The PE routers perform the label imposition (ingress PE router) and removal (egress PE router). The central devices in the MPLS network (P routers) perform simple label switching.
There are BGP processes running on the PE's devices, each PE is a neighbor to both RRs' devices. This way the lab has a backup if one RR is down.
All the devices are managed inband.
The management access point is Ethernet 0/0 on PE-East.
In order to have access to the CE devices a loop was created between two ports on PE-East.
Device Unreachable Correlation Scenarios
Device reachability is measured by management protocols connectivity. Connectivity tests are used to verify the connection between VNEs and the managed network elements. The connectivity is tested per each protocol a VNE uses to poll a device. Cisco ANA supported protocols for connectivity tests are SNMP, Telnet and ICMP.
There are three scenarios in which device reachability issues occur, namely:
•
Device Unreachable on Device Reload or Device Down Event
•
Device Unreachable on Another Device Unreachable Event
•
Device Unreachable on Link Down Event
For more information about device reachability see Chapter 5, "Causality Correlation and Root Cause Analysis".
Device Unreachable on Device Reload or Device Down Event
The diagram displays the Device unreachable on device down or reload setup.
Figure 6-2 Device Unreachable on Device Reload or Device Down Setup
Description of Fault Scenario in the Network
CE-5 goes down or is reloaded.
Related Faults
•
The port S.1/2 of CE-5 operationally goes down (between CE-5 and PE-East).
•
The port S.2/3 of PE-East operationally goes down (between PE-East and CE-5).
•
CE-5 is unreachable from the management subnet.
Note
Other related faults might occur due to the CE-5 down or reload. Syslogs and traps corresponding to network faults are also reported. Additional faults, other than for the connectivity issue of CE-5 and the Link down with the PE-East device, might be reported but are not described in this section. This section relates specifically to Device unreachable events.
Cisco ANA Failure Processing
Events Identification
The following service alarms are generated by the system:
•
The event [Device unreachable, CE-5] see Component unreachable, page 16-11.
The device unreachability event means that no other information can be collected from this device by the VNE.
•
The event [Link down on unreachable, PE-East<->CE-5] see Link down, page 16-23.
The Link down event is issued by the PE-East VNE (active) as a result of the link down negotiation process.
(Possible) Root Cause
Step 1
Waits two minutes. For more information see Correlation Process, page 5-3.
Step 2
The event:
•
[Device unreachable, CE-5] triggers the CE-5 VNE to initiate an IP-based flow to the management IP.
–
Flow Path—CE-5->PE-East->management subnet.
•
[Link down on unreachable, PE-East<->CE-5] triggers the CE-5 VNE to initiate local correlation.
Root Cause Selection
•
For the event [Device unreachable, CE-5]:
–
Collected Events—The event [Link down on unreachable, PE-East<->CE-5].
Note
Collects other possible events, for example, Interface status down events.
–
Root Cause—There is no root cause (opens a new ticket in the gateway).
Note
The root cause selection process activates special filtering for the event [Device unreachable, CE-5] for which the event [Link down on unreachable] cannot be selected as the root cause, therefore the event [Link down on unreachable, PE-East<->CE-5] will not be selected as the root cause.
•
For the event [Link down on unreachable, PE-East<->CE-5]:
–
Collected Events—The event [Device unreachable, CE-5].
–
Root Cause—Correlates to the event [Device unreachable, CE-5].
Figure 6-3 displays the events identified by the system in this scenario.
Figure 6-3 Device Unreachable on Device Down NetworkVision
Clearing Phase
When a down or reloaded device comes up again and starts responding to polling requests made by the corresponding VNE, the device is declared reachable thus clearing the unreachable alarm. Other related alarms are cleared in turn after the corresponding VNEs identify that the malfunctions have recovered.
Variation
In a device reload scenario the following additional events are identified by the system (in addition to the device down scenario):
•
[Reloading device syslog].
•
[Cold start trap].
The event [Device unreachable, CE-5]:
–
Additional Collected Events—The event [Reloading device syslog, CE-5].
–
Root Cause—Correlates to the event [Reloading device syslog, CE-5].
Figure 6-4 displays the events identified by the system in this scenario.
Figure 6-4 Device Unreachable on Device Reload NetworkVision
Device Unreachable on Another Device Unreachable Event
The diagram displays the Device unreachable on another Device unreachable event setup.
Figure 6-5 Device Unreachable on Another Device Unreachable Event Setup
Description of Fault Scenario in the Network
P-North device is reloaded.
Related Faults
•
P-North is unreachable from the management subnet.
•
The links of P-North operationally go down and as a result its surrounding devices go down.
•
RR2, accessed by the link P-North, RR2 (also known as L3) is unreachable.
Cisco ANA Failure Processing
Note
This scenario is similar to Device Unreachable on Device Reload or Device Down Event except that in this scenario the L3 Link down is not discovered because both connected devices (RR2 and P-North) are unreachable (by Cisco ANA, and therefore the VNE is unable to detect the Link down problem).
Events Identification
The following service alarms are generated by the system:
•
The event [Device unreachable, P-North] see Component unreachable, page 16-11.
The device unreachability event means that no other information can be collected from this device by the VNE.
•
The event [Device unreachable, RR2].
(Possible) Root Cause
Step 1
Waits two minutes.
Step 2
The event:
•
[Device unreachable, P-North] triggers the P-North VNE to initiate an IP-based flow to the management IP.
–
Flow Path—P-North->PE-East->management subnet
•
[Device unreachable, RR2] triggers the RR2 VNE to initiate an IP-based flow to the management IP.
–
Flow Path—RR2->P-North->PE-East->management subnet
Root Cause Selection
•
For the event [Device unreachable, P-North]:
–
Collected Events—The event [Reloading device syslog, P-North]
–
Root Cause—Correlates to the event [Reloading device syslog, P-North]
•
For the event [Device unreachable, RR2]:
–
Collected Events—The events [Device unreachable, P-North] and [Reloading device syslog, P-North]
–
Root Cause—Correlates to the event [Reloading device syslog, P-North] (as this has a higher weight than the event [Device unreachable, P-North]).
Figure 6-6 displays the events identified by the system in this scenario.
Figure 6-6 Device Unreachable on Other Device Unreachable NetworkVision
Clearing Phase
When a reloaded device comes up again and thus the L3 link that is vital for the RR2 management, the RR2 starts responding to polling requests from the RR2 VNE. The device is declared as reachable thus clearing the Device unreachable alarm.
Device Unreachable on Link Down Event
Figure 6-7 displays the Device unreachable on a Link down event setup.
Figure 6-7 Device Unreachable on a Link Down Event Setup
Description of Fault Scenario in the Network
The S.2/3 port of PE-East connected to the S.1/2 port of the CE-5 device (also called L1 link) is set to administrative status down. This effectively takes the L1 Link down.
Related Faults
•
The CE-5 device is managed from this link with no backup. The L1 Link down renders the CE-5 Device unreachable from the management subnet.
Cisco ANA Failure Processing
Events Identification
The following service alarms are generated by the system:
•
The event [Device unreachable, CE-5] see Component unreachable, page 16-11.
The device unreachability event means that no other information can be collected from this device by the VNE.
•
[Link down due to admin down, PE-East<->CE-5] see Link down, page 16-23.
The Link down event is issued by the PE-East VNE (active) as a result of the Link down negotiation process.
Noncorrelating Events
•
[Link down due to admin down, PE-East<->CE-5] opens a new ticket in the gateway.
The L1 Link down event is configured to not correlate to other events at all. This is logical due to the fact that the edge VNEs identify the Link down events as [Link down due to admin down] events. This implies that the VNEs know the root cause of the event already, based on the administrator's configurations. The [Link down due to admin down] events reach the northbound immediately after the links' new statuses are discovered by Cisco ANA and after the link down negotiation methods are over.
(Possible) Root Cause
Step 1
Waits two minutes
Step 2
The event:
•
[Device unreachable, CE-5] triggers the CE-5 VNE to initiate an IP-based flow to the management IP.
–
Flow Path—CE-5->PE-East->management subnet
Root Cause Selection
•
For the event [Device unreachable, CE-5]:
–
Collected Events—The event [Link down due to admin down, PE-East<->CE-5]
Note
Collects other possible events, for example, [Interface status down] events.
–
Root Cause—[Link down due to admin down, PE-East<->CE-5]
Figure 6-8 displays the events identified by the system in this scenario.
Figure 6-8 Device Unreachable on Link Down NetworkVision
Note
In Figure 6-7 port E.0/3 should read S.2/3 and E.0/2 should read S.1/2.
Clearing Phase
When the PE-East port S.2/3 (L1 link) comes up again the CE-5 reachability from the management subnet comes back too. The CE-5 starts responding to polling requests from the CE-5 VNE. The device is declared reachable thus clearing the Device unreachable alarm. The L1 Link down is cleared when the PE-East device indicates that the status of the connected port has changed to up again.
Multiroute Correlation Scenarios
The diagram displays the lab multiroute configuration setup between P-South, P-North and P-West devices. The OSPF cost is the same along the path from P-South and P-North whether or not it goes via P-West, namely, P-South and P-North connect along two paths with equal cost.
Figure 6-9 Multiroute Configuration Setup Between P-South, P-North and P-West
Description of Fault Scenario in the Network
In this example, the P-North, P-South link (also known as L2) goes down in a multiroute segment between P-South and P-North. After approximately one minute another link, L1 (PE-East, P-North), goes down too. Both links go down administratively, the first from the P-North device and the later from the PE-East devices' ports.
Related Faults
•
Almost all the devices are unreachable from the management subnet. This section focuses on CE-1 unreachability (see Figure 6-1).
Note
Syslogs and traps corresponding to network faults are also reported. Additional related faults might also be reported, but are not described in this section.
Cisco ANA Failure Processing
Events Identification
The following service alarms are generated by the system:
•
The event [Device unreachable, CE-1] see Component unreachable, page 16-11.
The device unreachability event means that no other information can be collected from this device by the VNE.
•
The event [Link down due to admin down, P-North<->PE-East] see Link down, page 16-23.
The Link down event is issued by the PE-East VNE (active) as a result of the link down negotiation process.
•
The event [Link down due to admin down, P-North<->P-South].
The Link down event is issued by the P-North VNE as a result of the link down negotiation process.
Noncorrelating Events
•
[Link down due to admin down, P-North<->PE-East] opens a new ticket in the gateway.
•
[Link down due to admin down, P-North<->P-South] opens a new ticket in the gateway.
For more information see Noncorrelating Events.
(Possible) Root Cause
Step 1
Waits two minutes.
Step 2
The event:
•
[Device unreachable, CE-1] triggers the CE-1 VNE to initiate an IP-based flow to the management IP.
–
Flow Path—CE-1->Cloud->PE-South->P-South->P-North->PE-East->management subnet.
–
Flow Path—CE-1->Cloud->PE-South->P-South->P-West->P-North->PE-East->management subnet.
Root Cause Selection
•
For the event [Device unreachable, CE-1]:
For the flow path—CE-1->Cloud->PE-South->P-South->P-North->PE-East->management subnet:
–
Collected Events—The event [Link down due to admin down, P-North<->PE-East] and the event [Link down due to admin down, P-South->P-North]
Note
Collects other possible events, for example, [Interface status down] events.
–
Root Cause
[Link down due to admin down, P-SouthS.1/0 >P-North S.1/0<->PE-East S.2/2]
[Link down due to admin down, P-NorthS.1/3 >PE-East S.2/2]
For the Flow Path—CE-1->Cloud->PE-South->P-South->P-West->P-North->PE-East->management subnet:
–
Root Cause for Flow Path—CE-1->Cloud->PE-South->P-South->P-West->P-North->PE-East->management subnet
[Link down due to admin down, P-North S.1/0<->PE-East S.2/2]
Note
The CE-1's VNE root cause selection method identifies the Device unreachable event's root cause on the L1 Link down event. According to the logic two flows that split and come up with two sets of possible root cause events, remove sets that are super sets of others (depending on whether both flows end at the same location). Sets that are not removed are united into one set containing all of the events. This implies that in our scenario the set that includes both links is removed because it is a super set of the set that contains the L1 link solely.
Note
All devices that are unreachable correlate their unreachability events to the L1 link as expected.
Figure 6-10 displays the events identified by the system in this scenario (L1).
Figure 6-10 Multiroute Scenario NetworkVision - L1
Figure 6-11 displays the events identified by the system in this scenario (L2).
Figure 6-11 Multiroute Scenario NetworkVision - L2
Clearing Phase
Enabling the L1 link makes the CE-1 device reachable from the management. This alone clears the Device unreachable event of the CE-1 device. When the L1 link's new status is discovered by Cisco ANA, the PE-East device eventually initiates a link up event for this link. When the administrator enables the L2 link and Cisco ANA discovers this change this Link down event is cleared by its matching link up event.
BGP Neighbor Loss Correlation Scenarios
The VNE models the BGP connection between routers and actively monitors its state. BGP neighbor loss events are generated from both sides of the connection only in the case of a connectivity loss, and where the other side of the link is unmanaged.
The correlation engine identifies various faults that affect the BGP connection and reports them as the root cause for the BGP neighbor loss alarm. For example, Link down, CPU over utilized, and link data loss.
Figure 6-12 Setup for BGP Neighbor Loss Correlation Scenarios
Note
In Figure 6-12 the link between P-West and PE-North-West is not real and merely emphasizes how PE-North-West is connected in the network.
There are two main scenarios that might lead to a BGP neighbor loss event:
•
BGP neighbor loss due to a Link down (or an equivalent port down).
•
BGP neighbor loss due to BGP process down or device down.
BGP Neighbor Loss Due to Port Down
Figure 6-13 displays the BGP neighbor loss due to port down scenario.
Figure 6-13 BGP Neighbor Loss Due to Physical Port Down (P-West -> PE-North-West)
Description of Fault Scenario in the Network
In Figure 6-13 the BGP neighbor loss occurs due to a physical port down (in P-West that connects to PE-North-West). The relevant devices are PE-North-West, RR2, P-North and P-West.
Related Faults
•
Port on P-West that is connected to the PE-North-West goes down.
•
BGP neighbor, on RR2, to PE-North-West changes state from Established to Idle.
Note
Syslogs and traps corresponding to network faults are also reported. Additional related faults might also be reported, but are not described in this section.
Cisco ANA Failure Processing
Events Identification
The following service alarms are generated by the system:
•
The event [BGP neighbor loss, RR2] see BGP neighbor loss, page 16-5.
Since the VNE that monitors each PE or RR holds records of the entire device's BGP information, the change in the BGP table is identified by the VNE and causes it to send this event.
(Possible) Root Cause
Step 1
Waits two minutes. For more information, see Correlation Process, page 5-3.
Step 2
The event:
•
[BGP neighbor loss, RR2] triggers the VNE to initiate an IP-based flow to the destination IP of its lost BGP neighbor, namely, PE-North-West.
–
Flow Path—RR2->P-North->P-West->P-West port is connected to PE-North-West (which is unmanaged), and is in a down state.
Root Cause Selection
•
For the event [BGP neighbor loss, RR2]:
–
Collected Events—The event [Port down, P-West].
–
Root Cause—Correlates to the event [Port down, P-West].
Figure 6-14 displays the events identified by the system in this scenario.
Figure 6-14 BGP Neighbor Loss Due to Physical Port Down
Clearing Phase
When a [Port up] event is detected by the system for the same port that was detected as the root cause for the [BGP neighbor loss] event, the alarm is cleared. The ticket is cleared (colored green) when all the alarms in the ticket have been cleared.
Figure 6-15 displays the up event that clears all the down events identified by the system.
Figure 6-15 BGP Neighbor Up Event that Clears All the Down Events
Variation
In a BGP process down scenario, the following additional events are identified by the system (in addition to the BGP neighbor loss event):
•
[BGP process down] see BGP process down, page 16-7.
In Figure 6-16 the BGP process down causes several events (the [BGP neighbor loss] event cannot be seen). The relevant devices are RR2 ([BGP process down], marked in red), and PE-North-West (marked as unmanaged).
Figure 6-16 BGP Process Down Causes Several Events
The event [BGP neighbor loss, RR2]:.
•
Additional Collected Events—The event [BGP process down, RR2], [BGP neighbor loss, RR2].
•
Root cause—Correlates to the event [BGP process down, RR2].
Figure 6-17 displays the events identified by the system in this scenario.
Figure 6-17 BGP Process Down Correlation
BGP Link Down Scenarios
Figure 6-18 illustrates the setup for the BGP link down scenarios described in this section.
Figure 6-18 Setup for BGP Link Down Scenarios
Figure 6-19 illustrates the setup for the scenarios with only the BGP links displayed.
Figure 6-19 Setup for Scenarios With Only the BGP Links Displayed
The VNE models the BGP connection between routers and actively monitors its state. In case the connectivity is lost and a link between the devices exists in the VNE, then and only then a [BGP link down] event is created. In other words, a [BGP link down] event is created only if both sides of the link are managed.
A BGP link might be disconnected in the following scenarios:
•
The BGP process goes down on a certain device, causing all the BGP links that were connected to that device to disconnect.
•
Disconnection of a physical link (path) that causes one side of the logical BGP link to become unreachable.
•
A device that becomes unreachable, due to reload or because it is shutting down. This causes all the links to the device to be lost, including the BGP links.
Description of Fault Scenario in the Network
Due to a physical link down, the BGP connection between PE-North-West and RR2 is lost.
Related Faults
•
Port that is connected to the P-North goes down.
•
Port that is connected to the RR2 goes down.
•
BGP link between RR2 and PE-North-West is disconnected.
Note
Syslogs and traps corresponding to network faults are also reported. Additional related faults might also be reported, but are not described in this section.
Figure 6-20 reflects the BGP link down due to physical link down scenario. The relevant devices are RR2, P-north, P-West and PE-North-West.
Figure 6-20 BGP Link Down Due to Physical Link Down
Cisco ANA Failure Processing
This section includes the following:
•
Events Identification
•
(Possible) Root Cause
•
Root Cause Selection
•
Clearing Phase
•
Variation
Events Identification
The following service alarms are generated by the system:
•
The event [BGP link down, RR2<->PE-North-West] see BGP link down, page 16-4. This event might be revealed in one of two ways, namely:
–
Changes in the BGP neighbor list in the device are found after polling
–
Syslogs suggest that something has changed in the device's BGP neighbors or process
This actually causes an acceleration of the polling for the BGP neighbor data on the device.
(Possible) Root Cause
Step 1
Waits two minutes. For more information see Correlation Process, page 5-3.
Step 2
The event:
•
[BGP link down, RR2<->PE-North-West] triggers the RR2 VNE to initiate two IP-based flows one from its routing entity to the destination IP of its lost BGP neighbor, namely, PE-North-West, and one from the destination IP of its lost BGP neighbor back to the RR2.
–
Flow Path—RR2->P-North->PE-North-West
–
Flow Path—RR2->PE-North-West -> P-North -> RR2
Root Cause Selection
•
For the event [BGP link down, RR2<->PE-North-West]:
–
Collected Events—The event [Link down, P-North<->RR2], and the event [BGP link down, RR2<->PE-North-West]
–
Root Cause—Correlates to the event [Link down, P-North<->RR2]
Figure 6-21 displays the events identified by the system in this scenario.
Figure 6-21 BGP Link Down Correlation to the Root Cause of Physical Link Down
Clearing Phase
A [BGP link up] event arrives when the root cause event is fixed so that the network is repaired. This clearing event is created after a clearing syslog arrives or after the next polling result reestablishes the BGP connection.
Figure 6-22 displays the up event that clears all the tickets identified by the system.
Figure 6-22 BGP Link Up Clears All the Tickets
Variation
In a managed network, the following additional events might be identified by the system (in addition to the BGP link down event):
•
[BGP process down]
•
[Device unreachable]
[BGP process down]
Figure 6-23 displays the scenario where the event [BGP process down] causes [BGP link down] events.
Figure 6-23 BGP Process Down Causes BGP Link Down Events
The event [BGP link down, RR2<->PE-North-West]:
•
Additional Collected Events—The event [BGP process down, RR2] see BGP process down, page 16-7.
•
Root cause—Correlates to the event [BGP process down, RR2].
Figure 6-24 displays the events identified by the system in this scenario.
Figure 6-24 BGP Process Down Correlation
[Device unreachable]
The event [BGP link down, RR2<->PE-North-West]:
•
Additional Collected Events—The event [Device unreachable, RR2].
•
Root cause—Correlates to the event [Device unreachable, RR2].
In an unmanaged network core, the following additional events might be identified by the system (in addition to the BGP link down event):
•
[BGP process down]
•
[Device unreachable]
Figure 6-25 Setup Without the Core
[BGP process down]
Note
The event [BGP process down] occurs on the managed PE-South.
In Figure 6-26, the event [BGP process down] on PE-South causes [BGP link down] events. The relevant devices are PE-South, RR1 and RR2.
Figure 6-26 BGP Process Down on PE-South Causes BGP Link Down Events
The event [BGP link down, PE-South<->RR2] see BGP link down, page 16-4:
•
Additional Collected Events—The event [BGP process down, PE-South], [BGP link down, PE-South<->RR1].
•
Root cause—Correlates to the event [BGP process down, PE-South].
Figure 6-27 displays the events identified by the system in this scenario.
Figure 6-27 BGP Process Down Correlation
[Device unreachable]
For the event [Device unreachable] one or more PEs report on BGP connectivity loss to a neighbor that is unreachable.
In Figure 6-28, the Device unreachable on an unmanaged core causes multiple [BGP link down] events. The relevant devices are RR2 (Device unreachable), RR1, PE-East, PE-South.
Figure 6-28 Device Unreachable on Unmanaged Core Causes Multiple BGP Link Down Events
The event [BGP link down, RR2<->PE-South] see BGP link down, page 16-4:
•
Additional Collected Events—The event [Device unreachable, RR2], [BGP link down, RR2<->RR1].
•
Root cause—Correlates to the event [Device unreachable, RR2].
Figure 6-29 displays the events identified by the system in this scenario.
Figure 6-29 Device Unreachable on Unmanaged Core Correlation
The event [Device unreachable, RR2] (see Figure 6-30) see Component unreachable, page 16-11:
•
Additional Collected Events—The event [BGP link down, RR2<->PE-South].
•
Root cause—Correlates to the event [BGP link down, RR2<->PE-South].
Figure 6-30 Device Unreachable on CE
HSRP Scenarios
This section includes:
•
HSRP Alarms
•
HSRP Example
HSRP Alarms
When an active Hot Standby Router Protocol (HSRP) group's status changes, a service alarm is generated and a syslog is sent.
Table 6-1 HSRP Service Alarms
Alarm
|
Ticketable?
|
Correlation allowed?
|
Correlated to
|
Severity
|
Primary HSRP interface is not active / Primary HSRP interface is active (see HSRP group status changed, page 16-17)
|
Yes
|
No
|
Can be correlated to several other alarms, for example, link down
|
Major
|
Secondary HSRP interface is active / Secondary HSRP interface is not active (see HSRP group status changed, page 16-17)
|
Yes
|
No
|
Can be correlated to several other alarms, for example, link down
|
Major
|
Note
HSRP group information can be viewed in the Inventory window of Cisco ANA NetworkVision.
HSRP Example
In Figure 6-31, the link between Router 2 and Switch 2 is shut down causing the HSRP standby group on Router 3 to become active, and a link-down service alarm is generated. The primary HSRP group on Router 2 is not active anymore. A service alarm is generated and correlated to the link-down alarm. Router 2 also sends a syslog which is correlated to the link-down alarm.
The secondary HSRP group configured on Router 3 now changes from standby to active. This network event triggers an IP-based active flow with the destination being the virtual IP address configured in the HSRP group. When the flow reaches its destination, a service alarm is generated and correlated to the link-down alarm. Router 3 also sends a syslog that is correlated to the link-down alarm.
Figure 6-31 Example
In this case, the system provides the following report:
•
Root cause—Link down (Router 2-Switch 2)
•
Correlated events:
–
Primary HSRP interface is not active (source: Router 2)
–
%HSRP-6-STATECHANGE: FastEthernet0/0 Grp 1 state Active -> Speak (source: Router 2)
–
Secondary HSRP interface is active (source: Router 3)
–
%STANDBY-6-STATECHANGE: Ethernet0/0 Group 1 state Standby -> Active (source: Router 3)
IP Interface Failure Scenarios
This section includes:
•
IP Interface Status Down Alarm
•
All IP Interfaces Down Alarm
•
IP Interface Failure Examples
IP Interface Status Down Alarm
Alarms related to subinterfaces, for example, line-down trap, line-down syslog, and so on, are reported on IP interfaces configured above the relevant subinterface. This means that in the system, subinterfaces are represented by the IP interfaces configured above them. All events sourcing from subinterfaces without a configured IP are reported on the underlying Layer 1.
An [ip interface status down] alarm is generated when the status of the IP interfaces (whether it is over an interface or a subinterface) changes from up to down or any other non-operational state (see Table 6-3). All events sourced from the subinterfaces correlate to this alarm. In addition, an [All ip interfaces down] alarm is generated when all the IP interfaces above a physical port change state to down. For more information see Interface status, page 16-18.
Table 6-2 IP Interface Status Down Alarm
Name
|
Description
|
Ticketable
|
Correlation allowed
|
Correlated to
|
Severity
|
Interface status down/up
|
Sent when an IP interface changes oper status to "down"/"up"
|
Yes
|
Yes
|
Link Down/Device unreachable
|
Major
|
The alarm's description includes the full name of the IP interface, for example Serial0.2 (including the identifier for the subinterface if it is a subinterface) and the source of the alarm source points to the IP interface (and not to Layer1).
All syslogs and traps indicating changes in subinterfaces (above which an IP is configured) correlate to the [ip interface status down] alarm. The source of these events is the IP interface. Syslogs and traps that indicate problems in Layer1 (that do not have a subinterface qualifier in their description) are sourced to Layer1.
Note
In case a syslog or trap is received from a subinterface that does not have an IP configured above it, the source of the created alarm is the underlying Layer 1.
For example:
•
Line-down trap (for subinterface).
•
Line-down syslogs (for subinterface).
For events that occur on subinterfaces:
•
When sending the information northbound, the system uses the full subinterface name in the interface name in the source field, as described in the ifDesc/ifName OID (for example Serial0/0.1 and not Serial0/0 DLCI 50).
•
The source of the alarm is the IP interface configured above the subinterface.
•
If there is no IP configured, the source is the underlying Layer 1.
In case the main interface goes down, all related subinterfaces' traps and syslogs are correlated as child tickets to the main interface parent ticket.
The following technologies are supported:
•
Frame Relay/HSSI
•
ATM
•
Ethernet, Fast Ethernet, Gigabit Ethernet
•
POS
•
Channelized Optical Carrier (CHOC)
Correlation of Syslogs and Traps
When receiving a trap or syslog for the subinterface level, immediate polling of the status of the relevant IP interface occurs and a polled parent event (for example, ip interface status down) is created. The trap or syslog is correlated to this alarm.
Where there is a multipoint setup and only some circuits under an IP interface go down, and this does not cause the state of the IP interface to change to down, then no [ip interface status down] alarm is created. All the circuit down syslogs correlate by flow to the possible root cause, for example, Device unreachable on a Customer Edge (CE) device.
All IP Interfaces Down Alarm
•
When all the IP interfaces configured above a physical interface change their state to down, the All ip interfaces down alarm is sent.
•
When at least one of the IP interfaces changes its state to up, a clearing (active ip interfaces found) alarm is sent.
•
The ip interface status down alarm for each of the failed IP interfaces is correlated to the All ip interfaces down alarm.
Note
When an All ip interfaces down alarm is cleared by the active ip interfaces down alarm but there are still correlated ip interface status down alarms for some IP interfaces, the severity of the parent ticket is the highest severity among all the correlated alarms. For example, if there is an uncleared interface status down alarm, the severity of the ticket remains major, despite the fact that the Active ip interfaces found alarm has a cleared severity.
For more information, see Table 6-3 and All ip interfaces down, page 16-3.
Table 6-3 All IP Interfaces Down
Name
|
Description
|
Ticketable
|
Correlation allowed
|
Correlated to
|
Severity
|
All ip interfaces down/Active ip interfaces found
|
Sent when all the IP interfaces configured above a physical port change their operational status to down
|
Yes
|
Yes
|
Link Down
|
Major
|
The All ip interfaces down alarm is sourced to the Layer1 component. All alarms from "the other side", for example, Device unreachable correlate to the All ip interfaces down alarm.
IP Interface Failure Examples
Note
In all the examples that follow it is assumed that the problems that result in the unmanaged cloud, or the problems that occurred on the other side of the cloud (for example, an unreachable CE device from a provider edge (PE) device) cause the relevant IP interfaces' state to change to down. This in turn causes the ip interface status down alarm to be sent.
If this is not the case, as in some Ethernet networks, and there is no change to the state of the IP interface, all the events on the subinterfaces that are capable of correlation flow will try to correlate to other possible root causes, including [cloud problem].
Interface Example 1
In Figure 6-32 there is multipoint connectivity between a PE and number of CEs through an unmanaged Frame Relay network. All the CEs (Router2 and Router3) have logical connectivity to the PE through a multipoint subinterface on the PE (Router10). The keepalive option is enabled for all circuits. A link is disconnected inside the unmanaged network that causes all the CEs to become unreachable.
Figure 6-32 Interface Example 1
The following failures are identified in the network:
•
A Device unreachable alarm is generated for each CE.
•
An ip interface status down alarm is generated for the multipoint IP interface on the PE.
The following correlation information is provided:
•
The root cause is IP interface down.
•
All the Device unreachable alarms are correlated to the ip interface status down alarm on the PE.
Interface Example 2
In Figure 6-33 there is point-to-point connectivity between a PE and a CE through an unmanaged Frame Relay network. CE1 became unreachable, and the status of the IP interface on the other side (on the PE1) changed state to down. The "keepalive" option is enabled. The interface is shut down between the unmanaged network and CE1.
Figure 6-33 Interface Example 2
The following failures are identified in the network:
•
A Device unreachable alarm is generated on the CE.
•
An ip interface status down alarm is generated on the PE.
The following correlation information is provided:
•
The root cause is Device unreachable:
–
The ip interface status down alarm is correlated to the Device unreachable alarm.
–
The syslogs and traps for the related subinterfaces are correlated to the ip interface status down alarm.
Interface Example 3
In Figure 6-34 there is a failure of multiple IP interfaces above the same physical port (mixed point-to-point and multipoint Frame Relay connectivity). CE1 (Router2) has a point-to-point connection to PE1 (Router10). CE1 and CE2 (Router3) have multipoint connections to PE1. The IP interfaces on PE1 that are connected to CE1, and CE2 are all configured above Serial0/0. The "keepalive" option is enabled. A link is disconnected inside the unmanaged network that has caused all the CEs to become unreachable.
Figure 6-34 Interface Example 3
The following failures are identified in the network:
•
All the CEs become unreachable.
•
An ip interface status down alarm is generated for each IP interface above Serial0/0 that has failed.
The following correlation information is provided:
•
The root cause is All IP interfaces down on Serial0/0 port:
–
The ip interface status down alarms are correlated to the All IP interfaces down alarm.
–
The Device unreachable alarms are correlated to the All IP interfaces down alarm.
–
The syslogs and traps for the related subinterfaces are correlated to the All IP interfaces down alarm.
Interface Example 4
In Figure 6-35 there is a failure of multiple IP interfaces above the same physical port (mixed point-to-point and multipoint Frame Relay connectivity). CE1 (Router2) has a point-to-point connection to PE1 (Router10). CE1 and CE2 (Router3) have multipoint connections to PE1. The IP interfaces on PE1 that are connected to CE1, and CE2 are all configured above Serial0/0. The "keepalive" option is enabled.
A link is disconnected inside the unmanaged network that has caused all the CEs to become unreachable. In a situation where a Link down occurs, whether it involves a cloud or not, the link failure is considered to be the most probable root cause for any other failures. In this example, a link is disconnected between the unmanaged network and the PE.
Figure 6-35 Interface Example 4
The following failures are identified in the network:
•
A Link down alarm is generated on Serial0/0.
•
A Device unreachable alarm is generated for each CE.
•
An ip interface status down alarm is generated for each IP interface above Serial0/0.
•
An All IP interfaces down alarm is generated on Serial0/0.
The following correlation information is provided:
•
The Device unreachable alarms are correlated to the link-down alarm
•
The ip interface status down alarm is correlated to the link-down alarm
•
The All IP interfaces down alarm is correlated to the link-down alarm
•
All the traps and syslogs for the subinterfaces are correlated to the link-down alarm
Interface Example 5
In Figure 6-36 on the PE1 device that has multipoint connectivity, one of the circuits under the IP interface has gone down and the CE1 device which is connected to it has become unreachable. The status of the IP interface has not changed and other circuits are still operational.
Figure 6-36 General Interface Example
The following failures are identified in the network:
•
A Device unreachable alarm is generated on CE1.
•
A syslog alarm is generated notifying the user about a circuit down.
The following correlation information is provided:
•
Device unreachable on the CE:
–
The syslog alarm is correlated by flow to the Device unreachable alarm on CE1
ATM Examples
Similar examples involving ATM technology have the same result, assuming that a failure in an unmanaged network causes the status of the IP interface to change to down (ILMI is enabled).
Ethernet, Fast Ethernet, Giga Ethernet Examples
This section includes the following examples:
•
There is an unreachable CE due to a failure in the unmanaged network.
•
There is a Link down on the PE that results in the CE becoming unreachable.
Interface Example 6
In Figure 6-37 there is an unreachable CE due to a failure in the unmanaged network.
Figure 6-37 Interface Example 6
The following failures are identified in the network:
•
A Device unreachable alarm is generated on the CE. For more information see Component unreachable, page 16-11.
•
A cloud problem alarm is generated. For more information see Cloud problem, page 16-10.
The following correlation information is provided:
•
No alarms are generated on a PE for Layer1, Layer2 or for the IP layers.
•
The Device unreachable alarm is correlated to the cloud problem alarm.
Interface Example 7
In Figure 6-38 there is a Link down on the PE that results in the CE becoming unreachable.
Figure 6-38 Interface Example 7
The following failures are identified in the network:
•
A Link down alarm is generated on the PE. For more information see Link down, page 16-23.
•
An ip interface status down alarm is generated on the PE. For more information see Interface status, page 16-18.
•
A Device unreachable alarm is generated on the CE. For more information see Component unreachable, page 16-11.
The following correlation information is provided:
•
Link down on the PE:
–
The ip interface status down alarm on the PE is correlated to the Link down alarm.
–
The Device unreachable alarm on the CE is correlated to the Link down alarm on the PE.
–
The traps and syslogs for the subinterface are correlated to the Link down alarm on the PE
Generic Routing Encapsulation (GRE) Tunnel Down/Up
Generic Routing Encapsulation (GRE) is a tunneling protocol that encapsulates a variety of network layer packets inside IP tunneling packets, creating a virtual point-to-point link to devices at remote points over an IP network. It is used on the Internet to secure virtual private networks (VPNs). GRE encapsulates the entire original packet with a standard IP header and GRE header before the IPsec process. GRE can carry multicast and broadcast traffic, making it possible to configure a routing protocol for virtual GRE tunnels. The routing protocol detects loss of connectivity and reroutes packets to the backup GRE tunnel, thus providing high resiliency.
GRE is stateless that means that the tunnel endpoints do not monitor the state or availability of other tunnel endpoints. This feature helps service providers support IP tunnels for clients, who do not know the service provider's internal tunneling architecture. It gives clients the flexibility of reconfiguring their IP architectures without worrying about connectivity.
GRE Tunnel Down/Up Alarm
When a GRE tunnel link exists, if the status of the IP interface of the GRE tunnel edge changes to down, a GRE tunnel down alarm is created. The IP interface down alarms of both sides of the link will correlate to the GRE tunnel down alarm. The GRE tunnel down alarm will initiate an IP-based flow toward the GRE destination. If an alarm is found during the flow, it will correlate to it. For more information, see GRE tunnel down, page 16-16.
Note
The GRE Tunnel Alarm Down is supported only on GRE tunnels that are configured with keepalive. When keepalive is configured on the GRE tunnel edge, if a failure occurs in the GRE tunnel link, both IP interfaces of the GRE tunnel will be in Down state. If keepalive is not configured on the GRE tunnel edge, since the alarm is generated arbitrarily from one of the tunnel devices when the IP Interface changes to the Down state, the GRE tunnel down alarm might not be generated.
When a failure occurs, the GRE tunnel link is marked orange. When the IP interface comes back up, a fixing alarm is sent, and the link is marked green. The GRE tunnel down alarm is cleared by a corresponding GRE Tunnel Up alarm.
GRE Tunnel Down Correlation Example 1
Figure 6-39 provides an example of a GRE tunnel down correlation for a single GRE tunnel.
In this example:
•
Router 1 (R1) is connected to Router 3 (R3) through a physical link L1.
•
Router 3 is connected to Router 2 through a physical link L2.
•
Router 1 is connected to Router 2 through a GRE tunnel.
Figure 6-39 GRE Tunnel Down Example 1 (Single GRE Tunnel)
When a Link down occurs on L2, a Link down alarm appears. A GRE tunnel down alarm is issued as the IP interfaces of the tunnel edge devices go down. The ip interface status down alarms will correlate to the GRE tunnel down alarm. The GRE tunnel down will correlate to the Link down alarm.
The system provides the following report:
•
Root cause—Link down: L2 Router 2 <-> Router 3 (see Link down, page 16-23)
•
Correlated events:
GRE tunnel down Router1:tunnel <-> Router 2:tunnel
–
IP interface down Router 1:tunnel
–
IP interface down Router 2:tunnel
GRE Tunnel Down Correlation Example 2
This example provides a real-world scenario, whereby multiple GRE tunnels cross through a physical link. When this link is shut down by an administrator, many alarms are generated. All the alarms are correlated to the root cause ticket [Link down due to admin down], as illustrated in Figure 6-40.
Figure 6-40 GRE Tunnel Down Example 2 (Multiple GRE Tunnels)
Figure 6-41 shows the Correlation tab of the Ticket Properties dialog box that displays all the alarms that are correlated to the ticket, including the correlation for each GRE tunnel and its interface status.
Figure 6-41 Alarms Correlation to GRE Tunnel Down Ticket
As illustrated, the system provides the following report:
•
Root cause—Link down due to admin down (see Link down, page 16-23)
•
Correlated events:
GRE tunnel down ME-6524AGRE:Tunnel2 <-> ME-6524B GRE:Tunnel2
–
Interface status down ME-6524A IP:Tunnel2
–
Interface status down ME-6524B IP:Tunnel2
GRE tunnel down ME-6524AGRE:Tunnel3 <-> ME-6524B GRE:Tunnel3
–
Interface status down ME-6524A IP:Tunnel3
–
Interface status down ME-6524B IP:Tunnel3
and so on.