User Guide for Device Fault Manager 1.2 and 1.2 Updated for Common Services Version 2.2 (With LMS 2.2)
Excessive Restarts and Flapping

Table Of Contents

Excessive Restarts and Flapping


Excessive Restarts and Flapping


This topic describes how DFM concludes that a system is excessively restarting or that a network adapter is flapping. Excessive restarts and flapping are faulty conditions that trigger the notification of an Operational Exception.

A system is considered to be excessively restarting if it performs too many cold or warm starts over a short period of time. A network adapter is considered to be flapping if it fluctuates too often between up and down states over a short period of time.

DFM monitors a system or network adapter's state via the SNMP traps it receives. DFM determines when to send an Excessive Restart or Flapping Notification based on a combination of fixed values and user-controlled settings. DFM also calculates a stable time in which to wait before clearing the Excessive Restart or Flapping Notification.

DFM monitors the following SNMP traps to determine a change in an element's state:

Warm Start Traps and Cold Start Traps for a system (bridge, host, hub, probe, router, RSM, switch, terminal server, or uncertified device)

Link Down Traps for a network adapter (port or interface)

DFM uses the following values to diagnose an element as excessively restarting or flapping:

Minimum Traps—The minimum number of Link/Restart Traps received in order to conclude that the element is flapping/excessively restarting. This variable is set by the Link Trap Threshold parameter (contained in the Interface/Port Flapping setting) for network adapters and the Restart Trap Threshold parameter (contained in the Connectivity setting) for systems.

Trap Window—The period within which the Minimum Traps must be received to declare the element as flapping/excessively restarting. This window is set by the Link Trap Window parameter (contained in the Interface/Port Flapping setting) for network adapters and the Restart Trap Window parameter (contained in the Connectivity setting) for systems. Once an element is declared at fault, DFM computes the Stable Time.

Stable Time—The amount of time that must elapse without further traps before DFM declares the element stable again. Stable Time depends on the length of time the element was at fault. It is at least as large as that time, and at least as large as the Trap Window. However, it can be no longer than one hour.

Figure A-1 illustrates how a system is diagnosed as performing excessive restarts or a network adapter is diagnosed as flapping.

Figure A-1 Diagnosing Excessive System Restarts or Flapping Network Adapters

In the example, let's assume the Link/Restart Trap Window parameter has a value of 30 seconds and the Link/Restart Trap Threshold parameter has a value of 2. DFM would perform the following actions for the example:

1. As soon as DFM receives a Link Down Trap from a physical port or interface (or a Warm Start/Cold Start Trap from a system), it begins counting.

2. When DFM receives two or more traps within 30 seconds, the Trap Window, it considers the network adapter or system to be at fault and it sends an Excessive Restart or Flapping Notification. This triggers an Operational Exception Notification. The Minimum Traps variable (set by the Link/Restart Trap Threshold parameter) determines the number of traps DFM must receive (2) within the Trap Window (set by the Link/Restart Trap Window parameter) before it considers an element at fault.

3. DFM continues to receive traps for 80 seconds after the initial trap. This results in a Stable Time of 80 seconds.

The Stable Time is the amount of time DFM waits before it clears the Excessive Restart or Flapping Notification. In our example, the Stable Time is set at 80 seconds since it is greater than the Trap Window (30 seconds) and less than one hour.

As you can see, DFM uses a relative measure to determine how long an element must be stable before it clears the notification of the faulty condition. This measure is proportional to the amount of time an element is at fault. The longer an element is at fault, the longer it must be stable before the notification of the faulty condition is cleared. Because the element in our example remains stable for 80 seconds, DFM clears the notification of the faulty condition no sooner than
80 seconds after it receives the final trap.