Health Monitor and Diagnostics Monitor for the Cisco AS5850 [Cisco AS5800 Series Universal Gateways]

Table Of Contents

Health Monitor and Diagnostic Monitor for the Cisco AS5850

Contents

Prerequisites for HM and DM for the Cisco AS5850

Restrictions for HM and DM for the Cisco AS5850

Information About HM and DM for the Cisco AS5850

Health Monitor Overview

Health Monitor Design

Rules Database

Action Handler

Health

Event Notification

Diagnostic Monitor Overview

Diagnostic Monitor Design

Benefits

How To Configure HM and DM for the Cisco AS5850

Activating or De-activating Health-Monitor Rules

Setting HM Notifications

Configure Diagnostic Monitor Tests

Verify System Health

General IOS Health Monitor Rules Operation

System Health Monitor Subsystem Rules

The zero_system_health_rule Operation

FIB Health Monitor Subsystem Rules

FIB Rules Operation

Memory Health Monitor Subsystem Rules

Low Memory Rules Operation

Fragmented Memory Rules Operation

rsc_slot Health Monitor Subsystem Rules

rsc_slot Rules Operation

Diagnostic Monitor Health Monitor Rules Operation

Troubleshooting Tips

Additional References

Related Documents

Standards

MIBs

RFCs

Technical Assistance

Command Reference

bootup

show health-monitor events

show health-monitor variable

show health-monitor subsystem

show health-monitor rule

show diagnostic-monitor test

show monitor event-trace hm

health-monitor subsystem

debug health-monitor

debug ip cef health-monitor

debug memory health-monitor

debug diagnostic-monitor

debug slot health-monitor

rule

notify subsystem

test

default

Glossary

Health Monitor and Diagnostic Monitor for the Cisco AS5850

Health Monitor (HM) and Diagnostic Monitor (DM) are components of the Cisco IOS software that monitor the health of registered subsystems on the Cisco AS5850 Route Switch Controller (RSC).

Feature Specifications for Health Monitor and Diagnostic Monitor for the Cisco AS5850

Feature History

Release

Modification

12.3(2)T

This feature was introduced on the Cisco AS5850.

Supported Platforms

Cisco AS5850

Determining Platform Support Through Cisco Feature Navigator

Cisco IOS software is packaged in feature sets that are supported on specific platforms. To get updated information regarding platform support for this feature, access Cisco Feature Navigator. Cisco Feature Navigator dynamically updates the list of supported platforms as new platform support is added for the feature.

Cisco Feature Navigator is a web-based tool that enables you to determine which Cisco IOS software images support a specific set of features and which features are supported in a specific Cisco IOS image. You can search by feature or release. Under the release section, you can compare releases side by side to display both the features unique to each software release and the features in common.

To access Cisco Feature Navigator, you must have an account on Cisco.com. If you have forgotten or lost your account information, send a blank e-mail to cco-locksmith@cisco.com. An automatic check will verify that your e-mail address is registered with Cisco.com. If the check is successful, account details with a new random password will be e-mailed to you. Qualified users can establish an account on Cisco.com by following the directions found at this URL:

http://www.cisco.com/register

Cisco Feature Navigator is updated regularly when major Cisco IOS software releases and technology releases occur. For the most current information, go to the Cisco Feature Navigator home page at the following URL:

http://www.cisco.com/go/fn

Availability of Cisco IOS Software Images

Platform support for particular Cisco IOS software releases is dependent on the availability of the software images for those platforms. Software images for some platforms may be deferred, delayed, or changed without prior notice. For updated information about platform support and availability of software images for each Cisco IOS software release, refer to the online release notes or, if supported, Cisco Feature Navigator.

Contents

Prerequisites for HM and DM for the Cisco AS5850

•Cisco IOS Release 12.3(2)T or later release

Restrictions for HM and DM for the Cisco AS5850

•none

Information About HM and DM for the Cisco AS5850

This section provides detailed information about the Cisco AS5850 Health Monitor and the Cisco As5850 Diagnostics Monitor.

Health Monitor Overview

Health Monitor (HM) is a Cisco IOS subsystem that monitors the state of hardware and software in the Cisco AS5850. By monitoring the state of individual hardware and software subsystems, the state or "health" of the system can be determined. Early detection of faults within individual subsystems can help increase the availability of the entire system. Health Monitor increases availability in the following ways:

•Fault notification

The Health Monitor receives notification events from the system and contains logs and statistics about faults that occur. This information can be displayed on a per subsystem basis and hence all events for a particular subsystem can be examined.

•Fault isolation

Once it is detected that the system health is suboptimal it will be possible to "drill down" and identify subcomponents that are causing the system health to be compromised. This can be used to isolate subsystems with problems that require attention.

•Failure recovery

The Health Monitor rules can trigger an action that can be used to recover from a fault or minimize its effect.

Health Monitor Design

Health Monitor (HM) is a rules based system that allows the user to enable or disable a set of rules for monitoring registered hardware or software subsystem. When the Cisco IOS is loaded, hardware and software subsystems register with Health Monitor. Once registered, the subsystems can then create health monitor rules and the health monitor will monitor events for these subsystems.

Note HM rules can be enabled or disabled. The conditions and actions associated with a rule are predetermined and cannot be altered.

The important aspects of Health Monitor are described below:

•Rules Database

•Action Handler

•Health

•Event Notification

Rules Database

The rules which are added to the rules database determine how the health of the subsystem is affected. They are also used to recover from error conditions or minimize the effect of an encountered condition. The rules processor is responsible for evaluating rule conditions, and calling the corresponding action handler(s) when appropriate.

Each HM rule consists of one condition, which may consist of several subconditions, and one or more actions. Table 1 shows the default rules registered with Health Monitor. This list may change as other subsystems register with Health Monitor. A current list can be seen by using the show health-monitor rule command.

Table 1 Health Monitor Rules

Rule ID

Subsystem

Rule Name

Action

1

peer_rsc_ping_diags

peer_rsc_ping_rc_dec

Decrement health by 100%

2

peer_rsc_ping_diags

peer_rsc_ping_rc_inc

Increment health by 100%

3

rsc_redundancy_line_diags

rsc_redundancy_line_rc_dec

Decrement health by 100%

4

rsc_redundancy_line_diags

rsc_redundancy_line_rc_incr_health

Increment health by 100%

5

fb0_mil_ping_diags

fb0_mil_path_ping_rc_dec

Decrement health by 100%

6

fb0_mil_ping_diags

fb0_mil_path_ping_rc_inc

Increment health by 100%

7

fb1_mil_ping_diags

fb1_mil_path_ping_rc_dec

Decrement health by 100%

8

fb1_mil_ping_diags

fb1_mil_path_ping_rc_inc

Increment health by 100%

9

fb2_mil_ping_diags

fb2_mil_path_ping_rc_dec

Decrement health by 100%

10

fb2_mil_ping_diags

fb2_mil_path_ping_rc_inc

Increment health by 100%

11

fb3_mil_ping_diags

fb3_mil_path_ping_rc_dec

Decrement health by 100%

12

fb3_mil_ping_diags

fb3_mil_path_ping_rc_inc

Increment health by 100%

13

fb4_mil_ping_diags

fb4_mil_path_ping_rc_dec

Decrement health by 100%

14

fb4_mil_ping_diags

fb4_mil_path_ping_rc_inc

Increment health by 100%

15

fb5_mil_ping_diags

fb5_mil_path_ping_rc_dec

Decrement health by 100%

16

fb5_mil_ping_diags

fb5_mil_path_ping_rc_inc

Increment health by 100%

17

fb8_mil_ping_diags

fb8_mil_path_ping_rc_dec

Decrement health by 100%

18

fb8_mil_ping_diags

fb8_mil_path_ping_rc_inc

Decrement health by 100%

19

fb9_mil_ping_diags

fb9_mil_path_ping_rc_dec

Decrement health by 100%

20

fb9_mil_ping_diags

fb9_mil_path_ping_rc_inc

Decrement health by 100%

21

fb10_mil_ping_diags

fb10_mil_path_ping_rc_dec

Decrement health by 100%

22

fb10_mil_ping_diags

fb10_mil_path_ping_rc_inc

Increment health by 100%

23

fb11_mil_ping_diags

fb11_mil_path_ping_rc_dec

Decrement health by 100%

24

fb11_mil_ping_diags

fb11_mil_path_ping_rc_inc

Increment health by 100%

25

fb12_mil_ping_diags

fb12_mil_path_ping_rc_dec

Decrement health by 100%

26

fb12_mil_ping_diags

fb12_mil_path_ping_rc_inc

Increment health by 100%

27

fb13_mil_ping_diags

fb13_mil_path_ping_rc_dec

Decrement health by 100%

28

fb13_mil_ping_diags

fb13_mil_path_ping_rc_inc

Increment health by 100%

29

rsc_iofpga_diags

rsc_iofpga_id_rc_dec

Decrement health by 100%

30

rsc_iofpga_diags

rsc_iofpga_scratch_rw_rc_dec

Decrement health by 100%

31

rsc_iofpga_diags

rsc_iofpga_id_rc_inc

Increment health by 100%

32

rsc_iofpga_diags

rsc_iofpga_scratch_rw_rc_inc

Increment health by 100%

33

rsc_rommon_diags

rsc_rommon_read_rc_dec

Decrement health by 100%

34

rsc_rommon_diags

rsc_rommon_read_rc_inc

Increment health by 100%

35

rsc_mmc_diags

rsc_mmc_id_rc_dec

Decrement health by 100%

36

rsc_mmc_diags

rsc_mmc_id_rc_inc

Increment health by 100%

37

rsc_epif0_diags

rsc_epif0_id_rc_dec

Decrement health by 100%

38

rsc_epif0_diags

rsc_epif0_id_rc_inc

Increment health by 100%

39

rsc_epif0_diags

rsc_epif0_phy_read_rc_dec

Decrement health by 100%

40

rsc_epif0_diags

rsc_epif0_phy_read_rc_inc

Increment health by 100%

41

rsc_epif0_diags

rsc_epif0_imem_read_rc_dec

Decrement health by 100%

42

rsc_epif0_diags

rsc_epif0_imem_read_rc_inc

Increment health by 100%

43

rsc_epif4_diags

rsc_epif4_id_rc_dec

Decrement health by 100%

44

rsc_epif4_diags

rsc_epif4_id_rc_inc

Increment health by 100%

45

rsc_epif4_diags

rsc_epif4_phy_read_rc_dec

Decrement health by 100%

46

rsc_epif4_diags

rsc_epif4_phy_read_rc_inc

Increment health by 100%

47

rsc_epif4_diags

rsc_epif4_imem_read_rc_dec

Decrement health by 100%

48

rsc_epif4_diags

rsc_epif4_imem_read_rc_inc

Increment health by 100%

49

rsc_epif8_diags

rsc_epif8_id_rc_dec

Decrement health by 100%

50

rsc_epif8_diags

rsc_epif8_id_rc_inc

Increment health by 100%

51

rsc_epif8_diags

rsc_epif8_phy_read_rc_dec

Decrement health by 100%

52

rsc_epif8_diags

rsc_epif8_phy_read_rc_inc

Decrement health by 100%

53

rsc_epif8_diags

rsc_epif8_imem_read_rc_dec

Decrement health by 100%

54

rsc_epif8_diags

rsc_epif8_imem_read_rc_inc

Increment health by 100%

55

rsc_epif12_diags

rsc_epif12_id_rc_dec

Decrement health by 100%

56

rsc_epif12_diags

rsc_epif12_id_rc_inc

Increment health by 100%

57

rsc_epif12_diags

rsc_epif12_phy_read_rc_dec

Decrement health by 100%

58

rsc_epif12_diags

rsc_epif12_phy_read_rc_inc

Increment health by 100%

59

rsc_epif12_diags

rsc_epif12_imem_read_rc_dec

Decrement health by 100%

60

prsc_epif12_diags

rsc_epif12_imem_read_rc_inc

Increment health by 100%

61

rsc_xpif_diags

rsc_xpif_id_rc_dec

Decrement health by 100%

62

rsc_xpif_diags

rsc_xpif_id_rc_inc

Increment health by 100%

63

rsc_xpif_diags

rsc_xpif_phy_read_rc_dec

Decrement health by 100%

64

rsc_xpif_diags

rsc_xpif_phy_read_rc_inc

Increment health by 100%

65

rsc_xpif_diags

rsc_xpif_imem_read_rc_dec

Decrement health by 100%

66

rsc_xpif_diags

rsc_xpif_imem_read_rc_inc

Increment health by 100%

67

rsc_xpif_diags

rsc_xpif_pc_read_rc_dec

Decrement health by 100%

68

rsc_xpif_diags

rsc_xpif_pc_read_rc_inc

Increment health by 100%

69

rsc_fpfe0_diags

rsc_fpfe0_id_root_cause_dec

Decrement health by 100%

70

rsc_fpfe0_diags

rsc_fpfe0_usr_reg_rw_root_cause_dec

Decrement health by 100%

71

rsc_fpfe0_diags

rsc_fpfe0_xtal_root_cause_dec

Decrement health by 100%

72

rsc_fpfe0_diags

rsc_fpfe0_low_voltage_root_cause_dec

Decrement health by 100%

73

rsc_fpfe0_diags

rsc_fpfe0_reg_rw_root_cause_dec

Decrement health by 100%

74

rsc_fpfe0_diags

rsc_fpfe0_loopback_result_root_cause_de

Decrement health by 100%

75

rsc_fpfe0_diags

rsc_fpfe0_id_root_cause_inc

Increment health by 100%

76

rsc_fpfe0_diags

rsc_fpfe0_user_reg_rw_root_cause_inc

Increment health by 100%

77

rsc_fpfe0_diags

rsc_fpfe0_xtal_root_cause_inc

Increment health by 100%

78

rsc_fpfe0_diags

rsc_fpfe0_low_voltage_root_cause_inc

Increment health by 100%

79

rsc_fpfe0_diags

rsc_fpfe0_reg_rw_root_cause_inc

Increment health by 100%

80

rsc_fpfe0_diags

rsc_fpfe0_loopback_result_root_cause_in

Increment health by 100%

81

rsc_gige0_diags

rsc_gige0_reg_rw_result_rc_dec

Decrement health by 100%

82

rsc_gige0_diags

rsc_gige0_addr_reg_rw_rc_dec

Decrement health by 100%

83

rsc_gige0_diags

rsc_gige0_reg_rw_rc_inc

Increment health by 100%

84

rsc_gige0_diags

rsc_gige0_addr_reg_rw_rc_inc

Increment health by 100%

85

rsc_gige1_diags

rsc_gige1_reg_rw_result_rc_dec

Decrement health by 100%

86

rsc_gige1_diags

rsc_gige1_addr_reg_rw_rc_dec

Decrement health by 100%

87

rsc_gige1_diags

rsc_gige1_reg_rw_rc_inc

Increment health by 100%

88

rsc_gige1_diags

rsc_gige1_addr_reg_rw_rc_inc

Increment health by 100%

89

fb0_dsip_ping_diags

fb0_dsip_ping_rc_dec

Decrement health by 100%

90

fb0_dsip_ping_diags

fb0_dsip_ping_rc_inc

Increment health by 100%

91

fb1_dsip_ping_diags

fb1_dsip_ping_rc_dec

Decrement health by 100%

92

fb1_dsip_ping_diags

fb1_dsip_ping_rc_inc

Increment health by 100%

93

fb2_dsip_ping_diags

fb2_dsip_ping_rc_dec

Decrement health by 100%

94

fb2_dsip_ping_diags

fb2_dsip_ping_rc_inc

Increment health by 100%

95

fb3_dsip_ping_diags

fb3_dsip_ping_rc_dec

Decrement health by 100%

96

fb3_dsip_ping_diags

fb3_dsip_ping_rc_inc

Increment health by 100%

97

fb4_dsip_ping_diags

fb4_dsip_ping_rc_dec

Decrement health by 100%

98

fb4_dsip_ping_diags

fb4_dsip_ping_rc_inc

Increment health by 100%

99

fb5_dsip_ping_diags

fb5_dsip_ping_rc_dec

Decrement health by 100%

100

fb5_dsip_ping_diags

fb5_dsip_ping_rc_inc

Increment health by 100%

101

fb8_dsip_ping_diags

fb8_dsip_ping_rc_dec

Decrement health by 100%

102

fb8_dsip_ping_diags

fb8_dsip_ping_rc_inc

Increment health by 100%

103

fb9_dsip_ping_diags

fb9_dsip_ping_rc_dec

Decrement health by 100%

104

fb9_dsip_ping_diags

fb9_dsip_ping_rc_inc

Increment health by 100%

105

fb10_dsip_ping_diags

fb10_dsip_ping_rc_dec

Decrement health by 100%

106

fb10_dsip_ping_diags

fb10_dsip_ping_rc_inc

Increment health by 100%

107

fb11_dsip_ping_diags

fb11_dsip_ping_rc_dec

Decrement health by 100%

108

fb11_dsip_ping_diags

fb11_dsip_ping_rc_inc

Increment health by 100%

109

fb12_dsip_ping_diags

fb12_dsip_ping_rc_dec

Decrement health by 100%

110

fb12_dsip_ping_diags

fb12_dsip_ping_rc_inc

Increment health by 100%

111

fb13_dsip_ping_diags

fb13_dsip_ping_rc_dec

Decrement health by 100%

112

fb13_dsip_ping_diags

fb13_dsip_ping_rc_inc

Increment health by 100%

113

rsc_mbus_diags

rsc_mbus_temp_sensor_rc_dec

Decrement health by 100%

114

rsc_mbus_diags

rsc_mbus_temp_sensor_rc_inc

Increment health by 100%

115

rsc_mbus_diags

rsc_mbus_eeprom_rw_rc_dec

Decrement health by 100%

116

rsc_mbus_diags

rsc_mbus_eeprom_rw_rc_inc

Increment health by 100%

117

rsc_mbus_diags

rsc_mbus_id_rc_dec

Decrement health by 100%

118

rsc_mbus_diags

rsc_mbus_id_rc_inc

Increment health by 100%

119

rsc_tcam_diags

rsc_tcam_rw_rc_dec

Decrement health by 100%

120

rsc_tcam_diags

rsc_tcam_rw_rc_inc

Increment health by 100%

121

rsc_compactflash_diags

rsc_compactflash_rw_rc_dec

Decrement health by 100%

122

rsc_compactflash_diags

rsc_compactflash_id_rc_dec

Decrement health by 100%

123

rsc_compactflash_diags

rsc_compactflash_read_rc_dec

Decrement health by 100%

124

rsc_compactflash_diags

rsc_compactflash_rw_rc_inc

Increment health by 100%

125

rsc_compactflash_diags

rsc_compactflash_id_rc_inc

Increment health by 100%

126

rsc_compactflash_diags

rsc_compactflash_read_rc_inc

Increment health by 100%

127

rsc_bic_diags

rsc_bic_id_rc_dec

Decrement health by 100%

128

rsc_bic_diags

rsc_bic_cfg_read_rc_dec

Decrement health by 100%

129

rsc_bic_diags

rsc_bic_cfg_rw_rc_dec

Decrement health by 100%

130

rsc_bic_diags

rsc_bic_reg_rw_rc_dec

Decrement health by 100%

131

rsc_bic_diags

rsc_bic_loopback_rc_dec

Decrement health by 100%

132

rsc_bic_diags

rsc_bic_id_rc_inc

Increment health by 100%

133

rsc_bic_diags

rsc_bic_cfg_read_rc_inc

Increment health by 100%

134

rsc_bic_diags

rsc_bic_cfg_rw_rc_inc

Increment health by 100%

135

rsc_bic_diags

rsc_bic_reg_rw_rc_inc

Increment health by 100%

136

rsc_bic_diags

rsc_bic_int_loopback_rc_inc

Increment health by 100%

137

rsc_sys_contoller_diags

rsc_system_controller_id_root_cause_dec

Decrement health by 100%

138

rsc_sys_contoller_diags

rsc_system_controller_root_cause_inc

Increment health by 100%

139

system

zero_system_health_rule

Reload this RSC if health <= 0%

140

memory

low_processor_memory_rule

Decrement health by 100% and reload this RSC

141

memory

low_iomem1_memory_rule

Decrement health by 100% and reload this RSC

142

memory

fragmented_processor_memory_rule

Decrement health by 100%and reload this RSC

143

memory

fragmented_iomem1_memory_rule

Decrement health by 100% and reload this RSC

144

fib

fibdisable_busyout_&_power_cycle_fb

Busyout & Power cycle FB

145

fib

fib_disabled_event_rule

Decrement FIB subsys health

146

fib

fib_recovered_event_rule

Increment FIB subsys health

147

rsc_cpu_utilisation_diags

rsc_cpu_util_rc_dec

Decrement health by 100%

148

rsc_cpu_utilisation_diags

rsc_cpu_util_rc_inc

Increment health by 100%

149

rsc_process_latency_diags

rsc_high_prio_latency_rc_dec

Decrement health by 100%

150

rsc_process_latency_diags

rsc_high_prio_latency_rc_inc

Increment health by 100%

151

rsc_process_latency_diags

rsc_normal_prio_latency_rc_dec

Decrement health by 50%

152

rsc_process_latency_diags

rsc_normal_prio_latency_rc_inc

Increment health by 50%

153

rsc_process_latency_diags

rsc_low_prio_latency_rc_dec

Decrement health by 10%

154

rsc_process_latency_diags

rsc_low_prio_latency_rc_inc

Increment health by 10%

155

rsc_slot

boot_adjust_fb10_health_rule

Adjust rsc_slot subsys health

156

rsc_slot

repeat_reboot_fb10_rule

Power down FB 10

157

rsc_slot

boot_adjust_fb2_health_rule

Adjust rsc_slot subsys health

158

rsc_slot

repeat_reboot_fb2_rule

Power down FB 2

159

rsc_slot

boot_adjust_fb0_health_rule

Adjust rsc_slot subsys health

160

rsc_slot

repeat_reboot_fb0_rule

Power down FB 0

Action Handler

Invokes all actions associated with a rule if the condition of that rule evaluates to TRUE. In the following example, the zero_system_health_rule checks the system health rating. If the rating for the system is 0, the action will be to reload the RSC.

The following example shows the condition and action for the zero_system_health_rule:
Router#show health-monitor rule subsystem system rule-name zero_system_health_rule
Status (S) codes:
A = active
D = deactivated
S ID    Subsystem                 Name
D 69    system                    zero_system_health_rule
Condition:
(system_health <= 0%)
Action:
Reload this RSC
Number of times this rule has been evaluated = 0
Number of times this rule evaluated to TRUE = 0
Number of times associated actions were invoked = 0
Health

The health of a system or subsystem is always indicated by a percentage between 0 (no health) and 100 (full health). A health rating of 100% indicates full health and a rating of 0% indicates the lowest health rating. The overall health of a system depends upon the composite health of the subsystems registered with the Health Monitor. Each subsystem that is registered with the Health Monitor has a health weighting associated with it. This indicates the amount by which the overall system health will change as a result of health changes of the subsystem.

For example, the rsc_cf_iosdiags subsystem has a health weighting of 500 (i.e. 500/10000 of the system health). Suppose that the overall system including the rsc_cf_iosdiags subsystem was at full health, this would be represented by a system and subsystem health of 100%. Should a serious rsc_cf_iosdiags fault now occur which lowers the rsc_cf_iosdiags subsystem health to 0% then the overall system health would be lowered to 95%.

It is possible for a single subsystem to have anywhere from no affect on the overall system health (weighting of zero) to 100% affect on overall system health. A single subsystem can drive the overall system health to zero.

Event Notification

The Heath Monitor increments and decrements the health of subsystems based on the rules registered for that subsystem. It uses the health weighting of each subsystem to determine the effect this has on the overall system health.

The events are typically internally generated notifications in response to detrimental or correctional changes in the state of the hardware or software of the system. Detrimental events are classified under one of the following severity levels:

•Catastrophic—causes or leads to system failure

•Critical—major subsystem or functionality failure

•High—potential for major impact to important functions

•Medium—potential for minor impact to functionality

•Low—negligible impact to functionality

Correctional events fall under the following classification:

•Positive—not a fault event. May cause or lead to the return of functionality.

Diagnostic Monitor Overview

Diagnostic Monitor (DM) is a new Cisco IOS subsystem that will pro-actively detect hardware and software failures on the active and standby RSCs on the Cisco AS5850. Intensive diagnostics are run on the RSC's components and a sophisticated dependency tree algorithm determines the root cause of component failures. RSC component status is sent to the Health Monitor to be included in overall system health. Individual diagnostic tests can be enabled or disabled and the testing frequency and interval can be changed.

Different sets of diagnostic tests are run on the active and standby RSCs. Because the active RSC is using more processor resources to handle calls than the standby RSC, more intensive tests can be run on the standby RSC. This assures the availability of the standby RSC if a switchover is necessary. Tests run on both RSCs include:

•Multicast pings from RSC to feature cards (FC) with packets transported across the switching fabric every 5 seconds.

•MBUS queries and responses for MBUS Local ID (performed every 15 seconds). EEPROM read (performed every hour) and temperature sensor test (performed every 30 seconds).

•Compact Flash device test (performed every hour)

•Compact Flash file system read test (performed every hour)

•Compact Flash file system write test (performed every 7 days from time of boot-up)

•Peer RSC polling over FE and MBUS (performed every 5 seconds)

•ROMMON EMT calls (performed hourly)

•FATAL and GIVE_UP line tests (performed daily)

•System Controller ID test (performed every 5 seconds)

•IO FPGA register test (performed every 5 seconds)

•Backplane Inter-Connect (BIC) configuration register and ID tests (performed every 3 and 5 seconds respectively)

•RSC Front Panel Fast Ethernet (FPFE) register test (performed every 5 seconds)

•RSC Gigabit Ethernet (GigE) register test (performed every 5 seconds)

•CPU utilization in the last 5 seconds (performed every 30 seconds)

•CPU latency tests for high, normal, and low priority processes (performed every 5, 15, and 30 seconds respectively)

•Switching fabric tests (performed every 5 seconds)

•XPIF/EPIF tests (performed every 5 seconds except XPIF PC read test-every 3 seconds)

Tests run on the active RSC only:

•DSIP client ping tests from RSC to FC (performed every 10 seconds)

In addition to the ongoing tests being run on the RSCs during run-time, additional tests are run during bootup. These tests could interfere with live traffic and are run once before traffic is being routed.

•Backplane Interconnect (BIC) Local Register and internal loop-back test

•RSC FE register and internal loop-back tests

•RSC GigE register test

Diagnostic Monitor Design

Diagnostic Monitor (DM) is tied closely to HM. Tests are designed to exercise the RSC components and report problems to HM. HM has rules established so that when a failure notification comes from DM, the appropriate rule can be applied and any necessary action taken.

During bootup, DM runs tests on both RSCs. Once the RSCs have come up, DM determines if the RSC is in active or standby mode and runs the appropriate diagnostics. DM registers with HM, just like other subsystems.

DM determines the root cause of a failure and reports that to HM. If an RSC component has a problem, it doesn't necessarily mean that component has failed. It's possible another component, that this component depends on, has failed. DM will follow the dependency tree, running diagnostics on all components, and determine the component with the actual failure. This information will be reported to HM. If a component failure has already been reported to HM, DM will not send another notification to HM. Once HM is notified of a problem, it will react according to the HM rules in place for that component. See Table 2 for a description of the diagnostic monitor tests.

Note For information about diagnostic monitor tests, use the show diagnostic-monitor test command.

Table 2 Diagnostic Monitor Tests

Test ID

Test Name

Description

rsc-compactflash-id

RSC Compact Flash Id

Reading the RSC compact flash: (disk0:)

rsc-compactflash-read

RSC Compact flash read

Reading from a test file on the RSC compact flash: (disk0:).

rsc-compactflash-rw

RSC Compact flash read/write

Writing to and reading from a test file on the RSC compact flash: (disk0:).

rsc-cpu-utilisation

RSC CPU Utilization

Monitoring and reporting CPU intensive processes that utilize more than 99% CPU in the last 5 sec. The process name and pid are reported as an IOS information error message.

rsc-mbus-eeprom-read

RSC MBUS eeprom read

Reading and validating data from MBUS EEPROM on the RSC slot.

rsc-epif0-id

RSC EPIF switch-port 0 ID

Reading and verifying the version of EPIF-0 through a valid MMC NP5400 switch local port (SLP) register.

rsc-epif0-imem-read

RSC EPIF switch-port 0 IMEM Read

Reading from an internal memory of each EPIF or XPIF.

rsc-epif0-phy-read

RSC EPIF switch-port 0 PHY Read

Reading from a valid PIF PHY Reg for EPIF-0.

rsc-epif12-id

RSC EPIF switch-port 12 ID

Reading and verifying the version of EPIF-12 through a valid MMC NP5400 switch local port (SLP) register.

rsc-epif12-imem-read

RSC EPIF switch-port 12 IMEM Read

Reading from an internal memory of each EPIF or XPIF.

rsc-epif12-phy-read

RSC EPIF switch-port 12 PHY Read

Reading from a valid PIF PHY Reg for EPIF-12.

rsc-epif4-id

RSC EPIF switch-port 4 ID

Reading and verifying the version of EPIF-4 through a valid MMC NP5400 switch local port (SLP) register.

rsc-epif4-imem-read

RSC EPIF switch-port 4 IMEM Read

Reading from an internal memory of each EPIF or XPIF.

rsc-epif4-phy-read

RSC EPIF switch-port 4 PHY Read

Reading from a valid PIF PHY Reg for EPIF-4.

rsc-epif8-id

RSC EPIF switch-port 8 ID

Reading and verifying the version of EPIF-8 through a valid MMC NP5400 switch local port (SLP) register.

rsc-epif8-imem-read

RSC EPIF switch-port 8 IMEM Read

Reading from an internal memory of each EPIF or XPIF.

rsc-epif8-phy-read

RSC EPIF switch-port 8 PHY Read

Reading from a valid PIF PHY Reg for EPIF-8.

fb0-dsip-ping

DSIP Client Ping per Featureboard

Sending and receiving a test message from the DSIP master clients on the RSC to corresponding DSIP slave clients on the FB via reliable IPC transport.

fb0-mil-path-ping

Interconnect Layer ping for featureboard 0

Verifying the path from the RSC to any FB inserted in the Marvel chassis via the MIL.

fb1-dsip-ping

DSIP Client Ping per Featureboard

Sending and receiving a test message from the DSIP master clients on the RSC to corresponding DSIP slave clients on the FB via reliable IPC transport.

fb1-mil-path-ping

Interconnect Layer ping for featureboard 1

Verifying the path from the RSC to any FB inserted in the Marvel chassis via the MIL.

fb10-dsip-ping

DSIP Client Ping per Featureboard

Sending and receiving a test message from the DSIP master clients on the RSC to corresponding DSIP slave clients on the FB via reliable IPC transport.

fb10-mil-path-ping

Interconnect Layer ping for featureboard 10

Verifying the path from the RSC to any FB inserted in the Marvel chassis via the MIL.

fb11-dsip-ping

DSIP Client Ping per Featureboard

Sending and receiving a test message from the DSIP master clients on the RSC to corresponding DSIP slave clients on the FB via reliable IPC transport.

fb11-mil-path-ping

Interconnect Layer ping for featureboard 11

Verifying the path from the RSC to any FB inserted in the Marvel chassis via the MIL.

fb12-dsip-ping

DSIP Client Ping per Featureboard

Sending and receiving a test message from the DSIP master clients on the RSC to corresponding DSIP slave clients on the FB via reliable IPC transport.

fb12-mil-path-ping

Interconnect Layer ping for featureboard 12

Verifying the path from the RSC to any FB inserted in the Marvel chassis via the MIL.

fb13dsip-ping

DSIP Client Ping per Featureboard

Sending and receiving a test message from the DSIP master clients on the RSC to corresponding DSIP slave clients on the FB via reliable IPC transport.

fb13-mil-path-ping

Interconnect Layer ping for featureboard 13

Verifying the path from the RSC to any FB inserted in the Marvel chassis via the MIL.

fb2-dsip-ping

DSIP Client Ping per Featureboard

Sending and receiving a test message from the DSIP master clients on the RSC to corresponding DSIP slave clients on the FB via reliable IPC transport.

fb2mil-path-ping

Interconnect Layer ping for featureboard 2

Verifying the path from the RSC to any FB inserted in the Marvel chassis via the MIL.

fb3-dsip-ping

DSIP Client Ping per Featureboard

Sending and receiving a test message from the DSIP master clients on the RSC to corresponding DSIP slave clients on the FB via reliable IPC transport.

fb3-mil-path-ping

Interconnect Layer ping for featureboard 3

Verifying the path from the RSC to any FB inserted in the Marvel chassis via the MIL.

fb4-dsip-ping

DSIP Client Ping per Featureboard

Sending and receiving a test message from the DSIP master clients on the RSC to corresponding DSIP slave clients on the FB via reliable IPC transport.

fb4-mil-path-ping

Interconnect Layer ping for featureboard 4

Verifying the path from the RSC to any FB inserted in the Marvel chassis via the MIL.

fb5-dsip-ping

DSIP Client Ping per Featureboard

Sending and receiving a test message from the DSIP master clients on the RSC to corresponding DSIP slave clients on the FB via reliable IPC transport.

fb5-mil-path-ping

Interconnect Layer ping for featureboard 5

Verifying the path from the RSC to any FB inserted in the Marvel chassis via the MIL.

fb8-dsip-ping

DSIP Client Ping per Featureboard

Sending and receiving a test message from the DSIP master clients on the RSC to corresponding DSIP slave clients on the FB via reliable IPC transport.

fb8-mil-path-ping

Interconnect Layer ping for featureboard 8

Verifying the path from the RSC to any FB inserted in the Marvel chassis via the MIL.

fb9-dsip-ping

DSIP Client Ping per Featureboard

Sending and receiving a test message from the DSIP master clients on the RSC to corresponding DSIP slave clients on the FB via reliable IPC transport.

fb9-mil-path-ping

Interconnect Layer ping for featureboard 9

Verifying the path from the RSC to any FB inserted in the Marvel chassis via the MIL.

rsc-fpfe0-clock

RSC Front Panel FastEthernet 0 Clock

Reading and testing the RSC FPFE (Front Panel Fast Ethernet) PHY crystal circuit status. (For RSC-1 only.)

rsc-fpfe0-id

RSC Front Panel FastEthernet 0 ID

Reading and validating the RSC FPFE (Front Panel Fast Ethernet) PHY ID.

rsc-fpfe1-id

RSC Front Panel FastEthernet 1 ID

Reading and validating the RSC FPFE (Front Panel Fast Ethernet) PHY ID. (For eRSC only.)

rsc-fpfe0-loopback

RSC Front Panel FastEthernet 0 Loopback

Transmission and receiving of data via PCI FE MAC ports.

rsc-fpfe1-loopback

RSC Front Panel FastEthernet 1 Loopback

Transmission and receiving of data via PCI FE MAC ports. (For eRSC only.)

rsc-fpfe0-low-voltage

RSC Front Panel FastEthernet 0 Low Voltage

Reading and testing the RSC FPFE (Front Panel Fast Ethernet) PHY low voltage status. (For RSC-1 only.)

rsc-fpfe0-reg-rw

RSC Front Panel FastEthernet 0 Register Read/Write

Reading the RSC FPFE (Front Panel Fast Ethernet) PHY control and configuration registers read write test result.

rsc-fpfe1-reg-rw

RSC Front Panel FastEthernet 1 Register Read/Write

Reading the RSC FPFE (Front Panel Fast Ethernet) PHY control and configuration registers read write test result. (For eRSC only.)

rsc-fpfe0-user-reg-rw

RSC Front Panel FastEthernet 0 User Register Read/Write

Writing to and reading from FP FE PHY user defined mirror register of LXT970 PHY chip and verifying the test data. (For RSC-1 only.)

rsc-gige0-address-reg-rw

RSC GigabitEthernet port 0 Address Register Read/Write

Writing to and reading from GIGE GMAC (MMC GMAC-B3) address register and verifying the test data.

rsc-gige0-reg-rw

RSC GigabitEthernet port 0 Register Read/Write

Reading the RSC GIGE GMAC registers read/write test result.

rsc-gige1-address-reg-rw

RSC GigabitEthernet port 1 Address Register Read/Write

Writing to and reading from GIGE GMAC (MMC GMAC-B3) address register and verifying the test data.

rsc-gige1-reg-rw

RSC GigabitEthernet port 1 Register Read/Write

Reading the RSC GIGE GMAC registers read/write test result.

rsc-iofpga-id

IOFPGA ID

Reading and validating the revision of the IO FPGA.

rsc-iofpga-rw

IOFPGA Read/Write

Writing to and reading from IO FPGA register using the scratch register and verifying the test data.

rsc-bic-config-read

Backplane InterConnect Configuration Read

Reading and validating the RSC BIC configuration registers (read-only).

rsc-bic-config-rw

Backplane InterConnect Configuration Read/Write

Writing to and reading from the RSC BIC configuration registers (read-write) and verifying the test data. (For RSC-1 only.)

rsc-bic-id

Backplane InterConnect ID

Reading and validating the RSC BIC (Backplane Interconnect) identity. (For RSC-1 only.)

rsc-bic-loopback

Backplane InterConnect FastEthernet Loopback

The RSC BIC is internally looped and tested for the following 4 test cases: (For RSC-1 only.)

Burst Read: OFF, Burst Write: OFF (case 0)
Burst Read: ON, Burst Write: OFF (case 1)
Burst Read: OFF, Burst Write: ON (case 2)
Burst Read: ON, Burst Write: ON (case 3)

rsc-bic-reg-rw

Backplane InterConnect Register Read/Write

Writing to and reading from the CSR (Control and Status Register) and BCR (Bus Configuration Register) of the RSC BIC and verifying the test data. (For RSC-1 only.)

rsc-mbus-id

RSC MBUS ID

Read the identity of the local MBUS on the RSC.

rsc-mbus-temp-sensor

RSC temperature sensor

Checking environmental sensors for RSC temperature.

rsc-mmc-id

RSC Switching Fabric ID

Reading and validating MMC switch fabric version and device type.

peer-rsc-ping

Peer RSC ping

Sending a ping message from the local RSC to the peer RSC via the MIL through the BIC.

rsc-high-proc-latency

Latency response test for high priority processes

Measuring and reporting CPU latency when scheduling processes at the different process priority as an indirect indication for CPU utilisation. Any latency over the default latency threshold in msec is reported as an IOS information error message.

rsc-low-proc-latency

Latency response test for low priority processes

Measuring and reporting CPU latency when scheduling processes at the different process priority as an indirect indication for CPU utilisation. Any latency over the default latency threshold in msec is reported as an IOS information error message.

rsc-normal-proc-latency

Latency response test for normal priority processes

Measuring and reporting CPU latency when scheduling processes at the different process priority as an indirect indication for CPU utilisation. Any latency over the default latency threshold in msec is reported as an IOS information error message.

rsc-redundancy-lines

RSC high availability hardware lines

Asserting and de-asserting the RSC HA Fatal and Giveup lines from one RSC to its peer RSC.

rsc-rommon-read

RSC ROM monitor read

Reading and validating ROMMON version.

rsc-system-controller-id

RSC System Controller ID

Reading and validating vendor ID and device ID from Galileo GT64120 System Controller. (For RSC-1 only.)

rsc-tcam-rw

RSC TCAM Read/Write

Writing to and reading from the TCAM and verifying the test data.

rsc-xpif2-id

RSC XPIF switch-port 2 ID

Reading and verifying the version of each XPIF through a valid MMC NP5400 switch local port (SLP) register.

rsc-xpif6-id

RSC XPIF switch-port 6 ID

Reading and verifying the version of each XPIF through a valid MMC NP5400 switch local port (SLP) register. (For eRSC only.)

rsc-xpif2-imem-read

RSC XPIF switch-port 2 IMEM Read

Reading from an internal memory of each EPIF or XPIF.

rsc-xpif6-imem-read

RSC XPIF switch-port 6 IMEM Read

Reading from an internal memory of each EPIF or XPIF. (For eRSC only.)

rsc-xpif2-pc-read

RSC XPIF switch-port 2 PC Read

Checking for Program Counter lock-up of each XPIF2 process through the valid MMC NP5400 switch local port register of the RSC.

rsc-xpif6-pc-read

RSC XPIF switch-port 6 PC Read

Checking for Program Counter lock-up of each XPIF6 process through the valid MMC NP5400 switch local port register of the RSC. (For eRSC only.)

rsc-xpif2-phy-read

RSC XPIF switch-port 2 PHY Read

Reading from a valid PIF PHY Reg for XPIF-2.

rsc-xpif6-phy-read

RSC XPIF switch-port 6 PHY Read

Reading from a valid PIF PHY Reg for XPIF-6. (For eRSC only.)

Benefits

Health Monitor

•Increased system availability through fault analysis and recovery

•Ability to incorporate diagnostic test results into the system health

Diagnostic Monitor

•Early detection of RSC component failures

•Customizing of RSC diagnostic test intervals

•Root cause analysis

How To Configure HM and DM for the Cisco AS5850

See the following sections for configuration tasks for the Health Monitor feature. Each task in the list is identified as either required or optional.

•Activating or De-activating Health-Monitor Rules (optional)

•Setting HM Notifications (optional)

•Configure Diagnostic Monitor Tests (optional)

Activating or De-activating Health-Monitor Rules

To disable or enable a specific rule, use the rule subsystem command in health-monitor configuration mode.
Router#configure terminal
Router(config)#health-monitor 
Router(config-hm)#rule subsystem subsystem-name rule-name rule-name [disable | enable]
To disable or enable the rules for a subsystem, use the rule subsystem command in health-monitor configuration mode.
Router#configure terminal
Router(config)#health-monitor 
Router(config-hm)#rule subsystem subsystem-name [disable | enable]
To disable or enable all rules for all subsystems, use the rule all command in health-monitor configuration mode.
Router#configure terminal
Router(config)#health-monitor
Router(config-hm)#rule all [disable | enable]
Setting HM Notifications

To disable or enable health monitor notifications, use the notify subsystem command in health-monitor configuration mode. This configuration will enable or disable rules on both the active and standby RSCs at the same time.
Router#configure terminal
Router(config)#health monitor 
Router(config-hm)#notify subsystem subsystem-name [disable | enable]

To set the high threshold for SNMP notifications for a subsystem, use the notify high-threshold command in health-monitor configuration mode.
Router#configure terminal
Router(config)#health monitor 
Router(config-hm)#notify subsystem subsystem-name high-threshold threshold-value

To set the low threshold for SNMP notifications for a subsystem, use the notify low-threshold command in health-monitor configuration mode.
Router#configure terminal
Router(config)#health monitor 
Router(config-hm)#notify subsystem subsystem-name low-threshold threshold-value

Configure Diagnostic Monitor Tests

To set the default parameters for all diagnostic tests, use the default all command in diagnostic-monitor configuration mode.
Router#configure terminal
Router(config)#diagnostic-monitor
Router(diag-mon)#default all

To set the default parameters for a specific diagnostic test, use the default test command in diagnostic-monitor configuration mode.
Router#configure terminal
Router(config)#diagnostic-monitor
Router(diag-mon)#default test test-name

To configure the frequency value for a specific diagnostic test, use the test command in diagnostic-monitor configuration mode.
Router#configure terminal
Router(config)#diagnostic-monitor
Router(diag-mon)#test test-name frequency [ active | standby ] frequency-value
To configure the timeout value for a specific diagnostic test, use the test command in diagnostic-monitor configuration mode.
Router#configure terminal
Router(config)#diagnostic-monitor
Router(diag-mon)#test test-name timeout [ active | standby ] timeout-value

To disable a specific DM test, use the no test command in diagnostic-monitor configuration mode.
Router#configure terminal
Router(config)#diagnostic-monitor
Router(config-dm)#no test testname
Reset test result(s) to pass?? [yes/no]:
Answer "yes" if there is a known software problem with this diagnostic test.
Answer "no" if the test is being disabled for a failed component.
To disable the diagnostic bootup tests, use the test command in diagnostic-monitor configuration mode.
Router#configure terminal
Router(config)#diagnostic-monitor
Router(diag-mon)#no bootup tests
Verify System Health

•To check the overall system health, use the show health-monitor subsystem command in privileged EXEC mode.
Router#show health-monitor subsystem
System health is 100%
Subsystem                 Health  Weighting (max 10000)
fb0_dsip_ping_diags       100%    834
fb0_mil_ping_diags        100%    834
fb10_dsip_ping_diags      100%    834
fb10_mil_ping_diags       100%    834
fb11_dsip_ping_diags      100%    834
fb11_mil_ping_diags       100%    834
fb12_dsip_ping_diags      100%    834
fb12_mil_ping_diags       100%    834
fb13_dsip_ping_diags      100%    834
fb13_mil_ping_diags       100%    834
fb1_dsip_ping_diags       100%    834
fb1_mil_ping_diags        100%    834
fb2_dsip_ping_diags       100%    834
fb2_mil_ping_diags        100%    834
fb3_dsip_ping_diags       100%    834
fb3_mil_ping_diags        100%    834
fb4_dsip_ping_diags       100%    834
fb4_mil_ping_diags        100%    834
fb5_dsip_ping_diags       100%    834
fb5_mil_ping_diags        100%    834
fb8_dsip_ping_diags       100%    834
fb8_mil_ping_diags        100%    834
fb9_dsip_ping_diags       100%    834
fb9_mil_ping_diags        100%    834
fib                       100%    100
health_monitor            100%    10000
memory                    100%    10000
peer_rsc_ping_diags       100%    0
rsc_bic_diags             100%    10000
rsc_compactflash_diags    100%    0
rsc_cpu_utilisation_diags 100%    0
rsc_epif0_diags           100%    3334
rsc_epif12_diags          100%    10000
rsc_epif4_diags           100%    3334
rsc_epif8_diags           100%    3334
rsc_fpfe0_diags           100%    2500
rsc_gige0_diags           100%    5000
rsc_gige1_diags           100%    5000
rsc_iofpga_diags          100%    10000
rsc_mbus_diags            100%    10000
rsc_mmc_diags             100%    10000
rsc_process_latency_diags 100%    0
rsc_redundancy_line_diags 100%    0
rsc_rommon_diags          100%    100
rsc_slot                  100%    100
rsc_sys_contoller_diags   100%    10000
rsc_tcam_diags            100%    10000
rsc_xpif_diags            100%    10000
system                    100%    10000
•To see a report of health monitor events, use the show health-monitor events command in privileged EXEC mode.
Router#show health-monitor events
Event Statistics
0          catastrophic
0          criticial
6          high
1          medium
18         low
0          positive
The following events were discarded
26         unknown
0          negligible
0          health monitor events
Event buffer pool
Number of free event buffers = 300
Number of events awaiting processing by HM Normal process = 0
Number of events awaiting processing by HM Urgent process = 0
•To check the health of a subsystem, use the show health-monitor subsystem <subsys-name> command in privileged EXEC mode.
Router#show health-monitor subsystem memory
Subsystem            Health  Weighting (max 10000)
memory               100%     10000
Subsystem Event Statistics
0          catastrophic
0          criticial
0          high
0          medium
0          low
0          positive
Subsystem Notification Configuration
100        high-threshold
0          low-threshold
FALSE      notify-enable
General IOS Health Monitor Rules Operation

This section details how the currently implemented Health Monitor rules operate, grouped by Health-Monitor subsystem:

•System Health Monitor Subsystem Rules

•FIB Health Monitor Subsystem Rules

•Memory Health Monitor Subsystem Rules

•rsc_slot Health Monitor Subsystem Rules

System Health Monitor Subsystem Rules

There is one system Health Monitor subsystem rule:
zero_system_health_rule
The zero_system_health_rule Operation

This rule simply reloads the RSC if the overall system health goes to zero. The idea of this rule is to have other HM rules that trigger on catastrophic failures/problems drive the overall system health to 0, which would trigger this rule and reload the RSC. Note that it will trigger regardless of how the overall system health goes to 0. So, if a lot of smaller problems cause the system health to go to 0, this rule will trigger.

Note This zero_system_health_rule is redundancy mode independent and operates on both the active and standby RSCs.

FIB Health Monitor Subsystem Rules

There are three FIB Health Monitor subsystem rules:
fib_disabled_busyout_&_power_cycle_fb 
fib_disabled_event_rule 
fib_recovered_event_rule 
These three rules are best viewed as working together as a single rule. The FIBDISABLE rule is the major rule and it invokes the other 2 rules to increase or decrease the FIB Health Monitor subsystem health.

FIB Rules Operation

When a FIBDISABLE errmsg occurs for an FB, the FIB rules are triggered. The rules busy-out the FB that had the problem, decrement the FIB Health Monitor subsystem health, and then perform the following tasks every 30 seconds:

1. If FIB has recovered abort the rule

2. If the busy-out is complete, reload the FB

3. If 20 minutes has elapsed, reload the FB

When the rule is aborted or the FB comes back up, the FIB Health Monitor subsystem health is reinstated. The FIB Health Monitor subsystem health weighting is non-zero and affects the overall system health.

Note The FIB rules are redundancy mode independent and are installed on both the active and standby RSCs, but are only operational on the active RSC. They remain dormant on the standby RSC.

Memory Health Monitor Subsystem Rules

There are four Memory Health Monitor subsystem rules, two low Memory rules and two fragmented Memory rules:
low_processor_memory_rule 
low_iomem1_memory_rule 
fragmented_processor_memory_rule 
fragmented_iomem1_memory_rule 
Low Memory Rules Operation

The low memory rules check the amount of free memory on the RSC every 30 seconds. If they detect that the amount of free memory is below the hard coded threshold, they decrement the Memory Health Monitor subsystem to zero (0) and reload the RSC.

Note The RSC reload is built into the memory rules. They do not rely on the zero_system_health_rule to reload the RSC.

The hard coded thresholds are:

•Processor memory-5 Mbytes

•IOMEM memory-2 Mbytes

The Memory Health Monitor subsystem health weighting is non-zero and affects the overall system health.

Fragmented Memory Rules Operation

The fragmented memory rules check the size of the largest available block of free memory on the RSC every 30 seconds. If they detect that the memory is too fragmented (the largest block of free memory is below the hard coded threshold), they decrement the Memory Health Monitor subsystem to zero (0) and reload the RSC.

Note The RSC reload is built into the memory rules. They do not rely on the zero_system_health_rule to reload the RSC.

The hard coded thresholds are:

•Processor memory-500 kbytes

•IOMEM memory-100 kbytes

The Memory Health Monitor subsystem health weighting is non-zero and affects the overall system health.

Note The Memory rules are redundancy mode dependent, so their operation changes depending on which redundancy mode the RSC is in. The changes are internal to the rule and not noticeable via the console. These rules operate on both the active and standby RSCs.

rsc_slot Health Monitor Subsystem Rules

There are two rsc_slot Health Monitor subsystem rules:
repeat_reboot_fbx_rule 
boot_adjust_fbx_health_rule 
Where x is the slot number a FB is installed in.

rsc_slot Rules Operation

If an FB reboots 5 times in 60 minutes, the repeat_reboot_fbx_rule is triggered and powers down the FB. It also decrements the health of the rsc_slot subsystem.

The FB is allowed to fail 4 times. Then, on the 5th attempt, the FB is powered down, regardless of whether it would have rebooted successfully or not.

If the FB is manually powered up after this, then the boot_adjust_fbx_health_rule is triggered and the RSC slot subsystem health is restored.

Once the rsc_slot rule has triggered and the FB has been powered down, the FB can be manually rebooted by issuing the hw-module slot X reset command. Manually rebooting the FB causes two things to happen:

•It triggers the boot_adjust_fbx_health_rule, which restores the rsc_slot health monitor subsystem health.

•It re-initializes the repeat_reboot_fbx_rule so that another "5 FB reloads in 60 minutes" must occur for it to trigger again.

The rsc_slot Health Monitor subsystem rules for a particular FB only exist while that FB is physically inserted in the chassis. Removing an FB removes the rules for that FB. Inserting an FB installs the rules for that FB. The rules are installed only for FBs, not for RSCs. You will not see the rsc_slot rules installed for slot 6 or 7.

If the active RSC reloads for any reason, the repeat_reboot_fbx_rule is re-initialized so that another "5 FB reloads in 60 minutes" must occur for it to trigger again.

The rsc_slot Health Monitor subsystem health weighting is non-zero and affects the overall system health.

Note The rsc_slot rules are redundancy mode independent and are installed on both the active and standby RSCs. However, they are only active on the active RSC. They remain dormant on the standby RSC.

Diagnostic Monitor Health Monitor Rules Operation

A diagnostic monitor rule is triggered when a test fails, then the rule decrements the corresponding subsystem's health. When the failed test is repeated and passes, the rule increments the corresponding subsystems health.

The health weightings of the various diagnostic monitor subsystems are assigned such that they have differing effects on the overall system health. They may have from no affect (zero health weighting) to maximum affect (maximum health weighting). Those that have maximum effect will drive the overall system health to zero when the rule associated with the particular diagnostic monitor subsystem triggers. This will trigger the zero_system_health_rule, which causes the RSC to reload. Use the show health-monitor subsystem command to determine the weightings of any health monitor subsystem.

Troubleshooting Tips

•To see the state of the HM system, use the show health-monitor subsystem health_monitor command in privileged EXEC mode:
Router#show health-monitor subsystem health_monitor
Subsystem            Health  Weighting (max 10000)
health_monitor       100%     500
Subsystem Event Statistics
0          catastrophic
0          criticial
0          high
0          medium
0          low
0          positive
Subsystem Notification Configuration
100        high-threshold
0          low-threshold
FALSE      notify-enable
•To troubleshoot the Health Monitor feature, use the debug health-monitor command in privileged EXEC mode.

•To troubleshoot the Diagnostic Monitor feature, use the debug diagnostic-monitor command in user EXEC mode.

•To troubleshoot the non-diagnostic based Health Monitor Rules, use the debug <hm-subsys> health-monitor command in privileged EXEC mode.
debug ip cef health-monitor
debug memory health-monitor
debug slot health-monitor
debug hm-rules redundancy
•DM and HM continuously report a component as being faulty and then ok. When a component has intermittent problems, the DM test and HM rule associated with the component need to be disabled while the component is replaced.

To disable a DM test and HM rule, follow this procedure:

Note When enabling or disabling Health Monitor rules which are associated with diagnostic component tests it is imperative that this be done in a specific order so that Diagnostic Monitor events which are sent to the Health Monitor are not lost. Failure to do so may leave the Health Monitor and Diagnostic Monitor out of sync.

a. Disable scheduling of DM test.

To disable a specific DM test, use the no test command in diagnostic-monitor configuration mode.
Router#configure terminal
Router(config)#diagnostic-monitor
Router(config-dm)#no test testname
Reset test result(s) to pass?? [yes/no]:
Note Answer "no" so that HM still sees this as a failed component.

b. Disable associated HM rule.

To disable a specific HM rule, use the rule subsystem command in health-monitor configuration mode.
Router#configure terminal
Router(config)#health-monitor 
Router(config-hm)#rule subsystem subsystem-name rule-name rule-name [disable | enable]

c. Replace faulty component

d. Enable HM rule.

e. Enable scheduling of DM test.

Additional References

For additional information related to HM and DM for the Cisco AS5850, refer to the following references:

Related Documents

Related Topic

Document Title

Redundancy modes

RPR+ for the Cisco AS5850

Configuration

•Cisco AS5850 Universal Gateway Operations, Administration, Maintenance, and Provisioning Guide, located at:

http://www.cisco.com/en/US/products/hw/univgate/ps509/products_maintenance_guide_book09186a008007e624.html

Standards

Standards

Title

No new or modified standards are supported by this feature.

—

MIBs

MIBs

MIBs Link

The Health Monitor feature introduces a new SNMP MIB:

CISCO-HEALTH-MONITOR-MIB

To obtain lists of supported MIBs by platform and Cisco IOS release, and to download MIB modules, go to the Cisco MIB website on Cisco.com at the following URL:

http://www.cisco.com/public/sw-center/netmgmt/cmtk/mibs.shtml

This MIB module provides health information for the system and each of its subsystems on the active and standby RSC. In addition to providing health metrics, statistics are provided for the number of error events and correctional events which are received for each subsystem. Also, notifications can be configured so that they are sent when the health of a subsystem reaches a high or low threshold. Indexing in the MIB is performed on the ASCII subsystem name.

To locate and download MIBs for selected platforms, Cisco IOS releases, and feature sets, use Cisco MIB Locator found at the following URL:

http://tools.cisco.com/ITDIT/MIBS/servlet/index

If Cisco MIB Locator does not support the MIB information that you need, you can also obtain a list of supported MIBs and download MIBs from the Cisco MIBs page at the following URL:

http://www.cisco.com/public/sw-center/netmgmt/cmtk/mibs.shtml

To access Cisco MIB Locator, you must have an account on Cisco.com. If you have forgotten or lost your account information, send a blank e-mail to cco-locksmith@cisco.com. An automatic check will verify that your e-mail address is registered with Cisco.com. If the check is successful, account details with a new random password will be e-mailed to you. Qualified users can establish an account on Cisco.com by following the directions found at this URL:

http://www.cisco.com/register

RFCs

RFCs

Title

No new or modified RFCs are supported by this feature.

—

Technical Assistance

Description

Link

Technical Assistance Center (TAC) home page, containing 30,000 pages of searchable technical content, including links to products, technologies, solutions, technical tips, tools, and lots more. Registered Cisco.com users can log in from this page to access even more content.

http://www.cisco.com/public/support/tac/home.shtml

Command Reference

This section documents new or modified commands. All other commands used with this feature are documented in the Cisco IOS High Availability command reference publications for various releases.

New Commands

•bootup

•show health-monitor events

•show health-monitor variable

•show health-monitor subsystem

•show health-monitor rule

•show diagnostic-monitor test

•show monitor event-trace hm

•health-monitor subsystem

•debug ip cef health-monitor

•debug memory health-monitor

•debug diagnostic-monitor

•debug slot health-monitor

•rule

•notify subsystem

•test

•default

bootup

To enable or disable bootup test, use the bootup command in diagnostic-monitor configuration mode.

[no] bootup {tests}
Syntax Description

tests

Specifies all bootup tests for DM.

Defaults

No default behavior or values.

Command Modes

Diagnostic-monitor configuration mode.

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850

Usage Guidelines

Use this command to enable or disable the bootup diagnostic tests that can be monitored by the Diagnostic Monitor.

Examples

This example enables the bootup tests to be monitored by Diagnostic Monitor after the RSC is reloaded.
Router#configure terminal
Router(config)#diagnostic-monitor
Router(diag-mon)#bootup tests
Related Commands

Command

Description

show diagnostic-monitor test

Display the results and default values for Diagnostic Monitor tests.

show health-monitor events

To see statistics for the events that the Health Monitor has received use the show health-monitor events command.

show health-monitor events
Syntax Description

This command has no arguments or keywords.

Defaults

No default behavior or values.

Command Modes

privileged EXEC

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850

Usage Guidelines

Use this command if there is reason to believe that the Health Monitor is not processing events from the system.

Examples

This example shows the output of this command:
Event Statistics
0          catastrophic
0          critical
0          high
3          medium
5          low
10         positive
The following events were discarded
136        unknown
0          negligible
0          health monitor events
Event buffer pool
Number of free event buffers = 300
Number of events awaiting processing by HM Normal process = 0
Number of events awaiting processing by HM Urgent process = 0
Related Commands

None.

show health-monitor variable

To see information about a variable in the Health Monitor Variable Database use the show health-monitor variable command.

show health-monitor variable [subsystem subsystem-name [var-name variable-name]]
Syntax Description

subsystem subsystem-name

Designates a Health Monitor subsystem.
subsystem-name specifies the name of Health Monitor subsystem.

var-name variable-name

Designates a Health Monitor variable.
variable-name specifies the name of Health Monitor variable.

Defaults

No default behavior or values.

Command Modes

privileged EXEC

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850

Usage Guidelines

Use this command to see the value of a Health Monitor variable. This can be of use if a variable is used to trigger a Health Monitor rule.

Examples

The following example shows part of the Variable Database output:
Router#show health-monitor variable 
Type Key:
  (Num)Number    (Hlth)Health       (VPtr)Void Pointer  (Str)String
  (Bool)Boolean  (Freq)Frequency    (Arg)Argument       (Tokn)Token
Subsystem                 Variable Name                          Type Value
fb0_dsip_ping_diags       fb0_dsip_ping_diags_health             Hlth 100%      
fb0_mil_ping_diags        fb0_mil_ping_diags_health              Hlth 100%      
fb10_dsip_ping_diags      fb10_dsip_ping_diags_health            Hlth 100%      
fb10_mil_ping_diags       fb10_mil_ping_diags_health             Hlth 100%   
The following shows the output for all of the variables in the memory subsystem:
Router#show health-monitor variable subsystem memory
Variable Name                                        Type           Value
free_iomem1_memory                                   Number         101481696   
free_processor_memory                                Number         683770428   
largest_block_iomem1_memory                          Number         101390044   
largest_block_processor_memory                       Number         564721540   
memory_health                                        Health         100%        
The following shows the output of this command for one specific variable. Note that the rules associated with this variable are also shown:
Router#show health-monitor variable subsystem memory var-name free_iomem1_memory
Subsystem    : memory
Variable Name: free_iomem1_memory
Type         : Number
Value        : 101481696
Num Reads    : 20859
Num Writes   : 20859
Associated rules:
Status (S) codes:
A = active
D = deactivated
S ID    Subsystem                 Name
A 137   memory                    low_iomem1_memory_rule
Related Commands

None.

show health-monitor subsystem

To check the overall system health as well as the health of each subsystem, use the show health-monitor subsystem command in privileged EXEC mode.

show health-monitor subsystem [ subsystem-name | standby ]
Syntax Description

standby

All subsystems on the standby RSC

subsystem-name

Name of health-monitor subsystem

Defaults

No default behavior or values.

Command Modes

privileged EXEC

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850

Usage Guidelines

Use the show health-monitor subsystem command to check the health of a system or a subsystem on either the active or standby RSC. If the system health is degraded, use this command to isolate which subsystem has less than perfect health and is effecting the system health.

Examples

The following example shows output of the show health-monitor subsystem command:
Router#show health-monitor subsystem 
System health is 100%
Subsystem            Health  Weighting (max 10000)
dsip_fbx_ping_iosdia 100%     10000
fdm_appl_iosdiags    100%     10000
fib                  100%     100
gt64120_iosdiags     100%     10000
health_monitor       100%     500
inter_rsc_iosdiags   100%     5000
mbus_eeprom_iosdiags 100%     500
memory               100%     10000
mha_line_iosdiags    100%     5000
pci_amdfe_iosdiags   100%     5000
pif_iosdiags         100%     10000
rsc_cf_iosdiags      100%     500
rsc_common_iosdiags  100%     8000
rsc_envm_iosdiags    100%     1000
rsc_fb_iosdiags      100%     10000
rsc_fpfe_iosdiags    100%     2000
rsc_fpga_iosdiags    100%     10000
rsc_gige_iosdiags    100%     10000
rsc_mmc_iosdiags     100%     10000
rsc_rommon_iosdiags  100%     100
rsc_slot             100%     100
system               100%     10000
Router#show health-monitor subsystem memory
Subsystem            Health  Weighting (max 10000)
memory               100%     10000
Subsystem Event Statistics
0          catastrophic
0          criticial
0          high
0          medium
0          low
0          positive
Subsystem Notification Configuration
100        high-threshold
0          low-threshold
FALSE      notify-enable
Table 3 Show Health-Monitor Subsystem Fields

Field

Description

Subsystem

Name of subsystem

Health

Current health of the subsystem

Weighting

Subsystem weighting

catastrophic

The number of catastrophic faults that have occurred in this subsystem on the specified entity since the system was initialized.

critical

The number of critical faults that have occurred in this subsystem on the specified entity since the system was initialized.

high

The number of high severity faults that have occurred in this subsystem on the specified entity since the system was initialized.

medium

The number of medium severity faults that have occurred in this subsystem on the specified entity since the system was initialized.

low

The number of low severity faults that have occurred in this subsystem on the specified entity since the system was initialized.

positive

The number of positive events that have occurred in this subsystem on the specified entity since the system was initialized.

high-threshold

Specifies the health level at which a ciscoHealthMonitorHealthLevel notification will be generated for the specified subsystem and entity. A notification will only be generated if the health level had previously reached the low threshold level prior to reaching this high threshold level. Health levels oscillating between the high and the low threshold levels do not generate notifications. A health level going from low threshold (or below) to high threshold (or above) represents a return to normal health for the specified subsystem.

low-threshold

Specifies the health level at which a ciscoHealthMonitorHealthLevel notification will be generated for the specified subsystem and entity. A notification will only be generated if the health level had previously reached the high threshold level prior to reaching this low threshold level. Health levels oscillating between the high and the low threshold levels do not generate notifications. A health level going from high threshold (or above) to low threshold (or below) represents a deterioration of the health for the specified subsystem

notify-enable

TRUE if notifications are enabled, FALSE if not enabled.

Related Commands

Command

Description

health-monitor subsystem

Sets the subsystem's health and those subsystems dependent on it. Use of this command is not recommended unless recovering from a fault situation.

notify subsystem

Enables/Disables SNMP notifications or configures the health level notification thresholds for a subsystem.

show health-monitor rule

To display current or historical status relating to health monitor rules, use the show health-monitor rule command in privileged EXEC mode.

show health-monitor rule [rule-ID | event-id | rule-id | subsystem [subsystem-name | rule-name rule-name | detail] | detail]
Syntax Description

rule-ID

Rule ID

event-id

Event ID table

rule-id

Rule ID table

subsystem

Designates a health monitor subsystem

subsystem-name

Name of the health monitor subsystem

rule-name

Designates a health-monitor rule

rule-name

Name of a health-monitor rule

detail

Displays a detailed view of each rule

Defaults

No default behavior or values.

Command Modes

Privileged EXEC

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850 platform.

Usage Guidelines

Use this command to view a summary list of Health Monitor rules, or detailed information on an individual rule. Detailed information includes the condition which triggers the rule, the action(s) the rule performs, whether the rule is activated or deactivated and historical data on how many times the rule has been evaluated and triggered.

The show health-monitor rule command can be used to find the rule-name, the rule-id or the subsystem name associated with a rule. This information can then be used in the longer commands based on the show health-monitor rule command.

Examples

The following example shows the output from the show health-monitor rule command:
Router#show health-monitor rule
Status (S) codes:
A = active
D = deactivated
S ID    Subsystem                 Name
A 43    dsip_fbx_ping_iosdiags    dsip_fbx_ping_root_cause_dec
A 44    dsip_fbx_ping_iosdiags    dsip_fbx_ping_root_cause_inc
A 49    fdm_appl_iosdiags         tcam_rw_root_cause_dec
A 50    fdm_appl_iosdiags         tcam_rw_root_cause_inc
A 72    fib                       fibdisable_busyout_&_power_cycle_fb
A 73    fib                       fib_disabled_event_rule
A 74    fib                       fib_recovered_event_rule
A 67    gt64120_iosdiags          gt64120_id_root_cause_dec
A 68    gt64120_iosdiags          gt64120_id_root_cause_inc
A 1     inter_rsc_iosdiags        inter_rsc_root_cause_dec
A 2     inter_rsc_iosdiags        inter_rsc_root_cause_inc
A 45    mbus_eeprom_iosdiags      mbus_eeprom_rw_root_cause_dec
A 46    mbus_eeprom_iosdiags      mbus_eeprom_rw_root_cause_inc
A 70    memory                    low_processor_memory_rule
A 71    memory                    low_iomem1_memory_rule
A 3     mha_line_iosdiags         ha_line_root_cause_dec
A 4     mha_line_iosdiags         ha_line_root_cause_incr_health
A 57    pci_amdfe_iosdiags        pci_mac_id_root_cause_dec
A 58    pci_amdfe_iosdiags        pci_mac_cfg_read_root_cause_dec
A 59    pci_amdfe_iosdiags        pci_mac_cfg_rw_root_cause_dec
A 60    pci_amdfe_iosdiags        pci_mac_reg_rw_root_cause_dec
A 61    pci_amdfe_iosdiags        pci_mac_int_loopback_root_cause_dec
A 62    pci_amdfe_iosdiags        pci_mac_id_root_cause_inc
A 63    pci_amdfe_iosdiags        pci_mac_cfg_read_root_cause_inc
A 64    pci_amdfe_iosdiags        pci_mac_cfg_rw_root_cause_inc
A 65    pci_amdfe_iosdiags        pci_mac_reg_rw_root_cause_inc
A 66    pci_amdfe_iosdiags        pci_mac_int_loopback_root_cause_inc
A 15    pif_iosdiags              epifx_id_root_cause_dec
A 16    pif_iosdiags              epifx_phy_read_root_cause_dec
A 17    pif_iosdiags              epifx_imem_read_root_cause_dec
A 18    pif_iosdiags              xpifx_id_root_cause_dec
A 19    pif_iosdiags              xpifx_phy_read_root_cause_dec
A 20    pif_iosdiags              xpifx_imem_read_root_cause_dec
A 21    pif_iosdiags              epifx_id_root_cause_inc
A 22    pif_iosdiags              epifx_phy_read_root_cause_inc
A 23    pif_iosdiags              epifx_imem_read_root_cause_inc
A 24    pif_iosdiags              xpifx_id_root_cause_inc
A 25    pif_iosdiags              xpifx_phy_read_root_cause_inc
A 26    pif_iosdiags              xpifx_imem_read_root_cause_inc
A 51    rsc_cf_iosdiags           compact_flash_rw_root_cause_dec
A 52    rsc_cf_iosdiags           compact_flash_id_root_cause_dec
A 53    rsc_cf_iosdiags           compact_flash_read_root_cause_dec
A 54    rsc_cf_iosdiags           compact_flash_rw_root_cause_inc
A 55    rsc_cf_iosdiags           compact_flash_id_root_cause_inc
A 56    rsc_cf_iosdiags           compact_flash_read_root_cause_inc
A 75    rsc_common_iosdiags       rsc_high_prio_latency_root_cause_dec
A 76    rsc_common_iosdiags       rsc_high_prio_latency_root_cause_inc
A 77    rsc_common_iosdiags       rsc_normal_prio_latency_root_cause_dec
A 78    rsc_common_iosdiags       rsc_normal_prio_latency_root_cause_inc
A 79    rsc_common_iosdiags       rsc_low_prio_latency_root_cause_dec
A 80    rsc_common_iosdiags       rsc_low_prio_latency_root_cause_inc
A 81    rsc_common_iosdiags       rsc_cpu_util_root_cause_dec
A 82    rsc_common_iosdiags       rsc_cpu_util_root_cause_inc
A 83    rsc_common_iosdiags       rsc_mbus_local_id_root_cause_dec
A 84    rsc_common_iosdiags       rsc_mbus_local_id_root_cause_inc
A 47    rsc_envm_iosdiags         rsc_mbus_temp_sensor_root_cause_dec
A 48    rsc_envm_iosdiags         rsc_mbus_temp_sensor_root_root_cause_in
A 5     rsc_fb_iosdiags           rsc_fb_mil_path_ping_root_cause_dec
A 6     rsc_fb_iosdiags           rsc_fb_mil_path_ping_root_cause_inc
A 27    rsc_fpfe_iosdiags         rsc_fpfe_id_root_cause_dec
A 28    rsc_fpfe_iosdiags         rsc_fpfe_usr_reg_rw_root_cause_dec
A 29    rsc_fpfe_iosdiags         rsc_fpfe_xtal_root_cause_dec
A 30    rsc_fpfe_iosdiags         rsc_fpfe_low_voltage_root_cause_dec
A 31    rsc_fpfe_iosdiags         rsc_fpfe_reg_rw_root_cause_dec
A 32    rsc_fpfe_iosdiags         rsc_fpfe_loopback_result_root_cause_dec
A 33    rsc_fpfe_iosdiags         rsc_fpfe_id_root_cause_inc
A 34    rsc_fpfe_iosdiags         rsc_fpfe_user_reg_rw_root_cause_inc
A 35    rsc_fpfe_iosdiags         rsc_fpfe_xtal_root_cause_inc
A 36    rsc_fpfe_iosdiags         rsc_fpfe_low_voltage_root_cause_inc
A 37    rsc_fpfe_iosdiags         rsc_fpfe_reg_rw_root_cause_inc
A 38    rsc_fpfe_iosdiags         rsc_fpfe_loopback_result_root_cause_inc
A 7     rsc_fpga_iosdiags         fpga_id_root_cause_dec
A 8     rsc_fpga_iosdiags         fpga_scratch_rw_root_cause_dec
A 9     rsc_fpga_iosdiags         fpga_id_root_cause_inc
A 10    rsc_fpga_iosdiags         fpga_scratch_rw_root_cause_inc
A 39    rsc_gige_iosdiags         rsc_gige_reg_rw_result_root_cause_dec
A 40    rsc_gige_iosdiags         rsc_gige_addr_reg_rw_root_cause_dec
A 41    rsc_gige_iosdiags         rsc_gige_reg_rw_root_cause_inc
A 42    rsc_gige_iosdiags         rsc_gige_addr_reg_rw_root_cause_inc
A 13    rsc_mmc_iosdiags          rsc_mmc_id_root_cause_dec
A 14    rsc_mmc_iosdiags          rsc_mmc_id_root_cause_inc
A 11    rsc_rommon_iosdiags       rsc_rommon_read_root_cause_dec
A 12    rsc_rommon_iosdiags       rsc_rommon_read_root_cause_inc
A 85    rsc_slot                  repeat_reboot_fb2_rule
A 86    rsc_slot                  repeat_reboot_fb3_rule
A 87    rsc_slot                  repeat_reboot_fb6_rule
A 88    rsc_slot                  repeat_reboot_fb9_rule
A 89    rsc_slot                  repeat_reboot_fb10_rule
A 90    rsc_slot                  repeat_reboot_fb12_rule
D 69    system                    zero_system_health_rule
The following example show the output for a health-monitor rule:
Router#show health-monitor rule 70
Status (S) codes:
A = active
D = deactivated
S ID    Subsystem                 Name
A 70    memory                    low_processor_memory_rule
Condition:
(free_processor_memory < 5242880)
Action:
Decrement health by 100%
Action:
Reload this RSC
Number of times this rule has been evaluated = 131
Number of times this rule evaluated to TRUE = 0
Number of times associated actions were invoked = 0
Note To see the same information as above, you could also use the command show health-monitor subsystem memory rule-name low_processor_memory_rule.

The following example shows the output for a health-monitor subsystem:
Router#show health-monitor rule subsystem memory
Status (S) codes:
A = active
D = deactivated
S ID    Subsystem                 Name
A 70    memory                    low_processor_memory_rule
A 71    memory                    low_iomem1_memory_rule
Table 4 Show Health-Monitor Rule Fields

Field

Description

active

The rule is monitored.

deactivated

The rule is not being monitored.

ID

The rule identification number.

Subsystem

The name of the subsystem.

Name

The name of the rule.

Action

The event that happens if the condition evaluates to TRUE.

Condition

A logical expression that evaluates to TRUE or FALSE.

Related Commands

Command

Description

show health-monitor variable subsystem <subsystem-name> var-name <var-name>

Shows all rules associated with a variable.

rule

Activates/Deactivates rule(s).

show diagnostic-monitor test

To display the tests run by diagnostic monitor, use the show diagnostic-monitor test command in privileged EXEC mode.

show diagnostic-monitor test {all | test-name {counters | details | status | summary | timers}
Syntax Description

all

Displays all the tests registered with Diagnostic Monitor.

test-name

Name of diagnostic monitor test.

counters

Shows how many times the test has been run.

details

Shows the test results.

status

Shows whether the test is running on the RSC (Running column) and whether it is allowed to run on the RSC (Runnable column).

summary

Summarizes the status of the test and when it is next scheduled to run.

timers

Displays information about the frequency of a test running on the active and standby RSCs as well as what their timeout values are.

Defaults

No default behavior or values.

Command Modes

privileged EXEC

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850

Usage Guidelines

Use this command to display the details and status of diagnostic monitor tests.

Examples

The following example shows how many times the test has been run.
Router#show diagnostic-monitor test tcam-rw counters
Name                        Passed         Failed         Unknown 
                            Count          Count          Count       
--------------------------  -------------  -------------  -------------
tcam-rw                     2337           0              0 
The following example shows the test results.
Router#show diagnostic-monitor test tcam-rw details
Note: R = Root cause failure.
      S = Superseded root cause failure.
      * = Bootup test only.
   Name                        Test         Test Result Reason       
                               Result                                
   --------------------------  -----------  -----------------------  
   tcam-rw                     Pass         Test response   
The output below is useful to see whether the test is running on the RSC (Running column) and more importantly whether it is allowed to run on the RSC (Runnable column). The RSC may be in a mode where the test is not allowed to run and this will indicate it.
Router#show diagnostic-monitor test tcam-rw status
Note: R = Root cause failure.
      S = Superseded root cause failure.
      * = Bootup test only.
      Runnable = Test is allowed to run.
      Running = Test is scheduled to run.
   Name                        Test         Runnable  Running
                               Result
   --------------------------  -----------  --------  -------
   tcam-rw                     Pass         Yes        Yes   
The output below summarizes the status of the test and when it is next scheduled to run.
Router#show diagnostic-monitor test tcam-rw summary
Note: R = Root cause failure.
      S = Superseded root cause failure.
      * = Bootup test only.
   Name                        Test          Next-Test Scheduled
                               Result       (days.hrs:min:sec.ms)
   --------------------------  -----------  ---------------------
   tcam-rw                     Pass         00.00:00:30.772      
The following command displays information about the frequency of a test running on the active and standby RSCs as well as what their timeout values are.
Router#show diagnostic-monitor test tcam-rw timers
Name                        Active-Freq       Standby-Freq     Timeout
                                (days.hrs:min:sec.ms)           (ms) 
--------------------------  ---------------------------------  -------
tcam-rw                     00.00:01:00.000   00.00:01:00.000   1000 
Table 5 Show Diagnostic-Monitor Test Fields

Field

Description

Name

The name of the test.

Passed Count

Number of times, since last bootup, the test has run and passed.

Failed Count

Number of times, since last bootup, the test has run and failed.

Unknown Count

Number of times, since last bootup, the test has run and the result could not be confirmed.

Root cause failure

Root-cause failure - The component is currently considered to be a root cause of a failure.

Superseded root cause failure

Superseded root-cause - The component was once a root cause failure but another component it is reliant on has also failed and thus it has been superseded as a root cause of a failure.

Bootup test only

The test is run only during bootup of the RSC.

Test Result

Possible options are:
Unavailable
Pass
Fail

Test Result Reason

Test response - Test was scheduled and returned this result back
Test not applicable - Test is not applicable in the mode of circumstance the RSC is in, eg a test for a featureboard slot which has no featureboard in it.
SW/HW not ready - The test was requested to by the Diagnostic monitor but when it when to run it was not able to due to something not being ready.

Next-Test Scheduled

The time, since the test was last run, the test will run again.

Active-Freq

How often the test will run on the active RSC.

Standby-Freq

How often the test will run on the standby RSC.

Timeout

Waiting period.

Related Commands

Command

Description

bootup

To enable or disable bootup diagnostic tests.

test

To change diagnostic test parameters.

default

To set diagnostic test parameters to their defaults.

show monitor event-trace hm

To display the events sent to HM, use the show monitor event-trace hm command in privileged EXEC mode.

show monitor event-trace hm {all | back {mmm / hhh:mm }| clock {hh:mm }| from-boot {seconds}| latest | parameters size}
Syntax Description

all

All events.

back

Events from the current time back to a specified time.

clock

Show events from a specific time.

from-boot

Time from boot in seconds.

latest

parameters size

Shows how many entries are stored.

Defaults

No default behavior or values.

Command Modes

privileged EXEC

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850

Usage Guidelines

Use this command to display the events that have changed the health of the system and subsystems.

Examples

The following example shows events, actions, and changes to the system and subsystem health.
Router#show monitor event-trace hm all
Feb 24 03:10:34: Event: Subsystem rsc_slot: ev_num 12, ntf type EVENT
Feb 24 03:10:34: Event: Subsystem rsc_slot: ev_num 1, ntf type EVENT
Feb 24 03:10:34: Event: Subsystem rsc_slot: ev_num 3, ntf type EVENT
Feb 24 03:10:34: Event: Subsystem rsc_slot: ev_num 4, ntf type EVENT
Feb 24 03:10:35: Event: Subsystem rsc_slot: ev_num 9, ntf type EVENT
Feb 24 03:10:35: Event: Subsystem rsc_slot: ev_num 10, ntf type EVENT
Feb 24 03:12:54: Event: Subsystem fb3_dsip_ping_diags: ev_num 1, ntf type EVENT
Feb 24 03:12:54: Health change: Subsystem fb3_dsip_ping_diags: Health decreased to 0%
Feb 24 03:12:54: Health change: Subsystem system: Health decreased to 91.66%
Feb 24 03:12:54: Action invoked: Subsystem fb3_dsip_ping_diags: Rule-name 
fb3_dsip_ping_rc_dec, Action: Decrement health by 100%
Feb 24 03:13:04: Event: Subsystem fb10_dsip_ping_diags: ev_num 1, ntf type EVENT
Feb 24 03:13:04: Health change: Subsystem fb10_dsip_ping_diags: Health decreased to 0%
Feb 24 03:13:04: Health change: Subsystem system: Health decreased to 83.32%
Feb 24 03:13:04: Action invoked: Subsystem fb10_dsip_ping_diags: Rule-name 
fb10_dsip_ping_rc_dec, Action: Decrement health by 100%
Feb 24 03:13:04: Event: Subsystem fb9_dsip_ping_diags: ev_num 1, ntf type EVENT
Feb 24 03:13:04: Health change: Subsystem fb9_dsip_ping_diags: Health decreased to 0%
Feb 24 03:13:04: Health change: Subsystem system: Health decreased to 74.98%
Feb 24 03:13:04: Action invoked: Subsystem fb9_dsip_ping_diags: Rule-name 
fb9_dsip_ping_rc_dec, Action: Decrement health by 100%
Feb 24 03:13:04: Event: Subsystem fb4_dsip_ping_diags: ev_num 1, ntf type EVENT
Feb 24 03:13:04: Health change: Subsystem fb4_dsip_ping_diags: Health decreased to 0%
Feb 24 03:13:04: Health change: Subsystem system: Health decreased to 66.64%
Feb 24 03:13:04: Action invoked: Subsystem fb4_dsip_ping_diags: Rule-name 
fb4_dsip_ping_rc_dec, Action: Decrement health by 100%
Feb 24 03:13:27: Event: Subsystem fb4_dsip_ping_diags: ev_num 2, ntf type EVENT
Feb 24 03:13:27: Health change: Subsystem fb4_dsip_ping_diags: Health increased to 100%
Feb 24 03:13:27: Health change: Subsystem system: Health increased to 74.98%
Feb 24 03:13:27: Action invoked: Subsystem fb4_dsip_ping_diags: Rule-name 
fb4_dsip_ping_rc_inc, Action: Increment health by 100%
Feb 24 03:13:27: Event: Subsystem fb10_dsip_ping_diags: ev_num 2, ntf type EVENT
Feb 24 03:13:27: Health change: Subsystem fb10_dsip_ping_diags: Health increased to 100%
Feb 24 03:13:27: Health change: Subsystem system: Health increased to 83.32%
Feb 24 03:13:27: Action invoked: Subsystem fb10_dsip_ping_diags: Rule-name 
fb10_dsip_ping_rc_inc, Action: Increment health by 100%
Feb 24 03:13:27: Event: Subsystem fb9_dsip_ping_diags: ev_num 2, ntf type EVENT
Feb 24 03:13:27: Health change: Subsystem fb9_dsip_ping_diags: Health increased to 100%
Feb 24 03:13:27: Health change: Subsystem system: Health increased to 91.66%
Feb 24 03:13:27: Action invoked: Subsystem fb9_dsip_ping_diags: Rule-name 
fb9_dsip_ping_rc_inc, Action: Increment health by 100%
Feb 24 03:13:34: Event: Subsystem fb3_dsip_ping_diags: ev_num 2, ntf type EVENT
Feb 24 03:13:34: Health change: Subsystem fb3_dsip_ping_diags: Health increased to 100%
Feb 24 03:13:34: Health change: Subsystem system: Health increased to 100%
Feb 24 03:13:34: Action invoked: Subsystem fb3_dsip_ping_diags: Rule-name 
fb3_dsip_ping_rc_inc, Action: Increment health by 100%
Related Commands

Command

Description

None

health-monitor subsystem

To set the health monitor subsystem health value, use the health-monitor subsystem command in privileged EXEC mode.

health-monitor subsystem subsystem set health-value

Syntax Description

subsystem

Subsystem name.

set

Sets the subsystem's health value.

health-value

Health value in .01 percent increments.

Defaults

No default behavior or values.

Command Modes

Privileged EXEC.

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850.

Usage Guidelines

This will set a subsystem's health and update those subsystems dependent on it, including the system health. Use of this command is not recommended unless recovering from a Health Monitor internal fault or from an incorrect user procedure which led to incorrect subsystem health.

Examples

This example sets the health value for the memory subsystem to 50 percent.

Router#health-monitor subsystem memory health set 5000

Related Commands

Command

Description

show health-monitor subsystem

Display the subsystem health status.

debug health-monitor

To turn on health monitoring debugging, use the debug health-monitor command in privileged EXEC mode. Use the no form of this command to turn off debugging.

debug health-monitor [ action | api | cli | condition { duration | frequency} | errors | events | mib | remote-support | rule | subsystem | variable]
Syntax Description

action

Enables debugging for any Health Monitor actions that are created, destroyed or invoked during rules operations.

api

Enables debugging for any API (Application Programming Interface) of the Health Monitor that is called.

cli

Enables debugging for the Health Monitor CLI (command line interface)

condition

Enables debugging for any Health Monitor conditions that are created, destroyed or evaluated during rules operations.

duration

Enables debugging for Health Monitor duration conditions only

frequency

Enables debugging for Health Monitor frequency conditions only

errors

Enables debugging for errors in the Health Monitor. This is not applicable to errors that enter the Health Monitor as inputs.

events

Enables debugging for all events entering the Health Monitor

mib

Enables debugging related to the operation of the CISCO-HEALTH-MONITOR-MIB

remote-support

Enables debugging related to Health Monitor communication of the Active RSC with the Standby RSC.

rule

Enables debugging for any Health Monitor rules that are created, destroyed or triggered

subsystem

Enables debugging for the Health Monitor Subsystem Database. This includes debugging for any changes in subsystem or system health.

variable

Enables debugging for the Health Monitor Variable database. This includes debugging for any accesses to Health Monitor Variable database variables

Defaults

No default behavior or values.

Command Modes

Privileged EXEC

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850.

Usage Guidelines

Use the debug health-monitor commands to turn on debugging for various aspects of the health-monitor. This is used to diagnose the operations of rules or to determine why system or subsystem health is changing.

This command provides detailed information on the health monitor internal processing. As the health monitor rules rely so heavily on the health monitor, this provides a lot of extra information on the rule operation.

Examples

The example below shows the HM evaluating conditions in the rule database:
Router# debug health-monitor condition
Aug  4 15:38:59.206: HM COND: eval condition
Aug  4 15:38:59.206: HM COND: eval leaf condition (0x648F83B8), left var type, operator<
Aug  4 15:38:59.206: HM COND: eval leaf condition, right val type
Aug  4 15:38:59.206: HM COND: eval leaf condition (0x648F83B8), evaluates to FALSE<
Aug  4 15:38:59.206: HM COND: eval condition, leaf result FALSE
Aug  4 15:38:59.206: HM COND: eval condition
Aug  4 15:38:59.206: HM COND: eval leaf condition (0x648F8710), left var type, operator<
Aug  4 15:38:59.206: HM COND: eval leaf condition, right val type
Aug  4 15:38:59.206: HM COND: eval leaf condition (0x648F8710), evaluates to FALSE<
Aug  4 15:38:59.206: HM COND: eval condition, leaf result FALSE
Aug  4 15:38:59.206: HM COND: eval condition
Aug  4 15:38:59.206: HM COND: eval leaf condition (0x648F8A68), left var type, operator<
Aug  4 15:38:59.206: HM COND: eval leaf condition, right val type
Aug  4 15:38:59.206: HM COND: eval leaf condition (0x648F8A68), evaluates to FALSE<
Aug  4 15:38:59.206: HM COND: eval condition, leaf result FALSE
Aug  4 15:38:59.206: HM COND: eval condition
Aug  4 15:38:59.206: HM COND: eval leaf condition (0x648F8DC0), left var type, operator<
Aug  4 15:38:59.206: HM COND: eval leaf condition, right val type
Aug  4 15:38:59.206: HM COND: eval leaf condition (0x648F8DC0), evaluates to FALSE<
Aug  4 15:38:59.206: HM COND: eval condition, leaf result FALSE
The example below shows the reading and writing of memory variables in the Health Monitor Variable database:
Router# debug health-monitor variable
Aug  4 15:39:29.253: HM_VAR: Write to var (free_processor_memory) succeeded
Aug  4 15:39:29.253: HM_VAR: Write to var (free_iomem1_memory) succeeded
Aug  4 15:39:29.253: HM_VAR: Write to var (largest_block_processor_memory) succeeded
Aug  4 15:39:29.253: HM_VAR: Write to var (largest_block_iomem1_memory) succeeded
Aug  4 15:39:29.253: HM_VAR: Var (free_processor_memory) read succeeded
Aug  4 15:39:29.253: HM_VAR: Var (free_iomem1_memory) read succeeded
Aug  4 15:39:29.253: HM_VAR: Var (largest_block_processor_memory) read succeeded
Aug  4 15:39:29.253: HM_VAR: Var (largest_block_iomem1_memory) read succeeded
Related Commands

Command

Description

debug diagnostic-monitor

Turns on debugging for the Diagnostics Monitor.

show monitor event-trace hm

Displays the Health Monitor event trace

debug ip cef health-monitor

To turn on debugging for the FIB health monitor subsystem, use the debug ip cef health-monitor command in privileged EXEC mode. Use the no form of this command to turn off debugging.

debug ip cef health-monitor
Syntax Description

This command has no arguments or keywords.

Defaults

No default behavior or values.

Command Modes

Privileged EXEC

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850.

Usage Guidelines

Use the debug ip cef health-monitor command to turn on debugging for all health monitor rules associated with the FIB health monitor subsystem.This is used to diagnose faults in or view more detailed operational information re the health monitor rules that belong to the FIB health monitor subsystem.

The FIB health-monitor rules are shown below:
Router#show health-monitor rule subsystem fib 
Status (S) codes:
A = active
D = deactivated
S ID    Subsystem                 Name
A 72    fib                       fib_disabled_busyout_&_power_cycle_fb
A 73    fib                       fib_disabled_event_rule
A 74    fib                       fib_recovered_event_rule
Together these rules perform the required actions when a FIBDISABLE errmsg occurs. That is, they busyout the FB and decrement the FIB subsystem health. Periodically, the rules check to see if FIB has recovered or the busyout is complete. If FIB recovers, the rule aborts and increments the health. If FIB does not recover and the busyout completes, the rule reloads the FB immediately. There is a timeout period after which the rule will reload the FB regardless of whether the busyout is complete or not. When the FB boots, the FIB subsystem health will be reinstated.

Note The terms FIB and CEF are interchangeable. These terms refer to the same switching functionality.

Examples

The following example turns on debugging for the FIB Health Monitor rules:
Router#debug ip cef health-monitor
IP CEF Health Monitor Rules debugging is on 
The following example shows detailed operational information when the rule triggers and FIB recovers:
Oct 4 14:38:13.506: %FIB-3-FIBDISABLE: Fatal error, slot 0: No window 
message, LC to RP IPC is non-operational 
Oct 4 14:38:13.510: CEF_HM: FIBDISABLE Rule triggered - busying out FB 0 
Oct 4 14:38:13.510: CEF_HM: Sent FIB_DISABLED event to Health Monitor 
(slot 0) 
Oct 4 14:38:13.510: CEF_HM: Started check timer for slot 0 
Oct 4 14:38:13.510: %SLOT-4-FB_BUSYOUT: Busy out feature board 0, initiated 
by the Health Monitor 
Oct 4 14:38:13.510: CEF_HM: CEF problem on slot 0 detected. Decremented CEF 
subsystem health by 5000
Oct 4 14:38:43.510: CEF_HM: Check timer expired for FB 0 - processing... 
Oct 4 14:38:43.510: CEF_HM: Sent FIB_RECOVERED event to Health Monitor 
(slot 0) 
Oct 4 14:38:43.510: %SLOT-4-FB_BUSYOUT: Busy out feature board 0, cancelled 
by the Health Monitor due to FIB recovery 
Oct 4 14:38:43.606: CEF_HM: CEF problem on slot 0 recovered. Incremented 
CEF subsystem health by 5000
The following example shows detailed operational information when the rule triggers and the busyout completes before FIB recovers (the feature card is reset).
Oct 4 14:41:10.561: %FIB-3-FIBDISABLE: Fatal error, slot 0: No window 
message, LC to RP IPC is non-operational 
Oct 4 14:41:10.565: CEF_HM: FIBDISABLE Rule triggered - busying out FB 0 
Oct 4 14:41:10.565: CEF_HM: Sent FIB_DISABLED event to Health Monitor 
(slot 0) 
Oct 4 14:41:10.565: CEF_HM: Started check timer for slot 0 
Oct 4 14:41:10.565: %SLOT-4-FB_BUSYOUT: Busy out feature board 0, initiated 
by the Health Monitor 
Oct 4 14:41:10.565: CEF_HM: CEF problem on slot 0 detected. Decremented CEF 
subsystem health by 5000
Oct 4 14:41:40.565: CEF_HM: Check timer expired for FB 0 - processing... 
Oct 4 14:41:40.573: CEF_HM: Busyout complete - reset FB 0 
Oct 4 14:41:40.573: %SLOT-4-FB_RESET: Resetting feature board 0, as 
requested by the Health Monitor 
Oct 4 14:41:40.573: %DSIPPF-5-DS_KEEPALIVE_LOSS: DSIP Keepalive Loss from 
shelf 0 slot 0
Oct 4 14:41:55.577: CEF_HM: Notify HM that FB 0 reloaded 
Oct 4 14:41:55.577: CEF_HM: Sent FIB_RECOVERED event to Health Monitor 
(slot 0) 
Oct 4 14:41:55.577: %SLOT-4-FB_BUSYOUT: Busy out feature board 0, cancelled 
by the Health Monitor due to FB reload 
Oct 4 14:41:55.653: CEF_HM: CEF problem on slot 0 recovered. Incremented 
CEF subsystem health by 5000
Related Commands

Command

Description

show health-monitor rule subsystem fib

View all rules associated with a health monitor subsystem.

show health-monitor rule

View full details of the specified rule by rule number.

show health-monitor rule subsystem fib rule-name

View full details of the specified rule by rule name.

debug health-monitor

View detailed information on the health monitor internal processing. As the health monitor rules rely so heavily on the health monitor, this provides a lot of extra information on the rule operation.

debug memory health-monitor

To turn on debugging for the memory health monitor subsystem, use the debug memory health-monitor command in privileged EXEC mode. Use the no form of this command to turn off debugging.

debug memory health-monitor
Syntax Description

This command has no arguments or keywords.

Defaults

No default behavior or values.

Command Modes

Privileged EXEC

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850.

Usage Guidelines

Use the debug memory health-monitor command to turn on debugging for all health monitor rules associated with the memory subsystem.This is used to diagnose faults in or view more detailed operational information re the health monitor rules that belong to the memory subsystem.
Router#show health-monitor rule subsystem memory
Status (S) codes:
A = active
D = deactivated
S ID    Subsystem                 Name
A 140   memory                    low_processor_memory_rule
A 141   memory                    low_iomem1_memory_rule
A 142   memory                    fragmented_processor_memory_rule
A 143   memory                    fragmented_iomem1_memory_rule
These low memory rules periodically check the amount of free processor/IOMEM memory on the RSC and reload the RSC if the amount of free memory falls below a predefined threshold. The fragmented memory rules periodically check for excessively fragmented memory and reload the RSC if the memory fragmentation exceeds a predefined threshold.

Examples

The following example turns on debugging for the memory health monitor rules:
Router#debug memory health-monitor
Memory Health Monitor Rules debugging is on 
The following example shows detailed operational information of the low memory rule where the amount of free memory is above the threshold (rule does not trigger):
Oct 4 15:45:09.232: HM_VAR: Write to var (Free_Processor_Memory) succeeded 
Oct 4 15:45:09.232: HM_VAR: Write to var (Free_IOMEM1_Memory) succeeded 
Oct 4 15:45:09.232: HM RULE: Received var update; evaluating rule list 
Oct 4 15:45:09.236: HM_VAR: Var (Free_Processor_Memory) read succeeded 
Oct 4 15:45:09.236: HM RULE: Rule [70] evaluation: FALSE 
Oct 4 15:45:09.236: HM RULE: Received var update; evaluating rule list 
Oct 4 15:45:09.236: HM_VAR: Var (Free_IOMEM1_Memory) read succeeded 
Oct 4 15:45:09.236: HM RULE: Rule [71] evaluation: FALSE
The following example shows detailed operational information of the Low Processor memory rule where the amount of free memory is below the threshold (rule triggers):
Oct 4 15:56:06.341: HM_VAR: Write to var (Free_Processor_Memory) succeeded 
Oct 4 15:56:06.341: HM_VAR: Write to var (Free_IOMEM1_Memory) succeeded 
Oct 4 15:56:06.341: HM RULE: Received var update; evaluating rule list 
Oct 4 15:56:06.341: HM_VAR: Var (Free_Processor_Memory) read succeeded 
Oct 4 15:56:06.341: HM RULE: Rule [70] evaluation: TRUE 
Oct 4 15:56:06.341: HM ACTION: invoke 
Oct 4 15:56:06.341: HM ACTION: Action:Decrement health by 100%: 2 args 
Oct 4 15:56:06.341: HM ACTION: Arg 1: Type Subsys Handle, Value PTR 
0x63D97E9C 
Oct 4 15:56:06.341: HM ACTION: Arg 2: Type Health, Value 100% 
Oct 4 15:56:06.341: HM SUBSYS: decrementing health of subsystem Memory by 
10000 
Oct 4 15:56:06.341: HM_VAR: Var (Memory_health) read succeeded 
Oct 4 15:56:06.341: HM_VAR: Write to var (Memory_health) succeeded 
Oct 4 15:56:06.341: HM SUBSYS: system health decr. due to health decr. of 
Memory by 10000 
Oct 4 15:56:06.341: HM_VAR: Var (system_health) read succeeded 
Oct 4 15:56:06.341: HM_VAR: Write to var (system_health) succeeded 
Oct 4 15:56:06.341: HM_VAR: Var (Memory_health) read succeeded 
Oct 4 15:56:06.341: HM ACTION: invoke 
Oct 4 15:56:06.341: HM ACTION: Action:Reload this RSC: 1 args 
Oct 4 15:56:06.341: HM ACTION: Arg 1: Type Number, Value 1 
Oct 4 15:56:06.341: HM SUBSYS: hm_subsys_db_search_common: name Memory: 
Oct 4 15:56:06.341: hm_subsys_db_compare_name: name Memory, elem 
0x63D97FB0, s 0x63D97E9C, subsys_name Memory (len 6)found subsys in 
Subsystem Database at 0x63D97FB0 
Oct 4 15:56:06.341: %MEMORY_HM-3-RSC_LOW_MEMORY: Health Monitor detected 
low Processor_Memory on the RSC: Reload this RSC
Related Commands

Command

Description

show health-monitor rule

View full details of the specified rule (use show health-monitor rule subsystem FIB to find the rule-id).

debug health-monitor

View detailed information on the health monitor internal processing. As the health monitor rules rely so heavily on the health monitor, this provides a lot of extra information on the rule operation.

debug diagnostic-monitor

To turn on diagnostic debugging, use the debug diagnostic-monitor command in privileged EXEC mode. Use the no form of this command to turn off debugging.

debug diagnostic-monitor [ errors | events | test { all | test-name} ]
Syntax Description

errors

Debug output of any errors (not including the diagnostic test result failures) that the DM may encounter.

events

Debug events the DM is dealing with. These include what tests are being scheduled and what results are coming back from the tests.

test

Enables you to selectively debug particular test results coming into the DM as well as debug how the root-cause is determined.

all

Turns on debugging for all the DM tests.

test-name

Turns on debugging for a specific DM test.

Defaults

No default behavior or values.

Command Modes

Privileged EXEC

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850.

Usage Guidelines

Use the debug diagnostic-monitor commands to turn on debugging for various aspects of the diagnostic-monitor.This is used to diagnose why certain tests have passed or failed as well as determine why a choice was made in determining why something was marked as being the root cause of a failure.

Examples

The example below shows DM scheduling tests and receiving the results for them.
Router#debug diagnostic-monitor events
Dec 10 17:59:19.466: DM: Component test start for "fb0-dsip-ping".
Dec 10 17:59:19.466: DM: Component test start for "fb1-dsip-ping".
Dec 10 17:59:19.466: DM: Component test start for "fb10-dsip-ping".
Dec 10 17:59:19.466: DM: Component test start for "fb11-dsip-ping".
Dec 10 17:59:19.466: DM: Component test start for "fb12-dsip-ping".
Dec 10 17:59:19.466: DM: Component test start for "fb13-dsip-ping".
Dec 10 17:59:19.466: DM: Component test start for "fb2-dsip-ping".
Dec 10 17:59:19.466: DM: Component test start for "proc-latency-priority-low".
Dec 10 17:59:19.466: DM: Fail result recvd for component fb0-dsip-ping 
Dec 10 17:59:19.466: DM: Fail result recvd for component fb1-dsip-ping 
Dec 10 17:59:19.466: DM: Pass result recvd for component fb10-dsip-ping 
Dec 10 17:59:19.466: DM: Pass result recvd for component fb11-dsip-ping 
Dec 10 17:59:19.466: DM: Pass result recvd for component fb12-dsip-ping 
Dec 10 17:59:19.466: DM: Fail result recvd for component fb13-dsip-ping 
The example below shows a test which has passed, failed and been determined to be the root cause
Router#debug diagnostic-monitor test fb0-dsip-ping
Dec 10 18:02:51.679: DM: Component "fb0-dsip-ping" test has been scheduled to run in 10000 
ms. (reason: periodic)
Dec 10 18:02:01.679: DM: Component "Pass" received test result 
Dec 10 18:12:39.028: DM: Component "Fail" received test result 
Dec 10 18:12:39.028: DM: Health change detected for component fb0-dsip-ping
Dec 10 18:12:39.028: DM: VComponent linked to Component fb0-dsip-ping in Module Domain 22 
marked as RC_Candidate
Dec 10 18:12:39.028: DM: Component "fb0-dsip-ping" test has been scheduled to run in 10000 
ms. (reason: periodic)
Dec 10 18:12:39.032: DM: Checking whether RC_CANDIDATE for VComponent linked to Component 
fb0-dsip-ping in Module Domain 22 is a root-cause
Dec 10 18:12:39.032: DM: VComponent linked to Component fb0-dsip-ping in Module Domain 22 
has been detected as root cause.
Dec 10 18:12:39.032: DM: VComponent linked to Component fb0-dsip-ping in Module Domain 22 
is now marked as root cause. Propogating it through the tree.
Dec 10 18:12:39.032: %DM-6-ROOT_CAUSE_DETECTED: Component fb0-dsip-ping detected as a root 
cause of a failure.
Dec 10 18:12:39.032: DM: Notifying Component fb0-dsip-ping, Reason: DM_NODE_RC
Related Commands

Command

Description

debug health-monitor

Turns on debugging for Health Monitor.

debug slot health-monitor

To turn on debugging for the rsc_slot health monitor subsystem, use the debug slot health-monitor command in privileged EXEC mode. Use the no form of this command to turn off debugging.

debug slot health-monitor
Syntax Description

This command has no arguments or keywords.

Defaults

No default behavior or values.

Command Modes

privileged EXEC

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850

Usage Guidelines

Use the debug slot health-monitor command to turn on debugging for all rsc_slot health monitor rules. This command is used to diagnose faults or to view detailed operational information.

The rsc_slot health-monitor rules are shown below:
Router#show health-monitor rule subsystem rsc_slot
Status (S) codes:
A = active
D = deactivated
S ID    Subsystem                 Name
A 155   rsc_slot                  repeat_reboot_fb0_rule
A 156   rsc_slot                  boot_adjust_fb0_health_rule
A 157   rsc_slot                  repeat_reboot_fb10_rule
A 158   rsc_slot                  boot_adjust_fb10_health_rule
Examples

The following example turns on debugging for the rsc_slot health monitor rules:
Router#debug slot health-monitor 
RSC Slot Health Monitor Subsystem debugging is on
The following example shows detailed operational information of the repeat_reboot_fb10_rule.

When the FB reloads, but the rule does not trigger (has not reloaded often enough):
Jun 5 11:00:25.438: SLOT_HM: Sent POWERED_ON (slot 10) event to Health Monitor
Jun 5 11:00:25.438: SLOT_HM: slot 10 health had not been decremented for this slot, so no 
adjustment needs to be made 
When the FB reloads and the rule does trigger:
Jun  5 11:15:50.129: SLOT_HM: Sent POWERED_ON (slot 10) event to Health Monitor
Jun  5 11:15:50.129: %SLOT_HM-4-FB_POWERDOWN: Powering down feature board 10, as requested 
by the Health Monitor
Jun  5 11:15:50.129: SLOT_HM: Problem on slot 10 detected. Decremented rsc slot subsystem 
health by 5000
When the FB recovers (if a user forced FB reload is performed for example):
Jun  5 11:26:25.505: SLOT_HM: Sent POWERED_ON (slot 10) event to Health Monitor
Jun  5 11:26:25.505: SLOT_HM: slot 10.Incremented rsc slot subsystem health by 5000
Related Commands

Command

Description

show health-monitor rule

View full details of the specified rule. Use the show health-monitor rule subsystem rsc_slot command to find the rule-id

debug health-monitor

View detailed information on the health monitor internal processing. As the health monitor rules rely so heavily on the health monitor, this provides a lot of extra information on the rule operation.

rule

To disable or enable HM rules, use the rule command in health monitor configuration mode.

rule [all | subsystem {subsystem-name {disable | enable | rule-name rule-name {disable | enable}]
Syntax Description

all

Designates all the health monitor rules.

subsystem-name

Name of the HM subsystem.

rule-name

Name or ID or the HM rule.

disable

Disable HM rules.

enable

Enable HM rules.

Defaults

Rules are enabled by default.

Command Modes

Health monitor configuration mode.

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850.

Usage Guidelines

Users may turn off specific rules in the system. You can customize the rules that are active by using this command.

Examples

The following example disables the rule, low_processor_memory_rule, for the memory subsystem.
Router#configure terminal
Router(config)#health-monitor 
Router(config-hm)#rule subsystem memory rule-name low_processor_memory_rule

The following example disables the rules for the memory subsystem.
Router#configure terminal
Router(config)#health-monitor 
Router(config-hm)#rule subsystem memory disable

Related Commands

Command

Description

show health-monitor rule

Displays a specific rule.

show health-monitor rule subsystem

Displays HM rules for a subsystem.

notify subsystem

To enable or configure health monitor SNMP notifications, use the notify subsystem command in health monitor configuration mode.

notify subsystem subsystem-name { enable | high-threshold {threshold-value} | low-threshold {threshold-value}}
Syntax Description

subsystem-name

Name of the HM subsystem.

enable

Enable notifications for the HM subsystem.

high-threshold

Specifies the health level at which a ciscoHealth Monitor health level notification will be generated for the specified subsystem and entity. Set this to the optimal health level. Defaults to 100.

low-threshold

Specifies the health level at which a ciscoHealth Monitor health level notification will be generated for the specified subsystem and entity. Set this to an unacceptable health level. Defaults to 0.

threshold-value

Expressed as percentage of 0 - 100%.

Defaults

Notifications are disabled by default.

Command Modes

Health-monitor configuration mode.

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850.

Usage Guidelines

Use this command to make HM status information available to external network management applications.

Examples

The following example enables notifications for the memory subsystem.
Router#configure terminal
Router(config)#health monitor 
Router(config-hm)#notify subsystem memory enable

Related Commands

Command

Description

show health-monitor rule

Displays all the HM rules.

show health-monitor rule subsystem

Displays HM rules for a subsystem

show health-monitor subsystem

Displays information on the specified subsystem, including whether notifications are enabled for it, and the current notification low/high threshold.

test

To change test parameters, use the test command in diagnostic-monitor configuration mode. To de-activate a test, use the no form of this command.

test {all | test-name [ timeout {timeout-value} | frequency {active | standby} {frequency-value} [active | standby] {frequency-value}]}
Syntax Description

all

Specifies all DM tests.

test-name

Specifies a specific DM test.

frequency

How often the test is run.

active

The active RSC.

standby

The standby RSC.

frequency-value

Frequency in milliseconds.

timeout

Waiting period.

timeout-value

Timeout in milliseconds.

Defaults

Each test has its own default values.

Command Modes

Diagnostic-monitor configuration mode.

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850.

Usage Guidelines

Alter testing frequency to suit your system. You might run a test on a particular component more often when the component is having a problem.

Note If you de-activate a test, the component test acts as the test has passed before deactivating it. If it was a root cause and you answered yes it would no longer be a root-cause and it's health effect on the subsystem it belongs to and hence on the overall system health would be removed.

Examples

This example sets the timeout and frequency for the epif0-id test.
Router#configure terminal
Router(config)#diagnostic-monitor
Router(diag-mon)#test epif0-id timeout 324 frequency active 3000 stanby 5000
Related Commands

Command

Description

show diagnostic-monitor test

Display the results and default values for diagnostic monitor tests.

default

To set test parameters to their defaults, use the default command in diagnostic-monitor configuration mode.

default {all | test {all timers | test-name timers}}
Syntax Description

all

Specifies all DM tests.

test

Specified individual DM tests.

all timers

Frequency and timeout timers for all DM tests.

test-name

Specifies a specific DM test.

timers

Frequency and timeout timers.

Defaults

Each test has its own default values.

Command Modes

Diagnostic-monitor configuration mode.

Command History

Release

Modification

12.3(2)T

This command was introduced on the Cisco AS5850.

Usage Guidelines

Use this command to reset the DM tests to their default parameters.

Examples

This example sets the timeout and frequency for the epif0-id test to the default values.

Router#configure terminal

Router(config)#diagnostic-monitor

Router(diag-mon)#default test epif0-id timers

Related Commands

Command

Description

show diagnostic-monitor test

Display the results and default values for diagnostic monitor tests.

Glossary

BIC

Backplane Inter-Connect

DSIP

Distributed Systems Interconnect Protocol

EEPROM

Electrically Erasable Programmable Read-only Memory

EMT

Emulation Trap; minimal interface to Rommon functions

EPIF

Ethernet Port Interface

FDM

Forwarding Database Manager

FE

Fast Ethernet

FPFE

Front Panel Fast Ethernet

GigE

Gigabit Ethernet

HSA

High System Availability. Allows for a switchover to a redundant RSC. OS reboots on switchover, line cards reset

MBUS

Chassis maintenance bus

MIB

Management Information Base

RSC

Route Switch Controller. This is the generic term for the route processor and refers to both RSC-1 and eRSC route processors.

eRSC

Enhanced Route Switch Controller. This is the next generation RSC.

RSC-1

The original Route Switch Controller

TCAM

Ternary Content Addressable Memory

Cisco AS5800 Series Universal Gateways

Health Monitor and Diagnostics Monitor for the Cisco AS5850

Hierarchical Navigation

Downloads

Table Of Contents

Health Monitor and Diagnostic Monitor for the Cisco AS5850

Contents

Prerequisites for HM and DM for the Cisco AS5850

Restrictions for HM and DM for the Cisco AS5850

Information About HM and DM for the Cisco AS5850

Health Monitor Overview

Health Monitor Design

Rules Database

Action Handler

Health

Event Notification

Diagnostic Monitor Overview

Diagnostic Monitor Design

Benefits

How To Configure HM and DM for the Cisco AS5850

Activating or De-activating Health-Monitor Rules

Setting HM Notifications

Configure Diagnostic Monitor Tests

Verify System Health

General IOS Health Monitor Rules Operation

System Health Monitor Subsystem Rules

The zero_system_health_rule Operation

FIB Health Monitor Subsystem Rules

FIB Rules Operation

Memory Health Monitor Subsystem Rules

Low Memory Rules Operation

Fragmented Memory Rules Operation

rsc_slot Health Monitor Subsystem Rules

rsc_slot Rules Operation

Diagnostic Monitor Health Monitor Rules Operation

Troubleshooting Tips

Additional References

Related Documents

Standards

MIBs

RFCs

Technical Assistance

Command Reference

bootup

Syntax Description

Defaults

Command Modes

Command History

Usage Guidelines

Examples

Related Commands

show health-monitor events

Syntax Description

Defaults

Command Modes

Command History

Usage Guidelines

Examples

Related Commands

show health-monitor variable

Syntax Description

Defaults

Command Modes

Command History

Usage Guidelines

Examples

Related Commands

show health-monitor subsystem

Syntax Description

Defaults

Command Modes

Command History

Usage Guidelines

Examples

Related Commands

show health-monitor rule

Syntax Description

Defaults

Command Modes

Command History