|
Table Of Contents
Health Monitor and Diagnostic Monitor for the Cisco AS5850
Prerequisites for HM and DM for the Cisco AS5850
Restrictions for HM and DM for the Cisco AS5850
Information About HM and DM for the Cisco AS5850
How To Configure HM and DM for the Cisco AS5850
Activating or De-activating Health-Monitor Rules
Configure Diagnostic Monitor Tests
General IOS Health Monitor Rules Operation
System Health Monitor Subsystem Rules
The zero_system_health_rule Operation
FIB Health Monitor Subsystem Rules
Memory Health Monitor Subsystem Rules
Fragmented Memory Rules Operation
rsc_slot Health Monitor Subsystem Rules
Diagnostic Monitor Health Monitor Rules Operation
Health Monitor and Diagnostic Monitor for the Cisco AS5850
Health Monitor (HM) and Diagnostic Monitor (DM) are components of the Cisco IOS software that monitor the health of registered subsystems on the Cisco AS5850 Route Switch Controller (RSC).
Feature Specifications for Health Monitor and Diagnostic Monitor for the Cisco AS5850
Feature History Release Modification12.3(2)T
This feature was introduced on the Cisco AS5850.
Supported PlatformsCisco AS5850
Determining Platform Support Through Cisco Feature Navigator
Cisco IOS software is packaged in feature sets that are supported on specific platforms. To get updated information regarding platform support for this feature, access Cisco Feature Navigator. Cisco Feature Navigator dynamically updates the list of supported platforms as new platform support is added for the feature.
Cisco Feature Navigator is a web-based tool that enables you to determine which Cisco IOS software images support a specific set of features and which features are supported in a specific Cisco IOS image. You can search by feature or release. Under the release section, you can compare releases side by side to display both the features unique to each software release and the features in common.
To access Cisco Feature Navigator, you must have an account on Cisco.com. If you have forgotten or lost your account information, send a blank e-mail to cco-locksmith@cisco.com. An automatic check will verify that your e-mail address is registered with Cisco.com. If the check is successful, account details with a new random password will be e-mailed to you. Qualified users can establish an account on Cisco.com by following the directions found at this URL:
Cisco Feature Navigator is updated regularly when major Cisco IOS software releases and technology releases occur. For the most current information, go to the Cisco Feature Navigator home page at the following URL:
Availability of Cisco IOS Software Images
Platform support for particular Cisco IOS software releases is dependent on the availability of the software images for those platforms. Software images for some platforms may be deferred, delayed, or changed without prior notice. For updated information about platform support and availability of software images for each Cisco IOS software release, refer to the online release notes or, if supported, Cisco Feature Navigator.
Contents
Prerequisites for HM and DM for the Cisco AS5850
•Cisco IOS Release 12.3(2)T or later release
Restrictions for HM and DM for the Cisco AS5850
•none
Information About HM and DM for the Cisco AS5850
This section provides detailed information about the Cisco AS5850 Health Monitor and the Cisco As5850 Diagnostics Monitor.
Health Monitor Overview
Health Monitor (HM) is a Cisco IOS subsystem that monitors the state of hardware and software in the Cisco AS5850. By monitoring the state of individual hardware and software subsystems, the state or "health" of the system can be determined. Early detection of faults within individual subsystems can help increase the availability of the entire system. Health Monitor increases availability in the following ways:
•Fault notification
The Health Monitor receives notification events from the system and contains logs and statistics about faults that occur. This information can be displayed on a per subsystem basis and hence all events for a particular subsystem can be examined.
•Fault isolation
Once it is detected that the system health is suboptimal it will be possible to "drill down" and identify subcomponents that are causing the system health to be compromised. This can be used to isolate subsystems with problems that require attention.
•Failure recovery
The Health Monitor rules can trigger an action that can be used to recover from a fault or minimize its effect.
Health Monitor Design
Health Monitor (HM) is a rules based system that allows the user to enable or disable a set of rules for monitoring registered hardware or software subsystem. When the Cisco IOS is loaded, hardware and software subsystems register with Health Monitor. Once registered, the subsystems can then create health monitor rules and the health monitor will monitor events for these subsystems.
Note HM rules can be enabled or disabled. The conditions and actions associated with a rule are predetermined and cannot be altered.
The important aspects of Health Monitor are described below:
Rules Database
The rules which are added to the rules database determine how the health of the subsystem is affected. They are also used to recover from error conditions or minimize the effect of an encountered condition. The rules processor is responsible for evaluating rule conditions, and calling the corresponding action handler(s) when appropriate.
Each HM rule consists of one condition, which may consist of several subconditions, and one or more actions. Table 1 shows the default rules registered with Health Monitor. This list may change as other subsystems register with Health Monitor. A current list can be seen by using the show health-monitor rule command.
Action Handler
Invokes all actions associated with a rule if the condition of that rule evaluates to TRUE. In the following example, the zero_system_health_rule checks the system health rating. If the rating for the system is 0, the action will be to reload the RSC.
The following example shows the condition and action for the zero_system_health_rule:
Router#show health-monitor rule subsystem system rule-name zero_system_health_rule
Status (S) codes:A = activeD = deactivatedS ID Subsystem NameD 69 system zero_system_health_ruleCondition:(system_health <= 0%)Action:Reload this RSCNumber of times this rule has been evaluated = 0Number of times this rule evaluated to TRUE = 0Number of times associated actions were invoked = 0Health
The health of a system or subsystem is always indicated by a percentage between 0 (no health) and 100 (full health). A health rating of 100% indicates full health and a rating of 0% indicates the lowest health rating. The overall health of a system depends upon the composite health of the subsystems registered with the Health Monitor. Each subsystem that is registered with the Health Monitor has a health weighting associated with it. This indicates the amount by which the overall system health will change as a result of health changes of the subsystem.
For example, the rsc_cf_iosdiags subsystem has a health weighting of 500 (i.e. 500/10000 of the system health). Suppose that the overall system including the rsc_cf_iosdiags subsystem was at full health, this would be represented by a system and subsystem health of 100%. Should a serious rsc_cf_iosdiags fault now occur which lowers the rsc_cf_iosdiags subsystem health to 0% then the overall system health would be lowered to 95%.
It is possible for a single subsystem to have anywhere from no affect on the overall system health (weighting of zero) to 100% affect on overall system health. A single subsystem can drive the overall system health to zero.
Event Notification
The Heath Monitor increments and decrements the health of subsystems based on the rules registered for that subsystem. It uses the health weighting of each subsystem to determine the effect this has on the overall system health.
The events are typically internally generated notifications in response to detrimental or correctional changes in the state of the hardware or software of the system. Detrimental events are classified under one of the following severity levels:
•Catastrophic—causes or leads to system failure
•Critical—major subsystem or functionality failure
•High—potential for major impact to important functions
•Medium—potential for minor impact to functionality
•Low—negligible impact to functionality
Correctional events fall under the following classification:
•Positive—not a fault event. May cause or lead to the return of functionality.
Diagnostic Monitor Overview
Diagnostic Monitor (DM) is a new Cisco IOS subsystem that will pro-actively detect hardware and software failures on the active and standby RSCs on the Cisco AS5850. Intensive diagnostics are run on the RSC's components and a sophisticated dependency tree algorithm determines the root cause of component failures. RSC component status is sent to the Health Monitor to be included in overall system health. Individual diagnostic tests can be enabled or disabled and the testing frequency and interval can be changed.
Different sets of diagnostic tests are run on the active and standby RSCs. Because the active RSC is using more processor resources to handle calls than the standby RSC, more intensive tests can be run on the standby RSC. This assures the availability of the standby RSC if a switchover is necessary. Tests run on both RSCs include:
•Multicast pings from RSC to feature cards (FC) with packets transported across the switching fabric every 5 seconds.
•MBUS queries and responses for MBUS Local ID (performed every 15 seconds). EEPROM read (performed every hour) and temperature sensor test (performed every 30 seconds).
•Compact Flash device test (performed every hour)
•Compact Flash file system read test (performed every hour)
•Compact Flash file system write test (performed every 7 days from time of boot-up)
•Peer RSC polling over FE and MBUS (performed every 5 seconds)
•ROMMON EMT calls (performed hourly)
•FATAL and GIVE_UP line tests (performed daily)
•System Controller ID test (performed every 5 seconds)
•IO FPGA register test (performed every 5 seconds)
•Backplane Inter-Connect (BIC) configuration register and ID tests (performed every 3 and 5 seconds respectively)
•RSC Front Panel Fast Ethernet (FPFE) register test (performed every 5 seconds)
•RSC Gigabit Ethernet (GigE) register test (performed every 5 seconds)
•CPU utilization in the last 5 seconds (performed every 30 seconds)
•CPU latency tests for high, normal, and low priority processes (performed every 5, 15, and 30 seconds respectively)
•Switching fabric tests (performed every 5 seconds)
•XPIF/EPIF tests (performed every 5 seconds except XPIF PC read test-every 3 seconds)
Tests run on the active RSC only:
•DSIP client ping tests from RSC to FC (performed every 10 seconds)
In addition to the ongoing tests being run on the RSCs during run-time, additional tests are run during bootup. These tests could interfere with live traffic and are run once before traffic is being routed.
•Backplane Interconnect (BIC) Local Register and internal loop-back test
•RSC FE register and internal loop-back tests
•RSC GigE register test
Diagnostic Monitor Design
Diagnostic Monitor (DM) is tied closely to HM. Tests are designed to exercise the RSC components and report problems to HM. HM has rules established so that when a failure notification comes from DM, the appropriate rule can be applied and any necessary action taken.
During bootup, DM runs tests on both RSCs. Once the RSCs have come up, DM determines if the RSC is in active or standby mode and runs the appropriate diagnostics. DM registers with HM, just like other subsystems.
DM determines the root cause of a failure and reports that to HM. If an RSC component has a problem, it doesn't necessarily mean that component has failed. It's possible another component, that this component depends on, has failed. DM will follow the dependency tree, running diagnostics on all components, and determine the component with the actual failure. This information will be reported to HM. If a component failure has already been reported to HM, DM will not send another notification to HM. Once HM is notified of a problem, it will react according to the HM rules in place for that component. See Table 2 for a description of the diagnostic monitor tests.
Note For information about diagnostic monitor tests, use the show diagnostic-monitor test command.
Benefits
Health Monitor
•Increased system availability through fault analysis and recovery
•Ability to incorporate diagnostic test results into the system health
Diagnostic Monitor
•Early detection of RSC component failures
•Customizing of RSC diagnostic test intervals
•Root cause analysis
How To Configure HM and DM for the Cisco AS5850
See the following sections for configuration tasks for the Health Monitor feature. Each task in the list is identified as either required or optional.
•Activating or De-activating Health-Monitor Rules (optional)
•Setting HM Notifications (optional)
•Configure Diagnostic Monitor Tests (optional)
Activating or De-activating Health-Monitor Rules
To disable or enable a specific rule, use the rule subsystem command in health-monitor configuration mode.
Router#configure terminalRouter(config)#health-monitorRouter(config-hm)#rule subsystem subsystem-name rule-name rule-name [disable | enable]To disable or enable the rules for a subsystem, use the rule subsystem command in health-monitor configuration mode.
Router#configure terminalRouter(config)#health-monitorRouter(config-hm)#rule subsystem subsystem-name [disable | enable]To disable or enable all rules for all subsystems, use the rule all command in health-monitor configuration mode.
Router#configure terminalRouter(config)#health-monitorRouter(config-hm)#rule all [disable | enable]Setting HM Notifications
To disable or enable health monitor notifications, use the notify subsystem command in health-monitor configuration mode. This configuration will enable or disable rules on both the active and standby RSCs at the same time.
Router#configure terminalRouter(config)#health monitorRouter(config-hm)#notify subsystem subsystem-name [disable | enable]
To set the high threshold for SNMP notifications for a subsystem, use the notify high-threshold command in health-monitor configuration mode.
Router#configure terminalRouter(config)#health monitorRouter(config-hm)#notify subsystem subsystem-name high-threshold threshold-value
To set the low threshold for SNMP notifications for a subsystem, use the notify low-threshold command in health-monitor configuration mode.
Router#configure terminalRouter(config)#health monitorRouter(config-hm)#notify subsystem subsystem-name low-threshold threshold-value
Configure Diagnostic Monitor Tests
To set the default parameters for all diagnostic tests, use the default all command in diagnostic-monitor configuration mode.
Router#configure terminalRouter(config)#diagnostic-monitorRouter(diag-mon)#default all
To set the default parameters for a specific diagnostic test, use the default test command in diagnostic-monitor configuration mode.
Router#configure terminalRouter(config)#diagnostic-monitorRouter(diag-mon)#default test test-name
To configure the frequency value for a specific diagnostic test, use the test command in diagnostic-monitor configuration mode.
Router#configure terminalRouter(config)#diagnostic-monitorRouter(diag-mon)#test test-name frequency [ active | standby ] frequency-valueTo configure the timeout value for a specific diagnostic test, use the test command in diagnostic-monitor configuration mode.
Router#configure terminalRouter(config)#diagnostic-monitorRouter(diag-mon)#test test-name timeout [ active | standby ] timeout-value
To disable a specific DM test, use the no test command in diagnostic-monitor configuration mode.
Router#configure terminalRouter(config)#diagnostic-monitorRouter(config-dm)#no test testnameReset test result(s) to pass?? [yes/no]:Answer "yes" if there is a known software problem with this diagnostic test.Answer "no" if the test is being disabled for a failed component.To disable the diagnostic bootup tests, use the test command in diagnostic-monitor configuration mode.
Router#configure terminalRouter(config)#diagnostic-monitorRouter(diag-mon)#no bootup testsVerify System Health
•To check the overall system health, use the show health-monitor subsystem command in privileged EXEC mode.
Router#show health-monitor subsystem
System health is 100%Subsystem Health Weighting (max 10000)fb0_dsip_ping_diags 100% 834fb0_mil_ping_diags 100% 834fb10_dsip_ping_diags 100% 834fb10_mil_ping_diags 100% 834fb11_dsip_ping_diags 100% 834fb11_mil_ping_diags 100% 834fb12_dsip_ping_diags 100% 834fb12_mil_ping_diags 100% 834fb13_dsip_ping_diags 100% 834fb13_mil_ping_diags 100% 834fb1_dsip_ping_diags 100% 834fb1_mil_ping_diags 100% 834fb2_dsip_ping_diags 100% 834fb2_mil_ping_diags 100% 834fb3_dsip_ping_diags 100% 834fb3_mil_ping_diags 100% 834fb4_dsip_ping_diags 100% 834fb4_mil_ping_diags 100% 834fb5_dsip_ping_diags 100% 834fb5_mil_ping_diags 100% 834fb8_dsip_ping_diags 100% 834fb8_mil_ping_diags 100% 834fb9_dsip_ping_diags 100% 834fb9_mil_ping_diags 100% 834fib 100% 100health_monitor 100% 10000memory 100% 10000peer_rsc_ping_diags 100% 0rsc_bic_diags 100% 10000rsc_compactflash_diags 100% 0rsc_cpu_utilisation_diags 100% 0rsc_epif0_diags 100% 3334rsc_epif12_diags 100% 10000rsc_epif4_diags 100% 3334rsc_epif8_diags 100% 3334rsc_fpfe0_diags 100% 2500rsc_gige0_diags 100% 5000rsc_gige1_diags 100% 5000rsc_iofpga_diags 100% 10000rsc_mbus_diags 100% 10000rsc_mmc_diags 100% 10000rsc_process_latency_diags 100% 0rsc_redundancy_line_diags 100% 0rsc_rommon_diags 100% 100rsc_slot 100% 100rsc_sys_contoller_diags 100% 10000rsc_tcam_diags 100% 10000rsc_xpif_diags 100% 10000system 100% 10000•To see a report of health monitor events, use the show health-monitor events command in privileged EXEC mode.
Router#show health-monitor events
Event Statistics0 catastrophic0 criticial6 high1 medium18 low0 positiveThe following events were discarded26 unknown0 negligible0 health monitor eventsEvent buffer poolNumber of free event buffers = 300Number of events awaiting processing by HM Normal process = 0Number of events awaiting processing by HM Urgent process = 0•To check the health of a subsystem, use the show health-monitor subsystem <subsys-name> command in privileged EXEC mode.
Router#show health-monitor subsystem memory
Subsystem Health Weighting (max 10000)memory 100% 10000Subsystem Event Statistics0 catastrophic0 criticial0 high0 medium0 low0 positiveSubsystem Notification Configuration100 high-threshold0 low-thresholdFALSE notify-enableGeneral IOS Health Monitor Rules Operation
This section details how the currently implemented Health Monitor rules operate, grouped by Health-Monitor subsystem:
•System Health Monitor Subsystem Rules
•FIB Health Monitor Subsystem Rules
•Memory Health Monitor Subsystem Rules
•rsc_slot Health Monitor Subsystem Rules
System Health Monitor Subsystem Rules
There is one system Health Monitor subsystem rule:
zero_system_health_ruleThe zero_system_health_rule Operation
This rule simply reloads the RSC if the overall system health goes to zero. The idea of this rule is to have other HM rules that trigger on catastrophic failures/problems drive the overall system health to 0, which would trigger this rule and reload the RSC. Note that it will trigger regardless of how the overall system health goes to 0. So, if a lot of smaller problems cause the system health to go to 0, this rule will trigger.
Note This zero_system_health_rule is redundancy mode independent and operates on both the active and standby RSCs.
FIB Health Monitor Subsystem Rules
There are three FIB Health Monitor subsystem rules:
fib_disabled_busyout_&_power_cycle_fbfib_disabled_event_rulefib_recovered_event_ruleThese three rules are best viewed as working together as a single rule. The FIBDISABLE rule is the major rule and it invokes the other 2 rules to increase or decrease the FIB Health Monitor subsystem health.
FIB Rules Operation
When a FIBDISABLE errmsg occurs for an FB, the FIB rules are triggered. The rules busy-out the FB that had the problem, decrement the FIB Health Monitor subsystem health, and then perform the following tasks every 30 seconds:
1. If FIB has recovered abort the rule
2. If the busy-out is complete, reload the FB
3. If 20 minutes has elapsed, reload the FB
When the rule is aborted or the FB comes back up, the FIB Health Monitor subsystem health is reinstated. The FIB Health Monitor subsystem health weighting is non-zero and affects the overall system health.
Note The FIB rules are redundancy mode independent and are installed on both the active and standby RSCs, but are only operational on the active RSC. They remain dormant on the standby RSC.
Memory Health Monitor Subsystem Rules
There are four Memory Health Monitor subsystem rules, two low Memory rules and two fragmented Memory rules:
low_processor_memory_rulelow_iomem1_memory_rulefragmented_processor_memory_rulefragmented_iomem1_memory_ruleLow Memory Rules Operation
The low memory rules check the amount of free memory on the RSC every 30 seconds. If they detect that the amount of free memory is below the hard coded threshold, they decrement the Memory Health Monitor subsystem to zero (0) and reload the RSC.
Note The RSC reload is built into the memory rules. They do not rely on the zero_system_health_rule to reload the RSC.
The hard coded thresholds are:
•Processor memory-5 Mbytes
•IOMEM memory-2 Mbytes
The Memory Health Monitor subsystem health weighting is non-zero and affects the overall system health.
Fragmented Memory Rules Operation
The fragmented memory rules check the size of the largest available block of free memory on the RSC every 30 seconds. If they detect that the memory is too fragmented (the largest block of free memory is below the hard coded threshold), they decrement the Memory Health Monitor subsystem to zero (0) and reload the RSC.
Note The RSC reload is built into the memory rules. They do not rely on the zero_system_health_rule to reload the RSC.
The hard coded thresholds are:
•Processor memory-500 kbytes
•IOMEM memory-100 kbytes
The Memory Health Monitor subsystem health weighting is non-zero and affects the overall system health.
Note The Memory rules are redundancy mode dependent, so their operation changes depending on which redundancy mode the RSC is in. The changes are internal to the rule and not noticeable via the console. These rules operate on both the active and standby RSCs.
rsc_slot Health Monitor Subsystem Rules
There are two rsc_slot Health Monitor subsystem rules:
repeat_reboot_fbx_ruleboot_adjust_fbx_health_ruleWhere x is the slot number a FB is installed in.
rsc_slot Rules Operation
If an FB reboots 5 times in 60 minutes, the repeat_reboot_fbx_rule is triggered and powers down the FB. It also decrements the health of the rsc_slot subsystem.
The FB is allowed to fail 4 times. Then, on the 5th attempt, the FB is powered down, regardless of whether it would have rebooted successfully or not.
If the FB is manually powered up after this, then the boot_adjust_fbx_health_rule is triggered and the RSC slot subsystem health is restored.
Once the rsc_slot rule has triggered and the FB has been powered down, the FB can be manually rebooted by issuing the hw-module slot X reset command. Manually rebooting the FB causes two things to happen:
•It triggers the boot_adjust_fbx_health_rule, which restores the rsc_slot health monitor subsystem health.
•It re-initializes the repeat_reboot_fbx_rule so that another "5 FB reloads in 60 minutes" must occur for it to trigger again.
The rsc_slot Health Monitor subsystem rules for a particular FB only exist while that FB is physically inserted in the chassis. Removing an FB removes the rules for that FB. Inserting an FB installs the rules for that FB. The rules are installed only for FBs, not for RSCs. You will not see the rsc_slot rules installed for slot 6 or 7.
If the active RSC reloads for any reason, the repeat_reboot_fbx_rule is re-initialized so that another "5 FB reloads in 60 minutes" must occur for it to trigger again.
The rsc_slot Health Monitor subsystem health weighting is non-zero and affects the overall system health.
Note The rsc_slot rules are redundancy mode independent and are installed on both the active and standby RSCs. However, they are only active on the active RSC. They remain dormant on the standby RSC.
Diagnostic Monitor Health Monitor Rules Operation
A diagnostic monitor rule is triggered when a test fails, then the rule decrements the corresponding subsystem's health. When the failed test is repeated and passes, the rule increments the corresponding subsystems health.
The health weightings of the various diagnostic monitor subsystems are assigned such that they have differing effects on the overall system health. They may have from no affect (zero health weighting) to maximum affect (maximum health weighting). Those that have maximum effect will drive the overall system health to zero when the rule associated with the particular diagnostic monitor subsystem triggers. This will trigger the zero_system_health_rule, which causes the RSC to reload. Use the show health-monitor subsystem command to determine the weightings of any health monitor subsystem.
Troubleshooting Tips
•To see the state of the HM system, use the show health-monitor subsystem health_monitor command in privileged EXEC mode:
Router#show health-monitor subsystem health_monitor
Subsystem Health Weighting (max 10000)health_monitor 100% 500Subsystem Event Statistics0 catastrophic0 criticial0 high0 medium0 low0 positiveSubsystem Notification Configuration100 high-threshold0 low-thresholdFALSE notify-enable•To troubleshoot the Health Monitor feature, use the debug health-monitor command in privileged EXEC mode.
•To troubleshoot the Diagnostic Monitor feature, use the debug diagnostic-monitor command in user EXEC mode.
•To troubleshoot the non-diagnostic based Health Monitor Rules, use the debug <hm-subsys> health-monitor command in privileged EXEC mode.
debug ip cef health-monitordebug memory health-monitordebug slot health-monitordebug hm-rules redundancy•DM and HM continuously report a component as being faulty and then ok. When a component has intermittent problems, the DM test and HM rule associated with the component need to be disabled while the component is replaced.
To disable a DM test and HM rule, follow this procedure:
Note When enabling or disabling Health Monitor rules which are associated with diagnostic component tests it is imperative that this be done in a specific order so that Diagnostic Monitor events which are sent to the Health Monitor are not lost. Failure to do so may leave the Health Monitor and Diagnostic Monitor out of sync.
a. Disable scheduling of DM test.
To disable a specific DM test, use the no test command in diagnostic-monitor configuration mode.
Router#configure terminalRouter(config)#diagnostic-monitorRouter(config-dm)#no test testnameReset test result(s) to pass?? [yes/no]:
Note Answer "no" so that HM still sees this as a failed component.
b. Disable associated HM rule.
To disable a specific HM rule, use the rule subsystem command in health-monitor configuration mode.
Router#configure terminalRouter(config)#health-monitorRouter(config-hm)#rule subsystem subsystem-name rule-name rule-name [disable | enable]
c. Replace faulty component
d. Enable HM rule.
e. Enable scheduling of DM test.
Additional References
For additional information related to HM and DM for the Cisco AS5850, refer to the following references:
Related Documents
Standards
MIBs
This MIB module provides health information for the system and each of its subsystems on the active and standby RSC. In addition to providing health metrics, statistics are provided for the number of error events and correctional events which are received for each subsystem. Also, notifications can be configured so that they are sent when the health of a subsystem reaches a high or low threshold. Indexing in the MIB is performed on the ASCII subsystem name.
To locate and download MIBs for selected platforms, Cisco IOS releases, and feature sets, use Cisco MIB Locator found at the following URL:
http://tools.cisco.com/ITDIT/MIBS/servlet/index
If Cisco MIB Locator does not support the MIB information that you need, you can also obtain a list of supported MIBs and download MIBs from the Cisco MIBs page at the following URL:
http://www.cisco.com/public/sw-center/netmgmt/cmtk/mibs.shtml
To access Cisco MIB Locator, you must have an account on Cisco.com. If you have forgotten or lost your account information, send a blank e-mail to cco-locksmith@cisco.com. An automatic check will verify that your e-mail address is registered with Cisco.com. If the check is successful, account details with a new random password will be e-mailed to you. Qualified users can establish an account on Cisco.com by following the directions found at this URL:
RFCs
Technical Assistance
Command Reference
This section documents new or modified commands. All other commands used with this feature are documented in the Cisco IOS High Availability command reference publications for various releases.
New Commands
•show health-monitor subsystem
•rule
•test
bootup
To enable or disable bootup test, use the bootup command in diagnostic-monitor configuration mode.
[no] bootup {tests}
Syntax Description
Defaults
No default behavior or values.
Command Modes
Diagnostic-monitor configuration mode.
Command History
Usage Guidelines
Use this command to enable or disable the bootup diagnostic tests that can be monitored by the Diagnostic Monitor.
Examples
This example enables the bootup tests to be monitored by Diagnostic Monitor after the RSC is reloaded.
Router#configure terminalRouter(config)#diagnostic-monitorRouter(diag-mon)#bootup testsRelated Commands
Command Descriptionshow diagnostic-monitor test
Display the results and default values for Diagnostic Monitor tests.
show health-monitor events
To see statistics for the events that the Health Monitor has received use the show health-monitor events command.
show health-monitor events
Syntax Description
Defaults
No default behavior or values.
Command Modes
privileged EXEC
Command History
Usage Guidelines
Use this command if there is reason to believe that the Health Monitor is not processing events from the system.
Examples
This example shows the output of this command:
Event Statistics0 catastrophic0 critical0 high3 medium5 low10 positiveThe following events were discarded136 unknown0 negligible0 health monitor eventsEvent buffer poolNumber of free event buffers = 300Number of events awaiting processing by HM Normal process = 0Number of events awaiting processing by HM Urgent process = 0Related Commands
show health-monitor variable
To see information about a variable in the Health Monitor Variable Database use the show health-monitor variable command.
show health-monitor variable [subsystem subsystem-name [var-name variable-name]]
Syntax Description
Defaults
No default behavior or values.
Command Modes
privileged EXEC
Command History
Usage Guidelines
Use this command to see the value of a Health Monitor variable. This can be of use if a variable is used to trigger a Health Monitor rule.
Examples
The following example shows part of the Variable Database output:
Router#show health-monitor variableType Key:(Num)Number (Hlth)Health (VPtr)Void Pointer (Str)String(Bool)Boolean (Freq)Frequency (Arg)Argument (Tokn)TokenSubsystem Variable Name Type Valuefb0_dsip_ping_diags fb0_dsip_ping_diags_health Hlth 100%fb0_mil_ping_diags fb0_mil_ping_diags_health Hlth 100%fb10_dsip_ping_diags fb10_dsip_ping_diags_health Hlth 100%fb10_mil_ping_diags fb10_mil_ping_diags_health Hlth 100%The following shows the output for all of the variables in the memory subsystem:
Router#show health-monitor variable subsystem memoryVariable Name Type Valuefree_iomem1_memory Number 101481696free_processor_memory Number 683770428largest_block_iomem1_memory Number 101390044largest_block_processor_memory Number 564721540memory_health Health 100%The following shows the output of this command for one specific variable. Note that the rules associated with this variable are also shown:
Router#show health-monitor variable subsystem memory var-name free_iomem1_memorySubsystem : memoryVariable Name: free_iomem1_memoryType : NumberValue : 101481696Num Reads : 20859Num Writes : 20859Associated rules:Status (S) codes:A = activeD = deactivatedS ID Subsystem NameA 137 memory low_iomem1_memory_ruleRelated Commands
show health-monitor subsystem
To check the overall system health as well as the health of each subsystem, use the show health-monitor subsystem command in privileged EXEC mode.
show health-monitor subsystem [ subsystem-name | standby ]
Syntax Description
Defaults
No default behavior or values.
Command Modes
privileged EXEC
Command History
Usage Guidelines
Use the show health-monitor subsystem command to check the health of a system or a subsystem on either the active or standby RSC. If the system health is degraded, use this command to isolate which subsystem has less than perfect health and is effecting the system health.
Examples
The following example shows output of the show health-monitor subsystem command:
Router#show health-monitor subsystem
System health is 100%Subsystem Health Weighting (max 10000)dsip_fbx_ping_iosdia 100% 10000fdm_appl_iosdiags 100% 10000fib 100% 100gt64120_iosdiags 100% 10000health_monitor 100% 500inter_rsc_iosdiags 100% 5000mbus_eeprom_iosdiags 100% 500memory 100% 10000mha_line_iosdiags 100% 5000pci_amdfe_iosdiags 100% 5000pif_iosdiags 100% 10000rsc_cf_iosdiags 100% 500rsc_common_iosdiags 100% 8000rsc_envm_iosdiags 100% 1000rsc_fb_iosdiags 100% 10000rsc_fpfe_iosdiags 100% 2000rsc_fpga_iosdiags 100% 10000rsc_gige_iosdiags 100% 10000rsc_mmc_iosdiags 100% 10000rsc_rommon_iosdiags 100% 100rsc_slot 100% 100system 100% 10000Router#show health-monitor subsystem memory
Subsystem Health Weighting (max 10000)memory 100% 10000Subsystem Event Statistics0 catastrophic0 criticial0 high0 medium0 low0 positiveSubsystem Notification Configuration100 high-threshold0 low-thresholdFALSE notify-enable
Related Commands
show health-monitor rule
To display current or historical status relating to health monitor rules, use the show health-monitor rule command in privileged EXEC mode.
show health-monitor rule [rule-ID | event-id | rule-id | subsystem [subsystem-name | rule-name rule-name | detail] | detail]
Syntax Description
Defaults
No default behavior or values.
Command Modes
Privileged EXEC
Command History
Usage Guidelines
Use this command to view a summary list of Health Monitor rules, or detailed information on an individual rule. Detailed information includes the condition which triggers the rule, the action(s) the rule performs, whether the rule is activated or deactivated and historical data on how many times the rule has been evaluated and triggered.
The show health-monitor rule command can be used to find the rule-name, the rule-id or the subsystem name associated with a rule. This information can then be used in the longer commands based on the show health-monitor rule command.
Examples
The following example shows the output from the show health-monitor rule command:
Router#show health-monitor rule
Status (S) codes:A = activeD = deactivatedS ID Subsystem NameA 43 dsip_fbx_ping_iosdiags dsip_fbx_ping_root_cause_decA 44 dsip_fbx_ping_iosdiags dsip_fbx_ping_root_cause_incA 49 fdm_appl_iosdiags tcam_rw_root_cause_decA 50 fdm_appl_iosdiags tcam_rw_root_cause_incA 72 fib fibdisable_busyout_&_power_cycle_fbA 73 fib fib_disabled_event_ruleA 74 fib fib_recovered_event_ruleA 67 gt64120_iosdiags gt64120_id_root_cause_decA 68 gt64120_iosdiags gt64120_id_root_cause_incA 1 inter_rsc_iosdiags inter_rsc_root_cause_decA 2 inter_rsc_iosdiags inter_rsc_root_cause_incA 45 mbus_eeprom_iosdiags mbus_eeprom_rw_root_cause_decA 46 mbus_eeprom_iosdiags mbus_eeprom_rw_root_cause_incA 70 memory low_processor_memory_ruleA 71 memory low_iomem1_memory_ruleA 3 mha_line_iosdiags ha_line_root_cause_decA 4 mha_line_iosdiags ha_line_root_cause_incr_healthA 57 pci_amdfe_iosdiags pci_mac_id_root_cause_decA 58 pci_amdfe_iosdiags pci_mac_cfg_read_root_cause_decA 59 pci_amdfe_iosdiags pci_mac_cfg_rw_root_cause_decA 60 pci_amdfe_iosdiags pci_mac_reg_rw_root_cause_decA 61 pci_amdfe_iosdiags pci_mac_int_loopback_root_cause_decA 62 pci_amdfe_iosdiags pci_mac_id_root_cause_incA 63 pci_amdfe_iosdiags pci_mac_cfg_read_root_cause_incA 64 pci_amdfe_iosdiags pci_mac_cfg_rw_root_cause_incA 65 pci_amdfe_iosdiags pci_mac_reg_rw_root_cause_incA 66 pci_amdfe_iosdiags pci_mac_int_loopback_root_cause_incA 15 pif_iosdiags epifx_id_root_cause_decA 16 pif_iosdiags epifx_phy_read_root_cause_decA 17 pif_iosdiags epifx_imem_read_root_cause_decA 18 pif_iosdiags xpifx_id_root_cause_decA 19 pif_iosdiags xpifx_phy_read_root_cause_decA 20 pif_iosdiags xpifx_imem_read_root_cause_decA 21 pif_iosdiags epifx_id_root_cause_incA 22 pif_iosdiags epifx_phy_read_root_cause_incA 23 pif_iosdiags epifx_imem_read_root_cause_incA 24 pif_iosdiags xpifx_id_root_cause_incA 25 pif_iosdiags xpifx_phy_read_root_cause_incA 26 pif_iosdiags xpifx_imem_read_root_cause_incA 51 rsc_cf_iosdiags compact_flash_rw_root_cause_decA 52 rsc_cf_iosdiags compact_flash_id_root_cause_decA 53 rsc_cf_iosdiags compact_flash_read_root_cause_decA 54 rsc_cf_iosdiags compact_flash_rw_root_cause_incA 55 rsc_cf_iosdiags compact_flash_id_root_cause_incA 56 rsc_cf_iosdiags compact_flash_read_root_cause_incA 75 rsc_common_iosdiags rsc_high_prio_latency_root_cause_decA 76 rsc_common_iosdiags rsc_high_prio_latency_root_cause_incA 77 rsc_common_iosdiags rsc_normal_prio_latency_root_cause_decA 78 rsc_common_iosdiags rsc_normal_prio_latency_root_cause_incA 79 rsc_common_iosdiags rsc_low_prio_latency_root_cause_decA 80 rsc_common_iosdiags rsc_low_prio_latency_root_cause_incA 81 rsc_common_iosdiags rsc_cpu_util_root_cause_decA 82 rsc_common_iosdiags rsc_cpu_util_root_cause_incA 83 rsc_common_iosdiags rsc_mbus_local_id_root_cause_decA 84 rsc_common_iosdiags rsc_mbus_local_id_root_cause_incA 47 rsc_envm_iosdiags rsc_mbus_temp_sensor_root_cause_decA 48 rsc_envm_iosdiags rsc_mbus_temp_sensor_root_root_cause_inA 5 rsc_fb_iosdiags rsc_fb_mil_path_ping_root_cause_decA 6 rsc_fb_iosdiags rsc_fb_mil_path_ping_root_cause_incA 27 rsc_fpfe_iosdiags rsc_fpfe_id_root_cause_decA 28 rsc_fpfe_iosdiags rsc_fpfe_usr_reg_rw_root_cause_decA 29 rsc_fpfe_iosdiags rsc_fpfe_xtal_root_cause_decA 30 rsc_fpfe_iosdiags rsc_fpfe_low_voltage_root_cause_decA 31 rsc_fpfe_iosdiags rsc_fpfe_reg_rw_root_cause_decA 32 rsc_fpfe_iosdiags rsc_fpfe_loopback_result_root_cause_decA 33 rsc_fpfe_iosdiags rsc_fpfe_id_root_cause_incA 34 rsc_fpfe_iosdiags rsc_fpfe_user_reg_rw_root_cause_incA 35 rsc_fpfe_iosdiags rsc_fpfe_xtal_root_cause_incA 36 rsc_fpfe_iosdiags rsc_fpfe_low_voltage_root_cause_incA 37 rsc_fpfe_iosdiags rsc_fpfe_reg_rw_root_cause_incA 38 rsc_fpfe_iosdiags rsc_fpfe_loopback_result_root_cause_incA 7 rsc_fpga_iosdiags fpga_id_root_cause_decA 8 rsc_fpga_iosdiags fpga_scratch_rw_root_cause_decA 9 rsc_fpga_iosdiags fpga_id_root_cause_incA 10 rsc_fpga_iosdiags fpga_scratch_rw_root_cause_incA 39 rsc_gige_iosdiags rsc_gige_reg_rw_result_root_cause_decA 40 rsc_gige_iosdiags rsc_gige_addr_reg_rw_root_cause_decA 41 rsc_gige_iosdiags rsc_gige_reg_rw_root_cause_incA 42 rsc_gige_iosdiags rsc_gige_addr_reg_rw_root_cause_incA 13 rsc_mmc_iosdiags rsc_mmc_id_root_cause_decA 14 rsc_mmc_iosdiags rsc_mmc_id_root_cause_incA 11 rsc_rommon_iosdiags rsc_rommon_read_root_cause_decA 12 rsc_rommon_iosdiags rsc_rommon_read_root_cause_incA 85 rsc_slot repeat_reboot_fb2_ruleA 86 rsc_slot repeat_reboot_fb3_ruleA 87 rsc_slot repeat_reboot_fb6_ruleA 88 rsc_slot repeat_reboot_fb9_ruleA 89 rsc_slot repeat_reboot_fb10_ruleA 90 rsc_slot repeat_reboot_fb12_ruleD 69 system zero_system_health_ruleThe following example show the output for a health-monitor rule:
Router#show health-monitor rule 70
Status (S) codes:A = activeD = deactivatedS ID Subsystem NameA 70 memory low_processor_memory_ruleCondition:(free_processor_memory < 5242880)Action:Decrement health by 100%Action:Reload this RSCNumber of times this rule has been evaluated = 131Number of times this rule evaluated to TRUE = 0Number of times associated actions were invoked = 0
Note To see the same information as above, you could also use the command show health-monitor subsystem memory rule-name low_processor_memory_rule.
The following example shows the output for a health-monitor subsystem:
Router#show health-monitor rule subsystem memory
Status (S) codes:A = activeD = deactivatedS ID Subsystem NameA 70 memory low_processor_memory_ruleA 71 memory low_iomem1_memory_rule
Related Commands
Command Descriptionshow health-monitor variable subsystem <subsystem-name> var-name <var-name>
Shows all rules associated with a variable.
rule
Activates/Deactivates rule(s).
show diagnostic-monitor test
To display the tests run by diagnostic monitor, use the show diagnostic-monitor test command in privileged EXEC mode.
show diagnostic-monitor test {all | test-name {counters | details | status | summary | timers}
Syntax Description
Defaults
No default behavior or values.
Command Modes
privileged EXEC
Command History
Usage Guidelines
Use this command to display the details and status of diagnostic monitor tests.
Examples
The following example shows how many times the test has been run.
Router#show diagnostic-monitor test tcam-rw counters
Name Passed Failed UnknownCount Count Count-------------------------- ------------- ------------- -------------tcam-rw 2337 0 0The following example shows the test results.
Router#show diagnostic-monitor test tcam-rw details
Note: R = Root cause failure.S = Superseded root cause failure.* = Bootup test only.Name Test Test Result ReasonResult-------------------------- ----------- -----------------------tcam-rw Pass Test responseThe output below is useful to see whether the test is running on the RSC (Running column) and more importantly whether it is allowed to run on the RSC (Runnable column). The RSC may be in a mode where the test is not allowed to run and this will indicate it.
Router#show diagnostic-monitor test tcam-rw status
Note: R = Root cause failure.S = Superseded root cause failure.* = Bootup test only.Runnable = Test is allowed to run.Running = Test is scheduled to run.Name Test Runnable RunningResult-------------------------- ----------- -------- -------tcam-rw Pass Yes YesThe output below summarizes the status of the test and when it is next scheduled to run.
Router#show diagnostic-monitor test tcam-rw summary
Note: R = Root cause failure.S = Superseded root cause failure.* = Bootup test only.Name Test Next-Test ScheduledResult (days.hrs:min:sec.ms)-------------------------- ----------- ---------------------tcam-rw Pass 00.00:00:30.772The following command displays information about the frequency of a test running on the active and standby RSCs as well as what their timeout values are.
Router#show diagnostic-monitor test tcam-rw timers
Name Active-Freq Standby-Freq Timeout(days.hrs:min:sec.ms) (ms)-------------------------- --------------------------------- -------tcam-rw 00.00:01:00.000 00.00:01:00.000 1000
Related Commands
Command Descriptionbootup
To enable or disable bootup diagnostic tests.
test
To change diagnostic test parameters.
default
To set diagnostic test parameters to their defaults.
show monitor event-trace hm
To display the events sent to HM, use the show monitor event-trace hm command in privileged EXEC mode.
show monitor event-trace hm {all | back {mmm / hhh:mm }| clock {hh:mm }| from-boot {seconds}| latest | parameters size}
Syntax Description
Defaults
No default behavior or values.
Command Modes
privileged EXEC
Command History
Usage Guidelines
Use this command to display the events that have changed the health of the system and subsystems.
Examples
The following example shows events, actions, and changes to the system and subsystem health.
Router#show monitor event-trace hm all
Feb 24 03:10:34: Event: Subsystem rsc_slot: ev_num 12, ntf type EVENTFeb 24 03:10:34: Event: Subsystem rsc_slot: ev_num 1, ntf type EVENTFeb 24 03:10:34: Event: Subsystem rsc_slot: ev_num 3, ntf type EVENTFeb 24 03:10:34: Event: Subsystem rsc_slot: ev_num 4, ntf type EVENTFeb 24 03:10:35: Event: Subsystem rsc_slot: ev_num 9, ntf type EVENTFeb 24 03:10:35: Event: Subsystem rsc_slot: ev_num 10, ntf type EVENTFeb 24 03:12:54: Event: Subsystem fb3_dsip_ping_diags: ev_num 1, ntf type EVENTFeb 24 03:12:54: Health change: Subsystem fb3_dsip_ping_diags: Health decreased to 0%Feb 24 03:12:54: Health change: Subsystem system: Health decreased to 91.66%Feb 24 03:12:54: Action invoked: Subsystem fb3_dsip_ping_diags: Rule-name fb3_dsip_ping_rc_dec, Action: Decrement health by 100%Feb 24 03:13:04: Event: Subsystem fb10_dsip_ping_diags: ev_num 1, ntf type EVENTFeb 24 03:13:04: Health change: Subsystem fb10_dsip_ping_diags: Health decreased to 0%Feb 24 03:13:04: Health change: Subsystem system: Health decreased to 83.32%Feb 24 03:13:04: Action invoked: Subsystem fb10_dsip_ping_diags: Rule-name fb10_dsip_ping_rc_dec, Action: Decrement health by 100%Feb 24 03:13:04: Event: Subsystem fb9_dsip_ping_diags: ev_num 1, ntf type EVENTFeb 24 03:13:04: Health change: Subsystem fb9_dsip_ping_diags: Health decreased to 0%Feb 24 03:13:04: Health change: Subsystem system: Health decreased to 74.98%Feb 24 03:13:04: Action invoked: Subsystem fb9_dsip_ping_diags: Rule-name fb9_dsip_ping_rc_dec, Action: Decrement health by 100%Feb 24 03:13:04: Event: Subsystem fb4_dsip_ping_diags: ev_num 1, ntf type EVENTFeb 24 03:13:04: Health change: Subsystem fb4_dsip_ping_diags: Health decreased to 0%Feb 24 03:13:04: Health change: Subsystem system: Health decreased to 66.64%Feb 24 03:13:04: Action invoked: Subsystem fb4_dsip_ping_diags: Rule-name fb4_dsip_ping_rc_dec, Action: Decrement health by 100%Feb 24 03:13:27: Event: Subsystem fb4_dsip_ping_diags: ev_num 2, ntf type EVENTFeb 24 03:13:27: Health change: Subsystem fb4_dsip_ping_diags: Health increased to 100%Feb 24 03:13:27: Health change: Subsystem system: Health increased to 74.98%Feb 24 03:13:27: Action invoked: Subsystem fb4_dsip_ping_diags: Rule-name fb4_dsip_ping_rc_inc, Action: Increment health by 100%Feb 24 03:13:27: Event: Subsystem fb10_dsip_ping_diags: ev_num 2, ntf type EVENTFeb 24 03:13:27: Health change: Subsystem fb10_dsip_ping_diags: Health increased to 100%Feb 24 03:13:27: Health change: Subsystem system: Health increased to 83.32%Feb 24 03:13:27: Action invoked: Subsystem fb10_dsip_ping_diags: Rule-name fb10_dsip_ping_rc_inc, Action: Increment health by 100%Feb 24 03:13:27: Event: Subsystem fb9_dsip_ping_diags: ev_num 2, ntf type EVENTFeb 24 03:13:27: Health change: Subsystem fb9_dsip_ping_diags: Health increased to 100%Feb 24 03:13:27: Health change: Subsystem system: Health increased to 91.66%Feb 24 03:13:27: Action invoked: Subsystem fb9_dsip_ping_diags: Rule-name fb9_dsip_ping_rc_inc, Action: Increment health by 100%Feb 24 03:13:34: Event: Subsystem fb3_dsip_ping_diags: ev_num 2, ntf type EVENTFeb 24 03:13:34: Health change: Subsystem fb3_dsip_ping_diags: Health increased to 100%Feb 24 03:13:34: Health change: Subsystem system: Health increased to 100%Feb 24 03:13:34: Action invoked: Subsystem fb3_dsip_ping_diags: Rule-name fb3_dsip_ping_rc_inc, Action: Increment health by 100%Related Commands
health-monitor subsystem
To set the health monitor subsystem health value, use the health-monitor subsystem command in privileged EXEC mode.
health-monitor subsystem subsystem set health-value
Syntax Description
subsystem
Subsystem name.
set
Sets the subsystem's health value.
health-value
Health value in .01 percent increments.
Defaults
No default behavior or values.
Command Modes
Privileged EXEC.
Command History
Usage Guidelines
This will set a subsystem's health and update those subsystems dependent on it, including the system health. Use of this command is not recommended unless recovering from a Health Monitor internal fault or from an incorrect user procedure which led to incorrect subsystem health.
Examples
This example sets the health value for the memory subsystem to 50 percent.
Router#health-monitor subsystem memory health set 5000
Related Commands
debug health-monitor
To turn on health monitoring debugging, use the debug health-monitor command in privileged EXEC mode. Use the no form of this command to turn off debugging.
debug health-monitor [ action | api | cli | condition { duration | frequency} | errors | events | mib | remote-support | rule | subsystem | variable]
Syntax Description
Defaults
No default behavior or values.
Command Modes
Privileged EXEC
Command History
Usage Guidelines
Use the debug health-monitor commands to turn on debugging for various aspects of the health-monitor. This is used to diagnose the operations of rules or to determine why system or subsystem health is changing.
This command provides detailed information on the health monitor internal processing. As the health monitor rules rely so heavily on the health monitor, this provides a lot of extra information on the rule operation.
Examples
The example below shows the HM evaluating conditions in the rule database:
Router# debug health-monitor conditionAug 4 15:38:59.206: HM COND: eval conditionAug 4 15:38:59.206: HM COND: eval leaf condition (0x648F83B8), left var type, operator<Aug 4 15:38:59.206: HM COND: eval leaf condition, right val typeAug 4 15:38:59.206: HM COND: eval leaf condition (0x648F83B8), evaluates to FALSE<Aug 4 15:38:59.206: HM COND: eval condition, leaf result FALSEAug 4 15:38:59.206: HM COND: eval conditionAug 4 15:38:59.206: HM COND: eval leaf condition (0x648F8710), left var type, operator<Aug 4 15:38:59.206: HM COND: eval leaf condition, right val typeAug 4 15:38:59.206: HM COND: eval leaf condition (0x648F8710), evaluates to FALSE<Aug 4 15:38:59.206: HM COND: eval condition, leaf result FALSEAug 4 15:38:59.206: HM COND: eval conditionAug 4 15:38:59.206: HM COND: eval leaf condition (0x648F8A68), left var type, operator<Aug 4 15:38:59.206: HM COND: eval leaf condition, right val typeAug 4 15:38:59.206: HM COND: eval leaf condition (0x648F8A68), evaluates to FALSE<Aug 4 15:38:59.206: HM COND: eval condition, leaf result FALSEAug 4 15:38:59.206: HM COND: eval conditionAug 4 15:38:59.206: HM COND: eval leaf condition (0x648F8DC0), left var type, operator<Aug 4 15:38:59.206: HM COND: eval leaf condition, right val typeAug 4 15:38:59.206: HM COND: eval leaf condition (0x648F8DC0), evaluates to FALSE<Aug 4 15:38:59.206: HM COND: eval condition, leaf result FALSEThe example below shows the reading and writing of memory variables in the Health Monitor Variable database:
Router# debug health-monitor variableAug 4 15:39:29.253: HM_VAR: Write to var (free_processor_memory) succeededAug 4 15:39:29.253: HM_VAR: Write to var (free_iomem1_memory) succeededAug 4 15:39:29.253: HM_VAR: Write to var (largest_block_processor_memory) succeededAug 4 15:39:29.253: HM_VAR: Write to var (largest_block_iomem1_memory) succeededAug 4 15:39:29.253: HM_VAR: Var (free_processor_memory) read succeededAug 4 15:39:29.253: HM_VAR: Var (free_iomem1_memory) read succeededAug 4 15:39:29.253: HM_VAR: Var (largest_block_processor_memory) read succeededAug 4 15:39:29.253: HM_VAR: Var (largest_block_iomem1_memory) read succeededRelated Commands
Command Descriptiondebug diagnostic-monitor
Turns on debugging for the Diagnostics Monitor.
show monitor event-trace hm
Displays the Health Monitor event trace
debug ip cef health-monitor
To turn on debugging for the FIB health monitor subsystem, use the debug ip cef health-monitor command in privileged EXEC mode. Use the no form of this command to turn off debugging.
debug ip cef health-monitor
Syntax Description
This command has no arguments or keywords.
Defaults
No default behavior or values.
Command Modes
Privileged EXEC
Command History
Usage Guidelines
Use the debug ip cef health-monitor command to turn on debugging for all health monitor rules associated with the FIB health monitor subsystem.This is used to diagnose faults in or view more detailed operational information re the health monitor rules that belong to the FIB health monitor subsystem.
The FIB health-monitor rules are shown below:
Router#show health-monitor rule subsystem fibStatus (S) codes:A = activeD = deactivatedS ID Subsystem NameA 72 fib fib_disabled_busyout_&_power_cycle_fbA 73 fib fib_disabled_event_ruleA 74 fib fib_recovered_event_ruleTogether these rules perform the required actions when a FIBDISABLE errmsg occurs. That is, they busyout the FB and decrement the FIB subsystem health. Periodically, the rules check to see if FIB has recovered or the busyout is complete. If FIB recovers, the rule aborts and increments the health. If FIB does not recover and the busyout completes, the rule reloads the FB immediately. There is a timeout period after which the rule will reload the FB regardless of whether the busyout is complete or not. When the FB boots, the FIB subsystem health will be reinstated.
Note The terms FIB and CEF are interchangeable. These terms refer to the same switching functionality.
Examples
The following example turns on debugging for the FIB Health Monitor rules:
Router#debug ip cef health-monitor
IP CEF Health Monitor Rules debugging is onThe following example shows detailed operational information when the rule triggers and FIB recovers:
Oct 4 14:38:13.506: %FIB-3-FIBDISABLE: Fatal error, slot 0: No windowmessage, LC to RP IPC is non-operationalOct 4 14:38:13.510: CEF_HM: FIBDISABLE Rule triggered - busying out FB 0Oct 4 14:38:13.510: CEF_HM: Sent FIB_DISABLED event to Health Monitor(slot 0)Oct 4 14:38:13.510: CEF_HM: Started check timer for slot 0Oct 4 14:38:13.510: %SLOT-4-FB_BUSYOUT: Busy out feature board 0, initiatedby the Health MonitorOct 4 14:38:13.510: CEF_HM: CEF problem on slot 0 detected. Decremented CEFsubsystem health by 5000Oct 4 14:38:43.510: CEF_HM: Check timer expired for FB 0 - processing...Oct 4 14:38:43.510: CEF_HM: Sent FIB_RECOVERED event to Health Monitor(slot 0)Oct 4 14:38:43.510: %SLOT-4-FB_BUSYOUT: Busy out feature board 0, cancelledby the Health Monitor due to FIB recoveryOct 4 14:38:43.606: CEF_HM: CEF problem on slot 0 recovered. IncrementedCEF subsystem health by 5000The following example shows detailed operational information when the rule triggers and the busyout completes before FIB recovers (the feature card is reset).
Oct 4 14:41:10.561: %FIB-3-FIBDISABLE: Fatal error, slot 0: No windowmessage, LC to RP IPC is non-operationalOct 4 14:41:10.565: CEF_HM: FIBDISABLE Rule triggered - busying out FB 0Oct 4 14:41:10.565: CEF_HM: Sent FIB_DISABLED event to Health Monitor(slot 0)Oct 4 14:41:10.565: CEF_HM: Started check timer for slot 0Oct 4 14:41:10.565: %SLOT-4-FB_BUSYOUT: Busy out feature board 0, initiatedby the Health MonitorOct 4 14:41:10.565: CEF_HM: CEF problem on slot 0 detected. Decremented CEFsubsystem health by 5000Oct 4 14:41:40.565: CEF_HM: Check timer expired for FB 0 - processing...Oct 4 14:41:40.573: CEF_HM: Busyout complete - reset FB 0Oct 4 14:41:40.573: %SLOT-4-FB_RESET: Resetting feature board 0, asrequested by the Health MonitorOct 4 14:41:40.573: %DSIPPF-5-DS_KEEPALIVE_LOSS: DSIP Keepalive Loss fromshelf 0 slot 0Oct 4 14:41:55.577: CEF_HM: Notify HM that FB 0 reloadedOct 4 14:41:55.577: CEF_HM: Sent FIB_RECOVERED event to Health Monitor(slot 0)Oct 4 14:41:55.577: %SLOT-4-FB_BUSYOUT: Busy out feature board 0, cancelledby the Health Monitor due to FB reloadOct 4 14:41:55.653: CEF_HM: CEF problem on slot 0 recovered. IncrementedCEF subsystem health by 5000Related Commands
debug memory health-monitor
To turn on debugging for the memory health monitor subsystem, use the debug memory health-monitor command in privileged EXEC mode. Use the no form of this command to turn off debugging.
debug memory health-monitor
Syntax Description
This command has no arguments or keywords.
Defaults
No default behavior or values.
Command Modes
Privileged EXEC
Command History
Usage Guidelines
Use the debug memory health-monitor command to turn on debugging for all health monitor rules associated with the memory subsystem.This is used to diagnose faults in or view more detailed operational information re the health monitor rules that belong to the memory subsystem.
Router#show health-monitor rule subsystem memoryStatus (S) codes:A = activeD = deactivatedS ID Subsystem NameA 140 memory low_processor_memory_ruleA 141 memory low_iomem1_memory_ruleA 142 memory fragmented_processor_memory_ruleA 143 memory fragmented_iomem1_memory_ruleThese low memory rules periodically check the amount of free processor/IOMEM memory on the RSC and reload the RSC if the amount of free memory falls below a predefined threshold. The fragmented memory rules periodically check for excessively fragmented memory and reload the RSC if the memory fragmentation exceeds a predefined threshold.
Examples
The following example turns on debugging for the memory health monitor rules:
Router#debug memory health-monitor
Memory Health Monitor Rules debugging is onThe following example shows detailed operational information of the low memory rule where the amount of free memory is above the threshold (rule does not trigger):
Oct 4 15:45:09.232: HM_VAR: Write to var (Free_Processor_Memory) succeededOct 4 15:45:09.232: HM_VAR: Write to var (Free_IOMEM1_Memory) succeededOct 4 15:45:09.232: HM RULE: Received var update; evaluating rule listOct 4 15:45:09.236: HM_VAR: Var (Free_Processor_Memory) read succeededOct 4 15:45:09.236: HM RULE: Rule [70] evaluation: FALSEOct 4 15:45:09.236: HM RULE: Received var update; evaluating rule listOct 4 15:45:09.236: HM_VAR: Var (Free_IOMEM1_Memory) read succeededOct 4 15:45:09.236: HM RULE: Rule [71] evaluation: FALSEThe following example shows detailed operational information of the Low Processor memory rule where the amount of free memory is below the threshold (rule triggers):
Oct 4 15:56:06.341: HM_VAR: Write to var (Free_Processor_Memory) succeededOct 4 15:56:06.341: HM_VAR: Write to var (Free_IOMEM1_Memory) succeededOct 4 15:56:06.341: HM RULE: Received var update; evaluating rule listOct 4 15:56:06.341: HM_VAR: Var (Free_Processor_Memory) read succeededOct 4 15:56:06.341: HM RULE: Rule [70] evaluation: TRUEOct 4 15:56:06.341: HM ACTION: invokeOct 4 15:56:06.341: HM ACTION: Action:Decrement health by 100%: 2 argsOct 4 15:56:06.341: HM ACTION: Arg 1: Type Subsys Handle, Value PTR0x63D97E9COct 4 15:56:06.341: HM ACTION: Arg 2: Type Health, Value 100%Oct 4 15:56:06.341: HM SUBSYS: decrementing health of subsystem Memory by10000Oct 4 15:56:06.341: HM_VAR: Var (Memory_health) read succeededOct 4 15:56:06.341: HM_VAR: Write to var (Memory_health) succeededOct 4 15:56:06.341: HM SUBSYS: system health decr. due to health decr. ofMemory by 10000Oct 4 15:56:06.341: HM_VAR: Var (system_health) read succeededOct 4 15:56:06.341: HM_VAR: Write to var (system_health) succeededOct 4 15:56:06.341: HM_VAR: Var (Memory_health) read succeededOct 4 15:56:06.341: HM ACTION: invokeOct 4 15:56:06.341: HM ACTION: Action:Reload this RSC: 1 argsOct 4 15:56:06.341: HM ACTION: Arg 1: Type Number, Value 1Oct 4 15:56:06.341: HM SUBSYS: hm_subsys_db_search_common: name Memory:Oct 4 15:56:06.341: hm_subsys_db_compare_name: name Memory, elem0x63D97FB0, s 0x63D97E9C, subsys_name Memory (len 6)found subsys inSubsystem Database at 0x63D97FB0Oct 4 15:56:06.341: %MEMORY_HM-3-RSC_LOW_MEMORY: Health Monitor detectedlow Processor_Memory on the RSC: Reload this RSCRelated Commands
debug diagnostic-monitor
To turn on diagnostic debugging, use the debug diagnostic-monitor command in privileged EXEC mode. Use the no form of this command to turn off debugging.
debug diagnostic-monitor [ errors | events | test { all | test-name} ]
Syntax Description
Defaults
No default behavior or values.
Command Modes
Privileged EXEC
Command History
Usage Guidelines
Use the debug diagnostic-monitor commands to turn on debugging for various aspects of the diagnostic-monitor.This is used to diagnose why certain tests have passed or failed as well as determine why a choice was made in determining why something was marked as being the root cause of a failure.
Examples
The example below shows DM scheduling tests and receiving the results for them.
Router#debug diagnostic-monitor eventsDec 10 17:59:19.466: DM: Component test start for "fb0-dsip-ping".Dec 10 17:59:19.466: DM: Component test start for "fb1-dsip-ping".Dec 10 17:59:19.466: DM: Component test start for "fb10-dsip-ping".Dec 10 17:59:19.466: DM: Component test start for "fb11-dsip-ping".Dec 10 17:59:19.466: DM: Component test start for "fb12-dsip-ping".Dec 10 17:59:19.466: DM: Component test start for "fb13-dsip-ping".Dec 10 17:59:19.466: DM: Component test start for "fb2-dsip-ping".Dec 10 17:59:19.466: DM: Component test start for "proc-latency-priority-low".Dec 10 17:59:19.466: DM: Fail result recvd for component fb0-dsip-pingDec 10 17:59:19.466: DM: Fail result recvd for component fb1-dsip-pingDec 10 17:59:19.466: DM: Pass result recvd for component fb10-dsip-pingDec 10 17:59:19.466: DM: Pass result recvd for component fb11-dsip-pingDec 10 17:59:19.466: DM: Pass result recvd for component fb12-dsip-pingDec 10 17:59:19.466: DM: Fail result recvd for component fb13-dsip-pingThe example below shows a test which has passed, failed and been determined to be the root cause
Router#debug diagnostic-monitor test fb0-dsip-ping
Dec 10 18:02:51.679: DM: Component "fb0-dsip-ping" test has been scheduled to run in 10000 ms. (reason: periodic)Dec 10 18:02:01.679: DM: Component "Pass" received test resultDec 10 18:12:39.028: DM: Component "Fail" received test resultDec 10 18:12:39.028: DM: Health change detected for component fb0-dsip-pingDec 10 18:12:39.028: DM: VComponent linked to Component fb0-dsip-ping in Module Domain 22 marked as RC_CandidateDec 10 18:12:39.028: DM: Component "fb0-dsip-ping" test has been scheduled to run in 10000 ms. (reason: periodic)Dec 10 18:12:39.032: DM: Checking whether RC_CANDIDATE for VComponent linked to Component fb0-dsip-ping in Module Domain 22 is a root-causeDec 10 18:12:39.032: DM: VComponent linked to Component fb0-dsip-ping in Module Domain 22 has been detected as root cause.Dec 10 18:12:39.032: DM: VComponent linked to Component fb0-dsip-ping in Module Domain 22 is now marked as root cause. Propogating it through the tree.Dec 10 18:12:39.032: %DM-6-ROOT_CAUSE_DETECTED: Component fb0-dsip-ping detected as a root cause of a failure.Dec 10 18:12:39.032: DM: Notifying Component fb0-dsip-ping, Reason: DM_NODE_RCRelated Commands
debug slot health-monitor
To turn on debugging for the rsc_slot health monitor subsystem, use the debug slot health-monitor command in privileged EXEC mode. Use the no form of this command to turn off debugging.
debug slot health-monitor
Syntax Description
Defaults
No default behavior or values.
Command Modes
privileged EXEC
Command History
Usage Guidelines
Use the debug slot health-monitor command to turn on debugging for all rsc_slot health monitor rules. This command is used to diagnose faults or to view detailed operational information.
The rsc_slot health-monitor rules are shown below:
Router#show health-monitor rule subsystem rsc_slotStatus (S) codes:A = activeD = deactivatedS ID Subsystem NameA 155 rsc_slot repeat_reboot_fb0_ruleA 156 rsc_slot boot_adjust_fb0_health_ruleA 157 rsc_slot repeat_reboot_fb10_ruleA 158 rsc_slot boot_adjust_fb10_health_ruleExamples
The following example turns on debugging for the rsc_slot health monitor rules:
Router#debug slot health-monitorRSC Slot Health Monitor Subsystem debugging is onThe following example shows detailed operational information of the repeat_reboot_fb10_rule.
When the FB reloads, but the rule does not trigger (has not reloaded often enough):
Jun 5 11:00:25.438: SLOT_HM: Sent POWERED_ON (slot 10) event to Health MonitorJun 5 11:00:25.438: SLOT_HM: slot 10 health had not been decremented for this slot, so no adjustment needs to be madeWhen the FB reloads and the rule does trigger:
Jun 5 11:15:50.129: SLOT_HM: Sent POWERED_ON (slot 10) event to Health MonitorJun 5 11:15:50.129: %SLOT_HM-4-FB_POWERDOWN: Powering down feature board 10, as requested by the Health MonitorJun 5 11:15:50.129: SLOT_HM: Problem on slot 10 detected. Decremented rsc slot subsystem health by 5000When the FB recovers (if a user forced FB reload is performed for example):
Jun 5 11:26:25.505: SLOT_HM: Sent POWERED_ON (slot 10) event to Health MonitorJun 5 11:26:25.505: SLOT_HM: slot 10.Incremented rsc slot subsystem health by 5000Related Commands
rule
To disable or enable HM rules, use the rule command in health monitor configuration mode.
rule [all | subsystem {subsystem-name {disable | enable | rule-name rule-name {disable | enable}]
Syntax Description
all
Designates all the health monitor rules.
subsystem-name
Name of the HM subsystem.
rule-name
Name or ID or the HM rule.
disable
Disable HM rules.
enable
Enable HM rules.
Defaults
Rules are enabled by default.
Command Modes
Health monitor configuration mode.
Command History
Usage Guidelines
Users may turn off specific rules in the system. You can customize the rules that are active by using this command.
Examples
The following example disables the rule, low_processor_memory_rule, for the memory subsystem.
Router#configure terminalRouter(config)#health-monitorRouter(config-hm)#rule subsystem memory rule-name low_processor_memory_rule
The following example disables the rules for the memory subsystem.
Router#configure terminalRouter(config)#health-monitorRouter(config-hm)#rule subsystem memory disable
Related Commands
Command Descriptionshow health-monitor rule
Displays a specific rule.
show health-monitor rule subsystem
Displays HM rules for a subsystem.
notify subsystem
To enable or configure health monitor SNMP notifications, use the notify subsystem command in health monitor configuration mode.
notify subsystem subsystem-name { enable | high-threshold {threshold-value} | low-threshold {threshold-value}}
Syntax Description
Defaults
Notifications are disabled by default.
Command Modes
Health-monitor configuration mode.
Command History
Usage Guidelines
Use this command to make HM status information available to external network management applications.
Examples
The following example enables notifications for the memory subsystem.
Router#configure terminalRouter(config)#health monitorRouter(config-hm)#notify subsystem memory enable
Related Commands
test
To change test parameters, use the test command in diagnostic-monitor configuration mode. To de-activate a test, use the no form of this command.
test {all | test-name [ timeout {timeout-value} | frequency {active | standby} {frequency-value} [active | standby] {frequency-value}]}
Syntax Description
Defaults
Each test has its own default values.
Command Modes
Diagnostic-monitor configuration mode.
Command History
Usage Guidelines
Alter testing frequency to suit your system. You might run a test on a particular component more often when the component is having a problem.
Note If you de-activate a test, the component test acts as the test has passed before deactivating it. If it was a root cause and you answered yes it would no longer be a root-cause and it's health effect on the subsystem it belongs to and hence on the overall system health would be removed.
Examples
This example sets the timeout and frequency for the epif0-id test.
Router#configure terminal
Router(config)#diagnostic-monitor
Router(diag-mon)#test epif0-id timeout 324 frequency active 3000 stanby 5000Related Commands
Command Descriptionshow diagnostic-monitor test
Display the results and default values for diagnostic monitor tests.
default
To set test parameters to their defaults, use the default command in diagnostic-monitor configuration mode.
default {all | test {all timers | test-name timers}}
Syntax Description
all
Specifies all DM tests.
test
Specified individual DM tests.
all timers
Frequency and timeout timers for all DM tests.
test-name
Specifies a specific DM test.
timers
Frequency and timeout timers.
Defaults
Each test has its own default values.
Command Modes
Diagnostic-monitor configuration mode.
Command History
Usage Guidelines
Use this command to reset the DM tests to their default parameters.
Examples
This example sets the timeout and frequency for the epif0-id test to the default values.
Router#configure terminal
Router(config)#diagnostic-monitor
Router(diag-mon)#default test epif0-id timers
Related Commands
Command Descriptionshow diagnostic-monitor test
Display the results and default values for diagnostic monitor tests.
Glossary