Configuring GOLD Health Monitoring for the Cisco ASR 903 Router

Generic Online Diagnostic (GOLD) is a health monitoring feature implemented on the Cisco ASR 903 Router. The GOLD functionality is developed to provide online diagnostic capabilities that run at bootup, in the background on a periodic basis, or based on demand from the CLI.


Note

This is not applicable for Cisco ASR 900 RSP3 Module.

Restrictions for the GOLD feature

  • GOLD test cases are designed on a per chip or per interface level and are not expected to monitor at a per modem or per service flow level.

  • The Cisco ASR 903 Router currently supports the Error Counter Monitoring Test. Other GOLD tests are not supported.

Information About GOLD

The sections provide details of the GOLD feature.

Limitations of Existing Logging Mechanism

To provide high-availability for a router without any downtime it is imperative to analyze the stability of a system. The primary method of discovering the cause of system failure is system messages. However, there are certain system failures that do not send notifications. It is difficult to understand the cause of these system failures, as the existing logging mechanism fails to notify or maintain a log of these failures.

Understanding the Importance of GOLD Functionality

As there are certain system failures that do not send any notification or keep a log of failure, it is essential to address these limitations. The GOLD feature has been designed specifically to provide error detection by polling for errors for those system modules that do not have any notification mechanism. GOLD has been implemented on the Cisco ASR 903 Routerto actively poll for system errors. Online diagnostics is one of the requirements for high availability (HA). HA is a a set of quality standards that seeks to limit the impact of equipment failures on the network. A key part of HA is detecting system failures and taking corrective actions while the system is running in a live network.

Understanding the GOLD Feature

The GOLD feature is primarily used to poll for system errors targeted for those components, which do not send a notification upon failure. Although the infrastructure can be used to poll for both hardware and system errors, the main scope is to poll for status and error registers on physical hardware device. The Cisco ASR 903 Router uses a distributed GOLD implementation. In this model, the core Cisco IOS GOLD subsystem is linked on both the route service processor (RSP) and the interface modules.

Diagnostic tests can be registered either as local tests which run on the RP or as proxy tests which run on the line cards. When a proxy test is requested on the RP, a command is sent using Inter-Process Communication (IPC) to the line card to instruct it to run the test locally. The results are then returned to the RSP using IPC. Tests are specified by card type on a per slot/subslot basis. Diagnostic tests can be run either on bootup, periodically (triggered by a timer), or on demand from the CLI. GOLD feature is managed through a range of commands which are mainly used to provide on-demand diagnostic tests, schedule tests at particular intervals, monitor the system health on periodic basis and to view the diagnostic test results.

Configuring Online Diagnostics

The sections describe how to configure various types of diagnostics and view test reports.

Configuring the Bootup Diagnostics Level

You can configure the bootup diagnostics level as minimal or complete or you can bypass the bootup diagnostics entirely. Enter the complete keyword to run all bootup diagnostic tests and the minimal keyword to run minimal tests such as loopback. Enter the no form of the command to bypass all diagnostic tests. The default bootup diagnostics level is minimal.


Note

None of the currently implemented tests on the Cisco ASR 903 Router are bootup tests.

SUMMARY STEPS

  1. enable
  2. configure terminal
  3. Router(config)# diagnostic bootup level {minimal | complete}

DETAILED STEPS

  Command or Action Purpose
Step 1

enable

Example:


Router> enable

Enables privileged EXEC mode. Enter your password if prompted.

Step 2

configure terminal

Example:


Router# configure terminal

Enters global configuration mode.

Step 3

Router(config)# diagnostic bootup level {minimal | complete}

Example:


Router(config)# diagnostic bootup level complete 

Configures the bootup diagnostic level.

Configuring On-Demand Diagnostics

You can run the on-demand diagnostic tests from the CLI. You can set the execution action to either stop or continue the test when a failure is detected or to stop the test after a specific number of failures occur by using the failure count setting. You can configure a test to run multiple times using the iteration setting.

SUMMARY STEPS

  1. enable
  2. diagnostic ondemand {iteration iteration_count } | {action-on-error {continue | stop }[error_count ]}
  3. diagnostic start {slot slot-no } test {test-id | test-id-range | all | complete | minimal | non-disruptive }
  4. diagnostic stop {slot slot-no }

DETAILED STEPS

  Command or Action Purpose
Step 1

enable

Example:


Router> enable

Enables privileged EXEC mode. Enter your password if prompted.

Step 2

diagnostic ondemand {iteration iteration_count } | {action-on-error {continue | stop }[error_count ]}

Example:


Router# diagnostic ondemand iteration 3 

Configures on-demand diagnostic tests to run, how many times to run (iterations), and what action to take when errors are found.

Step 3

diagnostic start {slot slot-no } test {test-id | test-id-range | all | complete | minimal | non-disruptive }

Example:


Example:


Router# diagnostic start slot 1 test 5

Starts the on-demand diagnostic test on the specified slot.

Step 4

diagnostic stop {slot slot-no }

Example:


Router# diagnostic stop slot 1

Stops the diagnostic test running on the specified slot.

Scheduling Diagnostics

You can schedule online diagnostics to run at a designated time of day or on a daily, weekly, or monthly basis. You can schedule tests to run only once or to repeat at an interval. Use the no form of this command to remove the scheduling.

To schedule online diagnostics, follow these steps:

SUMMARY STEPS

  1. enable
  2. configure terminal
  3. diagnostic schedule {slot slot-no } test {test-id | test-id-range | all | complete | minimal | non-disruptive } {daily hh:mm | on mm dd year hh:mm | weekly day-of-week hh:mm }

DETAILED STEPS

  Command or Action Purpose
Step 1

enable

Example:


Router> enable

Enables privileged EXEC mode. Enter your password if prompted.

Step 2

configure terminal

Example:


Router# configure terminal

Enters global configuration mode.

Step 3

diagnostic schedule {slot slot-no } test {test-id | test-id-range | all | complete | minimal | non-disruptive } {daily hh:mm | on mm dd year hh:mm | weekly day-of-week hh:mm }

Example:


Example:


and

Example:


Example:

diagnostic schedule  {subslot  slot/sub-slot } test  {test-id  | test-id-range  | all  | complete  | minimal  | non-disruptive  | per-port  {daily  hh : mm |  on  mm dd year hh:mm |  weekly  day-of-week hh:mm |  port  {{num  | port# range  | all }{daily  hh : mm |  on  mm dd year hh:mm |  weekly  day-of-week hh:mm }}}}

Example:


This example shows how to schedule the diagnostic testing on a specific date and time for a specific slot:

Example:


Router(config)# diagnostic schedule slot 1 test 1 on september 2 2009 12:00 

Example:


This example shows how to schedule the diagnostic testing to occur daily at a certain time for a specific slot:

Example:


Router(config)# diagnostic schedule slot 1 test complete daily 08:00 

Schedules on-demand diagnostic tests for a specific date and time, how many times to run (iterations), and what action to take when errors are found.

Configuring Health-Monitoring Diagnostics

You can configure health-monitoring diagnostic testing while the system is connected to a live network. You can configure the execution interval for each health monitoring test, whether or not to generate a system message upon test failure, or to enable or disable an individual test. Use the no form of this command to disable testing.


Note

Before enabling the diagnostic monitor test, you first need to set the interval to run the diagnostic test. An error message is displayed if the interval is not configured before enabling the monitoring.

SUMMARY STEPS

  1. enable
  2. configure terminal
  3. diagnostic monitor interval {slot slot-no } test {test-id | test-id-range | all } {hh:mm:ss } {milliseconds } {number-of-days
  4. diagnostic monitor {slot slot-no } test {test-id | test-id-range | all }
  5. diagnostic monitor syslog
  6. diagnostic monitor threshold {slot slot-no } test {test-id | test-id-range | all } {failure count no-of-allowed-failures }

DETAILED STEPS

  Command or Action Purpose
Step 1

enable

Example:


Router> enable

Enables privileged EXEC mode. Enter your password if prompted.

Step 2

configure terminal

Example:


Router# configure terminal

Enters global configuration mode.

Step 3

diagnostic monitor interval {slot slot-no } test {test-id | test-id-range | all } {hh:mm:ss } {milliseconds } {number-of-days

Example:


The following example shows how to configure the periodic interval for running diagnostic tests on the Cisco ASR 903 Routerbefore enabling monitoring:

Example:


Router(config)# diagnostic monitor interval slot 1/0 test 2 06:00:00 100 10 

Configures the health-monitoring interval of the specified tests. The no form of this command will change the interval to the default interval, or zero.

Step 4

diagnostic monitor {slot slot-no } test {test-id | test-id-range | all }

Example:


The following example shows a sample output of an error message displayed when monitoring is enabled before configuring the test interval:

Example:


Router(config)# diagnostic monitor slot 1 test 2 

Example:


Enables or disables health-monitoring diagnostic tests.

Step 5

diagnostic monitor syslog

Example:


Router(config)# diagnostic monitor syslog 

Enables the generation of a system logging messages when a health-monitoring test fails.

Step 6

diagnostic monitor threshold {slot slot-no } test {test-id | test-id-range | all } {failure count no-of-allowed-failures }

Example:


Router(config)# diagnostic monitor threshold slot 1 test 2 failure count 10 

Configures the failure threshold value for the slot.

Displaying Online Diagnostic Tests and Test Results

You can display the online diagnostic tests that are configured and check the results of the tests using the show commands.

SUMMARY STEPS

  1. enable
  2. show diagnostic content all | slot slot-no
  3. show diagnostic result [[slot slot-no ] {detail | test {test-id | test-id-range | all }} | all
  4. show diagnostic schedule all | slot slot-no ]
  5. show diagnostic events [slot slot-no | event-type {error | info | warning }]

DETAILED STEPS

  Command or Action Purpose
Step 1

enable

Example:


Router> enable

Enables privileged EXEC mode. Enter your password if prompted.

Step 2

show diagnostic content all | slot slot-no

Example:


Router# show diagnostic content slot 1 

Displays the online diagnostics tests and test attributes that are configured.

Step 3

show diagnostic result [[slot slot-no ] {detail | test {test-id | test-id-range | all }} | all

Example:


Router# show diagnostic result all 

Displays the diagnostic test results (pass, fail, or untested) for a slot.

Step 4

show diagnostic schedule all | slot slot-no ]

Example:


Router# show diagnostic schedule slot 1 

Displays the current scheduled diagnostic tasks.

Step 5

show diagnostic events [slot slot-no | event-type {error | info | warning }]

Example:


Router# show diagnostic events slot 1 

Displays the diagnostic event log details for the specified slot.

Supported GOLD Tests on the Cisco ASR 903 Router

This section discusses the GOLD test cases that have been implemented on Cisco ASR 903 Router. The Cisco ASR 903 Router supports the following categories of GOLD tests:

  • Boot-up test

  • On-demand test

  • Health monitoring test

The following tests are currently supported:

  • Error Counter Monitoring Test—The error counter monitoring test is defined as a health monitoring test. The error counter monitoring test detects errors on ASICs attached to the active RSP. If errors exceed a certain threshold, the router displays a syslog message containing details including ASIC, register identifier, ASIC ID, ASIC instance, and counter values. The interval for polling for errors is fixed to 5 seconds.

For an example of an error counter monitoring test configuration, see Configuration Examples for GOLD Feature.

How to Manage Diagnostic Tests

This section describes how to manage the diagnostic tests. The following GOLD commands are used to to manage the ondemand and periodic diagnostic tests:

SUMMARY STEPS

  1. diagnostic ondemand
  2. show diagnostic ondemand settings
  3. diagnostic start {slot slot-no } test {test-id | test-id-range | all | complete | minimal | non-disruptive }
  4. show diagnostic content
  5. show diagnostic result
  6. show diagnostic events
  7. diagnostic stop {slot slot-no } test {test-id | test-id-range | all | complete | minimal | non-disruptive }
  8. configure terminal
  9. diagnostic bootup level {minimal | complete }
  10. show diagnostic bootup level
  11. diagnostic event-log size size
  12. diagnostic monitor interval {slot slot-no } test {test-id | test-id-range | all } hh:mm:ss milliseconds days
  13. diagnostic schedule module {module-number | slot/subslot} test {test-id | all | complete | minimal | non-disruptive | per-port }
  14. show diagnostic schedule

DETAILED STEPS

  Command or Action Purpose
Step 1

diagnostic ondemand

Example:


Router# diagnostic ondemand iteration 50

Configures the ondemand diagnostic parameters such as iteration-count and action-on-error. These parameters signify the number of times the test is run and the execution action when a failure is detected. These parameters are used when the command diagnostic start is executed. In the given example, the iteration count to the same ondemand diagnostic test again is configured as 50.

Note 
By default, iteration-count is 1, action-on-error is continue, and error count is 0.
Step 2

show diagnostic ondemand settings

Example:


Router# show diagnostic ondemand settings

Displays the ondemand diagnostic settings configured using the command diagnostic ondemand .

Step 3

diagnostic start {slot slot-no } test {test-id | test-id-range | all | complete | minimal | non-disruptive }

Example:


Router# diagnostic start slot 1 test 1 all

Starts an ondemand diagnostic test.

  • slot slot-no— Indicates the slot number of the full-height line card where the diagnostic test is executed. The slot keyword is used to refer a full-height line card on the router. The valid range for slot is from 1 to 8.
  • test — Specifies a test to run.
  • test-id —Identification number for the test to run.
  • test-id-range —Range of identification numbers for tests to run.
  • minimal —Runs minimal bootup diagnostic tests.
  • complete —Runs complete bootup diagnostic tests.
  • non-disruptive —Runs the non disruptive health-monitoring tests.
  • all —Runs all diagnostic tests.
Step 4

show diagnostic content

Example:


Router# show diagnostic content

Displays the registered tests, attributes, and the configured interval at which the test runs.

Step 5

show diagnostic result

Example:


Router# show diagnostic result

Displays the diagnostic test results for an interface module.

Step 6

show diagnostic events

Example:


Router# show diagnostic events

Displays the diagnostic event log details for all interface modules installed on the Cisco ASR 903 Router.

Step 7

diagnostic stop {slot slot-no } test {test-id | test-id-range | all | complete | minimal | non-disruptive }

Example:


Router# diagnostic stop slot 1 all

Stops the ondemand diagnostic test.

Step 8

configure terminal

Example:


Router# configure terminal

Enters global configuration mode.

Step 9

diagnostic bootup level {minimal | complete }

Example:


Router(config)# diagnostic bootup level complete

Configures the bootup diagnostic level.

  • minimal —Specifies minimal diagnostics.
  • complete —Specifies complete diagnostics.
Step 10

show diagnostic bootup level

Example:


Router# show diagnostic bootup

Displays the configured bootup diagnostic level.

Step 11

diagnostic event-log size size

Example:


Router(config)# diagnostic event log size 10000

Modifies the diagnostic event log size dynamically.

  • size —Diagnostic event-log sizes. The valid values range from 1 to 10000 entries.
Step 12

diagnostic monitor interval {slot slot-no } test {test-id | test-id-range | all } hh:mm:ss milliseconds days

Example:


Router(config)# diagnostic monitor interval slot 1 test 2 06:00:00 100 20

Configures the health monitoring diagnostic test interval to rerun the tests.

  • hh:mm:ss —Hours, minutes, and seconds interval configured to run the test again.
  • milliseconds —Number of milliseconds between tests.
  • days —Number of days between tests. The valid range is from 0 to 20.
Step 13

diagnostic schedule module {module-number | slot/subslot} test {test-id | all | complete | minimal | non-disruptive | per-port }

Example:


Router(config)# diagnostic schedule slot 1 test complete daily 08:00

Schedules the online diagnostic test to run at a designated time, or on daily, weekly or monthly basis.

  • module-number —Specifies the module number.
  • per-port —Selects the per-port test suite.
Step 14

show diagnostic schedule

Example:


Router# show diagnostic schedule

Displays the current scheduled diagnostic tests.

Configuration Examples for GOLD Feature

The following example shows a sample output of the test configuration, test attributes, and the supported coverage test levels for each test and for each slot:


Router#show diagnostic description slot R0 test ?
  Diagnostics test suite attributes:
    M/C/* - Minimal bootup level test / Complete bootup level test / NA
      B/* - Basic ondemand test / NA
    P/V/* - Per port test / Per device test / NA
    D/N/* - Disruptive test / Non-disruptive test / NA
      S/* - Only applicable to standby unit / NA
      X/* - Not a health monitoring test / NA
      F/* - Fixed monitoring interval test / NA
      E/* - Always enabled monitoring test / NA
      A/I - Monitoring is active / Monitoring is inactive
                                                          Test Interval   Thre-
  ID   Test Name                          Attributes      day hh:mm:ss.ms shold
  ==== ================================== ============    =============== =====
    1) TestErrorCounterMonitor ---------> ***N**F*A       000 00:00:05.00 50