Configuring Online Diagnostics

Beginning with Cisco MDS NX-OS Release 6.2, the Cisco MDS 9700 Series supports the GOLD (Generic Online Diagnostics) feature. GOLD is a diagnostic service which is also supported on the Cisco Nexus 7000 and 7700 series switches. This chapter describes how to configure the GOLD feature on a Cisco MDS 9700 Series switch.

Information About Online Diagnostics

Online diagnostics verifies the hardware and data paths and identifies faulty devices.

Online Diagnostic Overview

The GOLD (Generic Online Diagnostics) framework tests and verifies the hardware devices and data path in a live system.

The GOLD tests can be executed in three modes:

  • Bootup

  • Health-monitoring (also called Runtime)

  • On-demand

The following explains the diagnostics test suite attributes:

  • B/C/* - Bypass bootup level test / Complete bootup level test / NA
  • P/* - Per port test / NA
  • M/S/* - Only applicable to active / standby unit / NA
  • D/N/* - Disruptive test / Non-disruptive test / NA
  • H/O/* - Always enabled monitoring test / Conditionally enabled test / NA
  • F/* - Fixed monitoring interval test / NA
  • X/* - Not a health monitoring test / NA
  • E/* - Sup to line card test / NA
  • L/* - Exclusively run this test / NA
  • T/* - Not an ondemand test / NA
  • A/I/* - Monitoring is active / Monitoring is / NA

Bootup Diagnostics

Bootup diagnostics run during bootup and detect faulty hardware before a Cisco MDS 9700 series switch brings a module online. For example, if there is a faulty module in the device, the appropriate bootup diagnostics test fails to indicate the fault.


Note


The bootup diagnostics tests are triggered during bootup.

Table 1 describes the bootup diagnostic tests for a module and a supervisor.

Table 1. Bootup Diagnostics

Diagnostic

Attributes

Description

Linecard

EOBCPortLoopback

C**D**X**T*

Verifies the health of EOBC (Ethernet Out-of-Band Connectivity) interface.

OBFL

C**N**X**T*

Verifies the integrity of the OBFL (Onboard Failure Logging) flash.

BootupPortLoopback

CP*N**XE*T*

PortLoopback test that runs only during module bootup.

Note

 
From Cisco MDS NX-OS Release 6.2(11), BootupPortLoopback failure for FC ports (on the Cisco MDS 48-Port 32-Gbps Fibre Channel module) puts the failed ports in a diagfailure mode.

Supervisor

USB

C**N**X**T*

Verifies the USB controller initialization on a module.

ManagementPortLoopback

C**D**X**T*

Verifies the health of management interface of a module.

EOBCPortLoopback

C**D**X**T*

Verifies the health of EOBC (Ethernet Out-of-Band Connectivity) interface.

OBFL

C**N**X**T*

Verifies the integrity of the OBFL (Onboard Failure Logging) flash.

When the show module command is executed, the result of bootup diagnostics is displayed as Online Diag Status. The result of individual test is displayed when the show diagnostic result command is executed for appropriate module and test ID or test name.

The Cisco MDS 9700 Family switch can be configured to either bypass the bootup diagnostics or run the complete set of bootup diagnostics. See the Setting the Bootup Diagnostic Level.

Health Monitoring Diagnostics

Health Monitoring (HM) diagnostics is enabled by default to verify the health of a live system at periodic intervals. The monitoring interval (within an allowed range) can be configured by the user, which is different for each test. See the Activating a Health Monitoring Diagnostic Test for more information. The diagnostic tests detect hardware errors and data path issues.

Health Monitoring diagnostics are non-disruptive (does not disrupt the data or control traffic). The Health Monitoring tests can be disabled by the user. See the Deactivating a Health Monitoring Diagnostic Test for more information.

The following table describes the health monitoring diagnostics for a supervisor.

Diagnostic

Default Testing Interval

Attributes

Description

Supervisor

ASICRegisterCheck

20 seconds

***N******A

Verifies read or write access to scratch registers for the ASICs on the supervisor.

NVRAM

5 minutes

***N******A

Verifies the sanity of the NVRAM blocks on a supervisor.

RealTimeClock

5 minutes

***N******A

Verifies that the real-time clock on the supervisor is ticking.

PrimaryBootROM

30 minutes

***N******A

Verifies the integrity of the primary boot device on the supervisor.

SecondaryBootROM

30 minutes

***N******A

Verifies the integrity of the secondary boot device on the supervisor.

CompactFlash

30 minutes

***N******A

Verifies access to the compact flash devices.

ExternalCompactFlash

30 minutes

***N******A

Verifies access to the external compact flash devices.

PwrMgmtBus

30 seconds

**MN******A

Verifies the standby power management control bus.

SystemMgmtBus

30 seconds

**MN******A

Verifies the availability of the standby system management bus.

StatusBus

30 seconds

**MN******A

Verifies the status transmitted by the status bus for the supervisor, modules, and fabric cards.

StandbyFabricLoopback

30 seconds

**SN******A

Verifies the connectivity of the standby supervisor to the fabric modules.

Table 1 describes the health monitoring diagnostics for the Cisco MDS 9700 48-Port 32-Gbps Fibre Channel Switching Module .

Table 2. Health Monitoring Diagnostics

Diagnostic

Default Testing Interval

Attributes

Description

Linecard

ASICRegisterCheck

1 minute

***N******A

Verifies read or write access to scratch registers for the ASICs on a module.

PrimaryBootROM

30 minutes

***N******A

Verifies the integrity of the primary boot device on a module.

SecondaryBootROM

30 minutes

***N******A

Verifies the integrity of the secondary boot device on a module.

SnakeLoopback

20 minutes

*P*N***E**

Verifies connectivity from sup to all the ports in the Linecard. It checks the integrity of the data path up to the MAC component in a progressive manner (a single run of tests covers all the ports). It runs on all the ports irrespective of their states.

This is a non-disruptive test.

IntPortLoopback

5 minutes

*P*N***E***

Verifies connectivity from sup to all the ports in the Linecard (one port at a time). It checks the integrity of the data path up to the MAC component. This test runs in Health Monitoring (HM) mode as well as it can be triggered in “on-demand mode.”

This test is Non-disruptive.

Note

 
The IntPortLoopback test is supported from Cisco MDS NX-OS Release 6.2(7).

RewriteEngine Loopback

1 minute

*P*N***E**A

Verifies the integrity of each link on the fabric module from sup to linecard.

On-Demand Diagnostics

All the Health Monitoring tests can be evoked on demand also. On-demand diagnostics runs only when invoked by the user.

Cisco MDS 48-Port 32-Gbps Fibre Channel module—There are only 2 tests which can be invoked in on-demand mode only, see Table 1.


Note


The data paths (PHY and SFP) which are not verified by other Health Monitoring tests can be verified by the PortLoopback and ExtPortLoopback tests.

You can run on-demand diagnostics whenever required. See the Starting or Stopping an On-Demand Diagnostic Test for more information.

On Cisco MDS 48-Port 32-Gbps Fibre Channel module, both the PortLoopback and ExtPortLoopback tests are available in on-demand mode only as they are disruptive.

Table 1 describes the on-demand diagnostics (for module only) on the Cisco MDS 48-Port 32-Gbps Fibre Channel module.

Table 3. On-Demand Diagnostics

Diagnostic

Attributes

Description

Linecard

PortLoopback

*P*D**XE***

Verifies connectivity from sup to all the ports in the module. It checks the integrity of the data path up to PHY. This test is available only in “on-demand mode.” The test runs on all the ports irrespective of the port state.

Note

 
Portloopback test is equivalent to the Serdes Loopback test of OHMS.

ExtPortLoopback

*P*D**XE***

Identifies hardware errors in the entire data path up to PHY including the SFP.

Note

 

Connect a loopback plug to loop the Tx of the port to the Rx of the port before running the test. If the loopback plug is not connected this test fails.

Note

 

The ExtPortLoopback test is supported from Cisco MDS NX-OS Release 6.2(11c).


Caution


The PortLoopback and ExtPortLoopback tests are disruptive as they bring down the port for the purpose of diagnostic operation.


Recovery Actions on Specified Health Monitoring Diagnostics

When the Health Monitoring Diagnostic test fails consecutively for a threshold number of up to 10 times, it takes default action through EEM, which includes generating alerts (callhome, syslog) and logging (OBFL, exception logs), and the diagnostic test gets disabled on the failed instance (port or fabric or device).

These actions are informative, but they do not remove faulty devices from the live system, which can lead to network disruption, traffic black holing, and so forth.


Note


Restart the Health Monitoring tests on failed instances by clearing the test result, deactivating, and then activate the test on the same module. For more information see Clearing Diagnostic Results, Deactivating a Health Monitoring Diagnostic Test, and Activating a Health Monitoring Diagnostic Test.

Beginning with the Cisco MDS NX-OS Release 6.2(11), the system can be configured to take corrective (recovery) actions in addition to the default actions after reaching the threshold number of consecutive failures for any of the following Health Monitoring tests:

  • PortLoopback test (supported only on Cisco MDS 48-Port 10-Gbps FCoE Module)
  • RewriteEngineLoopback test
  • StandbyFabricLoopback test
  • Internal PortLoopback test

Note


The corrective (recovery) actions are disabled by default.

Corrective (recovery) Action for Supervisor

The corrective action for sup is as follows:

StandbyFabricLoopback test—The system reloads the standby supervisor and after three retries, the system powers off the standby supervisor.


Note


After reload, when the standby supervisor comes online, the Health Monitoring Diagnostics starts by default.

Note


One retry means a complete cycle of reloading the standby supervisor followed by threshold number of consecutive failures of StandbyFabricLoopback test.

Corrective (Recovery) Action for Cisco MDS 48-Port 32-Gbps Fibre Channel Module

The corrective action for each test is as follows:

  • Internal PortLoopback test—The system brings down the failed ports and puts them in a diagfailure state.

  • RewriteEngineLoopback test—The system takes different corrective action depending on the faulty component (supervisor or fabric):

    • On a chassis with a standby supervisor (which is in ha-standby state), if the system detects a fault with the active supervisor, the system triggers a switchover and switches over to the standby supervisor. If there is no standby supervisor in the chassis, the system does not take any action.


Note


As the PortLoopback test is available only in on-demand mode on the Cisco MDS 48-Port 32-Gbps Fibre Channel Module, it does not support corrective actions.

Note


From the Cisco MDS NX-OS Release 6.2(13), RewriteEngineLoopback test and corrective actions for RewriteEngineLookpback test are supported on the Cisco MDS 48-Port 32-Gbps Fibre Channel Module.

High Availability

A key part of high availability is detecting hardware failures and taking corrective action in a live system. GOLD contributes to the high availability of the system by detecting hardware failures and providing feedback to software components to make switchover decisions.

Cisco MDS 9700 Family switches support stateless restart for GOLD by applying the running configuration after a reboot. After supervisor switchover, GOLD resumes diagnostics from the new active supervisor.

Licensing Requirements for Online Diagnostics

Product

License Requirement

Cisco NX-OS

Online diagnostics require no license. Any feature not included in a license package is bundled with the Cisco NX-OS system images and is provided at no extra charge to you. For a complete explanation of the Cisco NX-OS licensing scheme, see the Cisco MDS 9000 Family NX-OS Licensing Guide.

Default Settings

Table 1 lists the default settings for online diagnostic parameters.

Table 4. Default Online Diagnostic Parameters

Parameters

Default

Bootup diagnostics level

complete

Health Monitoring tests

active

Corrective (Recovery) actions

disabled

Configuring Online Diagnostics

Setting the Bootup Diagnostic Level

To configure the bootup diagnostics to run the complete set of tests, or to bypass all bootup diagnostic tests for a faster module bootup time, perform these tasks:


Note


It is recommended to set the bootup online diagnostics level to complete .

Procedure


Step 1

configure terminal

Example:


switch# configure terminal
Enter configuration commands, one per line.  End with CNTL/Z.
switch(config)#

Places in the global configuration mode.

Step 2

diagnostic bootup level {complete | bypass }

Example:


switch(config)# diagnostic bootup level complete

Configures the bootup diagnostic level to trigger diagnostics when the device boots:

  • complete —Performs all bootup diagnostics. The default is complete.
  • bypass —Does not perform any bootup diagnostics.

Step 3

show diagnostic bootup level

Example:


switch(config)# show diagnostic bootup level

(Optional) Displays the bootup diagnostic level (bypass or complete) that is currently in place on the device.

Step 4

copy running-config startup-config

Example:


switch(config)# copy running-config startup-config

(Optional) Copies the running configuration to the startup configuration.


Displaying the List of Available Tests

Procedure


show diagnostic content module slot

Example:


switch# show diagnostic content module 1

(Optional) Displays the list of information about the diagnostics and their attributes on a given module.

slot—The module number on which the test is activated.


Activating a Health Monitoring Diagnostic Test

Procedure


Step 1

configure terminal

Example:


switch# configure terminal
Enter configuration commands, one per line.  End with CNTL/Z.
switch(config)#

Enters global configuration mode.

Step 2

diagnostic monitor interval module slot test [test-id | name | all ] hour hour min minutes second sec

Example:


switch(config)# diagnostic monitor interval module 6 test 3 hour 1 min 0 sec 0

(Optional) Configures the interval at which the specified test is run. If no interval is set, the test runs at the interval set previously, or the default interval.

The arguments are as follows:

  • slot—The module number on which the test is activated.
  • test-id—Unique identification number for the test.
  • name—Predefined name of the test.
  • hour —The range is from 0 to 23 hours.
  • minutes—The range is from 0 to 59 minutes.
  • seconds—The range is from 0 to 59 seconds.

Step 3

diagnostic monitor module slot test [test-id | name | all ]

Example:


switch(config)# diagnostic monitor module 6 test 3
switch(config)# diagnostic monitor module 6 test SecondaryBootROM

Activates the specified test.

The arguments are as follows:

  • slot—The module number on which the test is activated.
  • test-id—Unique identification number for the test.
  • name—Predefined name of the test.

Step 4

show diagnostic content module {slot | all }

Example:


switch(config)# show diagnostic content module 6

(Optional) Displays information about the diagnostics and their attributes.

The argument is as follows:

  • slot—The module number on which the test is activated.

Deactivating a Health Monitoring Diagnostic Test


Note


Inactive tests keep their current configuration but do not run at the scheduled interval.

To deactivate a test, perform this task:

Command

Purpose

no diagnostic monitor module slot test [test-id | name | all ]

Examples:


switch(config)# no diagnostic monitor interval module 8 test 3

switch(config)# no diagnostic monitor interval module 8 test SecondaryBootROM

Deactivates the specified test.

The arguments are as follows:

  • slot—The module number on which the test is activated.
  • test-id—Unique identification number for the test.
  • name—Predefined name of the test.

Starting or Stopping an On-Demand Diagnostic Test

On-demand diagnostic test can be started or stopped, with actions (optional) to modify the number of iterations to repeat the test and determine the action to be taken on test failure.


Note


It is recommended to manually start a disruptive diagnostic test during a scheduled network maintenance time.

To start or stop an on-demand diagnostic test, perform these tasks:

Procedure


Step 1

diagnostic ondemand iteration number

Example:


switch# diagnostic ondemand iteration 5

(Optional) Configures the number of times that the on-demand test runs. The range is from 1 to 999. The default is 1.

Step 2

diagnostic ondemand action-on-failure {continue failure-count num-fails | stop }

Example:

switch# diagnostic ondemand action-on-failure stop

(Optional) Configures the action to take if the on-demand test fails.

Step 3

show diagnostic ondemand setting

Example:


switch# show diagnostic ondemand setting
Test iterations = 1
Action on test failure = continue until test failure limit reaches 1

(Optional) Displays information about on-demand diagnostics.

Step 4

diagnostic start module slot test [test-id | name | all | non-disruptive ][port port-number | all ]

Example:


switch# diagnostic start module 6 test all

Starts one or more diagnostic tests on a module.

The arguments are as follows:

  • all— All the tests are triggered.

Note

 
The multiple test- id or name can be specified separated by commas.
  • non-disruptive—All the non-disruptive tests are triggered.
  • port— The tests can be invoked on a single port or range of ports or all ports.

Step 5

diagnostic run module slot test {PortLoopback | RewriteEngineLoopback | SnakeLoopback | IntPortLoopback | ExtPortLoopback } {port port-id }

Example:


switch# diagnostic run module 3 test PortLoopback port 1

Starts the selected test on a module and displays the result on the completion of the test.

Note

 
This command is introduced from the Cisco MDS NX-OS Release 6.2(11c).

For more information, see Starting an On-Demand Diagnostic Test in On-demand Mode.

Step 6

diagnostic stop module slot test [test-id | name | all ]

Example:


switch# diagnostic stop module 6 test all 

(Optional) Stops one or more diagnostic tests on a module.

Step 7

show diagnostic status module slot

Example:


switch# show diagnostic status module 6

 

(Optional) Displays all the tests which are running and queued up with information about the testing mode for that module.

When the tests are not running or enqueued on the given module, the status is displayed as NA.

Step 8

show diagnostic result module slot test [test-id | name]

Example:


switch# show diagnostic result module 1 test 3 SecondaryBootROM 

(Optional) Displays the result of the specified test.


Starting an On-Demand Diagnostic Test in On-demand Mode

OHMS (Online Health Management System) supports invoking tests in an “on-demand mode” which displays the results immediately after running the test.

From the Cisco MDS NX-OS Release 6.2(11c), GOLD supports invoking a specific test from a set of tests in “on-demand mode” and displaying the test results immediately after running the test.

GOLD tests can be invoked in an 'on-demand' mode using the diagnostic start module command. The diagnostic run module command also supports the same action but there are a few key differences between the two. The following are the differences between the two commands:

  • In contrast to the diagnostic start module command, the diagnostic run module command blocks the current CLI session till the completion of test. After the completion of the test the CLI session is unblocked, and the result is displayed on the same console.

Note


The CLI session will be blocked till the completion of test or for a maximum of 15 seconds. If the test is not completed within the time frame of 15 seconds, then GOLD unblocks the CLI session and allows the test to run in the background till completion.

Note


Only one test can be invoked on a particular module using the diagnostic run module command. If the user attempts to invoke another test on the same module, it displays an error and the test is not invoked.
  • The diagnostic start module command requires the user to execute the show diagnostic result command in order to display the test result. As the test runs in the background (the current CLI session is not blocked), the user needs to issue show diagnostic result command to view the result, whereas the test result is implicitly displayed on the same console when the diagnostic run module command is executed.
  • The results displayed through the diagnostic run command are more intuitive than those from the show diagnostic results command.

Note


The maximum number of ports recommended for the diagnostic run module command is 5.

Clearing Diagnostic Results

To clear the diagnostic test results, use the following command:

Command

Purpose

diagnostic clear result module [slot | all ] test {test-id | all }

Example:


switch# diagnostic clear result module 2 test all

switch# diagnostic clear result module 2 test 3

Clears the test result for the specified test.

Simulating Diagnostic Results

To test the behavior of GOLD in case of a diagnostic test failure, GOLD provides a mechanism to simulate the test failure on a port, sup, or fabric.


Note


Simulating a failure after enabling corrective actions will result in triggering an action (see Corrective action) on the component where the failure was simulated.

To simulate a diagnostic test result, use the following command:

Command

Purpose

diagnostic test simulation module slot test test-id {fail | random-fail | success } [port number | all ]

Example:


switch# diagnostic test simulation module 2 test 2 fail

Simulates a test result.

To clear the simulated diagnostic test result, use the following command:

Command

Purpose

diagnostic test simulation module slot test test-id clear

Example:


switch# diagnostic test simulation module 2 test 2 clear

Clears the simulated test result.

Enabling Corrective (Recovery) Actions

To enable corrective (recovery) actions, use the following command:

Procedure


Step 1

configure terminal

Enters global configuration mode.

Step 2

diagnostic eem action conservative

Example:


switch(config)# diagnostic eem action conservative

Enables corrective or recovery actions.

Note

 
This command is applicable to the system as a whole and cannot be specifically configured to any particular module or test.

Step 3

no diagnostic eem action conservative

Disables corrective (recovery) actions.


Verifying the Online Diagnostics

To display GOLD test results, status, and configuration information use one of these commands:

Command

Purpose

show diagnostic bootup level

Displays information about bootup diagnostics.

show diagnostic content module {slot | all }

Displays information about diagnostic test content for a module.

show diagnostic description module slot test [test-name | all ]

Displays the diagnostic description.

show diagnostic events [error | info ]

Displays diagnostic events by error and information event type.

show diagnostic ondemand setting

Displays information about on-demand diagnostics.

show diagnostic result module slot [test [test-name | all ]] [detail ]

Displays information about the results of a diagnostic.

show diagnostic simulation module slot

Displays information about a simulated diagnostic.

show diagnostic status module slot

Displays the test status for all tests on a module.

show module

Displays module information including the online diagnostic test status.

show diagnostic eem action

Displays the status of the corrective (recovery) action.

Configuration Examples for Online Diagnostics

This example shows how to start all on-demand tests on a module:

diagnostic start module 6 test all

This example shows how to activate a test and set the test interval for a test on a module:

configure terminal

diagnostic monitor module 6 test 2

diagnostic monitor interval module 6 test 2 hour 3 min 30 sec 0

Additional References

For additional information related to implementing online diagnostics, see the following sections:

Related Documents

Related Topic

Document Title

Online diagnostics CLI commands

Cisco MDS 9000 Series Command Reference

Feature History for Online Diagnostics

Table 1 lists the release history for this feature.

Table 5. Feature History for Online Diagnostics

Feature Name

Releases

Feature Information

Support for corrective (recovery) actions, IntPortLoopback, ExtPortLoopback, and RewriteEngine Loopback on Cisco MDS 48-Port 32-Gbps Fibre Channel Module

8.1(1)

This feature was introduced.

Generic Online Diagnostics (GOLD)

6.2

This feature was introduced.