Information About Online Diagnostics
Online diagnostics verifies the hardware and data paths and identifies faulty devices.
This section includes the following topics:
Online Diagnostic Overview
The GOLD (Generic Online Diagnostics) framework tests and verifies the hardware devices and data path in a live system.
The GOLD tests can be executed in three modes:
—Bootup
—Health-monitoring (also called Runtime)
—On-demand
The following explains the diagnostics test suite attributes:
- B/C/* - Bypass bootup level test / Complete bootup level test / NA
- P/* - Per port test / NA
- M/S/* - Only applicable to active / standby unit / NA
- D/N/* - Disruptive test / Non-disruptive test / NA
- H/O/* - Always enabled monitoring test / Conditionally enabled test / NA
- F/* - Fixed monitoring interval test / NA
- X/* - Not a health monitoring test / NA
- E/* - Sup to line card test / NA
- L/* - Exclusively run this test / NA
- T/* - Not an ondemand test / NA
- A/I/* - Monitoring is active / Monitoring is / NA
Bootup Diagnostics
Bootup diagnostics run during bootup and detect faulty hardware before a Cisco MDS 9700 Family switch brings a module online. For example, if there is a faulty module in the device, the appropriate bootup diagnostics test fails indicating the fault.
Note The bootup diagnostics tests are triggered during bootup.
Table 9-1 describes the bootup diagnostic tests for a linecard and a supervisor.
Table 9-1 Bootup Diagnostics
|
|
|
|
EOBCPortLoopback |
C**D**X**T* |
Verifies the health of EOBC (Ethernet Out-of-Band Connectivity) interface. |
OBFL |
C**N**X**T* |
Verifies the integrity of the OBFL (Onboard Failure Logging) flash. |
BootupPortLoopback |
CP*N**XE*T* |
PortLoopback test that runs only during module bootup. Note Beginning from the Cisco MDS NX-OS Release 6.2(11), BootupPortLoopback failure for FC ports (on the Cisco MDS 48-Port 16-Gbps Fibre Channel module) puts the failed ports in a diagfailure mode. |
|
USB |
C**N**X**T* |
Verifies the USB controller initialization on a module. |
ManagementPortLoopback |
C**D**X**T* |
Verifies the health of management interface of a module. |
EOBCPortLoopback |
C**D**X**T* |
Verifies the health of EOBC (Ethernet Out-of-Band Connectivity) interface. |
OBFL |
C**N**X**T* |
Verifies the integrity of the OBFL (Onboard Failure Logging) flash. |
When the show module command is executed, the result of bootup diagnostics is displayed as Online Diag Status. The result of individual test is displayed when the show diagnostic result command is executed for appropriate module and test ID or test name.
The Cisco MDS 9700 Family switch can be configured to either bypass the bootup diagnostics or run the complete set of bootup diagnostics. See the “Setting the Bootup Diagnostic Level” section.
Health Monitoring Diagnostics
Health Monitoring (HM) diagnostics is enabled by default to verify the health of a live system at periodic intervals. The monitoring interval (within an allowed range) can be configured by the user, which is different for each test. See the Activating a Health Monitoring Diagnostic Test for more information. The diagnostic tests detect hardware errors and data path issues.
Health Monitoring diagnostics are non-disruptive (does not disrupt the data or control traffic). The Health Monitoring tests can be disabled by the user. See the Deactivating a Health Monitoring Diagnostic Test for more information.
Table 9-2 describes the health monitoring diagnostics for a supervisor.
Table 9-2 Health Monitoring Diagnostics
|
|
|
|
|
ASICRegisterCheck |
20 seconds |
***N******A |
Verifies read or write access to scratch registers for the ASICs on the supervisor. |
NVRAM |
5 minutes |
***N******A |
Verifies the sanity of the NVRAM blocks on a supervisor. |
RealTimeClock |
5 minutes |
***N******A |
Verifies that the real-time clock on the supervisor is ticking. |
PrimaryBootROM |
30 minutes |
***N******A |
Verifies the integrity of the primary boot device on the supervisor. |
SecondaryBootROM |
30 minutes |
***N******A |
Verifies the integrity of the secondary boot device on the supervisor. |
CompactFlash |
30 minutes |
***N******A |
Verifies access to the compact flash devices. |
ExternalCompactFlash |
30 minutes |
***N******A |
Verifies access to the external compact flash devices. |
PwrMgmtBus |
30 seconds |
**MN******A |
Verifies the connectivity of line cards and crossbars from supervisors through the Power Management Bus. Note Starting from Cisco MDS NX-OS Release 6.2(17), PwrMgmtBus is supported on a standby supervisor. For accurate results, ensure that GOLD is enabled on both active and standby supervisors. |
SystemMgmtBus |
30 seconds |
**MN******A |
Verifies the availability of the standby system management bus. |
StatusBus |
30 seconds |
**MN******A |
Verifies the status transmitted by the status bus for the supervisor, modules, and fabric cards. |
StandbyFabricLoopback |
30 seconds |
**SN******A |
Verifies the connectivity of the standby supervisor to the fabric modules. |
Table 9-3 describes the health monitoring diagnostics for the Cisco MDS 48-Port 16-Gbps Fibre Channel module.
Table 9-3 Health Monitoring Diagnostics
|
|
|
|
|
ASICRegisterCheck |
1 minute |
***N******A |
Verifies read or write access to scratch registers for the ASICs on a module. |
PrimaryBootROM |
30 minutes |
***N******A |
Verifies the integrity of the primary boot device on a module. |
SecondaryBootROM |
30 minutes |
***N******A |
Verifies the integrity of the secondary boot device on a module. |
SnakeLoopback |
20 minutes |
*P*N***E** |
Verifies connectivity from sup to all the ports in the Linecard. It checks the integrity of the data path up to the MAC component in a progressive manner (a single run of tests covers all the ports). It runs on all the ports irrespective of their states. This is a non-disruptive test. |
IntPortLoopback |
5 minutes |
*P*N***E*** |
Verifies connectivity from sup to all the ports in the Linecard (one port at a time). It checks the integrity of the data path up to the MAC component. This test runs in Health Monitoring (HM) mode as well as it can be triggered in “on-demand mode.” This test is Non-disruptive. Note The IntPortLoopback test is supported beginning from the Cisco MDS NX-OS Release 6.2(7). |
RewriteEngine Loopback |
1 minute |
*P*N***E**A |
Verifies the integrity of each link on the fabric module from sup to linecard. |
Table 9-4 describes the health monitoring diagnostics for the Cisco MDS 48-Port 10-Gbps Fibre Channel over Ethernet Module.
Table 9-4 Health Monitoring Diagnostics
|
|
|
|
|
ASICRegisterCheck |
1 minute |
***N******A |
Verifies read or write access to scratch registers for the ASICs on a module. |
PrimaryBootROM |
30 minutes |
***N******A |
Verifies the integrity of the primary boot device on a module. |
SecondaryBootROM |
30 minutes |
***N******A |
Verifies the integrity of the secondary boot device on a module. |
PortLoopback |
15 minutes |
*P*D***E**A |
Verifies connectivity from sup to all the ports in the linecard. It checks the integrity of the data path up to PHY. This test runs in Health Monitoring (HM) mode as well as it can be triggered in “on-demand mode.” It runs only on ports which are down (administratively). This is a disruptive test. Note The PortLoopback test runs only on ports which are administratively down. |
RewriteEngine Loopback |
1 minute |
*P*N***E**A |
Verifies the integrity of each link between linecards or sup and linecard through fabric modules. |
SnakeLoopback |
20 minutes |
*P*N***E** |
Verifies connectivity from sup to all the ports in the linecard. It checks the integrity of the data path up to the MAC component in a progressive manner. It runs on all the ports irrespective of their states. This is a non-disruptive test. |
On-Demand Diagnostics
All the Health Monitoring tests can be evoked on demand also. On-demand diagnostics runs only when invoked by the user.
Cisco MDS 48-Port 16-Gbps Fibre Channel module—There are only 2 tests which can be invoked in on-demand mode only, see Table 9-5 .
Cisco MDS 48-Port 10-Gbps Fibre Channel over Ethernet Module—There are no tests which can be invoked only in on-demand mode.
Note The data paths (PHY and SFP) which are not verified by other Health Monitoring tests can be verified by the PortLoopback and ExtPortLoopback tests.
You can run on-demand diagnostics whenever required. See the Starting or Stopping an On-Demand Diagnostic Test for more information.
On Cisco MDS 48-Port 16-Gbps Fibre Channel module, both the PortLoopback and ExtPortLoopback tests are available in on-demand mode only as they are disruptive.
Table 9-5 describes the on-demand diagnostics (for linecard only) on the Cisco MDS 48-Port 16-Gbps Fibre Channel module.
Table 9-5 On-Demand Diagnostics
|
|
|
|
|
|
PortLoopback |
*P*D**XE*** |
Verifies connectivity from sup to all the ports in the linecard. It checks the integrity of the data path up to PHY. This test is available only in “on-demand mode.” The test runs on all the ports irrespective of the port state. Note Portloopback test is equivalent to the Serdes Loopback test of OHMS. |
ExtPortLoopback |
*P*D**XE*** |
Identifies hardware errors in the entire data path up to PHY including the SFP. Note Connect a loopback plug to loop the Tx of the port to the Rx of the port before running the test. If the loopback plug is not connected this test fails. Note The ExtPortLoopback test is supported beginning from the Cisco MDS NX-OS Release 6.2(11c). |
Caution
The PortLoopback and ExtPortLoopback tests are disruptive as they bring down the port for the purpose of diagnostic operation.
Recovery Actions on Specified Health Monitoring Diagnostics
When the Health Monitoring Diagnostic test fails consecutively for a threshold number of up to 10 times, it takes default action through EEM, which includes generating alerts (callhome, syslog) and logging (OBFL, exception logs), and the diagnostic test gets disabled on the failed instance (port or fabric or device).
These actions are informative, but they do not remove faulty devices from the live system, which can lead to network disruption, traffic black holing, and so forth.
Note Restart the Health Monitoring tests on failed instances by clearing the test result, deactivating, and then activate the test on the same module. For more information see Clearing Diagnostic Results, Deactivating a Health Monitoring Diagnostic Test, and Activating a Health Monitoring Diagnostic Test.
Beginning with the Cisco MDS NX-OS Release 6.2(11), the system can be configured to take corrective (recovery) actions in addition to the default actions after reaching the threshold number of consecutive failures for any of the following Health Monitoring tests:
- PortLoopback test (supported only on Cisco MDS 48-Port 10-Gbps FCoE Module)
- RewriteEngineLoopback test
- StandbyFabricLoopback test
- Internal PortLoopback test
Note The corrective (recovery) actions are disabled by default.
Corrective (recovery) Action for Supervisor
The corrective action for sup is as follows:
StandbyFabricLoopback test—The system reloads the standby supervisor and after three retries, the system powers off the standby supervisor.
Note After reload, when the standby supervisor comes online, the Health Monitoring Diagnostics starts by default.
Note One retry means a complete cycle of reloading the standby supervisor followed by threshold number of consecutive failures of StandbyFabricLoopback test.
Corrective (Recovery) Action for Cisco MDS 48-Port 16-Gbps Fibre Channel Module
The corrective action for each test is as follows:
- Internal PortLoopback test—The system brings down the failed ports and puts them in a diagfailure state.
- RewriteEngineLoopback test—The system takes different corrective action depending on the faulty component (supervisor or fabric):
– On a chassis with a standby supervisor (which is in ha-standby state), if the system detects a fault with the active supervisor, the system triggers a switchover and switches over to the standby supervisor. If there is no standby supervisor in the chassis, the system does not take any action.
Note As the PortLoopback test is available only in on-demand mode on the Cisco MDS 48-Port 16-Gbps Fibre Channel Module, it does not support corrective actions.
Note From the Cisco MDS NX-OS Release 6.2(13), RewriteEngineLoopback test and corrective actions for RewriteEngineLookpback test are supported on the Cisco MDS 48-Port 16-Gbps Fibre Channel Module.
Corrective (Recovery) Action for Cisco MDS 48-Port 10-Gbps FCoE Module
- PortLoopback test—The system brings down the failed ports and puts them in an error disabled state.
- RewriteEngineLoopback test—The system takes different corrective action depending on the faulty component (supervisor or fabric):
– On a chassis with a standby supervisor (which is in ha-standby state), if the system detects a fault with the active supervisor, the system triggers a “switchover” and switches over to the standby supervisor. If there is no standby supervisor in the chassis, the system does not take any action.
Note If the standby supervisor present in the chassis is powered down in response to the corrective action (associated with StandbyFabricLoopback test), the system does not take any action.
– After 10 consecutive failures of the Rewrite EngineLoopback test, if the faulty component is determined as the fabric module, it will reload that particular fabric module. This cycle of 10 consecutive failures and reload occurs for 3 consecutive times and then the fabric module is powered down.
– A fter 10 consecutive failures of the PortLoopback test, if the faulty component is determined as the port, the system moves the faulty port to an error-disabled state.
High Availability
A key part of high availability is detecting hardware failures and taking corrective action in a live system. GOLD contributes to the high availability of the system by detecting hardware failures and providing feedback to software components to make switchover decisions.
Cisco MDS 9700 Family switches support stateless restart for GOLD by applying the running configuration after a reboot. After supervisor switchover, GOLD resumes diagnostics from the new active supervisor.
Configuring Online Diagnostics
This section includes the following topics:
Setting the Bootup Diagnostic Level
To configure the bootup diagnostics to run the complete set of tests, or to bypass all bootup diagnostic tests for a faster module bootup time, perform these tasks:
Note It is recommended to set the bootup online diagnostics level to complete.
|
|
|
Step 1 |
config terminal Example: switch# config terminal Enter configuration commands, one per line. End with CNTL/Z. switch(config)# |
Places in the global configuration mode. |
Step 2 |
diagnostic bootup level { complete | bypass } Example: switch(config)# diagnostic bootup level complete |
Configures the bootup diagnostic level to trigger diagnostics when the device boots:
- complete —Performs all bootup diagnostics. The default is complete.
- bypass —Does not perform any bootup diagnostics.
|
Step 3 |
show diagnostic bootup level Example: switch(config)# show diagnostic bootup level |
(Optional) Displays the bootup diagnostic level (bypass or complete) that is currently in place on the device. |
Step 4 |
copy running-config startup-config Example: switch(config)# copy running-config startup-config |
(Optional) Copies the running configuration to the startup configuration. |
Displaying the List of Available Tests
|
|
|
Step 1 |
show diagnostic content module slot Example: switch# show diagnostic content module 1 |
(Optional) Displays the list of information about the diagnostics and their attributes on a given module.
- slot—The module number on which the test is activated.
|
Activating a Health Monitoring Diagnostic Test
|
|
|
Step 1 |
config terminal Example: switch# config terminal Enter configuration commands, one per line. End with CNTL/Z. switch(config)# |
Enters global configuration mode. |
Step 2 |
diagnostic monitor interval module slot test [test-id | name | all ] hour hour min minutes second sec Example: switch(config)# diagnostic monitor interval module 6 test 3 hour 1 min 0 sec 0 |
(Optional) Configures the interval at which the specified test is run. If no interval is set, the test runs at the interval set previously, or the default interval. The arguments are as follows:
- slot—The module number on which the test is activated.
- test-id—Unique identification number for the test.
- name—Predefined name of the test.
- hour —The range is from 0 to 23 hours.
- minutes—The range is from 0 to 59 minutes.
- seconds—The range is from 0 to 59 seconds.
|
Step 3 |
diagnostic monitor module slot test [test-id | name | all ] Example: switch(config)# diagnostic monitor module 6 test 3 Example: switch(config)# diagnostic monitor module 6 test SecondaryBootROM |
Activates the specified test. The arguments are as follows:
- slot—The module number on which the test is activated.
- test-id—Unique identification number for the test.
- name—Predefined name of the test.
|
Step 4 |
show diagnostic content module { slot | all } Example: switch(config)# show diagnostic content module 6 |
(Optional) Displays information about the diagnostics and their attributes. The argument is as follows:
- slot—The module number on which the test is activated.
|
Deactivating a Health Monitoring Diagnostic Test
Note Inactive tests keep their current configuration but do not run at the scheduled interval.
To deactivate a test, perform this task:
|
|
no diagnostic monitor module slot test [test-id | name | all ] Example: switch(config)# no diagnostic monitor interval module 8 test 3 Example: switch(config)# no diagnostic monitor interval module 8 test SecondaryBootROM |
Deactivates the specified test. The arguments are as follows:
- slot—The module number on which the test is activated.
- test-id—Unique identification number for the test.
- name—Predefined name of the test.
|
Starting or Stopping an On-Demand Diagnostic Test
On-demand diagnostic test can be started or stopped, with actions (optional) to modify the number of iterations to repeat the test and determine the action to be taken on test failure.
Note It is recommended to manually start a disruptive diagnostic test during a scheduled network maintenance time.
To start or stop an on-demand diagnostic test, perform these tasks:
|
|
|
Step 1 |
diagnostic ondemand iteration number Example: switch# diagnostic ondemand iteration 5 |
(Optional) Configures the number of times that the on-demand test runs. The range is from 1 to 999. The default is 1. |
Step 2 |
diagnostic ondemand action-on-failure { continue failure-count num-fails | stop } Example: switch# diagnostic ondemand action-on-failure stop |
(Optional) Configures the action to take if the on-demand test fails. |
Step 3 |
show diagnostic ondemand setting Example: switch# show diagnostic ondemand setting Test iterations = 1 Action on test failure = continue until test failure limit reaches 1 |
(Optional) Displays information about on-demand diagnostics. |
Step 4 |
diagnostic start module slot test [test-id | name | all | non-disruptive ] [ port port-number | all ] Example: switch# diagnostic start module 6 test all |
Starts one or more diagnostic tests on a module. The arguments are as follows:
- all— All the tests are triggered.
Note The multiple test- id or name can be specified separated by commas.
- non-disruptive—All the non-disruptive tests are triggered.
- port— The tests can be invoked on a single port or range of ports or all ports.
|
diagnostic run module slot test {PortLoopback | RewriteEngineLoopback | SnakeLoopback | IntPortLoopback | ExtPortLoopback} { port port-id } Example: switch# diagnostic run module 3 test PortLoopback port 1 |
Starts the selected test on a module and displays the result on the completion of the test. Note This command is introduced from the Cisco MDS NX-OS Release 6.2(11c). For more information, see Starting an On-Demand Diagnostic Test in On-demand Mode. |
Step 5 |
diagnostic stop module slot test [test-id | name | all ] Example: switch# diagnostic stop module 6 test all |
(Optional) Stops one or more diagnostic tests on a module. |
Step 6 |
show diagnostic status module slot Example: switch# show diagnostic status module 6 |
(Optional) Displays all the tests which are running and queued up with information about the testing mode for that module. When the tests are not running or enqueued on the given module, the status is displayed as NA. |
Step 7 |
show diagnostic result module slot test [test-id | name] Example: switch# show diagnostic result module 1 test 3 SecondaryBootROM |
(Optional) Displays the result of the specified test. |
Starting an On-Demand Diagnostic Test in On-demand Mode
OHMS (Online Health Management System) supports invoking tests in an “on-demand mode” which displays the results immediately after running the test.
From the Cisco MDS NX-OS Release 6.2(11c), GOLD supports invoking a specific test from a set of tests in “on-demand mode” and displaying the test results immediately after running the test.
GOLD tests can be invoked in an 'on-demand' mode using the diagnostic start module command. The diagnostic run module command also supports the same action but there are a few key differences between the two. The following are the differences between the two commands:
- In contrast to the diagnostic start module command, the diagnostic run module command blocks the current CLI session till the completion of test. After the completion of the test the CLI session is unblocked, and the result is displayed on the same console.
Note The CLI session will be blocked till the completion of test or for a maximum of 15 seconds. If the test is not completed within the time frame of 15 seconds, then GOLD unblocks the CLI session and allows the test to run in the background till completion.
Note Only one test can be invoked on a particular module using the diagnostic run module command. If the user attempts to invoke another test on the same module, it displays an error and the test is not invoked.
- The diagnostic start module command requires the user to execute the show diagnostic result command in order to display the test result. As the test runs in the background (the current CLI session is not blocked), the user needs to issue show diagnostic result command to view the result, whereas the test result is implicitly displayed on the same console when the diagnostic run module command is executed.
- The results displayed through the diagnostic run command are more intuitive than those from the show diagnostic results command.
Note The maximum number of ports recommended for the diagnostic run module command is 5.
Clearing Diagnostic Results
To clear the diagnostic test results, use the following command:
|
|
diagnostic clear result module [ slot | all ] test {test-id | all } Example: switch# diagnostic clear result module 2 test all Example: switch# diagnostic clear result module 2 test 3 |
Clears the test result for the specified test. |
Simulating Diagnostic Results
To test the behavior of GOLD in case of a diagnostic test failure, GOLD provides a mechanism to simulate the test failure on a port, sup, or fabric.
Note Simulating a failure after enabling corrective actions will result in triggering an action (see Corrective action) on the component where the failure was simulated.
To simulate a diagnostic test result, use the following command:
|
|
diagnostic test simulation module slot test test-id {fail | random-fail | success} [ port number | all ] Example: switch# diagnostic test simulation module 2 test 2 fail |
Simulates a test result. |
To clear the simulated diagnostic test result, use the following command:
|
|
diagnostic test simulation module slot test test-id clear Example: switch# diagnostic test simulation module 2 test 2 clear |
Clears the simulated test result. |
Enabling Corrective (Recovery) Actions
To enable corrective (recovery) actions, use the following command:
|
|
|
Step 1 |
configure terminal |
Enters global configuration mode. |
Step 2 |
diagnostic eem action conservative Example: switch(config)# diagnostic eem action conservative switch(config)# |
Enables corrective or recovery actions. Note This command is applicable to the system as a whole and cannot be specifically configured to any particular module or test. |
Step 3 |
no diagnostic eem action conservative |
Disables corrective (recovery) actions. |