Information About Online Diagnostics
Online diagnostics verifies the hardware and data paths and identifies faulty devices.
Online Diagnostic Overview
The GOLD (Generic Online Diagnostics) framework tests and verifies the hardware devices and data path in a live system.
The GOLD tests can be executed in three modes:
-
Bootup
-
Health-monitoring (also called Runtime)
-
On-demand
The following explains the diagnostics test suite attributes:
- B/C/* - Bypass bootup level test / Complete bootup level test / NA
- P/* - Per port test / NA
- M/S/* - Only applicable to active / standby unit / NA
- D/N/* - Disruptive test / Non-disruptive test / NA
- H/O/* - Always enabled monitoring test / Conditionally enabled test / NA
- F/* - Fixed monitoring interval test / NA
- X/* - Not a health monitoring test / NA
- E/* - Sup to line card test / NA
- L/* - Exclusively run this test / NA
- T/* - Not an ondemand test / NA
- A/I/* - Monitoring is active / Monitoring is / NA
Bootup Diagnostics
Bootup diagnostics run during bootup and detect faulty hardware before a Cisco MDS 9700 series switch brings a module online. For example, if there is a faulty module in the device, the appropriate bootup diagnostics test fails to indicate the fault.
Note |
The bootup diagnostics tests are triggered during bootup. |
Table 1 describes the bootup diagnostic tests for a module and a supervisor.
Diagnostic |
Attributes |
Description |
||
---|---|---|---|---|
Linecard |
||||
EOBCPortLoopback |
C**D**X**T* |
Verifies the health of EOBC (Ethernet Out-of-Band Connectivity) interface. |
||
OBFL |
C**N**X**T* |
Verifies the integrity of the OBFL (Onboard Failure Logging) flash. |
||
BootupPortLoopback |
CP*N**XE*T* |
PortLoopback test that runs only during module bootup.
|
||
Supervisor |
||||
USB |
C**N**X**T* |
Verifies the USB controller initialization on a module. |
||
ManagementPortLoopback |
C**D**X**T* |
Verifies the health of management interface of a module. |
||
EOBCPortLoopback |
C**D**X**T* |
Verifies the health of EOBC (Ethernet Out-of-Band Connectivity) interface. |
||
OBFL |
C**N**X**T* |
Verifies the integrity of the OBFL (Onboard Failure Logging) flash. |
When the show module command is executed, the result of bootup diagnostics is displayed as Online Diag Status. The result of individual test is displayed when the show diagnostic result command is executed for appropriate module and test ID or test name.
The Cisco MDS 9700 Family switch can be configured to either bypass the bootup diagnostics or run the complete set of bootup diagnostics. See the Setting the Bootup Diagnostic Level.
Health Monitoring Diagnostics
Health Monitoring (HM) diagnostics is enabled by default to verify the health of a live system at periodic intervals. The monitoring interval (within an allowed range) can be configured by the user, which is different for each test. See the Activating a Health Monitoring Diagnostic Test for more information. The diagnostic tests detect hardware errors and data path issues.
Health Monitoring diagnostics are non-disruptive (does not disrupt the data or control traffic). The Health Monitoring tests can be disabled by the user. See the Deactivating a Health Monitoring Diagnostic Test for more information.
The following table describes the health monitoring diagnostics for a supervisor.
Diagnostic |
Default Testing Interval |
Attributes |
Description |
---|---|---|---|
Supervisor |
|||
ASICRegisterCheck |
20 seconds |
***N******A |
Verifies read or write access to scratch registers for the ASICs on the supervisor. |
NVRAM |
5 minutes |
***N******A |
Verifies the sanity of the NVRAM blocks on a supervisor. |
RealTimeClock |
5 minutes |
***N******A |
Verifies that the real-time clock on the supervisor is ticking. |
PrimaryBootROM |
30 minutes |
***N******A |
Verifies the integrity of the primary boot device on the supervisor. |
SecondaryBootROM |
30 minutes |
***N******A |
Verifies the integrity of the secondary boot device on the supervisor. |
CompactFlash |
30 minutes |
***N******A |
Verifies access to the compact flash devices. |
ExternalCompactFlash |
30 minutes |
***N******A |
Verifies access to the external compact flash devices. |
PwrMgmtBus |
30 seconds |
**MN******A |
Verifies the standby power management control bus. |
SystemMgmtBus |
30 seconds |
**MN******A |
Verifies the availability of the standby system management bus. |
StatusBus |
30 seconds |
**MN******A |
Verifies the status transmitted by the status bus for the supervisor, modules, and fabric cards. |
StandbyFabricLoopback |
30 seconds |
**SN******A |
Verifies the connectivity of the standby supervisor to the fabric modules. |
Table 1 describes the health monitoring diagnostics for the Cisco MDS 9700 48-Port 32-Gbps Fibre Channel Switching Module .
Diagnostic |
Default Testing Interval |
Attributes |
Description |
||
---|---|---|---|---|---|
Linecard |
|||||
ASICRegisterCheck |
1 minute |
***N******A |
Verifies read or write access to scratch registers for the ASICs on a module. |
||
PrimaryBootROM |
30 minutes |
***N******A |
Verifies the integrity of the primary boot device on a module. |
||
SecondaryBootROM |
30 minutes |
***N******A |
Verifies the integrity of the secondary boot device on a module. |
||
SnakeLoopback |
20 minutes |
*P*N***E** |
Verifies connectivity from sup to all the ports in the Linecard. It checks the integrity of the data path up to the MAC component in a progressive manner (a single run of tests covers all the ports). It runs on all the ports irrespective of their states. This is a non-disruptive test. |
||
IntPortLoopback |
5 minutes |
*P*N***E*** |
Verifies connectivity from sup to all the ports in the Linecard (one port at a time). It checks the integrity of the data path up to the MAC component. This test runs in Health Monitoring (HM) mode as well as it can be triggered in “on-demand mode.” This test is Non-disruptive.
|
||
RewriteEngine Loopback |
1 minute |
*P*N***E**A |
Verifies the integrity of each link on the fabric module from sup to linecard. |
On-Demand Diagnostics
All the Health Monitoring tests can be evoked on demand also. On-demand diagnostics runs only when invoked by the user.
Cisco MDS 48-Port 32-Gbps Fibre Channel module—There are only 2 tests which can be invoked in on-demand mode only, see Table 1.
Note |
The data paths (PHY and SFP) which are not verified by other Health Monitoring tests can be verified by the PortLoopback and ExtPortLoopback tests. |
You can run on-demand diagnostics whenever required. See the Starting or Stopping an On-Demand Diagnostic Test for more information.
On Cisco MDS 48-Port 32-Gbps Fibre Channel module, both the PortLoopback and ExtPortLoopback tests are available in on-demand mode only as they are disruptive.
Table 1 describes the on-demand diagnostics (for module only) on the Cisco MDS 48-Port 32-Gbps Fibre Channel module.
Diagnostic |
Attributes |
Description |
||||
---|---|---|---|---|---|---|
Linecard |
||||||
PortLoopback |
*P*D**XE*** |
Verifies connectivity from sup to all the ports in the module. It checks the integrity of the data path up to PHY. This test is available only in “on-demand mode.” The test runs on all the ports irrespective of the port state.
|
||||
ExtPortLoopback |
*P*D**XE*** |
Identifies hardware errors in the entire data path up to PHY including the SFP.
|
Caution |
The PortLoopback and ExtPortLoopback tests are disruptive as they bring down the port for the purpose of diagnostic operation. |
Recovery Actions on Specified Health Monitoring Diagnostics
When the Health Monitoring Diagnostic test fails consecutively for a threshold number of up to 10 times, it takes default action through EEM, which includes generating alerts (callhome, syslog) and logging (OBFL, exception logs), and the diagnostic test gets disabled on the failed instance (port or fabric or device).
These actions are informative, but they do not remove faulty devices from the live system, which can lead to network disruption, traffic black holing, and so forth.
Note |
Restart the Health Monitoring tests on failed instances by clearing the test result, deactivating, and then activate the test on the same module. For more information see Clearing Diagnostic Results, Deactivating a Health Monitoring Diagnostic Test, and Activating a Health Monitoring Diagnostic Test. |
Beginning with the Cisco MDS NX-OS Release 6.2(11), the system can be configured to take corrective (recovery) actions in addition to the default actions after reaching the threshold number of consecutive failures for any of the following Health Monitoring tests:
- PortLoopback test (supported only on Cisco MDS 48-Port 10-Gbps FCoE Module)
- RewriteEngineLoopback test
- StandbyFabricLoopback test
- Internal PortLoopback test
Note |
The corrective (recovery) actions are disabled by default. |
Corrective (recovery) Action for Supervisor
The corrective action for sup is as follows:
StandbyFabricLoopback test—The system reloads the standby supervisor and after three retries, the system powers off the standby supervisor.
Note |
After reload, when the standby supervisor comes online, the Health Monitoring Diagnostics starts by default. |
Note |
One retry means a complete cycle of reloading the standby supervisor followed by threshold number of consecutive failures of StandbyFabricLoopback test. |
Corrective (Recovery) Action for Cisco MDS 48-Port 32-Gbps Fibre Channel Module
The corrective action for each test is as follows:
-
Internal PortLoopback test—The system brings down the failed ports and puts them in a diagfailure state.
-
RewriteEngineLoopback test—The system takes different corrective action depending on the faulty component (supervisor or fabric):
-
On a chassis with a standby supervisor (which is in ha-standby state), if the system detects a fault with the active supervisor, the system triggers a switchover and switches over to the standby supervisor. If there is no standby supervisor in the chassis, the system does not take any action.
-
Note |
As the PortLoopback test is available only in on-demand mode on the Cisco MDS 48-Port 32-Gbps Fibre Channel Module, it does not support corrective actions. |
Note |
From the Cisco MDS NX-OS Release 6.2(13), RewriteEngineLoopback test and corrective actions for RewriteEngineLookpback test are supported on the Cisco MDS 48-Port 32-Gbps Fibre Channel Module. |
High Availability
A key part of high availability is detecting hardware failures and taking corrective action in a live system. GOLD contributes to the high availability of the system by detecting hardware failures and providing feedback to software components to make switchover decisions.
Cisco MDS 9700 Family switches support stateless restart for GOLD by applying the running configuration after a reboot. After supervisor switchover, GOLD resumes diagnostics from the new active supervisor.