About Online Diagnostics
With online diagnostics, you can test and verify the hardware functionality of the device while the device is connected to a live network.
The online diagnostics contain tests that check different hardware components and verify the data path and control signals. Disruptive online diagnostic tests (such as the disruptive loopback test) and nondisruptive online diagnostic tests (such as the ASIC register check) run during bootup, line module online insertion and removal (OIR), and system reset. The nondisruptive online diagnostic tests run as part of the background health monitoring, and you can run these tests on demand.
Online diagnostics are categorized as bootup, runtime or health-monitoring diagnostics, and on-demand diagnostics. Bootup diagnostics run during bootup, health-monitoring tests run in the background, and on-demand diagnostics run once or at user-designated intervals when the device is connected to a live network.
Bootup Diagnostics
Bootup diagnostics run during bootup and detect faulty hardware before Cisco NX-OS brings a module online. For example, if you insert a faulty module in the device, bootup diagnostics test the module and take it offline before the device uses the module to forward traffic.
Bootup diagnostics also check the connectivity between the supervisor and module hardware and the data and control paths for all the ASICs. The following table describes the bootup diagnostic tests for a module and a supervisor.
Diagnostic |
Description |
||
---|---|---|---|
OBFL |
Verifies the integrity of the onboard failure logging (OBFL) flash. |
||
MacSecPortLoopback (Cisco Nexus 9736C-FX and 9736Q-FX line cards only) |
Tests the packet path from Supervisor to each physical front panel port on the ASIC, the MACSEC capabilities of each port, and the Encryption and Decryption capabilities of the Cisco Nexus 9736C-FX and 9736Q-FX line cards. The MacSecPortLoopback test runs at boot time when the diagnostic bootup level is set to complete . The MacSecPortLoopback test runs on every port of the 36 front ports on the Cisco Nexus 9736C-FX and 9736Q-FX line cards, including ports that are broken out. The MAC sec hardware is tested for the four available cipher suite algorithms: GCM-AES-128, GCM-AES-256, GCM-AES-XPN-128, and GCM-AES-XPN-256.
|
||
USB |
Nondisruptive test. Checks the USB controller initialization on a module. |
||
ManagementPortLoopback |
Disruptive test, not an on-demand test. Tests loopback on the management port of a module. |
||
EOBCPortLoopback |
Disruptive test, not an on-demand test. Ethernet out of band. |
Bootup diagnostics log failures to onboard failure logging (OBFL) and syslog and trigger a diagnostic LED indication (on, off, pass, or fail).
You can configure the device to either bypass the bootup diagnostics or run the complete set of bootup diagnostics.
Runtime or Health Monitoring Diagnostics
Runtime diagnostics are also called health monitoring (HM) diagnostics. These diagnostics provide information about the health of a live device. They detect runtime hardware errors, memory errors, the degradation of hardware modules over time, software faults, and resource exhaustion.
Health monitoring diagnostics are non-disruptive and run in the background to ensure the health of a device that is processing live network traffic. You can enable or disable health monitoring tests or change their runtime interval.
The following table describes the health monitoring diagnostics and test IDs for a module and a supervisor.
Note |
Some tests may or may not be present, depending on the capabilities of the module. A list of tests available to the module can be found using the CLI command show diagnostic content module <module> . |
Diagnostic |
Default Interval | Default Setting |
Description |
Corrective Action |
||||
---|---|---|---|---|---|---|---|---|
Module | ||||||||
ACT2 |
30 minutes |
active |
Verifies the integrity of the security device on the module. |
Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD "ACT2" test |
||||
ASICRegisterCheck |
modular switches: 1 minute non-modular switches: 20 seconds and a minimum configuration default simulation interval of 10 seconds |
active |
Validates read/write access to the ASICs on a module. |
Do CallHome, log error, and disable further HM testing for that ASIC device/instance after 20 consecutive failures of GOLD "ASICRegisterCheck" test |
||||
PrimaryBootROM |
24 hours 1 |
active |
Verifies the integrity of the primary boot device on a module. |
Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD "PrimaryBootROM" test |
||||
SecondaryBootROM |
24 hours 1 |
active |
Verifies the integrity of the secondary boot device on a module. |
Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD "SecondaryBootROM" test |
||||
BootupPortLoopback |
Only on bootup |
Only on boot up - active |
Checks if the supervisor to front-panel port (and back) path is operational. For every front port, the test generates a packet on an active supervisor, sends the packet toward a target port, and, using the internal loopback inside a front port, redirects the packet back to the active supervisor. |
Do CallHome, Error-disable affected ports, log error testing on affected ports after 1 consecutive failures of GOLD "BootupPortLoopback" test |
||||
PortLoopback |
30 minutes |
active |
Checks diagnostics on a per-port basis on all admin down ports. |
Do CallHome, log error in Syslog/OBFL/Exception Log, and disable further HM testing on affected ports after 10 consecutive failures of GOLD "PortLoopback" test |
||||
RewriteEngineLoopback |
1 minute |
active |
Verifies the integrity of the nondisruptive loopback for all ports up to the 1 Engine ASIC device. |
Do CallHome, log error in Syslog/OBFL/Exception Log, and disable further HM testing on affected ports after 10 consecutive failures of GOLD "RewriteEngine" test |
||||
AsicMemory |
Only on boot up |
Only on boot up - inactive |
Checks if the AsicMemory is consistent using the Mbist bit in the ASIC. |
Do CallHome and log error when GOLD "AsicMemory" test fails. As the issue causing the test failure may be transient, attempt recovery reload through kernel panic.
|
||||
FpgaRegTest |
30 seconds |
Health monitoring test - every 30 seconds - active |
Test the FPGA status by read/write to FPGA. |
Do CallHome, log error, disable further HM testing after 20 consecutive failures of GOLD "FpgaRegTest" test. As the issue causing the test failure may be transient, attempt recovery reload through kernel panic.
|
||||
L2ACLRedirect |
1 minute |
Health monitoring test - every minute - active |
Checks if the active inband path is operational. The test generates a packet on an active supervisor through the active fabric module. It then sends the packet toward the front panel port (physical interface on the line card) and, using the ACL entry, redirects the packet back to the active supervisor. |
Do CallHome, log error, disable further HM testing after 10 consecutive failures of L2ACLRedirect test. As the issue causing the test failure may be transient, attempt recovery reload through kernel panic.
|
||||
OBFL |
30 minutes |
active |
Verifies the integrity of the onboard failure logging (OBFL) flash, and monitors for available storage in the device. |
|||||
FabricConnectivityTest |
1 minute |
active |
Verifies fabric/linecard link status. Validates that the fabric links are functioning.
|
|||||
FabricReachabilityTest |
1 minute |
active |
Verifies fabric/linecard reachability status. Validates that each fabric component has a valid path to every other fabric component in the system.
|
|||||
Supervisor | ||||||||
Backplane |
30 minutes |
active |
Verifies the integrity of the backplane SPROM devices. |
|||||
NVRAM |
5 minutes |
active |
Verifies the sanity of the NVRAM blocks on a supervisor. |
Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD "NVRAM" test |
||||
RealTimeClock |
5 minutes |
active |
Verifies that the real-time clock on the supervisor is ticking. |
Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD "RealTimeClock" test |
||||
PrimaryBootROM |
30 minutes |
active |
Verifies the integrity of the primary boot device on the supervisor. |
Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD "PrimaryBootROM" test |
||||
SecondaryBootROM |
30 minutes |
active |
Verifies the integrity of the secondary boot device on the supervisor. |
Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD "SecondaryBootROM" test |
||||
BootFlash |
30 minutes |
active |
Verifies access to the bootflash devices. |
Do CallHome and log error when GOLD "BootFlash" test fails |
||||
USB |
30 minutes |
active |
Verifies access to the USB devices. |
Do Call Home and log error when GOLD "USB" test fails |
||||
SystemMgmtBus |
30 seconds |
active |
Verifies the availability of the system management bus. |
Do Call Home, log error, and disable further HM testing for that fan or power supply after 20 consecutive failures of GOLD "SystemMgmtBus" test |
||||
Mce |
30 minutes |
Health monitoring test - 30 minutes - active |
This test uses the mcd_dameon and reports any machine check error reported by the Kernel. |
Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD "Mce" test |
||||
Pcie |
Only on boot up |
Only on boot up - inactive |
Reads PCIe status registers and check for any error on the PCIe device. |
Do CallHome and log error when GOLD "Pcie" test fails |
||||
Console |
Only on boot up |
Only on boot up - inactive |
This runs a port loopback test on the management port on boot up to check for its consistency. |
Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD "Console" test |
||||
FpgaRegTest |
30 seconds |
Health monitoring test - every 30 seconds - active |
Test the FPGA status by read/write to FPGA.
|
Do CallHome, log error, disable further HM testing after 20 consecutive failures of GOLD "FpgaRegTest" test. As the issue causing the test failure may be transient, attempt recovery reload through kernel panic.
|
On-Demand Diagnostics
On-demand tests help localize faults and are usually needed in one of the following situations:
-
To respond to an event that has occurred, such as isolating a fault.
-
In anticipation of an event that may occur, such as a resource exceeding its utilization limit.
You can run all the health monitoring tests on demand. You can schedule on-demand diagnostics to run immediately.
You can also modify the default interval for a health monitoring test.
High Availability
A key part of high availability is detecting hardware failures and taking corrective action while the device runs in a live network. Online diagnostics in high availability detect hardware failures and provide feedback to high availability software components to make switchover decisions.
Cisco NX-OS supports stateless restarts for online diagnostics. After a reboot or supervisor switchover, Cisco NX-OS applies the running configuration.
Virtualization Support
Online diagnostics are virtual routing and forwarding (VRF) aware. You can configure online diagnostics to use a particular VRF to reach the online diagnostics SMTP server.