Online Diagnostics

This chapter describes how to configure the generic online diagnostics (GOLD) feature on NX-OS devices.

Generic Online Diagnostics

Generic Online Diagnostics (GOLD) are tests that

  • run automatically during startup and continuously monitor health in the background,

  • help detect hardware problems early to keep the switch running smoothly and avoid downtime, and

  • check hardware and data paths while the device is running without interrupting network traffic.

Operation and types of Online Diagnostics

Online diagnostics enable you to test and verify the hardware functionality of the device while it is connected to a live network.

The online diagnostics contain tests that check different hardware components and verify the data path and control signals. Disruptive online diagnostic tests (such as the disruptive loopback test) and non-disruptive online diagnostic tests (such as the ASIC register check) run during bootup, line module online insertion and removal (OIR), and system reset. The non-disruptive online diagnostic tests run as part of the background health monitoring, and you can run these tests on demand.

Online diagnostics are categorized as bootup, runtime or health-monitoring diagnostics, and on-demand diagnostics. Bootup diagnostics run during bootup, health-monitoring tests run in the background, and on-demand diagnostics run once or at user-designated intervals when the device is connected to a live network.

Bootup diagnostics

Bootup diagnostics run during bootup and detect faulty hardware before NX-OS brings a module online. For example, if you insert a faulty module in the device, bootup diagnostics test the module and take it offline before the device uses the module to forward traffic.

Bootup diagnostics also check the connectivity between the supervisor and module hardware and the data and control paths for all the ASICs. The following table describes the bootup diagnostic tests for a module and a supervisor.

Table 1. Bootup diagnostics

Diagnostic

Description

OBFL

Verifies the integrity of the onboard failure logging (OBFL) flash.

MacSecPortLoopback (Nexus 9736C-FX and 9736Q-FX line cards only)

Tests the packet path from Supervisor to each physical front panel port on the ASIC, the MACSEC capabilities of each port, and the Encryption and Decryption capabilities of the Nexus 9736C-FX and 9736Q-FX line cards. The MacSecPortLoopback test runs at boot time when the diagnostic bootup level is set to complete .

The MacSecPortLoopback test runs on every port of the 36 front ports on the Nexus 9736C-FX and 9736Q-FX line cards, including ports that are broken out. The MACsec hardware is tested for the four available cipher suite algorithms: GCM-AES-128, GCM-AES-256, GCM-AES-XPN-128, and GCM-AES-XPN-256.

Note

 

If a MacSecPortLoopback test failure occurs, the test reports in the form of SYSLOG or OBFL. When a test failure occurs, the port is taken down and display MACsec failure in the show interface command output. You can skip the MACsec test by setting the diagnostic bootup level to either minimal or bypass .

USB

Nondisruptive test. Checks the USB controller initialization on a module.

ManagementPortLoopback

Disruptive test, not an on-demand test. Tests loopback on the management port of a module.

EOBCPortLoopback

Disruptive test, not an on-demand test. Ethernet out of band.

Bootup diagnostics log failures to onboard failure logging (OBFL) and system log and trigger a diagnostic LED indication (on, off, pass, or fail).

You can configure the device to either bypass the bootup diagnostics or run the complete set of bootup diagnostics.

Runtime or health monitoring diagnostics

Runtime diagnostics, also called health monitoring (HM) diagnostics, provide information about the health of a live device and detect runtime hardware errors, memory errors, hardware degradation, software faults, and resource exhaustion.

Health monitoring diagnostics are non-disruptive and run in the background to ensure the health of a device that is processing live network traffic. You can enable or disable health monitoring tests or change their runtime interval.

The table describes the health monitoring diagnostics and test IDs for a module and a supervisor.


Note


Some tests may or may not be present, depending on the capabilities of the module. A list of tests available to the module can be found using the CLI command show diagnostic content module<module> .


Table 2. Health monitoring non-disruptive diagnostics

Diagnostic

Default Interval

Default Setting

Description

Corrective Action

Module

ACT2

30 minutes

active

Verifies the integrity of the security device on the module.

Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD ACT2 test

ASICRegisterCheck

modular switches: 1 minute

non-modular switches: 20 seconds and a minimum configuration default simulation interval of 10 seconds

active

Validates read/write access to the ASICs on a module.

Do CallHome, log error, and disable further HM testing for that ASIC device/instance after 20 consecutive failures of GOLD ASICRegisterCheck test

PrimaryBootROM

24 hours

1

active

Verifies the integrity of the primary boot device on a module.

Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD PrimaryBootROM test

SecondaryBootROM

24 hours

1

active

Verifies the integrity of the secondary boot device on a module.

Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD SecondaryBootROM test

BootupPortLoopback

Only on bootup

Only on boot up - active

Checks if the supervisor to front-panel port (and back) path is operational. For every front port, the test generates a packet on an active supervisor, sends the packet toward a target port, and, using the internal loopback inside a front port, redirects the packet back to the active supervisor.

Do CallHome, Error-disable affected ports, log error testing on affected ports after 1 consecutive failures of GOLD BootupPortLoopback test

PortLoopback

30 minutes

active

Checks diagnostics on a per-port basis on all admin down ports.

Do CallHome, log error in Syslog/OBFL/Exception Log, and disable further HM testing on affected ports after 10 consecutive failures of GOLD PortLoopback test

RewriteEngineLoopback

1 minute

active

Verifies the integrity of the nondisruptive loopback for all ports up to the 1 Engine ASIC device.

Do CallHome, log error in Syslog/OBFL/Exception Log, and disable further HM testing on affected ports after 10 consecutive failures of GOLD RewriteEngine test

AsicMemory

Only on boot up

Only on boot up - inactive

Checks if the AsicMemory is consistent using the Mbist bit in the ASIC.

Do CallHome and log error when GOLD AsicMemory test fails. As the issue causing the test failure may be transient, attempt recovery reload through kernel panic.

Note

 

To avoid a kernel panic when the test fails, you can override the EEM system policy.

FpgaRegTest

30 seconds

Health monitoring test - every 30 seconds - active

Test the FPGA status by read/write to FPGA.

Do CallHome, log error, disable further HM testing after 20 consecutive failures of GOLD FpgaRegTest test. As the issue causing the test failure may be transient, attempt recovery reload through kernel panic.

Note

 

To avoid a kernel panic when the test fails, you can override the EEM system policy.

L2ACLRedirect

1 minute

Health monitoring test - every minute - active

Checks if the active inband path is operational. The test generates a packet on an active supervisor through the active fabric module. It then sends the packet toward the front panel port (physical interface on the line card) and, using the ACL entry, redirects the packet back to the active supervisor.

Do CallHome, log error, disable further HM testing after 10 consecutive failures of L2ACLRedirect test. As the issue causing the test failure may be transient, attempt recovery reload through kernel panic.

Note

 

To avoid a kernel panic when the test fails, you can override the EEM system policy.

OBFL

30 minutes

active

Verifies the integrity of the onboard failure logging (OBFL) flash, and monitors for available storage in the device.

FabricConnectivityTest

1 minute

active

Verifies fabric/linecard link status.

Validates that the fabric links are functioning.

Note

 

Only available on Nexus 9500-R series line cards and N9K-X9836DM-A line cards. This is also supported on X98366DM-A and X98900CD-A line cards with Nexus 9808 and 9804 switches.

FabricReachabilityTest

1 minute

active

Verifies fabric/linecard reachability status.

Validates that each fabric component has a valid path to every other fabric component in the system.

Note

 

Only available on Nexus 9500-R series line cards and N9K-X9836DM-A line cards. This is also supported on X98366DM-A and X98900CD-A line cards with Nexus 9808 and 9804 switches.

Supervisor

Backplane

30 minutes

active

Verifies the integrity of the backplane SPROM devices.

NVRAM

5 minutes

active

Verifies the sanity of the NVRAM blocks on a supervisor.

Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD NVRAM test

RealTimeClock

5 minutes

active

Verifies that the real-time clock on the supervisor is ticking.

Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD RealTimeClock test

PrimaryBootROM

30 minutes

active

Verifies the integrity of the primary boot device on the supervisor.

Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD PrimaryBootROM test

SecondaryBootROM

30 minutes

active

Verifies the integrity of the secondary boot device on the supervisor.

Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD SecondaryBootROM test

BootFlash

30 minutes

active

Verifies access to the bootflash devices.

Do CallHome and log error when GOLD BootFlash test fails

USB

30 minutes

active

Verifies access to the USB devices.

Do Call Home and log error when GOLD USB test fails

SystemMgmtBus

30 seconds

active

Verifies the availability of the system management bus.

Do Call Home, log error, and disable further HM testing for that fan or power supply after 20 consecutive failures of GOLD SystemMgmtBus test

Mce

30 minutes

Health monitoring test - 30 minutes - active

This test uses the mcd_dameon and reports any machine check error reported by the Kernel.

Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD Mce test

Pcie

Only on boot up

Only on boot up - inactive

Reads PCIe status registers and check for any error on the PCIe device.

Do CallHome and log error when GOLD Pcie test fails

Console

Only on boot up

Only on boot up - inactive

This runs a port loopback test on the management port on boot up to check for its consistency.

Do CallHome, log error, and disable further HM testing after 20 consecutive failures of GOLD Console test

FpgaRegTest

30 seconds

Health monitoring test - every 30 seconds - active

Test the FPGA status by read/write to FPGA.

Note

 

FpgaRegTest on Nexus 9808 and 9804 switches for Fabric Modules (19-26) will be displayed under the FpgaRegTest result of Active-SUP.

Do CallHome, log error, disable further HM testing after 20 consecutive failures of GOLD FpgaRegTest test. As the issue causing the test failure may be transient, attempt recovery reload through kernel panic.

Note

 

To avoid a kernel panic when the test fails, you can override the EEM system policy.

1 Minimum configurable test interval is 6 hours

On-demand diagnostics

On-demand tests help localize faults and are usually needed in one of these situations:

  • To respond to an event that has occurred, such as isolating a fault.

  • In anticipation of an event that may occur, such as a resource exceeding its utilization limit.

You can run all the health monitoring tests on demand. You can schedule on-demand diagnostics to run immediately.

You can also modify the default interval for a health monitoring test.

High Availability

A key part of high availability is detecting hardware failures and taking corrective action while the device runs in a live network.

Online diagnostics in high availability detect hardware failures and provide feedback to high availability software components to make switchover decisions.

NX-OS supports stateless restarts for online diagnostics. After a reboot or supervisor switchover, NX-OS applies the running configuration.

Virtualization support

Online diagnostics are virtual routing and forwarding (VRF) aware. You can configure online diagnostics to use a particular VRF to reach the online diagnostics SMTP server.

Guidelines and limitations for online diagnostics

This topic describes the guidelines and limitations for Generic Online Diagnostics (GOLD) configuration.

  • The Nexus platform switches that do not support the BootupPortLoopback test on breakout ports are:

      • Nexus N9K-C9808, Nexus N9K-C9804

      • Nexus 9364E-SG2-Q, Nexus 9364E-SG2-O

      • Nexus N9K-X9836DM-A

      • Nexus N9K-C9232E-B1

      • Nexus 9336C-SE1

      • N9396Y12C-SE1, N9396T12C-SE1

      On these platforms, the BootupPortLoopback diagnostic test is not supported on breakout ports during bootup. The test may display as Untested (U) or Fail (F) for breakout subports. On-demand PortLoopback tests are supported on breakout ports after bootup.

  • You cannot run disruptive online diagnostic tests on demand.

  • Interface Rx and Tx packet counters are incremented (approximately four packets every 15 minutes) for ports in the shutdown state.

  • The PortLoopback test is periodic, so the packet counter is incremented on admin down ports every 30 minutes. The test runs only on admin down ports. When a port is unshut, the counters are not affected.

  • When a port fails for the per-port BootupPortLoopback test, the port enters the error-disabled state. (To remove this state, enter the shutdown and no shutdown commands on the port.)

  • Beginning with NX-OS Release 10.6(1)F, on Nexus modular switches, if the Backplane diagnostic test fails and a BACKPLANE_AUTHENTICATION_FAIL syslog appears, do not perform an upgrade or a system reload.

Platform support

  • The Nexus platform switches and line cards that do not support the run-time PortLoopback test but support the BootupPortLoopback test:

    Switches

    Line Cards

    • Nexus 9736C-EX

    • Nexus 97160YC-EX

    • Nexus 9732C-EX

    • Nexus 9732C-EXM

  • Beginning with NX-OS Release 10.3(1)F, Generic Online Diagnostics (GOLD) is supported on the Nexus 9800 platform switches.

  • Beginning with NX-OS Release 10.4(1)F, Generic Online Diagnostics (GOLD) is supported on the following line cards and switches:

    • Nexus 9804 switch

    • Nexus C9332D-H2R switch

    • Nexus X98900CD-A line card

    • Nexus X98900CD-A and X9836DM-A line cards with Nexus 9808 and 9804 switches

  • Beginning with NX-OS Release 10.4(2)F, Generic Online Diagnostics (GOLD) is supported on Nexus 93400LD-H1 platform switches.

  • Beginning with NX-OS Release 10.4(3)F, Generic Online Diagnostics (GOLD) is supported on Nexus 9364C-H1 platform switch.

  • Beginning with NX-OS Release 10.5(3)F, Generic Online Diagnostics (GOLD) is supported on N9364E-SG2-O and N9364E-SG2-Q switches. However, BootupPortLoopback test is not supported on any breakout ports on these switches.

  • Beginning with NX-OS Release 10.6(2)F, Generic On-line Diagnostics (GOLD) is supported on N9396Y12C-SE1 and N9396T12C-SE1 switches.

Default settings for online diagnostics

This table lists the default settings for online diagnostic parameters.

Parameters

Default

Bootup diagnostics level

complete

Nondisruptive tests

active

Configuring Online Diagnostics


Note


Be aware that the NX-OS commands for this feature may differ from those commands used in IOS.

Set the bootup diagnostic level

You can configure the bootup diagnostics to run the complete set of tests, or you can bypass all bootup diagnostic tests for a faster module bootup time.


Note


We recommend that you set the bootup online diagnostics level to complete. We do not recommend bypassing the bootup online diagnostics.

Procedure


Step 1

Enter global configuration mode using the configure terminal command.

Example:

switch# configure terminal
switch(config)#

Step 2

Configure the bootup diagnostic level to trigger diagnostics when the device boots using the diagnostic bootup level {complete | | bypass} command.

Example:

switch(config)# diagnostic bootup level complete

The bootup diagnostic levels include:

  • complete—Perform a complete set of bootup diagnostics. The default is complete.

  • bypass—Do not perform any bootup diagnostics.

Step 3

(Optional) Display the bootup diagnostic level (bypass or complete) that is currently in place on the device using the show diagnostic bootup level command.

Example:

switch(config)# show diagnostic bootup level

Step 4

(Optional) Copy the running configuration to the startup configuration using the copy running-config startup-config command.

Example:

switch(config)# copy running-config startup-config

Activate a diagnostic test

You can set a diagnostic test as active and optionally modify the interval (in hours, minutes, and seconds) at which the test runs.

Procedure


Step 1

configure terminal

Example:

switch# configure terminal
switch(config)#
					

Enters global configuration mode.

Step 2

Configure the interval at which the specified test is run using the diagnostic monitor interval module slot test [test-id | name | all] hour hour min minute second second command.

Example:

switch(config)# diagnostic monitor interval module 6 test 3 hour 1 min 0 second 0

If no interval is set, the test runs at the interval set previously, or the default interval.

The argument ranges are as follows:

  • slot —The range is from 1 to 10.

  • test-id —The range is from 1 to 14.

  • name —Can be any case-sensitive, alphanumeric string up to 32 characters.

  • hour —The range is from 0 to 23 hours.

  • minute —The range is from 0 to 59 minutes.

  • second —The range is from 0 to 59 seconds.

Step 3

Activate the specified test using the [no] diagnostic monitor module slot test [test-id | name | all] command.

Example:

switch(config)# diagnostic monitor interval module 6 test 3

The argument ranges are:

  • slot —The range is from 1 to 10.

  • test-id —The range is from 1 to 14.

  • name —Can be any case-sensitive, alphanumeric string up to 32 characters.

The [no] form of this command inactivates the specified test. Inactive tests keep their current configuration but do not run at the scheduled interval.

Step 4

(Optional) Display information about the diagnostics and their attributes using the show diagnostic content module {slot | all} command.

Example:

switch(config)# show diagnostic content module 6

Start or stop an on-demand diagnostic test

You can start or stop an on-demand diagnostic test. You can optionally modify the number of iterations to repeat this test, and the action to take if the test fails.

We recommend that you only manually start a disruptive diagnostic test during a scheduled network maintenance time.

Procedure


Step 1

(Optional) Configure the number of times that the on-demand test runs using the diagnostic ondemand iteration number command.

The range is from 1 to 999. The default is 1.

Example:

switch# diagnostic ondemand iteration 5

Step 2

(Optional) Configure the action to take if the on-demand test fails using the diagnostic ondemand action-on-failure {continue failure-count num-fails | stop} command.

The num-fails range is from 1 to 999. The default is 1.

Example:

switch# diagnostic ondemand action-on-failure stop

Step 3

Start one or more diagnostic tests on a module using the diagnostic start module slot test [test-id | name | all | non-disruptive] [port port-number | all] command.

The module slot range is from 1 to 10. The test-id range is from 1 to 14. The test name can be any case-sensitive, alphanumeric string up to 32 characters. The port range is from 1 to 48.

Example:

switch# diagnostic start module 6 test all

Step 4

Stop one or more diagnostic tests on a module using the diagnostic stop module slot test [test-id | name | all] command.

The module slot range is from 1 to 10. The test-id range is from 1 to 14. The test name can be any case-sensitive, alphanumeric string up to 32 characters.

Example:

switch# diagnostic stop module 6 test all

Step 5

(Optional) Verify that the diagnostic has been scheduled using the show diagnostic status module slot command.

Example:

switch# show diagnostic status module 6

Simulate diagnostic results

You can simulate a diagnostic test result.

Procedure


Simulates a test result using the diagnostic test simulation module slot test test-id {fail | random-fail | success} [port number | all] command.

The test-id range is from 1 to 14. The port range is from 1 to 48.

Example:

switch# diagnostic test simulation module 2 test 2 fail

Clear diagnostic results

You can clear diagnostic test results.

Procedure


Step 1

Clear the test result for the specified test using the diagnostic clear result module [slot | all] test {test-id | all} command.

The argument ranges are:

  • slot —The range is from 1 to 10.

  • test-id —The range is from 1 to 14.

Example:

switch# diagnostic clear result module 2 test all

Step 2

Clear the simulated test result using the diagnostic test simulation module slot test test-id clear command.

The test-id range is from 1 to 14.

Example:

switch# diagnostic test simulation module 2 test 2 clear

Commands for verifying Online Diagnostics configuration

Use any of the show commands provided in the table to display the required configuration information about online diagnostics.

Command

Purpose

show diagnostic bootup level

Displays information about bootup diagnostics.

show diagnostic content module {slot | all}

Displays information about diagnostic test content for a module.

show diagnostic description module slot test [test-name | all]

Displays the diagnostic description.

show diagnostic events [error | info]

Displays diagnostic events by error and information event type.

show diagnostic ondemand setting

Displays information about on-demand diagnostics.

show diagnostic result module slot [test [test-name | all]] [detail]

Displays information about the results of a diagnostic.

show diagnostic simulation module slot

Displays information about a simulated diagnostic.

show diagnostic status module slot

Displays the test status for all tests on a module.

show hardware capacity [eobc | forwarding | interface | module | power]

Displays information about the hardware capabilities and current hardware utilization by the system.

show module

Displays module information including the online diagnostic test status.

Configuration examples for Online Diagnostics

This topic provides configuration examples for starting and monitoring online diagnostics on modules.

This example shows how to start all on-demand tests on module 6.

diagnostic start module 6 test all

This example shows how to activate test 2 and set the test interval on module 6.

configure terminal
diagnostic monitor module 6 test 2
diagnostic monitor interval module 6 test 2 hour 3 min 30 sec 0