Configuring Online Diagnostics

This chapter describes how to configure the generic online diagnostics (GOLD) feature on Cisco NX-OS devices.

This chapter contains the following sections:

About Online Diagnostics

With online diagnostics, you can test and verify the hardware functionality of the device while the device is connected to a live network.

The online diagnostics contain tests that check different hardware components and verify the data path and control signals. Disruptive online diagnostic tests (such as the disruptive loopback test) and nondisruptive online diagnostic tests (such as the ASIC register check) run during bootup, line module online insertion and removal (OIR), and system reset. The nondisruptive online diagnostic tests run as part of the background health monitoring, and you can run these tests on demand.

Online diagnostics are categorized as bootup, runtime or health-monitoring diagnostics, and on-demand diagnostics. Bootup diagnostics run during bootup, health-monitoring tests run in the background, and on-demand diagnostics run once or at user-designated intervals when the device is connected to a live network.

Bootup Diagnostics

Bootup diagnostics run during bootup and detect faulty hardware before Cisco NX-OS brings a module online. For example, if you insert a faulty module in the device, bootup diagnostics test the module and take it offline before the device uses the module to forward traffic.

Bootup diagnostics also check the connectivity between the supervisor and module hardware and the data and control paths for all the ASICs. The following table describes the bootup diagnostic tests for a module and a supervisor.

Table 1. Bootup Diagnostics

Diagnostic

Description

OBFL

Verifies the integrity of the onboard failure logging (OBFL) flash.

BootupPortLoopback

Runs only during module bootup. Tests the packet path from the Supervisor CPU to each physical front panel port on the ASIC.

USB

Nondisruptive test. Checks the USB controller initialization on a module.

ManagementPortLoopback

Disruptive test, not an on-demand test. Tests loopback on the management port of a module.

EOBCPortLoopback

Disruptive test, not an on-demand test. Ethernet out of band.

Bootup diagnostics log failures to onboard failure logging (OBFL) and syslog and trigger a diagnostic LED indication (on, off, pass, or fail).

You can configure the device to either bypass the bootup diagnostics or run the complete set of bootup diagnostics.

Runtime or Health Monitoring Diagnostics

Runtime diagnostics are also called health monitoring (HM) diagnostics. These diagnostics provide information about the health of a live device. They detect runtime hardware errors, memory errors, the degradation of hardware modules over time, software faults, and resource exhaustion.

Health monitoring diagnostics are nondisruptive and run in the background to ensure the health of a device that is processing live network traffic. You can enable or disable health monitoring tests or change their runtime interval.

The following table describes the health monitoring diagnostics and test IDs for a module and a supervisor.

Table 2. Health Monitoring Nondisruptive Diagnostics

Diagnostic

Default Interval Default Setting

Description

Module

ACT2

30 minutes

active

Verifies the integrity of the security device on the module.

ASICRegisterCheck

1 minute

active

Checks read/write access to scratch registers for the ASICs on a module.

PrimaryBootROM

24 hours

1

active

Verifies the integrity of the primary boot device on a module.

SecondaryBootROM

24 hours

1

active

Verifies the integrity of the secondary boot device on a module.

PortLoopback

On demand [for releases prior to Cisco NX-OS 7.0(3)I1(2)]

30 minutes [starting with Cisco NX-OS Release 7.0(3)I1(2)]

active

Checks diagnostics on a per-port basis on all admin down ports.

RewriteEngineLoopback

1 minute

active

Verifies the integrity of the nondisruptive loopback for all ports up to the 1 Engine ASIC device.

AsicMemory

Only on boot up

Only on boot up - inactive

Checks if the AsicMemory is consistent using the Mbist bit in the ASIC.

FpgaRegTest

30 seconds

Health monitoring test - every 30 seconds - active

Test the FPGA status by read/write to FPGA.

Supervisor

NVRAM

5 minutes

active

Verifies the sanity of the NVRAM blocks on a supervisor.

RealTimeClock

5 minutes

active

Verifies that the real-time clock on the supervisor is ticking.

PrimaryBootROM

30 minutes

active

Verifies the integrity of the primary boot device on the supervisor.

SecondaryBootROM

30 minutes

active

Verifies the integrity of the secondary boot device on the supervisor.

BootFlash

30 minutes

active

Verifies access to the bootflash devices.

USB

30 minutes

active

Verifies access to the USB devices.

SystemMgmtBus

30 seconds

active

Verifies the availability of the system management bus.

Mce

30 minutes

Health monitoring test - 30 minutes - active

This test uses the mcd_dameon and reports any machine check error reported by the Kernel.

Pcie

Only on boot up

Only on boot up - inactive

Reads PCIe status registers and check for any error on the PCIe device.

Console

Only on boot up

Only on boot up - inactive

This runs a port loopback test on the management port on boot up to check for its consistency.

FpgaRegTest

30 seconds

Health monitoring test - every 30 seconds - active

Test the FPGA status by read/write to FPGA.

1 Minimum configurable test interval is 6 hours

On-Demand Diagnostics

On-demand tests help localize faults and are usually needed in one of the following situations:

  • To respond to an event that has occurred, such as isolating a fault.

  • In anticipation of an event that may occur, such as a resource exceeding its utilization limit.

You can run all the health monitoring tests on demand. You can schedule on-demand diagnostics to run immediately.

You can also modify the default interval for a health monitoring test.

High Availability

A key part of high availability is detecting hardware failures and taking corrective action while the device runs in a live network. Online diagnostics in high availability detect hardware failures and provide feedback to high availability software components to make switchover decisions.

Cisco NX-OS supports stateless restarts for online diagnostics. After a reboot or supervisor switchover, Cisco NX-OS applies the running configuration.

Virtualization Support

Online diagnostics are virtual routing and forwarding (VRF) aware. You can configure online diagnostics to use a particular VRF to reach the online diagnostics SMTP server.

Guidelines and Limitations for Online Diagnostics

Online diagnostics has the following configuration guidelines and limitations:

  • The following Cisco Nexus switches and line cards do not support the run-time PortLoopback test but do support the BootupPortLoopback test:

    Switches

    • Cisco Nexus 92160YC-X

    • Cisco Nexus 92304QC

    • Cisco Nexus 9264PQ

    • Cisco Nexus 9272Q

    • Cisco Nexus 9232C

    • Cisco Nexus 9236C

    • Cisco Nexus 9256PV

    • Cisco Nexus 92300YC

    • Cisco Nexus 93108TC-EX

    • Cisco Nexus 93108TC-EX-24

    • Cisco Nexus 93180LC-EX

    • Cisco Nexus 93180YC-EX

    • Cisco Nexus 93180YC-EXU

    • Cisco Nexus 93180YC-EX-24

    Line Cards

    • Cisco Nexus 9736C-EX

    • Cisco Nexus 97160YC-EX

    • Cisco Nexus 9732C-EX

    • Cisco Nexus 9732C-EXM

  • You cannot run disruptive online diagnostic tests on demand.

  • The BootupPortLoopback test is not supported.

  • Interface Rx and Tx packet counters are incremented (approximately four packets every 15 minutes) for ports in the shutdown state.

  • On admin down ports, the unicast packet Rx and Tx counters are incremented for GOLD loopback packets. The PortLoopback test is on demand for releases prior to Cisco NX-OS 7.0(3)I1(2), so the packet counter is incremented only when you run the test on admin down ports. Starting with Cisco NX-OS Release 7.0(3)I1(2), the PortLoopback test is periodic, so the packet counter is incremented on admin down ports every 30 minutes. The test runs only on admin down ports. When a port is unshut, the counters are not affected.

Default Settings for Online Diagnostics

The following table lists the default settings for online diagnostic parameters.

Parameters Default
Bootup diagnostics level complete
Nondisruptive tests active

Configuring Online Diagnostics


Note


Be aware that the Cisco NX-OS commands for this feature may differ from those commands used in Cisco IOS.

Setting the Bootup Diagnostic Level

You can configure the bootup diagnostics to run the complete set of tests, or you can bypass all bootup diagnostic tests for a faster module bootup time.


Note


We recommend that you set the bootup online diagnostics level to complete. We do not recommend bypassing the bootup online diagnostics.

Procedure

  Command or Action Purpose

Step 1

configure terminal

Example:

switch# configure terminal
switch(config)#

Enters global configuration mode.

Step 2

diagnostic bootup level {complete | minimal | bypass}

Example:

switch(config)# diagnostic bootup level complete

Configures the bootup diagnostic level to trigger diagnostics as follows when the device boots:

  • complete—Perform a complete set of bootup diagnostics. The default is complete.
  • minimal—Perform a minimal set of bootup diagnostics for the supervisor engine and bootup port loopback tests.
  • bypass—Do not perform any bootup diagnostics.

Step 3

(Optional) show diagnostic bootup level

Example:

switch(config)# show diagnostic bootup level
(Optional)

Displays the bootup diagnostic level (bypass or complete) that is currently in place on the device.

Step 4

(Optional) copy running-config startup-config

Example:

switch(config)# copy running-config startup-config
(Optional)

Copies the running configuration to the startup configuration.

Activating a Diagnostic Test

You can set a diagnostic test as active and optionally modify the interval (in hours, minutes, and seconds) at which the test runs.

Procedure

  Command or Action Purpose

Step 1

configure terminal

Example:

switch# configure terminal
switch(config)#

Enters global configuration mode.

Step 2

diagnostic monitor interval module slot test [test-id | name | all] hour hour min minute second second

Example:

switch(config)# diagnostic monitor interval module 6 test 3 hour 1 min 0 second 0

Configures the interval at which the specified test is run. If no interval is set, the test runs at the interval set previously, or the default interval.

The argument ranges are as follows:

  • slot —The range is from 1 to 10.
  • test-id —The range is from 1 to 14.
  • name —Can be any case-sensitive, alphanumeric string up to 32 characters.
  • hour —The range is from 0 to 23 hours.
  • minute —The range is from 0 to 59 minutes.
  • second —The range is from 0 to 59 seconds.

Step 3

[no] diagnostic monitor module slot test [test-id | name | all]

Example:

switch(config)# diagnostic monitor interval module 6 test 3

Activates the specified test.

The argument ranges are as follows:

  • slot —The range is from 1 to 10.
  • test-id —The range is from 1 to 14.
  • name —Can be any case-sensitive, alphanumeric string up to 32 characters.

The [no] form of this command inactivates the specified test. Inactive tests keep their current configuration but do not run at the scheduled interval.

Step 4

(Optional) show diagnostic content module {slot | all}

Example:

switch(config)# show diagnostic content module 6
(Optional)

Displays information about the diagnostics and their attributes.

Starting or Stopping an On-Demand Diagnostic Test

You can start or stop an on-demand diagnostic test. You can optionally modify the number of iterations to repeat this test, and the action to take if the test fails.

We recommend that you only manually start a disruptive diagnostic test during a scheduled network maintenance time.

Procedure

  Command or Action Purpose

Step 1

(Optional) diagnostic ondemand iteration number

Example:

switch# diagnostic ondemand iteration 5
(Optional)

Configures the number of times that the on-demand test runs. The range is from 1 to 999. The default is 1.

Step 2

(Optional) diagnostic ondemand action-on-failure {continue failure-count num-fails | stop}

Example:

switch# diagnostic ondemand action-on-failure stop
(Optional)

Configures the action to take if the on-demand test fails. The num-fails range is from 1 to 999. The default is 1.

Step 3

diagnostic start module slot test [test-id | name | all | non-disruptive] [port port-number | all]

Example:

switch# diagnostic start module 6 test all

Starts one or more diagnostic tests on a module. The module slot range is from 1 to 10. The test-id range is from 1 to 14. The test name can be any case-sensitive, alphanumeric string up to 32 characters. The port range is from 1 to 48.

Step 4

diagnostic stop module slot test [test-id | name | all]

Example:

switch# diagnostic stop module 6 test all

Stops one or more diagnostic tests on a module. The module slot range is from 1 to 10. The test-id range is from 1 to 14. The test name can be any case-sensitive, alphanumeric string up to 32 characters.

Step 5

(Optional) show diagnostic status module slot

Example:

switch# show diagnostic status module 6
(Optional)

Verifies that the diagnostic has been scheduled.

Simulating Diagnostic Results

You can simulate a diagnostic test result.

Procedure

Command or Action Purpose

diagnostic test simulation module slot test test-id {fail | random-fail | success} [port number | all]

Example:

switch# diagnostic test simulation module 2 test 2 fail

Simulates a test result. The test-id range is from 1 to 14. The port range is from 1 to 48.

Clearing Diagnostic Results

You can clear diagnostic test results.

Procedure

  Command or Action Purpose

Step 1

diagnostic clear result module [slot | all] test {test-id | all}

Example:

switch# diagnostic clear result module 2 test all

Clears the test result for the specified test.

The argument ranges are as follows:

  • slot —The range is from 1 to 10.
  • test-id —The range is from 1 to 14.

Step 2

diagnostic test simulation module slot test test-id clear

Example:

switch# diagnostic test simulation module 2 test 2 clear

Clears the simulated test result. The test-id range is from 1 to 14.

Verifying the Online Diagnostics Configuration

To display online diagnostics configuration information, perform one of the following tasks:

Command

Purpose

show diagnostic bootup level

Displays information about bootup diagnostics.

show diagnostic content module {slot | all}

Displays information about diagnostic test content for a module.

show diagnostic description module slot test [test-name | all]

Displays the diagnostic description.

show diagnostic events [error | info]

Displays diagnostic events by error and information event type.

show diagnostic ondemand setting

Displays information about on-demand diagnostics.

show diagnostic result module slot [test [test-name | all]] [detail]

Displays information about the results of a diagnostic.

show diagnostic simulation module slot

Displays information about a simulated diagnostic.

show diagnostic status module slot

Displays the test status for all tests on a module.

show hardware capacity [eobc | forwarding | interface | module | power]

Displays information about the hardware capabilities and current hardware utilization by the system.

show module

Displays module information including the online diagnostic test status.

Configuration Examples for Online Diagnostics

This example shows how to start all on-demand tests on module 6:

diagnostic start module 6 test all

This example shows how to activate test 2 and set the test interval on module 6:

configure terminal
diagnostic monitor module 6 test 2
diagnostic monitor interval module 6 test 2 hour 3 min 30 sec 0