Cisco Nexus 4001I and 4005I Switch Module for IBM BladeCenter NX-OS Configuration Guide
Configuring Online Diagnostics
Downloads: This chapterpdf (PDF - 160.0KB) The complete bookPDF (PDF - 4.46MB) | Feedback

Configuring Online Diagnostics

Table Of Contents

Configuring Online Diagnostics

Online Health Management System

System Health Initiation

Loopback Test Configuration Frequency

Hardware Failure Action

Test Run Requirements

Tests for a Specified Module

Clearing Previous Error Reports

Interpreting the Current Status

Displaying System Health

On-Board Failure Logging

About OBFL

Configuring OBFL for the Switch

Displaying OBFL Logs

Default Settings


Configuring Online Diagnostics


This chapter describes how to configure the online diagnostics feature.

This chapter includes the following sections:

Online Health Management System

On-Board Failure Logging

Online Health Management System

The Online Health Management System (OHMS) is a hardware fault detection and recovery feature. It ensures the general health of the switch.

This section includes the following topics:

System Health Initiation

Loopback Test Configuration Frequency

Hardware Failure Action

Test Run Requirements

Tests for a Specified Module

Clearing Previous Error Reports

Interpreting the Current Status

Displaying System Health

The OHMS monitors system hardware in the following ways:

The OHMS application launches a daemon process in the switch and runs multiple tests. The tests run at preconfigured intervals, cover all major fault points, and isolate any failing component in the MDS switch. The OHMS maintains control over all other OHMS components in the switch.

On detecting a fault, the system health application attempts the following recovery actions:

Performs additional testing to isolate the faulty component

If unable to recover, sends Call Home notifications, system messages and exception logs; and shuts down and discontinues testing the failed component (such as an ethernet interface)

Sends Call Home and system messages and exception logs as soon as it detects a failure.

Shuts down the failing component (such as an ethernet interface).

Isolates failed ports from further testing.

Reports the failure to the appropriate software component.

Provides CLI support to view, test, and obtain test run statistics or change the system health test configuration on the switch.

Performs tests to focus on the problem area.

The switch is configured to run the relevant test. You can change the default parameters of the test as required.

System Health Initiation

By default, the system health feature is enabled in the switch.

To disable or enable this feature in the switch, perform this task:

 
Command
Purpose

Step 1 

switch# config terminal

switch(config)#

Enters configuration mode.

Step 2 

switch(config)# no system health

System Health is disabled.

Disables system health from running tests in this switch.

switch(config)# system health

System Health is enabled.

Enables (default) system health to run tests in this switch.

Step 3 

switch(config)# no system health interface ethernet 1/1

System health for interface ethernet1/1 is disabled.

Disables system health from testing the Ethernet interface.

Loopback Test Configuration Frequency

Loopback tests are designed to identify hardware errors in the data path in the module. One loopback frame is sent to each module at a preconfigured frequency—it passes through the Ethernet interface.

The loopback tests can be run at frequencies ranging from 60 seconds (default) to 255 seconds. If you do not configure the loopback frequency value, the default frequency of 60 seconds is used for the switch. Loopback test frequencies can be altered for the switch.

To configure the frequency of loopback tests on a switch, perform this task:

 
Command
Purpose

Step 1 

switch# config terminal

switch(config)#

Enters configuration mode.

Step 2 

switch(config)# system health loopback frequency 60

The new frequency is set at 60 Seconds.

Configures the loopback frequency to 60 seconds. The default loopback frequency is 60 seconds. The valid range is from 60 to 255 seconds.

Hardware Failure Action

The failure-action command controls the Cisco NX-OS software from taking any action if a hardware failure is determined while running the tests.

By default, this feature is enabled in the switch—action is taken if a failure is determined and the failed component is isolated from further testing.

Failure action is controlled for the entire switch.

To configure failure action in a switch, perform this task:

 
Command
Purpose

Step 1 

switch# config terminal

switch(config)#

Enters configuration mode.

Step 2 

switch(config)# system health failure-action

System health global failure action is now enabled.

Enables the switch to take failure action (default).

Step 3 

switch(config)# no system health failure-action

System health global failure action now disabled.

Reverts the switch configuration to prevent failure action being taken.

Step 4 

switch(config)# system health module 1 failure-action

System health failure action for module 1 is now enabled.

Enables switch to take failure action for failures in module 1.

Step 5 

switch(config)# no system health module 1 loopback failure-action

System health failure action for module 1 loopback test is now disabled.

Prevents the switch from taking action on failures determined by the loopback test in module 1.

Test Run Requirements

Enabling a test does not guarantee that a test will run.

Tests on a given interface or module only run if you enable system health for all of the following items:

The entire switch.

The required module.

The required interface.


Tip The test will not run if system health is disabled in any combination. If system health is disabled to run tests, the test status shows up as disabled.



Tip If the switch or Ethernet interface is enabled to run tests, but is not running the tests due to system health being disabled, then tests show up as enabled (not running).


Tests for a Specified Module

The system health feature in the NX-OS software performs tests in the following areas:

Bootflash connectivity and accessibility on the switch.

Data path integrity for each interface on the switch.

Management port connectivity.

User-driven test for internal connectivity verification (Ethernet ports).

To perform the required test on a specific module, perform this task:

 
Command
Purpose

Step 1 

switch# config terminal

switch(config)#

Enters configuration mode.

 

Note The following steps can be performed in any order.

 

Note The various options for each test are described in the next step. Each command can be configured in any order. The various options are presented in the same step for documentation purposes.

Step 2 

switch(config)# system health module 1 bootflash

Enables the bootflash test on the switch.

switch(config)# system health module 1 bootflash frequency 200

Sets the new frequency of the bootflash test on the switch.

Step 3 

switch(config)# system health module 1 loopback

Enables the loopback test on the switch.

Step 4 

switch(config)# system health module 1 management

Enables the management test on the switch.

Clearing Previous Error Reports

You can clear the error history for Ethernet interfaces or the switch. By clearing the history, you are directing the software to retest all failed components that were previously excluded from tests.

If you previously enabled the failure-action option for a period of time (for example, one week) to prevent OHMS from taking any action when a failure is encountered and after that week you are now ready to start receiving these errors again, then you must clear the system health error status for each test.

Use the EXEC-level system health clear-errors command for the interface or switch to erase any previous error conditions logged by the system health application. The bootflash, the loopback, and the mgmt test options can be individually specified for a given module.

The following example clears the error history for the specified Ethernet interface:

switch# system health clear-errors interface ethernet 1/1
 
   

The following example clears the error history for the specified module:

switch# system health clear-errors module 1 
 
   

The following example clears the management test error history for the switch:

switch# system health clear-errors module 1 mgmt

Interpreting the Current Status

The status of each switch or test depends on the current configured state of the OHMS test (see Table 24-1).

Table 24-1 OHMS Configured Status for Tests and Modules 

Status
Description

Enabled

You have currently enabled the test, and the test is not running.

Disabled

You have currently disabled the test.

Running

You have enabled the test and the test is currently running.

Failing

This state is displayed if a failure is imminent for the test running—possibility of test recovery exists in this state.

Failed

The test has failed—and the state cannot be recovered.

Stopped

The test has been internally stopped by the Cisco NX-OS software.

Internal failure

The test encountered an internal failure in this module. For example, the system health application is not able to open a socket as part of the test procedure.

Note The internal failure status does not apply to the loopback test.

On demand

The system health internal-loopback tests are currently running. This command can be issued on demand.


The status of each test in the switch is visible when you display any of the show system health commands. See the "Displaying System Health" section.

Displaying System Health

Use the show system health command to display system-related status information (see Example 24-1 to Example 24-6).

Example 24-1 Displays the Current Health of All Modules in the Switch

switch# show system health
 
Current health information for module 1.
 
Test                    Frequency       Status          Action
-----------------------------------------------------------------
Bootflash                10 Sec         Running         Enabled
Management Port          60 Sec         Running         Enabled
Loopback                 60 Sec         Running         Enabled
-----------------------------------------------------------------
 
   

Example 24-2 Displays the Current Health of a Specified Module

switch# show system health module 1
 
Current health information for module 1.
 
Test                    Frequency       Status          Action
-----------------------------------------------------------------
Bootflash                10 Sec         Running         Enabled
Management Port          60 Sec         Running         Enabled
Loopback                 60 Sec         Running         Enabled
-----------------------------------------------------------------

Example 24-3 Displays Health Statistics for All Modules

switch# show system health statistics
 
Test statistics for module 1
------------------------------------------------------------------------------
Test Name         State       Frequency     Run      Pass      Fail CFail Errs
------------------------------------------------------------------------------
Bootflash         Running        10s        705       705         0     0    0
Management Port   Running        60s        117       117         0     0    0
Loopback          Running        60s       1504      1493        11     0    0
Loopback Port   Status
      1         Failed
      2         Failed
      3         Failed
      4         Failed
      5         Failed
      6         Failed
      7         Passed
      8         Passed
      9         Passed
     10         Passed
     11         Passed
     12         Passed
     13         Passed
     14         Passed
     15         Passed
     16         Failed
     17         Failed
     18         Failed
     19         Failed
     20         Failed
------------------------------------------------------------------------------
 
   

Example 24-4 Displays Statistics for a Specified Module

switch# show system health statistics module 1
 
Test statistics for module 1
------------------------------------------------------------------------------
Test Name         State       Frequency     Run      Pass      Fail CFail Errs
------------------------------------------------------------------------------
Bootflash         Running        10s        706       706         0     0    0
Management Port   Running        60s        117       117         0     0    0
Loopback          Running        60s       1504      1493        11     0    0
Loopback Port   Status
      1         Failed
      2         Failed
      3         Failed
      4         Failed
      5         Failed
      6         Failed
      7         Passed
      8         Passed
      9         Passed
     10         Passed
     11         Passed
     12         Passed
     13         Passed
     14         Passed
     15         Passed
     16         Failed
     17         Failed
     18         Failed
     19         Failed
     20         Failed
------------------------------------------------------------------------------

Example 24-5 Displays Loopback Test Statistics for the Entire Switch

switch# show system health statistics loopback
-----------------------------------------------------------------
Mod Port Status                Run     Pass     Fail   CFail Errs
  1   20 Running                 0        0        0       0    0
-----------------------------------------------------------------
 
   

Example 24-6 Displays Loopback Test Statistics for a Specified Interface

switch# show system health statistics loopback interface ethernet 1/1
-----------------------------------------------------------------
Mod Port Status                Run     Pass     Fail   CFail Errs
  1    1 Running                 0        0        0       0    0
-----------------------------------------------------------------

Note Interface-specific counters will remain at zero unless the loopback test reports errors or failures.


Example 24-7 Displays the Loopback Test Time Log for the Switch

switch# show system health statistics loopback timelog
-----------------------------------------------------------------
Mod        Samples     Min(usecs)     Max(usecs)     Ave(usecs)
  1              0              0              0              0
-----------------------------------------------------------------
 
   

Example 24-8 Displays the Loopback Test Time Log for a Specified Module

switch# show system health statistics loopback module 1 timelog
-----------------------------------------------------------------
Mod        Samples     Min(usecs)     Max(usecs)     Ave(usecs)
  1              0              0              0              0
-----------------------------------------------------------------

On-Board Failure Logging

The on-board failure logging (OBFL) feature stores failure and environmental information in nonvolatile memory on the module. The information will help in post-mortem analysis of failed cards.

This section includes the following topics:

About OBFL

Configuring OBFL for the Switch

Displaying OBFL Logs

Default Settings

About OBFL

OBFL data is stored in the existing eUSB on the module. OBFL uses the persistent logging (PLOG) facility available in the module firmware to store data in the eUSB. It also provides the mechanism to retrieve the stored data.

The data stored by the OBFL facility includes the following:

Time of initial power-on

Firmware, BIOS, FPGA, and ASIC versions

Serial number of the card

Stack trace for crashes

Software error messages

Hardware exception logs

Environmental history

OBFL specific history information

Configuring OBFL for the Switch

To configure OBFL for all the modules on the switch, perform this task:

 
Command
Purpose

Step 1 

switch# config terminal

switch(config)#

Enters configuration mode.

Step 2 

switch(config)# hw-module logging onboard

Enables all OBFL features.

switch(config)# hw-module logging onboard environmental-history

Enables the OBFL environmental history.

switch(config)# hw-module logging onboard obfl-log

Enables the boot uptime, device version, and OBFL history.

switch(config)# no hw-module logging onboard

Disables all OBFL features.

Use the show logging onboard status command to display the configuration status of OBFL:

switch# show logging onboard status
----------------------------
OBFL Status
----------------------------
    Switch OBFL Log:                                    Enabled
 
   
    Module:  1 OBFL Log:                                Enabled
    environmental-history                               Enabled
    exception-log                                       Enabled
    obfl-log (boot-uptime/device-version/obfl-history)  Enabled
    temp error                                          Enabled
    stack-trace                                         Enabled
 
   

Displaying OBFL Logs

To display OBFL information stored in the switch, use the following commands:

Command
Purpose

show logging onboard boot-uptime

Displays the boot and uptime information.

show logging onboard device-version

Displays device version information.

show logging onboard endtime

Displays OBFL logs to an end time.

show logging onboard environmental-history

Displays environmental history.

show logging onboard exception-log

Displays exception log information.

show logging onboard miscellaneous-error

Displays miscellaneous error information.

show logging onboard obfl-history

Displays history information.

show logging onboard stack-trace

Displays kernel stack trace information.

show logging onboard starttime

Displays OBFL logs from a specified start time.

show logging onboard system-health

Displays system health information.


Default Settings

Table 24-2 lists the default system health and log settings.

Table 24-2 Default System Health and Log Settings  

Parameters
Default

Kernel core generation

One module.

System health

Enabled.

Loopback frequency

60 seconds.

Failure action

Enabled.