Guest

Cisco Nexus 5000 Series Switches

Nexus 5010/5020 Switches %NOHMS-2-NOHMS_DIAG_ Error Message Interpretation

Nexus 5010/5020 Switches %NOHMS-2-NOHMS_DIAG_ Error Message Interpretation

Document ID: 116247

Updated: Jul 05, 2013

Contributed by Alejandro Eguiarte and Shelley Bhalla, Cisco TAC Engineers.

   Print

Introduction

This document describes a problem encountered with Nexus 5010/5020 switches caused by a hardware issue in the Altos ASIC (error message %NOHMS-2-NOHMS_DIAG_ERROR: Module 1: Runtime diag detected major event: Port Failure), and also provides a solution to the problem.

Prerequisites

Requirements

Cisco recommends that you have knowledge of the Nexus CLI.

Components Used

The information in this document is based on Cisco Nexus 5010/5020 switches only. It does not affect Cisco Nexus 5548/5596 switches.

The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, make sure that you understand the potential impact of any command.

Problem

Multiple interfaces on Card 2 are down, and you see this alert:

N5020 %$ VDC-1 %$ %NOHMS-2-NOHMS_DIAG_ERROR: Module 1: Runtime diag detected major event

The alert suggests a card failure, but some ports are up. Even though the Nexus 5020 switch is online, the Fiber Channel (FC) module in Slot 2 is offline. Enter the show module command in order to view the status of the modules:

Mod Ports  Module-Type                      Model                  Status
--- ----- -------------------------------- ---------------------- ------------
1 40 40x10GE/Supervisor N5K-C5020P-BF-SUP active *
2 8 8x1/2/4G FC Module N5K-M1008 offline <<<<<<

Mod Sw Hw World-Wide-Name(s) (WWN)
--- -------------- ------ --------------------------------------------------
1 4.2(1)N2(1) 1.3 --
2 4.2(1)N2(1) 1.0 77:9f:b7:62:2f:6c:69:62 to 00:00:00:b8:27:0a:08:2c

Enter the show environment command in order to view the module environment data.

Mod Model                   Power     Power       Power     Power       Status
                            Requested Requested   Allocated Allocated
                           (Watts)   (Amp)       (Watts)   (Amp)              
--- ----------------------  -------   ----------  --------- ----------  ----------
1    N5K-C5020P-BF-SUP      625.20    52.10       625.20    52.10       powered-up
2    N5K-M1008              9.96      0.83        9.96      0.83       fail/shutdown

Enter the show logging nvram command in order to view this output:

N5020 %$ VDC-1 %$ %NOHMS-2-NOHMS_DIAG_ERROR: Module 1: Runtime diag detected major event:
Port failure: Ethernet1/1
N5020 %$ VDC-1 %$ last message repeated 2 times
N5020 %$ VDC-1 %$ %NOHMS-2-NOHMS_DIAG_ERROR: Module 1: Runtime diag detected major event:
Port failure: Ethernet1/2 N5020 %$ VDC-1 %$ last message repeated 7 times
N5020 %$ VDC-1 %$ %NOHMS-2-NOHMS_DIAG_ERROR: Module 1: Runtime diag detected major event:
Port failure: Ethernet1/5 N5020 %$ VDC-1 %$ last message repeated 3 times
N5020 %$ VDC-1 %$ %NOHMS-2-NOHMS_DIAG_ERROR: Module 1: Runtime diag detected major event:
Port failure: Ethernet1/13

As you can see from the logs, several ports failed the runtime diagnostics. Also, two ports from every Gatos ASIC report a "Hardware failure" because the fabric is down. Enter the show interface brief command in order to view this output:

--------------------------------------------------------------------------------
Ethernet VLAN Type Mode Status Reason Speed Port
Interface Ch #
--------------------------------------------------------------------------------
Eth1/1 1 eth fabric down Hardware failure 10G(D) 138
Eth1/2 1 eth fabric down Hardware failure 10G(D) 138
Eth1/3 1 eth fabric up none 10G(D) 138
Eth1/4 1 eth fabric up none 10G(D) 138
Eth1/5 1 eth fabric down Hardware failure 10G(D) 140
Eth1/6 1 eth fabric down Hardware failure 10G(D) 140
Eth1/7 1 eth fabric up none 10G(D) 140
Eth1/8 1 eth fabric up none 10G(D) 140

The Gatos ASIC reports failures for some of the ports and disables them. Enter the show hardware internal gatos event-history error command in order to view this output:

1)  Event:E_DEBUG, length:81, at 775734 usecs after Fri May 24 15:28:10 2013
[101] xcvr_set_port_to_hw_failure(): Sending nohms failure notif for port xgb1/13
2) Event:E_DEBUG, length:44, at 775726 usecs after Fri May 24 15:28:10 2013[100] CODE-PATH:
xcvr_set_port_to_hw_failure
935) Event:E_DEBUG, length:34, at 434695 usecs after Fri May 24 15:28:06 2013[100] CODE-PATH:
xcvr_port_disable
936) Event:E_DEBUG, length:38, at 434653 usecs after Fri May 24 15:28:06 2013[100] CODE-PATH:
xcvr_set_port_disable
937) Event:E_DEBUG, length:81, at 408233 usecs after Fri May 24 15:28:06 2013
[101] xcvr_set_port_to_hw_failure(): Sending nohms failure notif for port xgb1/30
938) Event:E_DEBUG, length:44, at 408224 usecs after Fri May 24 15:28:06 2013
[100] CODE-PATH:
xcvr_set_port_to_hw_failure

From the Altos ASIC, there are numerous "error interrupt" messages due to synchronization issues that cause Fabric Interconnects (FI) resets. Enter the show hardware internal altos event-history errors command in order to view this output:

1)  Event:E_DEBUG, length:131, at 959201 usecs after Fri May 24 14:19:20 2013
[100] Threshold reached for error interrupt - ALT_FIC3_INT_3_XGXS_rx2_loss_of_sync, flags:
0xa8, fabric port: 15, Action: fi-reset
2) Event:E_DEBUG, length:122, at 372727 usecs after Fri May 24 14:15:05 2013
[100] Threshold reached for interrupt - ALT_FIC6_INT_0_XGXS_EXT_serdes_rx2_sync, masking it
(threshold=3 period=10 msecs)
453) Event:E_DEBUG, length:122, at 658189 usecs after Fri May 24 03:38:48 2013
[100] Threshold reached for interrupt - ALT_FIC6_INT_1_XGXS_EXT_serdes_rx0_sync, masking it
(threshold=3 period=10 msecs)
454) Event:E_DEBUG, length:129, at 658137 usecs after Fri May 24 03:38:48 2013

[100] Threshold reached for error interrupt - ALT_FIC6_INT_1_XGXS_rx2_code_eerror, flags:
0xa8, fabric port: 25, Action: fi-reset

Solution

The problem is due to a hardware issue in the Altos ASIC. Contact the Cisco Technical Assistance Center (TAC) in order to replace the Nexus 5000 Series switch.

Updated: Jul 05, 2013
Document ID: 116247