Troubleshoot Common Hardware and Architecture Issues in Nexus 7000 Series Switches

Available Languages

Updated:May 15, 2015

Document ID:118959

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Introduction

Problem: SpineControlBus Failure

Solution

Problem: Bad Blocks Found on NVRAM

Solution

Problem: Module 9 Compact Flash Failure

Solution

Problem: N7K-M132XP-12 Linecard PortLoopback Test Failure

Solution

Problem: N7K-M132XP-12 Linecard MODULE-4-MOD_WARNING

Solution

Problem: N7K-M224XP-23L chico serdes sync loss Error

Solution

Problem: N7K-F248XP-25 PrimaryBootROM and SecondaryBootROM Test Failures

Solution

Problem: Temperature Sensor Failure

Solution

Problem: Xbar Error/C7010-FAB-1 in Power Down State

Solution

Problem: N7K-C7010-FAN-F Failed Fan Module

Solution

Problem: %PLATFORM-2-PS_CAPACITY_CHANGE Power Supply Alarm

Solution

Problem: %PLATFORM-5-PS_STATUS: PowerSupply X PS_FAIL Alarm

Solution

Problem: Power Supply Issue on FEX

Solution

Problem: N7K-AC-6.0KW Power Supplies are Reported as Fail

Solution

Problem: Software Packet Drops

Solution

Problem: USER-2-SYSTEM_MSG FIPS Self-test Failure System Error

Solution

Introduction

This document provides a brief explanation and solutions for common hardware and architecture issues for Cisco Nexus 7000 Series switches that run Cisco NX-OS system software.

Note: The exact format of the syslog and error messages that this document describes can vary slightly. The variation depends on the software release that runs on the Supervisor Engine.

Problem: SpineControlBus Failure

The spine control test fails for the Nexus 7000 Supervisor:

Nexus7000# show module internal exceptionlog module 5
...
System Errorcode  : 0x418b0022 Spine control test failed
Error Type        : Warning
PhyPortLayer      : 0x0
Port(s) Affected  : none
Error Description : Module 10 Spine Control Bus test Failed
...
         11) SpineControlBus E
                Error code ------------------> DIAG TEST ERR DISABLE
                Total run count -------------> 1597800
                Last test execution time ----> Mon May 27 21:57:17 2013
                First test failure time -----> Sun Nov 20 00:30:55 2011
                Last test failure time ------> Mon May 27 21:57:17 2013
                Last test pass time ---------> Mon May 27 21:56:47 2013
                Total failure count ---------> 33
                Consecutive failure count ---> 1
                Last failure reason ---------> Spine control test failed

Solution

This isue is related to Cisco bug ID CSCuc72466. Refer to Nexus 7000 FAQ: What is the recommended action to take when the SpineControlBus test fails?.

Problem: Bad Blocks Found on NVRAM

NVRAM errors appear in diagnostic events:

Nexus7000#show diagnostic events
1) Event:E_DEBUG, length:97, at 9664 usecs after Wed Dec  5 01:03:42 2012
    [103] Event_ERROR: TestName->NVRAM TestingType->health monitoring module->5 
Result->fail Reason->
#show diagnostic result module 5 test NVRAM detail
 4) NVRAM-------------------------> E
                Error code ------------------> DIAG TEST ERR DISABLE
                Total run count -------------> 52596
                Last test execution time ----> Wed Dec  5 01:03:41 2012
                First test failure time -----> Tue Dec  4 23:28:45 2012
                Last test failure time ------> Wed Dec  5 01:03:42 2012
                Last test pass time ---------> Tue Dec  4 23:23:41 2012
                Total failure count ---------> 20
                Consecutive failure count ---> 20
                Last failure reason ---------> Bad blocks found on nvram

This is either a hardware issue, a Supervisor Engine failure, or a transient issue.

Solution

Rerun the NVRAM test in order to see if this is a false alarm. Enter these commands in order to disable and reenable the diagnostic test (example if given for problem module 5):
- no diagnostic monitor module 5 test NVRAM
- diagnostic monitor module 5 test NVRAM
Enter the show diagnostic result module 5 test NVRAM detail command in order to see the results of the test command.
If the NVRAM test fails again, reseat the module 5. Observe the result of the show diagnostic result module 5 and show module commands.
If the module fails again, raise a Return Material Authorization (RMA) request for the Supervisor in the problem slot.

Problem: Module 9 Compact Flash Failure

One or all of these are seen on the Supervisor 2/Supervisor 2E:

Error Message:

DEVICE_TEST-2-COMPACT_FLASH_FAIL: Module 5 has failed test CompactFlash 
20 times on device Compact Flash due to error The compact flash power test failed.

Unable to save config.

Diagnostic test failures:

       Test results: (. = Pass, F = Fail, I = Incomplete,
       U = Untested, A = Abort, E = Error disabled) 
       7) CompactFlash E 
               Error code ------------------> DIAG TEST ERR DISABLE
               Total run count -------------> 23302
               Last test execution time ----> Sun Apr 13 10:07:30 2014
               First test failure time -----> Sun Apr 13 00:37:41 2014
               Last test failure time ------> Sun Apr 13 10:07:40 2014
               Last test pass time ---------> Sun Apr 13 00:07:41 2014
               Total failure count ---------> 20
               Consecutive failure count ---> 20
               Last failure reason ---------> The compact flash power test
                                               failed
               Next Execution time ---------> Sun Apr 13 10:37:30 2014

Root Cause

Second generation Nexus 7000 Supervisors are shipped with two identical eUSB flashes for redundancy. The flashes provide a repository for bootflash, configurations, and other pertinent information. These two flashes are reconfigured as a Redundant Array of Independent Disks (RAID) 1 array which implements internal mirroring. With the redundancy, a Supervisor can function with the loss of one of the flashes but not both.

There are a few instances in the field where one or both of these flashes are marked as bad by the RAID software over a time span of several months or years in service. A reset/reboot of the board rediscovers these failed flashes are healthy at the next boot up.

Solution

Complete these steps in order to verify if this is or is not a hardware issue:

Reload the problem Supervisor, if possible.
If the issue is seen after reload, you need a hardware replacement.
If the issue is fixed by reload, the root cause is related to Cisco bug ID CSCus22805.

Problem: N7K-M132XP-12 Linecard PortLoopback Test Failure

The linecard reports a diagnostics failure due to port PortLoopback test failure 10 times consecutively:

DIAG_PORT_LB-2-PORTLOOPBACK_TEST_FAIL  Module:16 Test:PortLoopback
 failed 10 consecutive times. Faulty module:Module 16 affected ports:5,7  
Error:Loopback test failed. Packets lost on the LC at the Queueing engine ASIC

MODULE-4-MOD_WARNING  Module 16 (serial: XXXX) reported warning on 
ports 16/5-16/5 (Ethernet) due to Loopback test failed.  
Packets lost on the LC at the Queueing engine ASIC in device 78 
(device error 0x41830059)

Root Cause

This is a warning message and in most cases indicates a hardware issue with the port.

Solution

Check for Cisco bug ID CSCtn81109 and Cisco bug ID CSCti95293 first, as this could be a software issue.

Reseat the module first in order to reinitialize the card and rerun bootup hardware sanity tests. If the diagnostics tests still show failure for the same card, replace the card.

Reload the card at a convenient time and collect the outputs of these commands:

show logging log
show module
show diagn result module all detail

Alternatively, you can rerun only this specific test and do not need to reload the card. This example shows module 16:

show diagnostic result module 16
diagnostic clear result module all
(config)# no diagnostic monitor module 16 test 5
(config)# diagnostic monitor module 16 test 5
diagnostic start module 16 test 5
show diagnostic result module 16 test 5

Problem: N7K-M132XP-12 Linecard MODULE-4-MOD_WARNING

These errors appear and there is a possible module reload:

2013 Mar 27 00:40:23 DC3-7000-PRODD2-A23  MODULE-4-MOD_WARNING  
Module 9 (serial: XXX) reported warning on ports 9/1-9/3 (Unknown) 
due to BE2 Arbiter experienced an error in device 65 (device error 0xc410f613)

Root Cause

This is a hardware failure caused by parity errors or hardware issues on the daughter card.

Solution

Check the output of these commands:
- show version
- show system reset-reason module X
- show logging onboard internal reset-reason
- show module internal event-history module X
- show log
If your version of Cisco NS-OX is earlier than Version 4.2, then upgrade to a new version in order to ensure fixes for these software defects are integrated (minimize the possibility of parity errors):
- Cisco bug ID CSCso72230 L1 D-cache enabled 8541 CPU crashes with L1 D-cache parity errors
- Cisco bug ID CSCsr90831 - L1 D-cache enabled 8541 CPU crashes with L1 D-cache Push parity errors
If the errors repeatedly occur, reseat the card and monitor.
If the errors are still repeating, replace the problem module.

Additional Known Software Defect

Cisco bug ID CSCtb98876

Problem: N7K-M224XP-23L chico serdes sync loss Error

These errors appear on the module:

%MODULE-4-MOD_WARNING: Module # (Serial number: XXXX) reported warning 
Ethernet#/# due to chico serdes sync loss in device DEV_SKYTRAIN 
(device error 0xc9003600)

Root Cause

These errors indicate that there is a sync loss issue between module # and the Xbar/ASIC. In most cases the cause is a hardware failure of the module.

If your version of Cisco NS-OX is earlier than 6.1(4) and the message does not appear continuously, it can be affected by Cisco bug ID CSCud91672. The cause of the defect is that the NX-OS serdes settings are different from diagnostic settings on the two channels between SKT <-->SAC.

Solution

Collect the output of these commands:

show version
show module
show run
show module internal event-history module X
show module internal activity module X
show module internal exception-log module X
show module internal event-history errors
show logging last 200
show logging nvram

Upgrade the switch to NS-OX Version 6.1(4) or later in order to isolate the cause of the defect.

Perform this test in order to confirm if the card is faulty instead of the xbar or chassis slot:

Move the problem module to another free slot in the chassis.
If you have a spare module, insert it in a problem slot.
If the errors are not seen after step 1, insert the module back in the problem slot and verify.

Problem: N7K-F248XP-25 PrimaryBootROM and SecondaryBootROM Test Failures

Module N7K-F248XP-25 fails in both PrimaryBootROM and SecondaryBootROM tests:

show module internal exceptionlog module 1 | i Error|xception
********* Exception info for module 1 ********
exception information --- exception instance 1 ----
Error Description : Secondary BootROM test failed
 
exception information --- exception instance 2 ----
Error Description : Primary BootROM test failed

Root Cause

This usually is seen due to BIOS file corruption or linecard hardware failure.

Solution

Cisco bug ID CSCuf82089 adds code to show more descriptive information about such failures for better diagnostics. For example, it shows a failed component instead of a currently null value.

In some cases the issue is caused by BIOS corruption on the module. Enter the install module X bios forced command in order to resolve this. Note that this command can potentially impact service. The recommendation is to execute it only during a maintenance window.

Complete these steps in order to resolve the issue:

Schedule a maintenance window and enter the install module X bios forced command as a possible workaround. Only enter this command during a maintenance window in order to avoid potential service impact.
If step 1 does not help or it is not possible to have a maintenance window for this action, replace the module. This example output shows a failed attempt:

Nexus7000# install module 1 bios forced
Warning: Installing Bios forcefully...!
Warning: Please do not remove or power off the module at this time
Upgrading primary bios
Started bios programming .... please wait
[#                            0%                             ]
BIOS install failed for module 1, Error=0x40710027(BIOS flash-type verify failed)
BIOS is OK ...
Please try the command again...

Problem: Temperature Sensor Failure

This error is seen on the platform:

%PLATFORM-4-MOD_TEMPFAIL: Module-2 temperature sensor 7 failed

Root Cause

This is an intermittent issue with the temperature/voltage block in the ASIC under certain conditions due to internal ASIC timing. Cisco bug ID CSCtw79052 describes the known cause for this issue.

This is a timing issue between the ASIC which latches the temperature internally and the software that samples the valid bit. The issue is that it can hit on any of the 12 Clipper instances. There is no particular trigger for this problem and it is intermittent. This problem does not impact service and it arises because the temperature read logic has an issue that requires more retries in the driver.

Solution

Collect the output from these commands and check against Cisco bug ID CSCtw79052:

show version
show env temperature
show sprom module <module #>
Nexus# attach module <module #>
<module#>#show hardware internal sensor event-history errors

Problem: Xbar Error/C7010-FAB-1 in Power Down State

The C7010-FAB-1 is in a power down state and these errors appear:

%PLATFORM-3-EJECTOR_STAT_CHANGED: Ejectors' status in slot 13 has changed, 
Left Ejector is OPEN, Right Ejector is CLOSE

%PLATFORM-3-EJECTOR_STAT_CHANGED: Ejectors' status in slot 13 has changed,
 Left Ejector is OPEN, Right Ejector is OPEN

%PLATFORM-2-XBAR_REMOVE: Xbar 3 removed (Serial number XXX)
 
Xbar Ports  Module-Type                         Model              Status
---  -----  ----------------------------------- ------------------ ----------
3    0      Fabric Module                       N/A                powered-dn
?
Xbar Power-Status  Reason
---  ------------  ---------------------------
3    powered-dn     failure(powered-down) since maximum number of bringups were exceeded

Alternatively, xbar ASIC errors appear:

%MODULE-4-MOD_WARNING: Module 15 (serial: XXX) reported warning due to 
X-bar Interface ASIC Error in device 70 (device error 0xc4600248)

%OC_USD-SLOT15-2-RF_CRC: OC2 received packets with CRC error from MOD 15 
through XBAR slot 3/inst 2

Root Cause

This issue is due to either a faulty or bad-seated xbar module, or a bad chassis slot.

Solution

Check the output of these commands:
- show version
- show module
- show logging
- show logging nvram
- show module internal exception-log
- show module internal event-history
- show core
- show system reset-reason
- show environment | in xbar
- show system internal platform internal event-history xbar X is xbar #
- show system internal xbar-client internal event-history errors
- show system internal xbar all
- show system internal xbar event-history errors
Perform a hard reseat of the xbar module and check the status.
If the reseat fails, test xbar in another slot or test the same slot with another xbar module in order to ensure the chassis is fine.
Replace the faulty hardware based on the tests performed in steps 2 and 3.

Problem: N7K-C7010-FAN-F Failed Fan Module

One or more of these fan failure symptoms are observed:

%PLATFORM-5-FAN_STATUS: Fan module 3 (Serial number XXX) 
Fan3(fab_fan1) current-status is  FAN_FAIL
 
Nexus 7000#show environment fan
Fan3(fab_fan1)  N7K-C7010-FAN-F    1.1     Failure (Failed Fanlets: 2 6 7 8 9 10 14 15 )
Fan4(fab_fan2)  N7K-C7010-FAN-F    1.1     Ok 
...
 
#show hardware
----------------------------------
Chassis has 4 Fan slots
----------------------------------
Fan3(fab_fan1) failed
  Model number is N7K-C7010-FAN-F
...

Root Cause

In most cases this is a failure of the the fan or chassis slot.

Solution

Check the output of these commands:
- show version
- show module
- show inventory
- show log
- show log nvram
- show environment fan
Test this N7K-C7010-FAN-F in another good chassis.
Replace the fan or chassis based on the results of steps 1 and 2.

Problem: %PLATFORM-2-PS_CAPACITY_CHANGE Power Supply Alarm

Alarms are seen for the capacity changes, sometimes very frequently.

%PLATFORM-2-PS_CAPACITY_CHANGE: Power supply PS2 changed its capacity. 
possibly due to On/Off or power cable removal/

2013 Oct 17 17:06:40 ... last message repeated 14 times

Root Cause

This issue is due to either a faulty or disconnected power cable, or a power supply failure.

Solution

Check the output of the show env power detail command and research the power supply status. In this example output, both chords are connected but the second shows only 1200W capacity instead of 3000W and it needs to be for the 220V AC on the N7K-AC-6.0KW. The power source tested OK. Replace the power supply.

PS_2 total capacity:    4200 W   Voltage:50Vchord 1    capacity:    3000 W chord 1    
connected to 110v AC chord 2    capacity:    1200 W chord 2    connected to 220v AC

Problem: %PLATFORM-5-PS_STATUS: PowerSupply X PS_FAIL Alarm

This alert appears on the platform:

%PLATFORM-5-PS_STATUS: PowerSupply 3 current-status is PS_FAIL

%PLATFORM-2-PS_FAIL: Power supply 3 failed or shut down (Serial number xxxxx)

Root Cause

This alert is due to either a faulty or disconnected power cable, or a power supply failure.

Solution

Check the output of these commands:
- show environment power detail
- show power
Reseat the failed power supply. Use the redundant power supply in order to ensure the power does not go offline.
Submit a RMA for the power supply. Use the redundant power supply in order to ensure the power does not go offline.

References

Cisco Nexus 7000 Series Power Supply Redundancy

Problem: Power Supply Issue on FEX

These alarms appear for the FEX power supply:

%SATCTRL-FEX104-2-SOHMS_DIAG_ERROR: FEX-104 Module 1: Runtime diag detected major event:
Voltage failure on power supply: 1
%SATCTRL-FEX104-2-SOHMS_DIAG_ERROR: FEX-104 System minor alarm on power 
supply 1: failed

%SATCTRL-FEX104-2-SOHMS_DIAG_ERROR: FEX-104 Recovered: System minor alarm 
on power supply 1: failed

Solution

Check for hardware and power issues. If you have a software issue, error messages continue even after you swap hardware.

Methods to resolve these issues include:

Reseat the FEX power supply. Use the redundant power supply in order to ensure the power does not go offline.
Submit the RMA for the FEX power supply. Use the redundant power supply in order to ensure the power does not go offline.
Repeat these steps for the second power supply.

Review and answer these questions in order to help define the circumstances of the failure:

How many FEX power supplies are affected?
For a minor alarm, did you swap the input source, and did that make any difference?
Do you have other FEX power supplies that have issues?
Do you have any other boxes of the same power source?
Did you replace the power cord?
Was there a power surge or glitch in the environment?

Gather output from these commands in order to investigate the failures:

show sprom fex 100 all
show logging log | no-more
show tech fex 100 | no-more
attach fex 100
show platform software satctrl trace

Known Software Defect

Cisco bug ID CSCtr77620

Problem: N7K-AC-6.0KW Power Supplies are Reported as Fail

Emerson power supplies N7K-AC-6.0KW are reported as Fail / Shut but the switch runs fine and non-0 actual output is seen for the problem power supply.

Root Cause

On a supply with both inputs active, when an input is disconnected, reconnected, and disconnected again within 1.5 seconds the supply can latch an under-voltage fault and NX-OS can flag the power supply as failed. In another variation, on a supply with two inputs, remove one input and wait 20 to 30 seconds. The supply might intermittently set the Internal Fault alarm and NX-OS reports the power supply as failed.

Cisco bug ID CSCty78612 makes changes to the firmware on the power supply units in order to fix the issue.

Cisco bug ID CSCuc86262 adds a software enhancement in order to recover from these false failures. NX-OS now autonomously monitors the Power Supply Unit (PSU) status and modifies it to the appropriate status if the reported state differs from the real state.

Solution

Enter the show env power detail command and verify the actual output in order to verify the false failure:

Nexus7000# show env power
Power Supply:
Voltage: 50 Volts
Power Actual Total
Supply Model Output Capacity Status
 (Watts ) (Watts )
------- ------------------- ----------- ----------- --------------
1 N7K-AC-6.0KW 0 W 0 W Shutdown
2 N7K-AC-6.0KW 3888 W 6000 W Fail/Shut

The erroneous Fail/Shut status is cleared when you power off/on the PSU.

Cisco bug ID CSCty78612 makes changes to the firmware on the PSU. Software has been enhanced through Cisco bug ID CSCuc86262 which recovers from false fail/shut notifications with the correction of the false bits if the power supply in runtime operates normally. NX-OS Versions 5.2(9), 6.1(3), 6.2(2) and later have the enhancement present which avoids an RMA.

Problem: Software Packet Drops

Part of the large size packets are dropped when there is a high rate of IP packets with a length longer than the configured MTU on the egress interface of the packet.

Root Cause

This is expected behavior. When the system receives an IP packet with a length longer than the configured MTU on the egress interface of the packet, the system sends this packet to the control plane, which takes care of the fragmentation. In NX-OS 4.1.3 and later, a rate-limiter is applied to such punted packets. This limits it to a maximum of 500 pps by default.

Solution

This is a known software defect in Cisco bug ID CSCsu01048.

Problem: USER-2-SYSTEM_MSG FIPS Self-test Failure System Error

The "USER-2-SYSTEM_MSG FIPS self-test failure in DCOS_rand - netstack" error displays.

Root Cause

Whenever a random number is generated, the Conditional Random Number Generator (CRNG) self-test runs. If the test fails, a syslog message is logged. This is done as per the Federal Information Processing Standards (FIPS) recommendation. However, the impact of this is harmless as the random number is generated again.

There are two types of random number generators (RNGs) in NX-OS:

FIPS RNG which is implemented in the openssl crypto library
Non-FIPS RNG which is the linux RNG

As per FIPS, all RNGs must implement the Conditional Random Number Generator Test (CRNGT). The test compares the current generated random number with the previous one. If the numbers are the same, then a syslog message is generated and one more random number is generated.

The test is run in order to ensure that uniqueness of the random number. There is no functional impact as the number is regenerated.

Solution

This message is harmless to system operation. From Cisco NX-OS Version 5.2x and later, the severity of the message is lowered away from 2 so it is no longer seen with default logging configuration. This logging occurs as part of internal NX-OS self-tests for various functions on the switch.

This is a known software defect in Cisco bug ID CSCtn70083.

Revision History

Revision	Publish Date	Comments
1.0	15-May-2015	Initial Release

Contributed by Cisco Engineers

Ivan Shirshin and Naveen Venkateshaiah
Cisco TAC Engineers.

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

This Document Applies to These Products

Nexus 7000 Series Switches

Troubleshoot Common Hardware and Architecture Issues in Nexus 7000 Series Switches

Available Languages

Bias-Free Language

Contents

Introduction

Problem: SpineControlBus Failure

Solution

Problem: Bad Blocks Found on NVRAM

Solution

Problem: Module 9 Compact Flash Failure

Solution

Problem: N7K-M132XP-12 Linecard PortLoopback Test Failure

Solution

Problem: N7K-M132XP-12 Linecard MODULE-4-MOD_WARNING

Solution

Problem: N7K-M224XP-23L chico serdes sync loss Error

Solution

Problem: N7K-F248XP-25 PrimaryBootROM and SecondaryBootROM Test Failures

Solution

Problem: Temperature Sensor Failure

Solution

Problem: Xbar Error/C7010-FAB-1 in Power Down State

Solution

Problem: N7K-C7010-FAN-F Failed Fan Module

Solution

Problem: %PLATFORM-2-PS_CAPACITY_CHANGE Power Supply Alarm

Solution

Problem: %PLATFORM-5-PS_STATUS: PowerSupply X PS_FAIL Alarm

Solution

Problem: Power Supply Issue on FEX

Solution

Problem: N7K-AC-6.0KW Power Supplies are Reported as Fail

Solution

Problem: Software Packet Drops

Solution

Problem: USER-2-SYSTEM_MSG FIPS Self-test Failure System Error

Solution

Revision History

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco

This Document Applies to These Products