This document provides a brief explanation and solutions for common hardware and architecture issues for Cisco Nexus 7000 Series switches that run Cisco NX-OS system software.
Note: The exact format of the syslog and error messages that this document describes can vary slightly. The variation depends on the software release that runs on the Supervisor Engine.
Problem: SpineControlBus Failure
The spine control test fails for the Nexus 7000 Supervisor:
Nexus7000# show module internal exceptionlog module 5 ... System Errorcode : 0x418b0022 Spine control test failed Error Type : Warning PhyPortLayer : 0x0 Port(s) Affected : none Error Description : Module 10 Spine Control Bus test Failed ... 11) SpineControlBus E Error code ------------------> DIAG TEST ERR DISABLE Total run count -------------> 1597800 Last test execution time ----> Mon May 27 21:57:17 2013 First test failure time -----> Sun Nov 20 00:30:55 2011 Last test failure time ------> Mon May 27 21:57:17 2013 Last test pass time ---------> Mon May 27 21:56:47 2013 Total failure count ---------> 33 Consecutive failure count ---> 1 Last failure reason ---------> Spine control test failed
Nexus7000#show diagnostic events 1) Event:E_DEBUG, length:97, at 9664 usecs after Wed Dec 5 01:03:42 2012  Event_ERROR: TestName->NVRAM TestingType->health monitoring module->5 Result->fail Reason-> #show diagnostic result module 5 test NVRAM detail 4) NVRAM-------------------------> E Error code ------------------> DIAG TEST ERR DISABLE Total run count -------------> 52596 Last test execution time ----> Wed Dec 5 01:03:41 2012 First test failure time -----> Tue Dec 4 23:28:45 2012 Last test failure time ------> Wed Dec 5 01:03:42 2012 Last test pass time ---------> Tue Dec 4 23:23:41 2012 Total failure count ---------> 20 Consecutive failure count ---> 20 Last failure reason ---------> Bad blocks found on nvram
This is either a hardware issue, a Supervisor Engine failure, or a transient issue.
Rerun the NVRAM test in order to see if this is a false alarm. Enter these commands in order to disable and reenable the diagnostic test (example if given for problem module 5):
no diagnostic monitor module 5 test NVRAM
diagnostic monitor module 5 test NVRAM
Enter the show diagnostic result module 5 test NVRAM detail command in order to see the results of the test command.
If the NVRAM test fails again, reseat the module 5. Observe the result of the show diagnostic result module 5 and show module commands.
If the module fails again, raise a Return Material Authorization (RMA) request for the Supervisor in the problem slot.
Problem: Module 9 Compact Flash Failure
One or all of these are seen on the Supervisor 2/Supervisor 2E:
DEVICE_TEST-2-COMPACT_FLASH_FAIL: Module 5 has failed test CompactFlash 20 times on device Compact Flash due to error The compact flash power test failed.
Unable to save config.
Diagnostic test failures:
Test results: (. = Pass, F = Fail, I = Incomplete, U = Untested, A = Abort, E = Error disabled) 7) CompactFlash E Error code ------------------> DIAG TEST ERR DISABLE Total run count -------------> 23302 Last test execution time ----> Sun Apr 13 10:07:30 2014 First test failure time -----> Sun Apr 13 00:37:41 2014 Last test failure time ------> Sun Apr 13 10:07:40 2014 Last test pass time ---------> Sun Apr 13 00:07:41 2014 Total failure count ---------> 20 Consecutive failure count ---> 20 Last failure reason ---------> The compact flash power test failed Next Execution time ---------> Sun Apr 13 10:37:30 2014
Second generation Nexus 7000 Supervisors are shipped with two identical eUSB flashes for redundancy. The flashes provide a repository for bootflash, configurations, and other pertinent information. These two flashes are reconfigured as a Redundant Array of Independent Disks (RAID) 1 array which implements internal mirroring. With the redundancy, a Supervisor can function with the loss of one of the flashes but not both.
There are a few instances in the field where one or both of these flashes are marked as bad by the RAID software over a time span of several months or years in service. A reset/reboot of the board rediscovers these failed flashes are healthy at the next boot up.
Complete these steps in order to verify if this is or is not a hardware issue:
Reload the problem Supervisor, if possible.
If the issue is seen after reload, you need a hardware replacement.
If the issue is fixed by reload, the root cause is related to Cisco bug ID CSCus22805.
Problem: N7K-M132XP-12 Linecard PortLoopback Test Failure
The linecard reports a diagnostics failure due to port PortLoopback test failure 10 times consecutively:
DIAG_PORT_LB-2-PORTLOOPBACK_TEST_FAIL Module:16 Test:PortLoopback failed 10 consecutive times. Faulty module:Module 16 affected ports:5,7 Error:Loopback test failed. Packets lost on the LC at the Queueing engine ASIC
MODULE-4-MOD_WARNING Module 16 (serial: XXXX) reported warning on ports 16/5-16/5 (Ethernet) due to Loopback test failed. Packets lost on the LC at the Queueing engine ASIC in device 78 (device error 0x41830059)
This is a warning message and in most cases indicates a hardware issue with the port.
Check for Cisco bug ID CSCtn81109 and Cisco bug ID CSCti95293 first, as this could be a software issue.
Reseat the module first in order to reinitialize the card and rerun bootup hardware sanity tests. If the diagnostics tests still show failure for the same card, replace the card.
Reload the card at a convenient time and collect the outputs of these commands:
show logging log
show diagn result module all detail
Alternatively, you can rerun only this specific test and do not need to reload the card. This example shows module 16:
show diagnostic result module 16 diagnostic clear result module all (config)# no diagnostic monitor module 16 test 5 (config)# diagnostic monitor module 16 test 5 diagnostic start module 16 test 5 show diagnostic result module 16 test 5
These errors appear and there is a possible module reload:
2013 Mar 27 00:40:23 DC3-7000-PRODD2-A23 MODULE-4-MOD_WARNING Module 9 (serial: XXX) reported warning on ports 9/1-9/3 (Unknown) due to BE2 Arbiter experienced an error in device 65 (device error 0xc410f613)
This is a hardware failure caused by parity errors or hardware issues on the daughter card.
Check the output of these commands:
show system reset-reason module X
show logging onboard internal reset-reason
show module internal event-history module X
If your version of Cisco NS-OX is earlier than Version 4.2, then upgrade to a new version in order to ensure fixes for these software defects are integrated (minimize the possibility of parity errors):
Cisco bug ID CSCso72230 L1 D-cache enabled 8541 CPU crashes with L1 D-cache parity errors
Cisco bug ID CSCsr90831 - L1 D-cache enabled 8541 CPU crashes with L1 D-cache Push parity errors
If the errors repeatedly occur, reseat the card and monitor.
If the errors are still repeating, replace the problem module.
Problem: N7K-M224XP-23L chico serdes sync loss Error
These errors appear on the module:
%MODULE-4-MOD_WARNING: Module # (Serial number: XXXX) reported warning Ethernet#/# due to chico serdes sync loss in device DEV_SKYTRAIN (device error 0xc9003600)
These errors indicate that there is a sync loss issue between module # and the Xbar/ASIC. In most cases the cause is a hardware failure of the module.
If your version of Cisco NS-OX is earlier than 6.1(4) and the message does not appear continuously, it can be affected by Cisco bug ID CSCud91672. The cause of the defect is that the NX-OS serdes settings are different from diagnostic settings on the two channels between SKT <-->SAC.
Collect the output of these commands:
show module internal event-history module X
show module internal activity module X
show module internal exception-log module X
show module internal event-history errors
show logging last 200
show logging nvram
Upgrade the switch to NS-OX Version 6.1(4) or later in order to isolate the cause of the defect.
Perform this test in order to confirm if the card is faulty instead of the xbar or chassis slot:
Move the problem module to another free slot in the chassis.
If you have a spare module, insert it in a problem slot.
If the errors are not seen after step 1, insert the module back in the problem slot and verify.
Problem: N7K-F248XP-25 PrimaryBootROM and SecondaryBootROM Test Failures
Module N7K-F248XP-25 fails in both PrimaryBootROM and SecondaryBootROM tests:
show module internal exceptionlog module 1 | i Error|xception ********* Exception info for module 1 ******** exception information --- exception instance 1 ---- Error Description : Secondary BootROM test failed
exception information --- exception instance 2 ---- Error Description : Primary BootROM test failed
This usually is seen due to BIOS file corruption or linecard hardware failure.
Cisco bug ID CSCuf82089 adds code to show more descriptive information about such failures for better diagnostics. For example, it shows a failed component instead of a currently null value.
In some cases the issue is caused by BIOS corruption on the module. Enter the install module X bios forced command in order to resolve this. Note that this command can potentially impact service. The recommendation is to execute it only during a maintenance window.
Complete these steps in order to resolve the issue:
Schedule a maintenance window and enter the install module X bios forced command as a possible workaround. Only enter this command during a maintenance window in order to avoid potential service impact.
If step 1 does not help or it is not possible to have a maintenance window for this action, replace the module. This example output shows a failed attempt:
Nexus7000# install module 1 bios forced Warning: Installing Bios forcefully...! Warning: Please do not remove or power off the module at this time Upgrading primary bios Started bios programming .... please wait [# 0% ] BIOS install failed for module 1, Error=0x40710027(BIOS flash-type verify failed) BIOS is OK ... Please try the command again...
Problem: Temperature Sensor Failure
This error is seen on the platform:
%PLATFORM-4-MOD_TEMPFAIL: Module-2 temperature sensor 7 failed
This is an intermittent issue with the temperature/voltage block in the ASIC under certain conditions due to internal ASIC timing. Cisco bug ID CSCtw79052 describes the known cause for this issue.
This is a timing issue between the ASIC which latches the temperature internally and the software that samples the valid bit. The issue is that it can hit on any of the 12 Clipper instances. There is no particular trigger for this problem and it is intermittent. This problem does not impact service and it arises because the temperature read logic has an issue that requires more retries in the driver.
Collect the output from these commands and check against Cisco bug ID CSCtw79052:
Problem: Xbar Error/C7010-FAB-1 in Power Down State
The C7010-FAB-1 is in a power down state and these errors appear:
%PLATFORM-3-EJECTOR_STAT_CHANGED: Ejectors' status in slot 13 has changed, Left Ejector is OPEN, Right Ejector is CLOSE
%PLATFORM-3-EJECTOR_STAT_CHANGED: Ejectors' status in slot 13 has changed, Left Ejector is OPEN, Right Ejector is OPEN
%PLATFORM-2-XBAR_REMOVE: Xbar 3 removed (Serial number XXX)
Xbar Ports Module-Type Model Status --- ----- ----------------------------------- ------------------ ---------- 3 0 Fabric Module N/A powered-dn ? Xbar Power-Status Reason --- ------------ --------------------------- 3 powered-dn failure(powered-down) since maximum number of bringups were exceeded
Alternatively, xbar ASIC errors appear:
%MODULE-4-MOD_WARNING: Module 15 (serial: XXX) reported warning due to X-bar Interface ASIC Error in device 70 (device error 0xc4600248)
%OC_USD-SLOT15-2-RF_CRC: OC2 received packets with CRC error from MOD 15 through XBAR slot 3/inst 2
This issue is due to either a faulty or bad-seated xbar module, or a bad chassis slot.
Check the output of these commands:
show logging nvram
show module internal exception-log
show module internal event-history
show system reset-reason
show environment | in xbar
show system internal platform internal event-history xbarX is xbar #
show system internal xbar-client internal event-history errors
show system internal xbar all
show system internal xbar event-history errors
Perform a hard reseat of the xbar module and check the status.
If the reseat fails, test xbar in another slot or test the same slot with another xbar module in order to ensure the chassis is fine.
Replace the faulty hardware based on the tests performed in steps 2 and 3.
Problem: N7K-C7010-FAN-F Failed Fan Module
One or more of these fan failure symptoms are observed:
%PLATFORM-5-FAN_STATUS: Fan module 3 (Serial number XXX) Fan3(fab_fan1) current-status is FAN_FAIL
#show hardware ---------------------------------- Chassis has 4 Fan slots ---------------------------------- Fan3(fab_fan1) failed Model number is N7K-C7010-FAN-F ...
In most cases this is a failure of the the fan or chassis slot.
Check the output of these commands:
show log nvram
show environment fan
Test this N7K-C7010-FAN-F in another good chassis.
Replace the fan or chassis based on the results of steps 1 and 2.
Problem: %PLATFORM-2-PS_CAPACITY_CHANGE Power Supply Alarm
Alarms are seen for the capacity changes, sometimes very frequently.
%PLATFORM-2-PS_CAPACITY_CHANGE: Power supply PS2 changed its capacity. possibly due to On/Off or power cable removal/
2013 Oct 17 17:06:40 ... last message repeated 14 times
This issue is due to either a faulty or disconnected power cable, or a power supply failure.
Check the output of the show env power detail command and research the power supply status. In this example output, both chords are connected but the second shows only 1200W capacity instead of 3000W and it needs to be for the 220V AC on the N7K-AC-6.0KW. The power source tested OK. Replace the power supply.
PS_2 total capacity: 4200 W Voltage:50Vchord 1 capacity: 3000 W chord 1 connected to 110v AC chord 2 capacity: 1200 W chord 2 connected to 220v AC
Problem: %PLATFORM-5-PS_STATUS: PowerSupply X PS_FAIL Alarm
This alert appears on the platform:
%PLATFORM-5-PS_STATUS: PowerSupply 3 current-status is PS_FAIL
%PLATFORM-2-PS_FAIL: Power supply 3 failed or shut down (Serial number xxxxx)
This alert is due to either a faulty or disconnected power cable, or a power supply failure.
Check the output of these commands:
show environment power detail
Reseat the failed power supply. Use the redundant power supply in order to ensure the power does not go offline.
Submit a RMA for the power supply. Use the redundant power supply in order to ensure the power does not go offline.
%SATCTRL-FEX104-2-SOHMS_DIAG_ERROR: FEX-104 Module 1: Runtime diag detected major event: Voltage failure on power supply: 1 %SATCTRL-FEX104-2-SOHMS_DIAG_ERROR: FEX-104 System minor alarm on power supply 1: failed
%SATCTRL-FEX104-2-SOHMS_DIAG_ERROR: FEX-104 Recovered: System minor alarm on power supply 1: failed
Check for hardware and power issues. If you have a software issue, error messages continue even after you swap hardware.
Methods to resolve these issues include:
Reseat the FEX power supply. Use the redundant power supply in order to ensure the power does not go offline.
Submit the RMA for the FEX power supply. Use the redundant power supply in order to ensure the power does not go offline.
Repeat these steps for the second power supply.
Review and answer these questions in order to help define the circumstances of the failure:
How many FEX power supplies are affected?
For a minor alarm, did you swap the input source, and did that make any difference?
Do you have other FEX power supplies that have issues?
Do you have any other boxes of the same power source?
Did you replace the power cord?
Was there a power surge or glitch in the environment?
Gather output from these commands in order to investigate the failures:
Problem: N7K-AC-6.0KW Power Supplies are Reported as Fail
Emerson power supplies N7K-AC-6.0KW are reported as Fail / Shut but the switch runs fine and non-0 actual output is seen for the problem power supply.
On a supply with both inputs active, when an input is disconnected, reconnected, and disconnected again within 1.5 seconds the supply can latch an under-voltage fault and NX-OS can flag the power supply as failed. In another variation, on a supply with two inputs, remove one input and wait 20 to 30 seconds. The supply might intermittently set the Internal Fault alarm and NX-OS reports the power supply as failed.
Cisco bug ID CSCty78612 makes changes to the firmware on the power supply units in order to fix the issue.
Cisco bug ID CSCuc86262 adds a software enhancement in order to recover from these false failures. NX-OS now autonomously monitors the Power Supply Unit (PSU) status and modifies it to the appropriate status if the reported state differs from the real state.
Enter the show env power detail command and verify the actual output in order to verify the false failure:
Nexus7000# show env power Power Supply: Voltage: 50 Volts Power Actual Total Supply Model Output Capacity Status (Watts ) (Watts ) ------- ------------------- ----------- ----------- -------------- 1 N7K-AC-6.0KW 0 W 0 W Shutdown 2 N7K-AC-6.0KW 3888 W 6000 W Fail/Shut
The erroneous Fail/Shut status is cleared when you power off/on the PSU.
Cisco bug ID CSCty78612 makes changes to the firmware on the PSU. Software has been enhanced through Cisco bug ID CSCuc86262 which recovers from false fail/shut notifications with the correction of the false bits if the power supply in runtime operates normally. NX-OS Versions 5.2(9), 6.1(3), 6.2(2) and later have the enhancement present which avoids an RMA.
Problem: Software Packet Drops
Part of the large size packets are dropped when there is a high rate of IP packets with a length longer than the configured MTU on the egress interface of the packet.
This is expected behavior. When the system receives an IP packet with a length longer than the configured MTU on the egress interface of the packet, the system sends this packet to the control plane, which takes care of the fragmentation. In NX-OS 4.1.3 and later, a rate-limiter is applied to such punted packets. This limits it to a maximum of 500 pps by default.
This is a known software defect in Cisco bug ID CSCsu01048.
Problem: USER-2-SYSTEM_MSG FIPS Self-test Failure System Error
The "USER-2-SYSTEM_MSG FIPS self-test failure in DCOS_rand - netstack" error displays.
Whenever a random number is generated, the Conditional Random Number Generator (CRNG) self-test runs. If the test fails, a syslog message is logged. This is done as per the Federal Information Processing Standards (FIPS) recommendation. However, the impact of this is harmless as the random number is generated again.
There are two types of random number generators (RNGs) in NX-OS:
FIPS RNG which is implemented in the openssl crypto library
Non-FIPS RNG which is the linux RNG
As per FIPS, all RNGs must implement the Conditional Random Number Generator Test (CRNGT). The test compares the current generated random number with the previous one. If the numbers are the same, then a syslog message is generated and one more random number is generated.
The test is run in order to ensure that uniqueness of the random number. There is no functional impact as the number is regenerated.
This message is harmless to system operation. From Cisco NX-OS Version 5.2x and later, the severity of the message is lowered away from 2 so it is no longer seen with default logging configuration. This logging occurs as part of internal NX-OS self-tests for various functions on the switch.
This is a known software defect in Cisco bug ID CSCtn70083.