This document discusses how to troubleshoot Cisco Catalyst 6000/6500 Series Switch Supervisor Engine switch processor (SP) and Multilayer Switch Feature Card (MSFC) route processor (RP) crashes.
There are no specific requirements for this document.
The information in this document is based on the Cisco Catalyst 6000/6500 Series Switch Supervisors and MSFC modules.
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, make sure that you understand the potential impact of any command.
Refer to Cisco Technical Tips Conventions for more information on document conventions.
A Catalyst 6500/6000 with an SP configuration register that allows break, for example 0x2, and that receives a console break signal enters ROMmon diagnostic mode. The system appears to crash.
This example switch output indicates that the switch entered ROMmon diagnostic mode from a switch processor console break signal.
Note: The RP configuration register is 0x2102.
6500_IOS#show version Cisco Internetwork Operating System Software IOS (tm) c6sup2_rp Software (c6sup2_rp-PS-M), Version 12.1(13)E14, EARLY DEPLOYMENT RELEASE SOFTWARE (fc1) Technical Support: http://www.cisco.com/techsupport Copyright (c) 1986-2004 by Cisco Systems, Inc. Compiled Tue 30-Mar-04 01:56 by pwade Image text-base: 0x40008C00, data-base: 0x417A6000 ROM: System Bootstrap, Version 12.1(4r)E, RELEASE SOFTWARE (fc1) BOOTLDR: c6sup2_rp Software (c6sup2_rp-PS-M), Version 12.1(13)E14, EARLY DEPLOYMENT RELEASE SOFTWARE (fc1) 6500_IOS uptime is 31 minutes Time since 6500_IOS switched to active is 31 minutes System returned to ROM by power-on (SP by abort at PC 0x601061A8) System image file is "slot0:c6sup12-ps-mz.121-13.E14" cisco Catalyst 6000 (R7000) processor with 227328K/34816K bytes of memory. Processor board ID SAD053701CF R7000 CPU at 300Mhz, Implementation 39, Rev 2.1, 256KB L2, 1024KB L3 Cache Last reset from power-on X.25 software, Version 3.0.0. Bridging software. 1 Virtual Ethernet/IEEE 802.3 interface(s) 192 FastEthernet/IEEE 802.3 interface(s) 18 Gigabit Ethernet/IEEE 802.3 interface(s) 381K bytes of non-volatile configuration memory. 16384K bytes of Flash internal SIMM (Sector size 512K). Configuration register is 0x2102
The solution is to re-configure the configuration register and reload the system. Complete these steps:
In global configuration mode, issue the config-register 0x2102 command, and set the configuration register to 0x2102 for both the RP and the SP.
6500_IOS#config terminal Enter configuration commands, one per line. End with CNTL/Z. 6500_IOS(config)#config-register 0x2102 6500_IOS(config)#end
Issue the show bootvar command in order to verify the configuration register value at the next reload.
6500_IOS#show bootvar BOOT variable = slot0:c6sup12-ps-mz.121-13.E14,1 CONFIG_FILE variable = BOOTLDR variable = Configuration register is 0x2102
Issue the remote command switch show bootvar command in order to verify that the configuration register on the SP also changed.
6500_IOS#remote command switch show bootvar 6500_IOS-sp# BOOT variable = slot0:c6sup12-ps-mz.121-13.E14,1 CONFIG_FILE variable = BOOTLDR variable = Configuration register is 0x2 (will be 0x2102 at next reload)
Reload the switch for the new SP configuration register setting to take effect.
Note: You can issue the copy running-config startup-config command at this point in order to save the configuration. However, this step is not necessary because the configuration register setting is not part of the startup or running configuration.
While a password recovery procedure on a Supervisor Engine 720 is performed, the switch can crash while you break in order to gain access to the console of the RP.
*** System received a Software forced crash *** signal= 0x17, code= 0x24, context= 0x4269f6f4 PC = 0x401370d8, Cause = 0x3020, Status Reg = 0x34008002
Use this password recovery workaround procedure in order to prevent the Supervisor from crashing when you perform a password recovery:
Press the Break key on the terminal keyboard directly after the RP gains control of the console port.
On the Catalyst 6500 that runs Cisco IOS®, the SP boots first. It then turns control over to the RP. After the RP gains control, initiate the break sequence. The RP has gained control of the console port when you see this message. (Do not initiate the break sequence until you see this message):
00:00:03: %OIR-6-CONSOLE: Changing console ownership to route processor
Tip: Refer to the Standard Break Key Sequence Combinations During Password Recovery for the key combinations.
Enter the confreg 0x2142 command at the rommon 1> prompt, within 10 seconds, in order to boot from Flash without loading the configuration.
Reload the switch and continue to configure the new password.
Issue the config-register 0x2102 command, or the original value in global configuration mode.
The Cisco Catalyst 6000/6500 Switches can unexpectedly reload due to an unknown cause. The output of the show version command displays a similar error message:
System returned to ROM by unknown reload cause - suspect boot_data[BOOT_COUNT] 0x0, BOOT_COUNT 0, BOOTDATA 19 (SP by power-on)
This message indicates that the firmware of the specified module has detected a parity error. The system automatically resets the module in order to recover from the error. A crashinfo file also appears on this module. The error message can be due to a transient or hardware failure. If the error message occurs once, then it is a transient issue. This is automatically recovered by the system. The symptom for the parity can be identified by the CPO_ECC in the cache memory. The ECC that represents the parity error has been corrected by the system itself.
These are the two kinds of parity errors:
Soft parity errors
These errors occur when a Single Event Latch up (SEL) happens within the chip. When referenced by the CPU, such errors cause the system to either crash (if the error is in an area that is not recoverable) or they recover other systems (for example, a CyBus complex restarts if the error was in the packet memory [MEMD]). In case of a soft parity error, there is no need to swap the board or any of the components.
Hard parity errors
These errors occur when there is a chip or board failure that corrupts data. In this case, you need to re-seat or replace the affected component, which usually involves a memory chip swap or a board swap. There is a hard parity error when multiple parity errors occur at the same address. There are more complicated cases that are harder to identify. In general, if you see more than one parity error in a particular memory region in a relatively short period, you can consider it to be a hard parity error. The error message looks similar to this:
Mar 9 12:12:24.427 GMT: %PM_SCP-SP-1-LCP_FW_ERR: Module 6 is experiencing the following error: Pinnacle #0 PB parity error. Tx path. Status=0x0042
Studies have shown that soft parity errors are 10 to 100 times more frequent than hard parity errors. Therefore, Cisco highly recommends you wait for a hard parity error before you replace anything. This greatly reduces the impact on your network.
The message indicates that the system controller has detected an error. Reload the device. If this message occurs again, replace the faulty memory or the MSFC card.
%SYSTEM_CONTROLLER-3-FATAL: An unrecoverable error has been detected. The system is being reset. %Software-forced reload
When a fan tray fails or a power supply is turned off, the Cisco Catalyst Switches that run Cisco IOS Software Release 12.1(19)E1 might crash the Supervisor modules. The issue is documented in Cisco bug ID CSCeb51698 (registered customers only) . Upgrade the switch to the Cisco IOS release not affected by this defect.
If you suspect that the switch has reset by itself, issue the show version command in order to verify the switch uptime, which is the time since the last reset. Issue the show log command in order to look at the reboot history, as this example shows. View this command output in order to see if there are any exceptions recorded.
sup2a> (enable)show version WS-C6506 Software, Version NmpSW: 6.3(10) !--- Output is suppressed. Uptime is 7 days, 4 hours, 27 minutes
sup2a> (enable)show log Network Management Processor (ACTIVE NMP) Log: Reset count: 1 Re-boot History: Jan 06 2003 10:35:56 0 Bootrom Checksum Failures: 0 UART Failures: 0 Flash Checksum Failures: 0 Flash Program Failures: 0 Power Supply 1 Failures: 0 Power Supply 2 Failures: 0 Swapped to CLKA: 0 Swapped to CLKB: 0 Swapped to Processor 1: 0 Swapped to Processor 2: 0 DRAM Failures: 0 Exceptions: 0 Loaded NMP version: 6.3(10) Software version: slot0:cat6000-sup2.6-3-10.bin Reload same NMP version count: 1 Last software reset by user: 1/6/2003,10:35:35 EOBC Exceptions/Hang: 0 Heap Memory Log: Corrupted Block = none
This show log command output displays no software exceptions. The last reboot of the switch is Jan 06 2003. The reboot time matches in the Last software reset field.
This show log command output shows an exception that was recorded at the time of the last reboot.
esc-cat5500-b (enable)show log Network Management Processor (STANDBY NMP) Log: Reset count: 38 Re-boot History: Oct 14 2001 05:48:53 0, Jul 30 2001 06:51:38 0 Jul 28 2001 20:31:40 0, May 16 2001 21:15:39 0 May 02 2001 01:02:53 0, Apr 26 2001 21:42:24 0 Apr 07 2001 05:23:42 0, Mar 25 2001 02:48:03 0 Jan 05 2001 00:21:39 0, Jan 04 2001 4:54:52 0 Bootrom Checksum Failures: 0 UART Failures: 0 Flash Checksum Failures: 0 Flash Program Failures: 0 Power Supply 1 Failures: 4 Power Supply 2 Failures: 0 Swapped to CLKA: 0 Swapped to CLKB: 0 Swapped to Processor 1: 3 Swapped to Processor 2: 0 DRAM Failures: 0 Exceptions: 1 Loaded NMP version: 5.5(7) Reload same NMP version count: 3 Last software reset by user: 7/28/2001,20:30:38 Last Exception occurred on Oct 14 2001 05:47:29 ... Software version = 5.5(7) Error Msg: PID = 86 telnet87 EPC: 80269C44 !--- Output is suppressed.
If your switch shows such a software exception, issue the dir bootflash: command, which displays the MSFC (route processor [RP]) bootflash device, and the dir slavebootflash: command in order to check for a software crash. The output in this section shows that crashinfo has been recorded in the RP bootflash. Make sure that the crashinfo that you view is of the most recent crash.
cat6knative#dir bootflash: Directory of bootflash:/ 1 -rw- 1693168 Jul 24 2002 15:48:22 c6msfc2-boot-mz.121-8a.EX 2 -rw- 183086 Aug 29 2002 11:23:40 crashinfo_20020829-112340 3 -rw- 20174748 Jan 30 2003 11:59:18 c6sup22-jsv-mz.121-8b.E9 4 -rw- 7146 Feb 03 2003 06:50:39 test.cfg 5 -rw- 31288 Feb 03 2003 07:36:36 01_config.txt 6 -rw- 30963 Feb 03 2003 07:36:44 02_config.txt 31981568 bytes total (9860396 bytes free)
The dir sup-bootflash: command displays the Supervisor Engine bootflash device. You can also issue the dir slavesup-bootflash: command in order to display the standby Supervisor Engine bootflash device. This output shows crashinfo recorded in the Supervisor Engine bootflash device.
cat6knative11#dir sup-bootflash: Directory of sup-bootflash:/ 1 -rw- 14849280 May 23 2001 12:35:09 c6sup12-jsv-mz.121-5c.E10 2 -rw- 20176 Aug 02 2001 18:42:05 crashinfo_20010802-234205 !--- Output is suppressed.
If the command output indicates that a software crash occurred at the time you suspected that the switch rebooted, contact Cisco Technical Support. Provide the output of the show tech-support command and the show logging command, as well as the output of the crashinfo file.
If a Distributed Forwarding Card (DFC)-equipped module has reset on its own without user reload, you can check the bootflash of the DFC card in order to see if it crashed. If a crash information file is available, you can find the cause of the crash. Issue the dir dfc#module#-bootflash: command in order to verify if there is a crash information file and when it was written. If the DFC reset matches the crashinfo timestamp, issue the more dfc#module#-bootflash:filename command. Or, issue the copy dfc#module#-bootflash:filename tftp command in order to transfer the file via TFTP to a TFTP server.
cat6knative#dir dfc#6-bootflash: Directory of dfc#6-bootflash:/ -#- ED ----type---- --crc--- -seek-- nlen -length- -----date/time------ name 1 .. crashinfo 2B745A9A C24D0 25 271437 Jan 27 2003 20:39:43 crashinfo_ 20030127-203943
After you have the crashinfo file available, collect the output of the show logging command and the show tech command and contact Cisco Technical Support for further assistance.
When you boot from a device not listed in the device table, it causes a crash with the Supervisor module. Upgrade the switch to Cisco IOS Software Release 12.2(18r)SX05 or later.
%CONST_DIAG-2-HM_SUP_CRSH: Supervisor crashed due to unrecoverable errors, Reason: Failed TestSPRPInbandPing %CONST_DIAG-2-HM_SUP_CRSH: Standby supervisor crashed due to unrecoverable errors, Reason: Failed TestSPRPInbandPing
Causes and Resolutions:
If there is any corruption in the TCAM entries, the SPRPInbandPing test can fail. If the test, ran as part of Cisco Generic Online Diagnostics (GOLD), fails 10 times consecutively, then the supervisor engine can crash.
If health-monitoring is enabled on the device and complete diagnostics is configured during the startup, then the supervisor can crash at the time of the boot process.
Health-monitoring and complete diagnostics conflict with each other for some tests. As a workaround, disable either of them, which depends on your requirement.
The Cisco Catalyst 6500/6000 Switches can unexpectedly reload during the bootup process. The crash log can display system messages similar to these:
From the active Supervisor module:
%SYS-SP-2-MALLOCFAIL: Memory allocation of 320000 bytes failed from 0x40BCF26C, alignment 8 Pool: Processor Free: 75448 Cause: Not enough free memory Alternate Pool: None Free: 0 Cause: No Alternate pool -Process= "CEF process", ipl= 0, pid= 240 -Traceback= 40280AB4 40288058 40BCF274 40BE5660 40BE5730 4029A764 4029A750 %L2-SP-4-NOMEM: Malloc failed: L2-API Purge/Search failed. size req. 512 SP: EARL Driver:lyra_purge_search:process_push_event_list failed %SCHED-SP-2-SEMNOTLOCKED: L2 bad entry (7fff/0) purge proc attempted to unlock an unlocked semaphore -Traceback= 402C202C 4058775C 4058511C 40587CB8
From the standby Supervisor module:
%SYS-SP-STDBY-2-MALLOCFAIL: Memory allocation of 2920 bytes failed from 0x40174088, alignment 8 Pool: Processor Free: 9544 Cause: Memory fragmentation Alternate Pool: None Free: 0 Cause: No Alternate pool -Process= "DiagCard2/-1", ipl= 0, pid= 154 -Traceback= 4016F7CC 40172984 40174090 4063601C 40636584 4062D194 4062ABD8 4062A9EC 4017E0B0 4017E09C %L2-SP-STDBY-4-NOMEM: Malloc failed: L2-API Purge/Search failed. size req. 512 %SCHED-SP-STDBY-2-SEMNOTLOCKED: L2 bad entry (7fff/0) purge proc attempted to unlock an unlocked semaphore -Traceback= 4018A300 403F0400 403EDD7C 403F0A48 SP-STDBY: EARL Driver:lyra_purge_search:process_push_event_list failed %SYS-SP-STDBY-2-MALLOCFAIL: Memory allocation of 1400 bytes failed from 0x409928B4, alignment 8 Pool: Processor Free: 7544 Cause: Memory fragmentation Alternate Pool: None Free: 0 Cause: No Alternate pool -Process= "CEF LC Stats", ipl= 0, pid= 138 -Traceback= 4016F7CC 40172984 409928BC 409C5EEC 4098A5EC
From Cisco IOS Software Release 12.2(17d)SXB, the Supervisor Engine 2 needs a minimum DRAM of 256MB. If your Supervisor module has DRAM of 128MB, then in order to resolve this issue, upgrade the memory to 256MB or more. Refer to Release Notes for Cisco IOS Release 12.2SX on the Supervisor Engine 720, Supervisor Engine 32, and Supervisor Engine 2 for more information.
The Cisco Catalyst 6000/6500 Switches can unexpectedly reload due to an unexpected exception.
01:22:25: %SNMP-3-AUTHFAIL: Authentication failure for SNMP req from host 10.1.2.2 01:23:25: %SNMP-3-AUTHFAIL: Authentication failure for SNMP req from host 10.1.2.2 01:23:40: ROMMON image upgrade in progress 01:23:40: Erasing flash Unexpected exception, CPU signal 5, PC = 0x402F3DC4
While the ROMMon upgrade is in progress, if the system receives a SNMP query, it can cause the switch to reload.
Disable SNMP agent in the switch.
Disable possible SNMP queries to this device from the Network Management Stations.
Perform the ROMMon upgrade on the standby supervisor alone. In order to upgrade the active supervisor, do a force switchover and perform the ROMMon upgrade.
Complete this procedure to avoid the switch from crashing when you perform the ROMMon upgrade:
This message appears as port of the output of the show stacks command (also part of show tech-support command). The complete message is similar to this:
*************************************************** ******* Information of Last System Crash ********** *************************************************** Using bootflash:crashinfo. %Error opening bootflash:crashinfo (File not found) *************************************************** ****** Information of Last System Crash - SP ****** *************************************************** The last crashinfo failed to be written. Please verify the exception crashinfo configuration the filesytem devices, and the free space on the filesystem devices. Using crashinfo_FAILED. %Error opening crashinfo_FAILED (File not found)
There are two conditions where such a message displays:
The bootflash: device does not have enough space to store the crashinfo file. In order to verify whether the bootflash: has enough space, issue the dir bootflash: command or the dir all command. Ensure some free space in the bootflash for the crashinfo (if the switch crashes for any reasons in future).
The system has never encountered a crash. If you have restarted the switch after any suspected crash, issue the show version command. In the output, look for line that starts with System returned to ROM by . If the text that follows the line is power-on , the switch did not crash. The list is not comprehensive, but other phrases that can indicate whether a crash has occurred are these: unknown reload cause - suspect, processor memory parity error at PC, and SP by abort at PC.
The MSFC can crash with a bus error exception, which might be caused by a software or hardware problem. These error messages might display:
On the console:
*** System received a Bus Error exception *** signal= 0xa, code= 0x10, context= 0x60ef02f0 PC = 0x601d22f8, Cause = 0x2420, Status Reg = 0x34008002
In the output of the show version command:
!--- Output is suppressed. System was restarted by bus error at PC 0x0, address 0x0 at 15:31:54 EST Wed Mar 29 2000 !--- Output is suppressed.
If the address indicated is an invalid address out of the memory range, it is a software bug. If the address is in the valid range, the cause of the problem is probably a hardware failure of the processor memory.
The MSFC does not contain ECC memory protection. Therefore, the MSFC crashes at the detection of a parity error. These are some of the errors that you can see when this occurs:
On the console, you see:
*** System received a Cache Parity Exception *** signal= 0x14, code= 0xa405c428, context= 0x60dd1ee0 PC = 0x6025b2a8, Cause = 0x6420, Status Reg = 0x34008002
In the output of the show version command, you see:
!--- Output is suppressed. System returned to ROM by processor memory parity error at PC 0x6020F4D0, address 0x0 at 18:18:31 UTC Wed Aug 22 2001 !--- Output is suppressed.
In the crashinfo file, recorded in the bootflash or on the console, you see:
Error: primary data cache, fields: data, SysAD virtual addr 0x4B288202, physical addr(21:3) 0x288200, vAddr(14:12) 0x0000 virtual address corresponds to pcimem, cache word 0 Address: 0x4B288200 not in L1 Cache Address: 0x4B288202 Can not be loaded into L1 Cache
If the error occurs more than once, you must replace the MSFC. If the error only occurs once, you can have experienced a single event upset. In this case, monitor the MSFC. Refer to Processor Memory Parity Errors (PMPEs) for more information on parity errors.
The MSFC2 contains ECC memory protection. However, there are memory locations in which parity is checked but single-bit errors cannot be corrected. These are some error messages that you can see in the crashinfo file that indicate a parity error:
Error condition detected: TM_NPP_PARITY_ERROR
Error condition detected: SYSAD_PARITY_ERROR
Error condition detected: SYSDRAM_PARITY
If these error messages are logged only once, you might have experienced a single event upset. Monitor the MSFC2. If the errors happen more frequently, replace the MSFC2. Refer to Processor Memory Parity Errors (PMPEs) for more information on parity errors.
If your MSFC2 crashes and you have a crashinfo file in your bootflash device, issue the more bootflash:crashinfo_filename command. The command displays the information from the crashinfo file. If you see the MISTRAL-3-ERROR message in the initial log section of the crashinfo log, refer to MSFC2 Crashes with Mistral-3-Error Messages in the Crashinfo File in order to determine if you have run into one of the common reasons.
The show system sanity command runs a set of predetermined checks on the configuration with a possible combination of certain system states in order to compile a list of warning conditions. The checks are designed to look for anything that seems out of place. The checks are intended to help you maintain the desired and correct system configuration and functionality. This command is supported in CatOS version 8.3x or later.
Refer to Sanity Check for Configuration Issues and System Health in order to know the list of checks performed and have a look at the sample output of the command.
Refer to Recover the Catalyst 6500/6000 with Supervisor Engine I or II in order to recover Cisco Catalyst 6000/6500 with Supervisor Engine 1 or 2.
Refer to Recover the Catalyst 6500/6000 with Supervisor Engine 720 or Supervisor Engine 32 in order to recover Cisco Catalyst 6000/6500 with Supervisor Engine 720 or 32.
The crashinfo file is a collection of useful information related to the current crash stored in bootflash or Flash memory. When a router crashes due to data or stack corruption, more reload information is needed to debug this type of crash than just the output from the normal show stacks command.
The crashinfo file contains this information:
limited error message (log) and command history
description of the image that runs at the time of the crash
output of the show alignment command
malloc and free traces
process level stack trace
process level context
process level stack dump
interrupt level stack dump
process level info
process level register memory dump
Refer to Retrieving Information from the Crashinfo File for more information and for the procedure to retrieve the crashinfo file.
Refer to Creating Core Dumps for more information and for the procedure to collect core dump from the device.
For Cisco Catalyst 6000/6500 Switches that run Native IOS, refer to Common Error Messages on Catalyst 6500/6000 Series Switches Running Cisco IOS Software. If you see an error message that is not in one of the common error messages, refer to:
For Cisco Catalyst 6000/6500 Switches that run Hybrid OS, refer to Common CatOS Error Messages on Catalyst 6500/6000 Series Switches. If you see an error message that is not in one of the common error messages, refer to:
Refer to these other resources for more information:
- Error and System Messages - Cisco Catalyst 6500 Series Switches
- Common CatOS Error Messages on Catalyst 6500/6000 Series Switches
- Common Error Messages on Catalyst 6500/6000 Series Switches Running Cisco IOS Software
- Switches Product Support
- LAN Switching Technology Support
- Technical Support & Documentation - Cisco Systems
The Cisco Support Community is a forum for you to ask and answer questions, share suggestions, and collaborate with your peers.
Refer to Cisco Technical Tips Conventions for information on conventions used in this document.