This document provides information on how to troubleshoot line card crashes on the Cisco 12000 Series Internet Router.
There are no specific requirements for this document.
The information in this document is based on these software and hardware versions:
All Cisco 12000 Series Internet Routers, including the 12008, 12012, 12016, 12404, 12406, 12410, and the 12416.
All Cisco IOS® Software versions that support the Cisco 12000 Series Internet Router.
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, make sure that you understand the potential impact of any command.
Refer to Cisco Technical Tips Conventions for more information on document conventions.
This section provides a background on how to identify a line card crash.
In order to quickly identify a line card crash, use the show context summary command:
Router#show context summary CRASH INFO SUMMARY Slot 0 : 0 crashes Slot 1 : 0 crashes Slot 2 : 0 crashes Slot 3 : 0 crashes Slot 4 : 1 crashes 1 - crash at 04:28:56 EDT Tue Apr 20 1999 Slot 5 : 0 crashes Slot 6 : 0 crashes Slot 7 : 0 crashes Slot 8 : 0 crashes Slot 9 : 0 crashes Slot 10: 0 crashes Slot 11: 0 crashes
If the crash affects the router itself (and not the line card only), refer to Troubleshooting Router Crashes.
In order to collect the relevant data about the crash, use the commands shown in Table 1.
|show version||Provides general information about the system's hardware and software configurations.|
|show logging||Displays the general logs of the router.|
|show diag [slot #]||Provides specific information about a particular slot: type of engine, hardware revisions, memory configuration, and so on.|
|show context slot [slot #]||Provides context information about the recent crash(es). This is often the most useful command for troubleshooting line card crashes.|
|core dump||A core dump of a line card is the full content of its memory at the time of the crash. This data is normally not needed for an initial troubleshooting. It may be required later if the problem turns out to be a new software bug. In that case, refer to Configuring a Core Dump on a GSR Line Card.|
Check the value of the sig= field in the show context slot [slot#] output:
Router#show context slot 4 CRASH INFO: Slot 4, Index 1, Crash at 04:28:56 EDT Tue Apr 20 1999 VERSION: GS Software (GLC1-LC-M), Version 11.2(15)GS1a, EARLY DEPLOYMENT RELEASE SOFTWARE (fc1) Compiled Mon 28-Dec-98 14:53 by tamb Card Type: 1 Port Packet Over SONET OC-12c/STM-4c, S/N CAB020500AL System exception: SIG=20, code=0xA414EF5A, context=0x40337424 Traceback Using RA STACK TRACE: traceback 4014CFC0 40141AB8 40143944 4014607C 4014A7EC 401499D4 40149BB4 40149FD4 40080118 40080104 CONTEXT: $0 : 00000000, AT : 40330000, v0 : 00000000, v1 : 00000038 a0 : 4094EF58, a1 : 00000120, a2 : 00000002, a3 : 00000001 t0 : 00000010, t1 : 3400BF01, t2 : 34008D00, t3 : FFFF00FF t4 : 400A1410, t5 : 00000002, t6 : 00000000, t7 : 4041783C s0 : 4093F980, s1 : 4093F980, s2 : 4094EEF0, s3 : 4094EF00 s4 : 00000000, s5 : 00000001, s6 : 00000000, s7 : 00000000 t8 : 34008000, t9 : 00000000, k0 : 404D1860, k1 : 400A2F68 gp : 402F3070, sp : 4082BFB0, s8 : 00000000, ra : 400826FC EPC : 0x40098824, SREG : 0x3400BF04, Cause : 0x00000000 ErrorEPC : 0x4015B7E4
See Table 2 to find out what error reason matches the SIG value you recorded.
|SIG Value||SIG Name||Error Reason|
|2||SIGINT||Unexpected hardware interrupt.|
|3||SIGQUIT||Abort due to break key.|
|4||SIGILL||Illegal opcode exception.|
|5||SIGTRAP||Abort due to Break Point or an arithmetic exception.|
|8||SIGFPE||Floating point unit (FPU) exception.|
|10||SIGBUS||Bus error exception.|
|20||SIGCACHE||Cache parity exception.|
|21||SIGWBERR||Write bus error interrupt.|
|22||SIGERROR||Fatal hardware error.|
Note: Cache Parity Exception (SIG=20), Bus Error Exception (SIG=10), and Software-forced Crashes (SIG=23) account for more than 95% of line card crashes.
The Cisco 12000 Series supports the diag [slot#] command for testing the different board components. This command is useful for troubleshooting hardware-related crashes, and to identify the faulty board.
The verbose option causes the router to display the list of tests as they are being performed. Otherwise, it simply displays a "PASSED" or "FAILURE" message.
Note: Performing this diagnostic stops all activities of the line card for the duration of the tests (usually around five minutes).
Starting with Cisco IOS Software Release 12.0(22)S, Cisco has unbundled the Cisco 12000 Series Internet Router field diagnostics line card image from the Cisco IOS software image. In earlier versions, diagnostics could be launched from the command line and the imbedded image would be launched. In order to accommodate customers with 20 MB Flash memory cards, line card field diagnostics are now stored and maintained as a separate image that must be available on a Flash memory card or a Trivial File Transfer Protocol (TFTP) boot server before the field diagnostics commands can be used. Router processor and switch fabric field diagnostics continue to be bundled and need not be launched from a separate image. You can find more information at Field Diagnostics for the Cisco 12000 Series Internet Router.
Here is an example of a diag [slot#] command output:
Router#diag 3 verbose Running DIAG config check Running Diags will halt ALL activity on the requested slot. [confirm] CR1.LND10# Launching a Field Diagnostic for slot 3 Downloading diagnostic tests to slot 3 (timeout set to 400 sec.) Field Diag download COMPLETE for slot 3 FD 3> ***************************************************** FD 3> GSR Field Diagnostics V3.0 FD 3> Compiled by award on Tue Aug 3 15:58:13 PDT 1999 FD 3> view: award-bfr_112.FieldDiagRelease FD 3> ***************************************************** FD 3> BFR_CARD_TYPE_OC48_1P_POS testing... FD 3> running in slot 3 (128 tests) Executing all diagnostic tests in slot 3 (total/indiv. timeout set to 600/200 sec.) FD 3> Verbosity now (0x00000001) TESTSDISP FDIAG_STAT_IN_PROGRESS: test #1 R5K Internal Cache FDIAG_STAT_IN_PROGRESS: test #2 Burst Operations FDIAG_STAT_IN_PROGRESS: test #3 Subblock Ordering FDIAG_STAT_IN_PROGRESS: test #4 Dram Marching Pattern FDIAG_STAT_DONE_FAIL test_num 4, error_code 6 Field Diagnostic: ****TEST FAILURE**** slot 3: last test run 4, Dram Marching Pattern, error 6 Field Diag eeprom values: run 2 fail mode 1 (TEST FAILURE) slot 3 last test failed was 4, error code 6 Shutting down diags in slot 3 slot 3 done, will not reload automatically
Depending on the error encountered, the slot might or might not be automatically reloaded. If it is not, it might be in a stuck or inconsistent state (check with the show diag [slot #] command) until manually reloaded. This is normal. In order to manually reload the card, use the hw-module slot [slot#] reload command.
You can identify cache parity exceptions by the SIG=20 in the show context [slot #] output.
There are two different kinds of parity errors:
Soft parity errors—These occur when an energy level within the chip (for example, a one or a zero) changes. In case of a soft parity error, there is no need to swap the board or any of the components.
Hard parity errors— These occur when there is a chip or board failure that causes data to be corrupted. In this case, you should re-seat or replace the affected component, usually a memory chip swap or a board swap. There is a hard parity error when multiple parity errors are seen at the same address. There are more complicated cases which are harder to identify but, in general, if more than one parity error is seen in a particular memory region in a relatively short period of time (several weeks to months), this can be considered a hard parity error.
Studies have shown that soft parity errors are 10 to 100 times more frequent than hard parity errors.
In order to troubleshoot these errors, find a maintenance window to run the diag command for that slot.
If the diagnosis results in a failure, replace the line card.
If there is no failure, it is likely to be a soft parity error, and the line card does not have to be replaced (unless it crashes a second time with parity error after a short period of time).
You can identify bus error exceptions by the SIG=10 in the show context [slot #] output.
This type of crash is normally software-related, but if for some reason (for example, it is a brand new card, or the crashes start after a power outage) you think the problem could be hardware-related, run the diag command for that slot.
Note: Some software bugs have been known to cause the diag command to report errors, even though there is no problem with the hardware. If a card has already been replaced, but still fails at the same test in the diagnostic, you might be affected by this issue. In that case, treat the crash as a software problem.
Upgrading to the latest version of your Cisco IOS software release train eliminates all fixed bugs causing line card bus errors. If the crash is still present after the upgrade, collect the relevant information (see Gather Information about the Crash), along with a show tech-support, and any information that you think might be useful (such as recent topology change, or a new feature recently implemented) and contact your Cisco support representative.
You can identify software-forced crashes by the SIG=23 in the show context [slot #] output. Despite the name, these crashes are not always software-related.
The most common reason for software-forced crashes is the "Fabric Ping Timeout". During normal router operation, the Route Processor (RP) continually pings the line cards. If a line card doesn't answer, the route processor decides to reset it. This results in a software-forced crash (SIG=23) of the affected line card, and you should see these errors in the router's logs:
Mar 12 00:42:48: %GRP-3-FABRIC_UNI: Unicast send timed out (4) Mar 12 00:42:50: %GRP-3-COREDUMP: Core dump incident on slot 4, error: Fabric ping failure
In order to troubleshoot fabric ping timeouts, you need to find out why the line card didn't respond to the ping. There can be multiple causes:
The line card is experiencing high CPU utilization—This can be verified using the execute-on slot [slot #] show proc cpu command. If the CPU is really high (above 95%), refer to Troubleshooting High CPU Utilization on Cisco Routers.
There are software bugs in Inter Process Communication (IPC) or the line card is running out of IPC buffers. Most of the time these software-forced reloads are caused by software bugs.
Upgrading to the latest version of your Cisco IOS software release train eliminates all fixed bugs causing fabric ping timeouts. If the crash is still present after the upgrade, collect the relevant information (see Getting Information about the Crash), along with a show tech-support, a show ipc status, and any information that you think may be useful (such as recent topology change, or a new feature recently implemented) and contact your Cisco support representative.
Hardware failure—If the card has been running fine for a long time and no recent topology, software, or feature changes have taken place, or if the problems started after a move or a power outage, defective hardware may be the cause. Run the diag command on the affected line card. Replace the line card, if faulty. If multiple line cards are affected or the diag is fine, replace the fabric.
%GSRSPA-6-ERRORRECOVER: A Hardware or Software error occurred on Subslot 0.Reason Marvel: TXECCERR Automatic Error recovery initiate
TXECCERR/RXECCERR error occurs when RxFIFO or TxFIFO unrecoverable ECC error interrupt occurs in MAC more than the threshold value within the time interval. Unrecoverable ECC errors can not be corrected by the ECC logic. When an unrecoverable error occurs during RxFIFO read, the packet to which the data belongs is marked with EOP/Abort on the SPI4 receive interface and is discarded by upper layers.
This is due to the hardware and is corrected once we reload the SIP/SPA. The permanent solution is to replace the SIP/SPA in order to avoid the errors.
Other crash types are, by far, less common than the two mentioned above. In most cases, the diag command should indicate whether the card needs to be replaced or not. If the card passes the diagnostic test correctly, consider upgrading the software.
|If you still need assistance after following the troubleshooting steps above and want to open a service request (registered customers only) with the Cisco TAC, be sure to include the following information:|
Note: Do not manually reload or power-cycle the router before collecting the above information unless required to troubleshoot a line card crash on the Cisco 12000 Series Internet Router, as this can cause important information to be lost that is needed for determining the root cause of the problem.
The Cisco Support Community is a forum for you to ask and answer questions, share suggestions, and collaborate with your peers.
Refer to Cisco Technical Tips Conventions for information on conventions used in this document.