This document provides information on how to troubleshoot line card crashes on the Cisco 12000 Series Internet Router.
There are no specific requirements for this document.
The information in this document is based on these software and hardware versions:
All Cisco 12000 Series Internet Routers, including the 12008, 12012, 12016, 12404, 12406, 12410, and the 12416.
All Cisco IOS® Software versions that support the Cisco 12000 Series Internet Router.
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, make sure that you understand the potential impact of any command.
In order to collect the relevant data about the crash, use the commands shown in Table 1.
Table 1 – Commands to Use to Collect Data About the Crash
Provides general information about the system's hardware and software configurations.
Displays the general logs of the router.
show diag [slot #]
Provides specific information about a particular slot: type of engine, hardware revisions, memory configuration, and so on.
show context slot [slot #]
Provides context information about the recent crash(es). This is often the most useful command for troubleshooting line card crashes.
A core dump of a line card is the full content of its memory at the time of the crash. This data is normally not needed for an initial troubleshooting. It may be required later if the problem turns out to be a new software bug. In that case, refer to Configuring a Core Dump on a GSR Line Card.
Analyze the Collected Data
Check the value of the sig= field in the show context slot [slot#] output:
See Table 2 to find out what error reason matches the SIG value you recorded.
Table 2 – Find the Error That Matches the SIG Value
Unexpected hardware interrupt.
Abort due to break key.
Illegal opcode exception.
Abort due to Break Point or an arithmetic exception.
Floating point unit (FPU) exception.
Bus error exception.
Cache parity exception.
Write bus error interrupt.
Fatal hardware error.
Note: Cache Parity Exception (SIG=20), Bus Error Exception (SIG=10), and Software-forced Crashes (SIG=23) account for more than 95% of line card crashes.
The diag Exec Command
The Cisco 12000 Series supports the diag [slot#] command for testing the different board components. This command is useful for troubleshooting hardware-related crashes, and to identify the faulty board.
The verbose option causes the router to display the list of tests as they are being performed. Otherwise, it simply displays a "PASSED" or "FAILURE" message.
Note: Performing this diagnostic stops all activities of the line card for the duration of the tests (usually around five minutes).
Starting with Cisco IOS Software Release 12.0(22)S, Cisco has unbundled the Cisco 12000 Series Internet Router field diagnostics line card image from the Cisco IOS software image. In earlier versions, diagnostics could be launched from the command line and the imbedded image would be launched. In order to accommodate customers with 20 MB Flash memory cards, line card field diagnostics are now stored and maintained as a separate image that must be available on a Flash memory card or a Trivial File Transfer Protocol (TFTP) boot server before the field diagnostics commands can be used. Router processor and switch fabric field diagnostics continue to be bundled and need not be launched from a separate image. You can find more information at Field Diagnostics for the Cisco 12000 Series Internet Router.
Here is an example of a diag [slot#] command output:
Router#diag 3 verbose
Running DIAG config check
Running Diags will halt ALL activity on the requested slot.
Launching a Field Diagnostic for slot 3
Downloading diagnostic tests to slot 3 (timeout set to 400 sec.)
Field Diag download COMPLETE for slot 3
FD 3> *****************************************************
FD 3> GSR Field Diagnostics V3.0
FD 3> Compiled by award on Tue Aug 3 15:58:13 PDT 1999
FD 3> view: award-bfr_112.FieldDiagRelease
FD 3> *****************************************************
FD 3> BFR_CARD_TYPE_OC48_1P_POS testing...
FD 3> running in slot 3 (128 tests)
Executing all diagnostic tests in slot 3
(total/indiv. timeout set to 600/200 sec.)
FD 3> Verbosity now (0x00000001) TESTSDISP
FDIAG_STAT_IN_PROGRESS: test #1 R5K Internal Cache
FDIAG_STAT_IN_PROGRESS: test #2 Burst Operations
FDIAG_STAT_IN_PROGRESS: test #3 Subblock Ordering
FDIAG_STAT_IN_PROGRESS: test #4 Dram Marching Pattern
FDIAG_STAT_DONE_FAIL test_num 4, error_code 6
Field Diagnostic: ****TEST FAILURE**** slot 3: last test run 4,
Dram Marching Pattern, error 6
Field Diag eeprom values: run 2 fail mode 1 (TEST FAILURE) slot 3
last test failed was 4, error code 6
Shutting down diags in slot 3
slot 3 done, will not reload automatically
Depending on the error encountered, the slot might or might not be automatically reloaded. If it is not, it might be in a stuck or inconsistent state (check with the show diag [slot #] command) until manually reloaded. This is normal. In order to manually reload the card, use the hw-module slot [slot#] reload command.
Cache Parity Exceptions
You can identify cache parity exceptions by the SIG=20 in the show context [slot #] output.
There are two different kinds of parity errors:
Soft parity errors—These occur when an energy level within the chip (for example, a one or a zero) changes. In case of a soft parity error, there is no need to swap the board or any of the components.
Hard parity errors— These occur when there is a chip or board failure that causes data to be corrupted. In this case, you should re-seat or replace the affected component, usually a memory chip swap or a board swap. There is a hard parity error when multiple parity errors are seen at the same address. There are more complicated cases which are harder to identify but, in general, if more than one parity error is seen in a particular memory region in a relatively short period of time (several weeks to months), this can be considered a hard parity error.
Studies have shown that soft parity errors are 10 to 100 times more frequent than hard parity errors.
In order to troubleshoot these errors, find a maintenance window to run the diag command for that slot.
If the diagnosis results in a failure, replace the line card.
If there is no failure, it is likely to be a soft parity error, and the line card does not have to be replaced (unless it crashes a second time with parity error after a short period of time).
Bus Error Exceptions
You can identify bus error exceptions by the SIG=10 in the show context [slot #] output.
This type of crash is normally software-related, but if for some reason (for example, it is a brand new card, or the crashes start after a power outage) you think the problem could be hardware-related, run the diag command for that slot.
Note: Some software bugs have been known to cause the diag command to report errors, even though there is no problem with the hardware. If a card has already been replaced, but still fails at the same test in the diagnostic, you might be affected by this issue. In that case, treat the crash as a software problem.
Upgrading to the latest version of your Cisco IOS software release train eliminates all fixed bugs causing line card bus errors. If the crash is still present after the upgrade, collect the relevant information (see Gather Information about the Crash), along with a show tech-support, and any information that you think might be useful (such as recent topology change, or a new feature recently implemented) and contact your Cisco support representative.
You can identify software-forced crashes by the SIG=23 in the show context [slot #] output. Despite the name, these crashes are not always software-related.
The most common reason for software-forced crashes is the "Fabric Ping Timeout". During normal router operation, the Route Processor (RP) continually pings the line cards. If a line card doesn't answer, the route processor decides to reset it. This results in a software-forced crash (SIG=23) of the affected line card, and you should see these errors in the router's logs:
Mar 12 00:42:48: %GRP-3-FABRIC_UNI:
Unicast send timed out (4)
Mar 12 00:42:50: %GRP-3-COREDUMP: Core dump incident on slot 4, error: Fabric ping failure
In order to troubleshoot fabric ping timeouts, you need to find out why the line card didn't respond to the ping. There can be multiple causes:
There are software bugs in Inter Process Communication (IPC) or the line card is running out of IPC buffers. Most of the time these software-forced reloads are caused by software bugs.
Upgrading to the latest version of your Cisco IOS software release train eliminates all fixed bugs causing fabric ping timeouts. If the crash is still present after the upgrade, collect the relevant information (see Getting Information about the Crash), along with a show tech-support, a show ipc status, and any information that you think may be useful (such as recent topology change, or a new feature recently implemented) and contact your Cisco support representative.
Hardware failure—If the card has been running fine for a long time and no recent topology, software, or feature changes have taken place, or if the problems started after a move or a power outage, defective hardware may be the cause. Run the diag command on the affected line card. Replace the line card, if faulty. If multiple line cards are affected or the diag is fine, replace the fabric.
%GSRSPA-6-ERRORRECOVER: A Hardware or Software error occurred on Subslot 0.Reason Marvel: TXECCERR Automatic Error recovery initiate
TXECCERR/RXECCERR error occurs when RxFIFO or TxFIFO unrecoverable ECC error interrupt occurs in MAC more than the threshold value within the time interval. Unrecoverable ECC errors can not be corrected by the ECC logic. When an unrecoverable error occurs during RxFIFO read, the packet to which the data belongs is marked with EOP/Abort on the SPI4 receive interface and is discarded by upper layers.
This is due to the hardware and is corrected once we reload the SIP/SPA. The permanent solution is to replace the SIP/SPA in order to avoid the errors.
Other crash types are, by far, less common than the two mentioned above. In most cases, the diag command should indicate whether the card needs to be replaced or not. If the card passes the diagnostic test correctly, consider upgrading the software.
Information to Collect if You Open a TAC Service Request
If you still need assistance after following the troubleshooting steps above and want to open a service request (registered customers only) with the Cisco TAC, be sure to include the following information:
Troubleshooting performed before opening the service request.
show technical-support output (in enable mode if possible).
show log output or console captures, if available.
execute-on slot [slot #] show tech for the slot which experienced the line card crash.
Attach the collected data to your service request in non-zipped, plain text format (.txt). You can attach information to your service request by uploading it using the TAC Service Request tool (registered customers only) . If you cannot access the Service Request tool, you can send the information in an email attachment to email@example.com with your service request number in the subject line of your message.
Note: Do not manually reload or power-cycle the router before collecting the above information unless required to troubleshoot a line card crash on the Cisco 12000 Series Internet Router, as this can cause important information to be lost that is needed for determining the root cause of the problem.