Guest

Cisco 12000 Series Routers

Troubleshooting Line Card Crashes on the Cisco 12000 Series Internet Router

Document ID: 12770

Updated: Apr 23, 2007

   Print

Introduction

This document provides information on how to troubleshoot line card crashes on the Cisco 12000 Series Internet Router.

Prerequisites

Requirements

There are no specific requirements for this document.

Components Used

The information in this document is based on these software and hardware versions:

  • All Cisco 12000 Series Internet Routers, including the 12008, 12012, 12016, 12404, 12406, 12410, and the 12416.

  • All Cisco IOS® Software versions that support the Cisco 12000 Series Internet Router.

The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, make sure that you understand the potential impact of any command.

Conventions

Refer to Cisco Technical Tips Conventions for more information on document conventions.

Background Information

This section provides a background on how to identify a line card crash.

Identify a Line Card Crash

In order to quickly identify a line card crash, use the show context summary command:

   Router#show context summary 
       CRASH INFO SUMMARY 
         Slot 0 : 0 crashes 
         Slot 1 : 0 crashes 
         Slot 2 : 0 crashes 
         Slot 3 : 0 crashes 
         Slot 4 : 1 crashes 
           1 - crash at 04:28:56 EDT Tue Apr 20 1999 
         Slot 5 : 0 crashes 
         Slot 6 : 0 crashes 
         Slot 7 : 0 crashes 
         Slot 8 : 0 crashes 
         Slot 9 : 0 crashes 
         Slot 10: 0 crashes 
         Slot 11: 0 crashes

If the crash affects the router itself (and not the line card only), refer to Troubleshooting Router Crashes.

Gather Information About the Crash

In order to collect the relevant data about the crash, use the commands shown in Table 1.

Table 1 – Commands to Use to Collect Data About the Crash

Command Description
show version Provides general information about the system's hardware and software configurations.
show logging Displays the general logs of the router.
show diag [slot #] Provides specific information about a particular slot: type of engine, hardware revisions, memory configuration, and so on.
show context slot [slot #] Provides context information about the recent crash(es). This is often the most useful command for troubleshooting line card crashes.
core dump A core dump of a line card is the full content of its memory at the time of the crash. This data is normally not needed for an initial troubleshooting. It may be required later if the problem turns out to be a new software bug. In that case, refer to Configuring a Core Dump on a GSR Line Card.

If you have the output of a show tech-support (from enable mode) command from your Cisco device, you can use to display potential issues and fixes. In order to use , you must be a registered customer, be logged in, and have JavaScript enabled.

Analyze the Collected Data

Check the value of the sig= field in the show context slot [slot#] output:

       Router#show context slot 4 
       CRASH INFO: Slot 4, Index 1, Crash at 04:28:56 EDT Tue Apr 20 1999 

       VERSION: 
       GS Software (GLC1-LC-M), Version 11.2(15)GS1a, EARLY DEPLOYMENT RELEASE 
       SOFTWARE (fc1) 
       Compiled Mon 28-Dec-98 14:53 by tamb 
       Card Type: 1 Port Packet Over SONET OC-12c/STM-4c, S/N CAB020500AL 
       System exception: SIG=20, code=0xA414EF5A, context=0x40337424 

       Traceback Using RA 
       STACK TRACE: 
         traceback 4014CFC0 40141AB8 40143944 4014607C 4014A7EC 401499D4 40149BB4 
       40149FD4 40080118 40080104 
       CONTEXT: 
       $0 : 00000000, AT : 40330000, v0 : 00000000, v1 : 00000038 
       a0 : 4094EF58, a1 : 00000120, a2 : 00000002, a3 : 00000001 
       t0 : 00000010, t1 : 3400BF01, t2 : 34008D00, t3 : FFFF00FF 
       t4 : 400A1410, t5 : 00000002, t6 : 00000000, t7 : 4041783C 
       s0 : 4093F980, s1 : 4093F980, s2 : 4094EEF0, s3 : 4094EF00 
       s4 : 00000000, s5 : 00000001, s6 : 00000000, s7 : 00000000 
       t8 : 34008000, t9 : 00000000, k0 : 404D1860, k1 : 400A2F68 
       gp : 402F3070, sp : 4082BFB0, s8 : 00000000, ra : 400826FC 
       EPC : 0x40098824, SREG : 0x3400BF04, Cause : 0x00000000 
       ErrorEPC : 0x4015B7E4

See Table 2 to find out what error reason matches the SIG value you recorded.

Table 2 – Find the Error That Matches the SIG Value

SIG Value SIG Name Error Reason
2 SIGINT Unexpected hardware interrupt.
3 SIGQUIT Abort due to break key.
4 SIGILL Illegal opcode exception.
5 SIGTRAP Abort due to Break Point or an arithmetic exception.
8 SIGFPE Floating point unit (FPU) exception.
9 SIGKILL Reserved exception.
10 SIGBUS Bus error exception.
11 SIGSEGV SegV exception.
20 SIGCACHE Cache parity exception.
21 SIGWBERR Write bus error interrupt.
22 SIGERROR Fatal hardware error.
23 SIGRELOAD Software-forced crash.

Note: Cache Parity Exception (SIG=20), Bus Error Exception (SIG=10), and Software-forced Crashes (SIG=23) account for more than 95% of line card crashes.

The diag Exec Command

The Cisco 12000 Series supports the diag [slot#] command for testing the different board components. This command is useful for troubleshooting hardware-related crashes, and to identify the faulty board.

The verbose option causes the router to display the list of tests as they are being performed. Otherwise, it simply displays a "PASSED" or "FAILURE" message.

Note: Performing this diagnostic stops all activities of the line card for the duration of the tests (usually around five minutes).

Starting with Cisco IOS Software Release 12.0(22)S, Cisco has unbundled the Cisco 12000 Series Internet Router field diagnostics line card image from the Cisco IOS software image. In earlier versions, diagnostics could be launched from the command line and the imbedded image would be launched. In order to accommodate customers with 20 MB Flash memory cards, line card field diagnostics are now stored and maintained as a separate image that must be available on a Flash memory card or a Trivial File Transfer Protocol (TFTP) boot server before the field diagnostics commands can be used. Router processor and switch fabric field diagnostics continue to be bundled and need not be launched from a separate image. You can find more information at Field Diagnostics for the Cisco 12000 Series Internet Router.

Here is an example of a diag [slot#] command output:

Router#diag 3 verbose 
Running DIAG config check 
Running Diags will halt ALL activity on the requested slot. 
[confirm] 
CR1.LND10# 
Launching a Field Diagnostic for slot 3 
Downloading diagnostic tests to slot 3 (timeout set to 400 sec.) 
Field Diag download COMPLETE for slot 3 
FD 3> ***************************************************** 
FD 3> GSR Field Diagnostics V3.0 
FD 3> Compiled by award on Tue Aug 3 15:58:13 PDT 1999 
FD 3> view: award-bfr_112.FieldDiagRelease 
FD 3> ***************************************************** 
FD 3> BFR_CARD_TYPE_OC48_1P_POS testing... 
FD 3> running in slot 3 (128 tests) 

Executing all diagnostic tests in slot 3 
(total/indiv. timeout set to 600/200 sec.) 
FD 3> Verbosity now (0x00000001) TESTSDISP 

FDIAG_STAT_IN_PROGRESS: test #1 R5K Internal Cache 
FDIAG_STAT_IN_PROGRESS: test #2 Burst Operations 
FDIAG_STAT_IN_PROGRESS: test #3 Subblock Ordering 
FDIAG_STAT_IN_PROGRESS: test #4 Dram Marching Pattern 
FDIAG_STAT_DONE_FAIL test_num 4, error_code 6 
Field Diagnostic: ****TEST FAILURE**** slot 3: last test run 4, 
Dram Marching Pattern, error 6 
Field Diag eeprom values: run 2 fail mode 1 (TEST FAILURE) slot 3 
last test failed was 4, error code 6 
Shutting down diags in slot 3 

slot 3 done, will not reload automatically

Depending on the error encountered, the slot might or might not be automatically reloaded. If it is not, it might be in a stuck or inconsistent state (check with the show diag [slot #] command) until manually reloaded. This is normal. In order to manually reload the card, use the hw-module slot [slot#] reload command.

Cache Parity Exceptions

You can identify cache parity exceptions by the SIG=20 in the show context [slot #] output.

If you have the output of a show tech-support (from enable mode) command from your Cisco device, you can use to display potential issues and fixes. In order to use , you must be a registered customer, be logged in, and have JavaScript enabled.

There are two different kinds of parity errors:

  • Soft parity errors—These occur when an energy level within the chip (for example, a one or a zero) changes. In case of a soft parity error, there is no need to swap the board or any of the components.

  • Hard parity errors— These occur when there is a chip or board failure that causes data to be corrupted. In this case, you should re-seat or replace the affected component, usually a memory chip swap or a board swap. There is a hard parity error when multiple parity errors are seen at the same address. There are more complicated cases which are harder to identify but, in general, if more than one parity error is seen in a particular memory region in a relatively short period of time (several weeks to months), this can be considered a hard parity error.

Studies have shown that soft parity errors are 10 to 100 times more frequent than hard parity errors.

In order to troubleshoot these errors, find a maintenance window to run the diag command for that slot.

  • If the diagnosis results in a failure, replace the line card.

  • If there is no failure, it is likely to be a soft parity error, and the line card does not have to be replaced (unless it crashes a second time with parity error after a short period of time).

Bus Error Exceptions

You can identify bus error exceptions by the SIG=10 in the show context [slot #] output.

If you have the output of a show tech-support (from enable mode) command from your Cisco device, you can use to display potential issues and fixes. In order to use , you must be a registered customer, be logged in, and have JavaScript enabled.

This type of crash is normally software-related, but if for some reason (for example, it is a brand new card, or the crashes start after a power outage) you think the problem could be hardware-related, run the diag command for that slot.

Note: Some software bugs have been known to cause the diag command to report errors, even though there is no problem with the hardware. If a card has already been replaced, but still fails at the same test in the diagnostic, you might be affected by this issue. In that case, treat the crash as a software problem.

Upgrading to the latest version of your Cisco IOS software release train eliminates all fixed bugs causing line card bus errors. If the crash is still present after the upgrade, collect the relevant information (see Gather Information about the Crash), along with a show tech-support, and any information that you think might be useful (such as recent topology change, or a new feature recently implemented) and contact your Cisco support representative.

Software-forced Crashes

You can identify software-forced crashes by the SIG=23 in the show context [slot #] output. Despite the name, these crashes are not always software-related.

If you have the output of a show tech-support (from enable mode) command from your Cisco device, you can use to display potential issues and fixes. In order to use , you must be a registered customer, be logged in, and have JavaScript enabled.

The most common reason for software-forced crashes is the "Fabric Ping Timeout". During normal router operation, the Route Processor (RP) continually pings the line cards. If a line card doesn't answer, the route processor decides to reset it. This results in a software-forced crash (SIG=23) of the affected line card, and you should see these errors in the router's logs:

Mar 12 00:42:48: %GRP-3-FABRIC_UNI: 
Unicast send timed out (4) 
Mar 12 00:42:50: %GRP-3-COREDUMP: Core dump incident on slot 4, error: Fabric ping failure

In order to troubleshoot fabric ping timeouts, you need to find out why the line card didn't respond to the ping. There can be multiple causes:

  • The line card is experiencing high CPU utilization—This can be verified using the execute-on slot [slot #] show proc cpu command. If the CPU is really high (above 95%), refer to Troubleshooting High CPU Utilization on Cisco Routers.

  • There are software bugs in Inter Process Communication (IPC) or the line card is running out of IPC buffers. Most of the time these software-forced reloads are caused by software bugs.

    Upgrading to the latest version of your Cisco IOS software release train eliminates all fixed bugs causing fabric ping timeouts. If the crash is still present after the upgrade, collect the relevant information (see Getting Information about the Crash), along with a show tech-support, a show ipc status, and any information that you think may be useful (such as recent topology change, or a new feature recently implemented) and contact your Cisco support representative.

  • Hardware failure—If the card has been running fine for a long time and no recent topology, software, or feature changes have taken place, or if the problems started after a move or a power outage, defective hardware may be the cause. Run the diag command on the affected line card. Replace the line card, if faulty. If multiple line cards are affected or the diag is fine, replace the fabric.

%GSRSPA-6-ERRORRECOVER: A Hardware or Software error occurred on Subslot 0.Reason Marvel: TXECCERR Automatic Error recovery initiate

TXECCERR/RXECCERR error occurs when RxFIFO or TxFIFO unrecoverable ECC error interrupt occurs in MAC more than the threshold value within the time interval. Unrecoverable ECC errors can not be corrected by the ECC logic. When an unrecoverable error occurs during RxFIFO read, the packet to which the data belongs is marked with EOP/Abort on the SPI4 receive interface and is discarded by upper layers.

This is due to the hardware and is corrected once we reload the SIP/SPA. The permanent solution is to replace the SIP/SPA in order to avoid the errors.

Other Crashes

Other crash types are, by far, less common than the two mentioned above. In most cases, the diag command should indicate whether the card needs to be replaced or not. If the card passes the diagnostic test correctly, consider upgrading the software.

Information to Collect if You Open a TAC Service Request

If you still need assistance after following the troubleshooting steps above and want to open a service request (registered customers only) with the Cisco TAC, be sure to include the following information:
  • Troubleshooting performed before opening the service request.
  • show technical-support output (in enable mode if possible).
  • show log output or console captures, if available.
  • execute-on slot [slot #] show tech for the slot which experienced the line card crash.
Attach the collected data to your service request in non-zipped, plain text format (.txt). You can attach information to your service request by uploading it using the TAC Service Request tool (registered customers only) . If you cannot access the Service Request tool, you can send the information in an email attachment to attach@cisco.com with your service request number in the subject line of your message.

Note: Do not manually reload or power-cycle the router before collecting the above information unless required to troubleshoot a line card crash on the Cisco 12000 Series Internet Router, as this can cause important information to be lost that is needed for determining the root cause of the problem.

Related Information

Updated: Apr 23, 2007
Document ID: 12770