Document ID: 42102
Contents
Introduction
Prerequisites
Requirements
Components Used
Conventions
What are QA Errors?
What Types of QA Errors Can You See?
How to Troubleshoot QA Errors
Zero Link Error
Reused Error
Improve the Troubleshooting Mechanism
Case Studies
%RSP-2-QAERROR: Zero Link Error
%RSP-2-QAERROR: Reused Error
Information to Collect
Related Information
Introduction
This document explains the causes of Queueing and Accumulator ASIC (QA) errors on Cisco 7500 Series Routers, and how to troubleshoot them.
Prerequisites
Requirements
There are no specific prerequisites for this document.
Components Used
The information in this document is based on this hardware platform:
-
Cisco 7500 Series Routers
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, make sure that you understand the potential impact of any command.
Conventions
For more information on document conventions, refer to the Cisco Technical Tips Conventions.
What are QA Errors?
Route Switch Processors (RSPs) for Cisco 7500 routers have a fast packet memory reserved for packet buffers. Both RSP CPU and interface processors have direct access to this packet memory (also known as MEMD).
At boot-up time, the MEMD is carved. This means that the total available memory is distributed among the different interface processors (classic xIP or Versatile Interface Processor (VIP) cards) that are present in the router. This distribution is optimized according to the characteristics of each interface, mainly speed and Maximum Transmission Unit (MTU).
When a packet comes into the router, if the VIP can switch it itself, it does so. If it cannot (for example, distributed switching is not enabled, the received packet is not an IP packet, it has to be sent out through an interface that is not in the same VIP, and so on), the packet is sent to the RSP and buffered in MEMD.
If this packet in MEMD is going to be transmitted by another VIP, this second VIP takes it from MEMD without any intervention from the RSP CPU.
If the intervention of the RSP CPU is required (the packet has to be process-switched or the destination is the router itself), the packet data remains in MEMD and the MEMD buffer header is enqueued onto the Raw Queue (RawQ) for processing by the RSP.
QA errors appear when a buffer header is enqueued into the Queueing and Accumulator ASIC (QAASIC) and the ASIC detects a problem with the buffer header pointer. The ASIC signals a QA error and the RSP dumps the diagnostics and does a cbus complex.
In order to understand the different types of QA errors, it is important to know how the different queues in MEMD work, and what all the values in the output of the show controller cbus command mean, as this shows the status of all the queues.
What Types of QA Errors Can You See?
There are two causes of QA errors. In both cases, the beginning of the error log looks like this:
%RSP-2-QAERROR: reused or zero link error, write at addr 1A80 (QA) log 221A80C0, data 98800000 00000000 %QA-3-DIAG: Failed to enqueue buffer header 0x9880
Then, you can distinguish:
-
Zero link error—This happens when you enqueue a buffer header with address 0x0 into the hardware queue.
-
Reused error—The QAASIC detects that the buffer header that gets enqueued is a duplicate. That is, a buffer header with the same address has already been enqueued into some other hardware queue in the QAASIC.
How to Troubleshoot QA Errors
This section explains how to recognize the two types of QA errors and how to troubleshoot them.
Zero Link Error
In the case of a zero link error, the log shows this error message:
%QA-3-DIAG: Failed to enqueue buffer header 0x0
In this case, the logs after the error do not show detailed diagnostic of the hardware queues and the buffer header address is 0x0.
Normally, this problem is caused by hardware issues. Although, in some cases, it can also be due to a software bug.
In this situation, look for parity errors in the log and replace the part that reports this problem.
Note: When you get an MEMD parity error, deliberately enqueue the buffer header 0x0 into the Raw Queue to cause cleanup of the queues.
If there are no parity errors, collect the output of the show technical-support and show log commands, and create a service request with Cisco Technical Support.
These troubleshooting steps are illustrated in this case study.
Reused Error
MEMD is organized in several queues based on the different interfaces installed in the router and their MTU. Each of these queues has a NULL value at the end that shows which one is the last element of the queue.
In the case of a reused error, the log shows these messages:
-
%QA-3-DIAG: Queue 0x33 (48000198) has 88 elements
This means that the queue 0x33 at address 0x48000198 has 88 elements (buffer headers) enqueued into it.
-
%QA-3-DIAG: No NULL terminator for queue 0x36
This means that the queue 0x36 does not have a NULL terminator at the end.
-
%QA-3-DIAG: Buffer 0x9880 is element 1 on queue 0x350
This means that the buffer header 0x9880 is enqueued into the queue 0x350 as the first element. The buffer header given as part of the initial QA error is detected in any of the hardware queues (which means there is a duplicate buffer header).
-
Sep 5 16:09:38: %QA-3-DIAG: At least one QA queue is broken
This is the final diagnostic message.
In the log messages, there are two pieces of information that are important:
-
The traceback of the functions that the RSP was executing at the moment the QA error occurred:
%QA-3-DIAG: Approximate stack backtrace prior to interrupt: %QA-3-DIAG: -Traceback= 6044C7A8 6044C764 605FD320 602E8198 602E4D9C 60010A0C
If the tracebacks do not show the RSP doing an enqueue, this means that the RSP did not do the enqueue that triggered the QA error. It is one of the VIP cards or the slave RSP that did the enqueue. This can be due to a faulty VIP, a faulty slave RSP, or a software bug.
If you have these tracebacks from your Cisco device, you can use Output Interpreter (registered customers only) to display the decode and potential issues and fixes.
-
The dump of the packet headers in the queue where the problem appeared:
%QA-3-DIAG: Buffer header at 0x40009880: 3CDC88 5E20190 1900000 3CDC80
If you create a service request with Cisco Technical Support, give this information to the Support Engineer because decoding these packet headers can provide some information about the issue.
These troubleshooting steps are illustrated in this case study.
Improve the Troubleshooting Mechanism
The main issue with QA errors is that they trigger a cbus complex. This causes a few minutes of downtime and disrupts traffic.
In order to avoid this issue, there is a new command (introduced by Cisco bug ID CSCin29078 (registered customers only) ). In this command, the cbus complex only takes place if the QA error occurs three times.
This new recovery mechanism can be enabled by configuring hw-module main-cpu qa error-recovery in the active RSP. The command can be saved in nonvolatile RAM (NVRAM). Also, the "no" form of the command (no hw-module main-cpu qa error-recovery) is not seen in the running configuration and startup-configuration because it is the default and leads to the original behavior.
When you enable this command and you face a QA error, these messages appear:
%QA-3-DIAG: Trying to recover from QA ERROR. %QA-3-DIAG: Removing buffer header 0xE360 from all queues %QA-3-DIAG: Buffer 0xE360 is element 155 on queue 0x2E %QA-3-DIAG: Queue 0x2E (48000170) has 154 elements %QA-3-DIAG: Buffer 0xE360 is element 1 on queue 0x340 %QA-3-DIAG: Queue 0x340 (48001A00) has 0 elements %QA-3-DIAG: At least one QA queue is broken %QA-3-DIAG: Recovered from QA ERROR
There is also information about these errors in the output of the show controllers cbus command. A new field is also added:
Router# show controller cbus MEMD at 40000000, 2097152 bytes (unused 1728, recarves 1962, lost/qaerror recoveries 0/0) RawQ 48000100, ReturnQ 48000108, EventQ 48000110 BufhdrQ 48000138 (2849 items), LovltrQ 48000150 (42 items, 1632 bytes) IpcbufQ 48000158 (32 items, 4096 bytes) 3570 buffer headers (48002000 - 4800FF10) pool0: 15 buffers, 256 bytes, queue 48000140 pool1: 368 buffers, 1536 bytes, queue 48000148 pool2: 260 buffers, 4544 bytes, queue 48000160 pool3: 4 buffers, 4576 bytes, queue 48000168
Although this mechanism reduces the impact of QA errors, you still need to troubleshoot them and find out the root cause. Normally, if the recovery succeeds, the problem is caused by a software issue. Otherwise, it is usually a hardware issue. Cisco recommends that you follow the procedures indicated in the previous sections for the different cases.
Case Studies
%RSP-2-QAERROR: Zero Link Error
This is an example of the problem for which the troubleshooting steps are followed.
The logs show this information:
%RSP-2-QAERROR: reused or zero link error, write at addr 0110 (QA) log 22011000, data 00000000 00000000 %QA-3-DIAG: Failed to enqueue buffer header 0x0 %QA-3-DIAG: Approximate stack backtrace prior to interrupt: %QA-3-DIAG: -Traceback= 6025AE28 6050CA5C
Normally, these logs are caused by a parity error in MEMD. Determine whether the MEMD itself is faulty or if another card (a VIP or the slave RSP) has written bad parity there, or whether the hardware is fine and this behavior is caused by a bug.
When you look at the logs, it reveals the reason for the error. As an example, a VIP crash previous to the error messages or parity issues in MEMD. This output shows an example:
%RSP-3-ERROR: MD error 0000008000002000 %RSP-3-ERROR: SRAM parity error (bytes 0:7) 02 %RSP-3-ERROR: MEMD parity error condition %RSP-2-QAERROR: reused or zero link error, write at addr 0100 (QA) log 22010000, data 00000000 00000000 %QA-3-DIAG: Failed to enqueue buffer header 0x0 %QA-3-DIAG: Approximate stack backtrace prior to interrupt: %QA-3-DIAG: -Traceback= 6019B808 60196A38 600109B0 %RSP-3-RESTART: cbus complex
In this case, it is a faulty MEMD because a parity error has been detected there.
If you cannot determine where the problem comes from, create a service request with Cisco Technical Support.
%RSP-2-QAERROR: Reused Error
This output shows an example of a reused error message:
%RSP-2-QAERROR: reused or zero link error, write at addr 1A10 (QA) log 221A1080, data 5B900000 00000000 %QA-3-DIAG: Failed to enqueue buffer header 0x5B90 %QA-3-DIAG: Approximate stack backtrace prior to interrupt: %QA-3-DIAG: -Traceback= 4034B43C 40743038 40364780 40361338 40010BF4 %QA-3-DIAG: Queue 0x2A (E8000150) has 261 elements %QA-3-DIAG: Queue 0x2B (E8000158) has 8 elements %QA-3-DIAG: Buffer 0x5B90 is element 2729 on queue 0x2C %QA-3-DIAG: Buffer 0x0000, element 2730 on queue 0x2C is NULL %QA-3-DIAG: Queue 0x2C (E8000160) has 2729 elements %QA-3-DIAG: Queue 0x2D (E8000168) has 21 elements %QA-3-DIAG: Queue 0x2E (E8000170) has 20 elements %QA-3-DIAG: Queue 0x2F (E8000178) has 39 elements %QA-3-DIAG: No NULL terminator for queue 0x30 %QA-3-DIAG: Queue 0x30 (E8000180) has 331 elements %QA-3-DIAG: Queue 0x31 (E8000188) has 4 elements %QA-3-DIAG: No NULL terminator for queue 0x32 %QA-3-DIAG: Queue 0x32 (E8000190) has 3 elements %QA-3-DIAG: Queue 0x34 (E80001A0) has 4 elements %QA-3-DIAG: Queue 0x35 (E80001A8) has 4 elements %QA-3-DIAG: Queue 0x36 (E80001B0) has 5 elements %QA-3-DIAG: Buffer 0x5B90 is element 1 on queue 0x342 %QA-3-DIAG: Queue 0x342 (E8001A10) has 1 elements %QA-3-DIAG: At least one QA queue is broken %QA-3-DIAG: Buffer header at 0xE0005B90: 4E9B88 5E20160 1600000 4E9B80 %QA-3-DIAG: Buffer contents: %SYS-3-DMPMEM: F84E9B80: 65732F67 72616669 00E0B601 B1910004 289C90C0 08004500 %SYS-3-DMPMEM: F84E9B98: 05D47A9B 0000FF11 00048800 45000578 3EBC4000 7C06561A %SYS-3-DMPMEM: F84E9BB0: 3E57AD9A D586A331 0D2004BE 22AB071E E82ECD08 50104370 %SYS-3-DMPMEM: F84E9BC8: FEFC0000 0A7BB0DA 25764408 118448CD 97E12946 B344C2A9 %SYS-3-DMPMEM: F84E9BE0: D5C88ACA F302238D 2446404E C01FDB03 6DD4046E 4000900A %SYS-3-DMPMEM: F84E9BF8: 8E500B81 A6434DD1 713F1604 899DBEC8 ABDEC387 11A8F770 %QA-3-DIAG: Global queues: %QA-3-DIAG: 3314 buffer headers %QA-3-DIAG: RawQ 0xE8000100 (0), ReturnQ 0xE8000
Since you have all the dumps of the queues, you see that the problem is a reused pointer. Search for indications of this in the logs:
%QA-3-DIAG: Buffer 0x5B90 is element 2729 on queue 0x2C %QA-3-DIAG: Buffer 0x0000, element 2730 on queue 0x2C is NULL %QA-3-DIAG: Queue 0x2C (E8000160) has 2729 elements %QA-3-DIAG: Buffer 0x5B90 is element 1 on queue 0x342 %QA-3-DIAG: Queue 0x342 (E8001A10) has 1 elements
Notice that 0x5B90 is in two queues. It is element 2729 on queue 0x2C and element 1 on queue 0x342.
Another hint to help you find where the problem comes from is the fact that the queue 0x2C has a NULL terminator.
Next, decode the traceback of what the RSP was doing when the problem occurred:
%QA-3-DIAG: Approximate stack backtrace prior to interrupt: %QA-3-DIAG: -Traceback= 4034B43C 40743038 40364780 40361338 40010BF4
You can see that the RSP was not handling packets. Therefore, the RSP is not the faulty piece.
Finally, look into the dump of the packets to see exactly what happens:
%SYS-3-DMPMEM: F84E9B80: 65732F67 72616669 00E0B601 B1910004 289C90C0 08004500 %SYS-3-DMPMEM: F84E9B98: 05D47A9B 0000FF11 00048800 45000578 3EBC4000 7C06561A %SYS-3-DMPMEM: F84E9BB0: 3E57AD9A D586A331 0D2004BE 22AB071E E82ECD08 50104370 %SYS-3-DMPMEM: F84E9BC8: FEFC0000 0A7BB0DA 25764408 118448CD 97E12946 B344C2A9 %SYS-3-DMPMEM: F84E9BE0: D5C88ACA F302238D 2446404E C01FDB03 6DD4046E 4000900A %SYS-3-DMPMEM: F84E9BF8: 8E500B81 A6434DD1 713F1604 899DBEC8 ABDEC387 11A8F770
When you these buffer headers, it normally reveals the reason for the error. In this particular case, the source IP address shows that each time the problem occurs, the packet that is dumped comes in from the same VIP and, after a check to see that it is not due to a bug, the VIP card is replaced.
Information to Collect
|
If you still need assistance after you complete the troubleshooting steps in this document and want to open a case (registered customers only) with Cisco Technical Support, be sure to include this information: |
|---|
Attach the collected data to your case in non-zipped, plain text format (.txt). You can attach information to your case by uploading it using the TAC Service Request Tool (registered customers only) . If you cannot access the Case Query Tool, you can send the information in an E-mail attachment to attach@cisco.com with your case number in the subject line of your message. Note: Do not manually reload or power-cycle the router before you collect this information unless required for troubleshooting reasons. This can cause important information to be lost that is needed in order to determine the root cause of the problem. |
Related Information
- Troubleshooting Versatile Interface Processor Crashes
- What Causes a "%RSP-3-RESTART: cbus complex"?
- What Causes %RSP-3-RESTART: interface [xxx], output stuck/frozen/not transmitting messages?
- Hardware Troubleshooting for the 7500 Series Router
- Troubleshooting TechNotes - Cisco 7500 Series Routers
- Troubleshooting Router Crashes
- Single Line Card Reload Feature
- Technical Support - Cisco Systems
| Updated: Jul 07, 2005 | Document ID: 42102 |
