Guest

Cisco 7000 Series Routers

Nexus 7000 Fabric CRC Errors Troubleshoot

Document ID: 116458

Updated: Sep 11, 2013

Contributed by Yogesh Ramdoss, Cisco TAC Engineer.

   Print

Introduction 

This document describes how to resolve fabric errors reported in the Cisco Nexus 7000 platform. A troubleshoot of fabric Capture Resource Centers (CRCs) involves the collection of data, data analysis, and an elimination process in order to isolate the problem component. This document covers the most common types of fabric CRC errors.

Fabric CRC Detection Overview

Here is a high-level diagram of a Nexus 7018 fabric module with M1 linecards:

 

 

The previous image gives an overview of the components involved when a packet traverses a fabric module. Stage 1 (S1), Stage 2 (S2), and Stage 3 (S3) are the three stages of the Nexus 7000 fabric, Octopus is the queue engine, Santa Cruz (SC) is the fabric ASIC, and Instance 1 and 2 are the two SC instances on the XBAR. This document considers only one XBAR. Please remember that most of the Nexus 7000 Series switches have three or more XBARs installed.

With the assumption that a unidirectional flow from Module 1 (M1) to Module 2 (M2) is present, the ingress Octopus-1 on M1 performs error checks on packets it receives from the south, and the egress Octopus-1 on M2 from the north. If CRC is detected in S3, a problem might have happened in S1 or S2 also, since no CRC check is performed in those stages. So, the devices involved in the path are the ingress Octopus, chassis, crossbar fabric, and egress Octopus.

In M1/Fab1 architecture, CRCs are detected only on the egress linecard (S3).

Here is a sample error message:

%OC_USD-SLOT1-2-RF_CRC: OC1 received packets with
 CRC error from MOD 15 through XBAR slot 1/inst 1

This is reported by M1, which indicates that it received packets with the wrong CRC from Module 15 (M15) via XBAR slot 1/instance 1.

Understand the Different Fabric CRC Errors

This section describes four of the most common types of fabric CRC Errors.

  • CRC error with a single source module, receive module, and XBAR instance:
    %OC_USD-SLOT1-2-RF_CRC: OC1 received packets with
     CRC error from MOD 15 through XBAR slot 1/inst 1
    This means that the module in slot 1 detected a CRC error from M15 through XBAR slot 1/instance 1. The module where the CRC errors originate is referred to as the ingress module (M15 in this case), and the module that reported the problem is the egress module (M1). XBAR 1 is the cross bar in which the packet was received. There are two instances per XBAR. In this case, M1 detected CRC errors from M15 through XBAR slot 1 instance 1.

  • CRC error with a single source module, receive module, but no XBAR instance:
    %OC_USD-SLOT4-2-RF_CRC: OC2 received packets with
     CRC error from MOD 1
    In this message, Module 4 (M4) reported the CRC error from M1. Notice that the XBAR information is missing. The system is unable to ascertain the XBAR that the packet traversed. There are many reasons, but the most common ones are: The information in the fabric header of the packet might be corrupt, so the source module cannot be determined; the XBAR that was traversed is removed from the system since the error incremented. Thus, it was not reported in the hourly syslog message.

  • CRC error with no receive module:
    %OC_USD-2-RF_CRC: OC1 received packets with
     CRC error from MOD 16 through XBAR slot 1/inst 1
    In this instance, a device detected a CRC from Module 16 (M16) through XBAR 1. There is, however, no receiver module. When the Supervisor (SUP) detects a CRC that comes from the fabric module, the slot information is not logged. When you do not see slot information, then the SUP detected the problem. This does not mean that the SUP is bad. Just as when the module reports the problem, there are multiple components that might have caused the problem: M16, the chassis (not as likely), XBAR 1, or the SUP.

  • CRC error with multiple possible source modules:
    %OC_USD-SLOT6-2-RF_CRC: OC2 received packets with
     CRC error from MOD 11 or 12 or 14 or 15 or 16 or 17 or 18
    The source module is gleaned from the ingress Octopus that sourced the bad packet. The driver that raises an interrupt in order to log this error message does not always know the ingress Octopus from which the bad packet originated. This is because some of the bits used in order to represent the ingress Octopus are not used. If the system determines multiple modules have these unused bits turned on, the system must assume that any one of them might be the source, which causes the error message to include all of those modules. The system found that Module 13 (M13) cannot have this conflict due to those bits not being used; thus, it is not logged as a potential source.

Fabric CRC Troubleshoot Approach

New linecards (M2) and fabric module 2 (FAB2) detect CRCs in S1, S2, or S3. When you investigate in detail and find patterns in the failure and log messages, it helps isolate the faulty component.

Here are some questions to ask:

  • Was the error message a one-time event, or have multiple CRC error messages been logged?
  • How frequently are the CRC error messages logged? (Every hour, once a day, once a month?)
  • Do the CRC errors ALL come from the same ingress module?
  • Are the CRC errors ALL reported on the same egress module?
  • Are the CRC errors from multiple ingress modules AND reported on multiple egress modules?
  • If multiple modules report CRC errors, is there a common source module or XBAR module?

Answers to the these questions allow you to approach the troubleshoot procedure from an angle that is more likely to lead to faster resolution.

General CRC Troubleshoot Guidelines

This section establishes a general framework used in order to troubleshoot these issues.

  1. Find the common modules (including XBARs) that are reported in the fabric CRC error messages.
  2. After you find the common modules, pick the most likely cause of the problem, shut down (in case of XBAR), move it to a known slot that works, reseat, and replace it while you monitor in order to verify if the problem goes away. Shut down, reseat, and replace modules one at a time. This makes it easier to isolate the faulty part.
  3. When you shutdown, move, reseat, or replace a part, look for any changes in the problem symptoms. You might have to revise your action plan after you learn more from each step taken.
  4. If multiple parts are replaced and the problem still persists, then:

    • The new parts might be bad.
    • Multiple XBARs might be bad.
    • A bad chassis slot might be the cause.

Case Studies

This section provides examples of how to troubleshoot similar problems.

Ingress Module Corrupts the Packets

Logs

%OC_USD-SLOT1-2-RF_CRC: OC2 received packets with CRC error from MOD 7
%OC_USD-SLOT3-2-RF_CRC: OC2 received packets with CRC error from MOD 7
%OC_USD-SLOT1-2-RF_CRC: OC2 received packets with CRC error from MOD 7
%OC_USD-SLOT3-2-RF_CRC: OC2 received packets with CRC error from MOD 7
%OC_USD-SLOT1-2-RF_CRC: OC2 received packets with CRC error from MOD 7
%OC_USD-SLOT3-2-RF_CRC: OC2 received packets with CRC error from MOD 7

Problem

For a few hours, CRC errors are seen on M1 and Module 3 (M3) that come from Module 7 (M7) only. 

Probable Cause of the Problem

There is a bad or mis-seated XBAR that corrupts packets headed to M7, or M7 is bad or mis-seated.

Faulty Component Isolation Process

  1. Shutdown the XBARs one-by-one while you monitor in order to verfiy if the problem is resolved.
  2. Reseat the ingress M7 while you monitor.
  3. Replace the M7 while you monitor.

If you have three XBARs installed, it gives you N+1 redundancy. Therefore, you are able to shut them down one at a time (never shut down more than one at any given time) with only minimal impact in order to see if the problem is resolved. Enter these commands in order to complete this process:

N7K(config)# poweroff xbar 1

<monitor>

N7K(config)# no poweroff xbar 1
N7K(config)# poweroff xbar 2

<monitor>

N7K(config)# no poweroff xbar 2
N7K(config)# poweroff xbar 3
N7K(config)# no poweroff xbar 3

In this particular case study, the problem was not resolved when the XBARs were shut down.

As there are two modules that report CRC errors, it is unlikely that those two modules (M1 & M3) are the cause. The next step is to reseat M7 (ingress module), because it is most likely the faulty component. Mis-seated linecards might cause this problem, and it is recommended to reseat the module before replacement.

In this case study, CRC errors continued to increment on the fabric module after a reseat of M7. Contact the Cisco Technical Assistance Center (TAC) at this point (or before this point) in order to replace M7 since a reseat does not resolve the problem.

In this case study, the replacement of M7 stopped the fabric CRC error messages, and resolved the packet loss.

Mis-Seated XBAR Injects Corrupt Packets

Logs

%OC_USD-SLOT11-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT12-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT13-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT15-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT2-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT4-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT5-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT6-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT7-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1
%OC_USD-SLOT8-2-RF_CRC: CRC error from MOD 12 through XBAR slot 3/inst 1

Problem

Multiple modules report CRC errors from Module 12 (M12) that go through XBAR 3.

Probable Cause of the Problem

XBAR 3 is bad or mis-seated, or M12 is mis-seated or faulty.  

Faulty Component Isolation Process

  1. Shutdown XBAR 3 while you monitor.
  2. Reseat the ingress M12 while you monitor.
  3. Replace M12 while you monitor.

In this case, XBAR 3 is shut down with the procedure previously described (in the first case study), and monitored for further errors. It was found that errors ceased when XBAR 3 was shut down. At this point, XBAR 3 is reseated, and care is taken in order to ensure that no pins are bent on the midplane and that the module is properly inserted. After XBAR 3 is reenabled, the problem never reoccurs. This problem is attributed to a mis-seated XBAR module.

Faulty Egress Module Corrupts Packets from the Fabric

Logs

%OC_USD-SLOT6-2-RF_CRC: OC1 received packets with CRC error from
 MOD 1 or 2 or 7 or 13 or 17 through XBAR 
 slot 1/inst 1 and slot 2/inst 1 and slot 3/inst 1

%OC_USD-SLOT6-2-RF_CRC: OC2 received packets with CRC error from
 MOD 1 or 2 or 3 or 7 or 15 or 17 through XBAR
 slot 2/inst 1 and slot 3/inst 1

%OC_USD-SLOT6-2-RF_CRC: OC1 received packets with CRC error from
 MOD 1 or 2 or 5 or 7 or 16 or 17 through XBAR
 slot 1/inst 1 and slot 2/inst 1 and slot 3/inst 1

Problem

Module 6 (M6) reports packets with CRC errors received from multiple linecards and XBARs.

Probable Cause of the Problem

M6 is mis-seated or bad.

Faulty Component Isolation Process

  1. Reseat M6 while you monitor.
  2. Replace M6 while you monitor.

M6 is the most-likely cause of this issue because it is the one common modules in all of the error messages. Of all the modules listed in the error messages, the one that most consistently appears is M6. Therefore, attempt to reseat M6 in order to see if the issue is resolved before you replace it.

In this case, M6 is reseated, but the errors still persist. So, you must open a Cisco TAC case in order to have M6 replaced. After M6 is replaced, the errors are not reported.

Troubleshoot Commands

Here is a list of the commands used in order to troubleshoot/debug:

  • show clock
  • show mod xbar
  • show hardware fabric-utilization detail 
  • show hardware fabric-utilization detail timestamp
  • show hardware internal xbar-driver all event-history errors
  • show hardware internal xbar-driver all event-history msgs
  • show system internal xbar-client internal event-history msgs
  • show system internal xbar all
  • show module internal event-history xbar 1
  • show module internal activity xbar 1
  • show module internal event-history xbar 2
  • show module internal activity xbar 2
  • show module internal event-history xbar 3
  • show module internal activity xbar 3
  • show module internal event-history xbar 4
  • show module internal activity xbar 4
  • show module internal event-history xbar 5
  • show module internal activity xbar 5
  • show logging onboard internal xbar
  • show logging onboard internal octopus
  • show tech detail
Updated: Sep 11, 2013
Document ID: 116458