Guest

Cisco Carrier Routing System

Field Notice: FN - 62814 - CRS - MSCs and CRS-16-RPs Can Encounter Cache Parity Errors, Which May Cause Board Resets - Workaround - Fix On Failure


Revised June 23, 2008
June 18, 2007


NOTICE:

THIS FIELD NOTICE IS PROVIDED ON AN "AS IS" BASIS AND DOES NOT IMPLY ANY KIND OF GUARANTEE OR WARRANTY, INCLUDING THE WARRANTY OF MERCHANTABILITY. YOUR USE OF THE INFORMATION ON THE FIELD NOTICE OR MATERIALS LINKED FROM THE FIELD NOTICE IS AT YOUR OWN RISK. CISCO RESERVES THE RIGHT TO CHANGE OR UPDATE THIS FIELD NOTICE AT ANY TIME.

Revision History

    
Revision Date Comment
1.1 23-JUN-2008 Removed Upgrade Form and references to the upgrade program in the Workaround/Solution section and Title.
1.0 18-JUN-2007 Initial Public Release

Products Affected

Products Affected Top Assembly Comments
Part Number Revision
CRS-16-RP All    
CRS-16-RP= All    
CRS-MSC 800-25021-10 All This TAN and lower are all affected within the Serial Number List
CRS-MSC-20G 800-25021-10    
CRS-MSC-20G= 800-25021-10    
CRS-MSC-40G 800-25021-10    
CRS-MSC-40G= 800-25021-10    

Problem Description

A cache parity error on the CRS-MSC or the CRS-16-RP can occur, resulting in a board reload.

Background

Cisco has identified boards built between August 1, 2006 and May 1, 2007 to have a higher susceptibility to cache parity errors. Analysis has shown that the expected failure rate for these boards is around one percent. Detailed failure analysis has determined that these failures are related to a batch of processors used on the boards built in that time frame. As a result, a new processor has been certified for the CRS-MSC. The CRS-16-RP replacement will be the CRS-16-RP-B, which already uses the new processor. Additionally, the manufacturing test process for the processor vendor has been enhanced to capture cache parity errors.


Click here to see which CRS-MSC and CRS-16-RP serial numbers are affected.

Problem Symptoms

Cache error symptoms can be seen in multiple ways. The samples below show cache errors that were determined from a combination of the console log (heartbeat loss error message) and the core dump (hex decode of cache errors).

Sample CRS-16-RP failure message on console or from show log:

RP/0/RP0/CPU0:Nov 15 09:26:50.927 : sc_reddrv[284]:
%HA-REDDRV-7-KDUMPER_MESSAGE : Active RP received kdumper keepalive message
RP/0/RP0/CPU0:Nov 15 09:26:52.844 : shelfmgr[287]:
%PLATFORM-SHELFMGR-3-NODE_RESET_BRINGDOWN : Reset node 0/RP1/CPU0 due to heartbeat loss
RP/0/RP0/CPU0:Nov 15 09:26:53.105 : syncfs2[301]:
%MEDIA-SYNCFS2-7-LRD_NOTAVAILABLE : Standby has been rendered UNAVAILABLE -STOP SYNC
RP/0/RP0/CPU0:Nov 15 09:27:38.836 : shelfmgr[287]:
%PLATFORM-MBIMGR-7-IMAGE_VALIDATED :
0/RP1/CPU0: MBI
disk0:hfr-os-mbi-3.2.4/mbihfr-rp.vm validated RP/0/RP0/CPU0:Nov 15
09:27:39.040 : syncfs2[301]:
%MEDIA-SYNCFS2-7-LRD_NOTAVAILABLE : Standby has been rendered UNAVAILABLE -STOP SYNC
RP/0/RP0/CPU0:Nov 15 09:27:40.916 : exec[65771]:
%MGBL-LIBPARSER-6-HIDDEN_COMMAND : This command has been deprecated:
'show ip bgp summary
RP/0/RP0/CPU0:Nov 15 09:28:40.354 : sc_reddrv[284]:
%HA-REDDRV-7-KDUMPER_MESSAGE : Active RP received kdumper keepalive message
RP/0/RP0/CPU0:Nov 15 09:29:12.381 : oir_daemon[250]:
%PLATFORM-OIRD-5-OIROUT : OIR: Node 0/RP1/SP removed


Sample CRS-16-RP core dump:

RP/0/RP1/CPU0:Apr 9 20:58:32.631 : ipv6_io[227]: %FORWARDING-FIB-2-INIT : FIB initialization failure: ltrace initialization failed 'Lightweight Tracing' detected the 'fatal' condition 'System error': Not enough memoryCrash at mem_alloc line 100.
KDEBUG at 0x3d7c74, signal 5(TRAP) c=1 f=3
r0 r1 r2 r3 r4 r5 r6 r7
00001036 00035f70 00421390 0000001d 0000000a 00000000 31500000 00000005
r8 r9 r10 r11 r12 r13 r14 r15
00001036 00419ce0 00000000 00000001 24000004 00420468 00000000 00000000
r16 r17 r18 r19 r20 r21 r22 r23
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00418d40
r24 r25 r26 r27 r28 r29 r30 r31
00008000 0debcdd8 0dec2000 00002100 4815e000 00000000 00000000 04813d00
pc msr cnd lr cnt xer ear mq
003d7c74 00021036 24000004 003d7c64 003f14e4 20000000 00000000 04813d00
srr0 srr1 dar msssr0
003d7c70 40021036 48013000 00000000

Note: This particular error is a L1 Instruction PE.
SRR1 = 2XXXXXXX => L1 Data PE
SRR1 = 4XXXXXXX => L1 Instruction PE
MSSSR0= XXX2XXXX => L2 Data PE
MSSSR0= XXX1XXXX => L3 Tag PE
MSSSR0= XXXX8XXX => L3 Data PE

Releasing Mastership lock
init_driver: initing PROCESSOR_MODULE_TYPE_ASMP eth driver $kernel dumper: Initializing harddisk file system.
$Writing crash info
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
Writing kernel core file
Dumping core to harddisk:/kernel_core.Z

Current process pkg/bin/gsp


Sample CRS-MSC failure message on console or from show log:

SP/0/15/SP:Apr 18 16:59:32.916 : alphadisplay[100]: %PLATFORM-ALPHA_DISPLAY-6-CHANGE :
Alpha display on node 0/15/SP changed to IOS XR FAIL in state default
RP/0/RP0/CPU0:Apr 18 16:59:35.649 : shelfmgr[342]:
%PLATFORM-SHELFMGR-3-NODE_RESET_BRINGDOWN : Reset node 0/15/CPU0 due to heartbeat loss
RP/0/RP0/CPU0:Apr 18 16:59:35.935 : invmgr[206]: %PLATFORM-INV-6-NODE_STATE_CHANGE : Node:
0/15/SP, state: BRINGDOWN
RP/0/RP0/CPU0:Apr 18 16:59:36.755 : isis[250]: %ROUTING-ISIS-4-ADJCHANGE : Adjacency to
CR2.DCK (TenGigE0/15/0/0) (L2) Down, BFD session DOWN
RP/0/RP0/CPU0:Apr 18 16:59:36.765 : isis[250]: %ROUTING-ISIS-4-ADJCHANGE : Adjacency to
MSR1.DCK (TenGigE0/15/2/0) (L2) Down, BFD session DOWN
RP/0/RP0/CPU0:Apr 18 16:59:37.045 : invmgr[206]: %PLATFORM-INV-6-NODE_STATE_CHANGE : Node:
0/15/CPU0, state: BRINGDOWN
LC/0/0/CPU0:Apr 18 16:59:37.356 : ingressq[159]: %DRIVERS-INGRESSQ_DLL-4-LNS_LOP_DROP :
low availability of planes, aggr cell drop count: 102
RP/0/RP0/CPU0:Apr 18 16:59:51.974 : shelfmgr[342]: %PLATFORM-MBIMGR-7-IMAGE_VALIDATED :
0/15/SP: MBI
bootflash:mbis/hfr-os-mbi-3.4.1.CSCsi00922-1.0.0/60a306d3f81de89d8db0878efc1ca2d4/mbihfr-s
p.vm validated
RP/0/RP0/CPU0:Apr 18 17:00:41.811 : shelfmgr[342]: %PLATFORM-MBIMGR-7-IMAGE_VALIDATED :
0/15/CPU0: MBI tftp:/disk0/hfr-os-mbi-3.4.1.CSCsi00922-1.0.0/lc/mbihfr-lc.vm validated
RP/0/RP0/CPU0:Apr 18 17:00:44.492 : invmgr[206]: %PLATFORM-INV-6-NODE_STATE_CHANGE : Node:
0/15/SP, state: IOS XR RUN
SP/0/15/SP:Apr 18 17:00:06.493 : init[65541]: %OS-INIT-7-MBI_STARTED : total time 10.490
seconds
SP/0/15/SP:Apr 18 17:00:34.541 : sysmgr[76]: %OS-SYSMGR-5-NOTICE : Card is COLD started
SP/0/15/SP:Apr 18 17:00:37.351 : init[65541]: %OS-INIT-7-INSTALL_READY : total time 41.367
seconds


Sample CRS-MSC core dump:


Doing CRC check.
............................................
###############################################
KDEBUG at 0x2f807c, signal 9(KILL) c=3 f=33
r0 r1 r2 r3 r4 r5 r6 r7
01010101 000ffe70 00313380 40000000 00000000 00017274 7fd7c000 0002c000
r8 r9 r10 r11 r12 r13 r14 r15
0002bfff 00000000 40014d88 40000000 22000004 00312478 00000000 00000000
r16 r17 r18 r19 r20 r21 r22 r23
00000000 00000000 00000000 00000000 00000000 00000000 00000000 0030b428
r24 r25 r26 r27 r28 r29 r30 r31
00000000 0002b86c 40000000 000c6f2c 0002c000 0002c000 000c6f18 00000000
pc msr cnd lr cnt xer ear mq
002f807c 00109930 42000002 002a8a1c 00000000 20000000 00000000 00000000
srr0 srr1 dar msssr0
002f807c 00109930 40014000 00008000

Note: This particular error is a L3 Data Parity Error.
SRR1 = 2XXXXXXX => L1 Data PE
SRR1 = 4XXXXXXX => L1 Instruction PE
MSSSR0= XXX2XXXX => L2 Data PE
MSSSR0= XXX1XXXX => L3 Tag PE
MSSSR0= XXXX8XXX => L3 Data PE

Workaround/Solution

Cisco recommends a fix-on-fail strategy for this problem.

Although an upgrade program had previously been provided to replace potentially affected but otherwise working product, the upgrade program is now over and Cisco will only replace product which has actually failed. The standard RMA process should be used to replace failed product.

Note:
The CRS-16-RP-Bs require IOS-XR 3.3.0 or higher to run. These boards will be shipped with a new ROMMON version of 1.45.

Note for IOS-XR Release 3.4.1 only: When running Rel 3.4.1, FPD SMU CSCsi86270 (registered customers only) is required prior to upgrading the entire CRS-1 Router to a new ROMMON version - that is, 1.45. After this SMU is applied, the ROMMON FPD pie Upgrade Procedure should be followed to upgrade to the new ROMMON version.

Note: Cisco recommends that ROMMON versions should not be downgraded on any board as per FN62710.

As of May 21st, 2007 new products that were manufactured under Engineering Change Order (ECO) E089579 use the new processor. Refer to How to Identify Hardware Levels below for instructions on how to view the version of in-service product.

How To Identify Hardware Levels

For identification of the CRS-MSC or the CRS-16-RP, use the show diag command to display the both the board serial number and the TAN.
1) A CRS-MSC with a matching serial number on the affected serial number list that has a TAN (800-25021-10 or lower) does need to be replaced. TANs (800-25021-11 and higher) are not affected, even when a match occurs on the serial number list.

2) A CRS-16-RP with a matching serial number on the affected serial number list does need to be replaced.

Sample show diag output (in admin mode) for identifying the CRS-MSC TAN and Serial Number:

RP/0/RP1/CPU0:ios#admin
RP/0/RP1/CPU0:ios(admin)#show diag det

NODE 0/2/SP : Cisco CRS-1 Series Modular Services Card
MAIN: board type 500060
800-25021-06 rev A0
dev 080840, 080229
S/N SAD10xxxxxx
PCA: 73-7648-08 rev C0
PID: CRS-MSC VID: V03
CLEI: IPUCAD0BAA
ECI: 135786
Board State : IOS XR RUN PLD:
Motherboard: 0x0025, Processor: 0xda13,
Power: N/A MONLIB: QNXFFS Monlib Version 3.1
ROMMON: Version 1.43(20061109:050237) [CRS-1 ROMMON]

For More Information

If you require further assistance, or if you have any further questions regarding this field notice, please contact the Cisco Systems Technical Assistance Center (TAC) by one of the following methods:

Receive Email Notification For New Field Notices

Product Alert Tool - Set up a profile to receive email updates about reliability, safety, network security, and end-of-sale issues for the Cisco products you specify.