Revision | Date | Comment |
---|---|---|
1.1 | 23-JUN-2008 | Removed Upgrade Form and references to the upgrade program in the Workaround/Solution section and Title. |
1.0 | 18-JUN-2007 | Initial Public Release |
Products Affected | Top Assembly | Comments | |
---|---|---|---|
Part Number | Revision | ||
CRS-16-RP | All | ||
CRS-16-RP= | All | ||
CRS-MSC | 800-25021-10 | All | This TAN and lower are all affected within the Serial Number List |
CRS-MSC-20G | 800-25021-10 | ||
CRS-MSC-20G= | 800-25021-10 | ||
CRS-MSC-40G | 800-25021-10 | ||
CRS-MSC-40G= | 800-25021-10 |
A cache parity error on the CRS-MSC or the CRS-16-RP can occur, resulting in a board reload.
Cisco has identified boards built between August 1, 2006 and May 1, 2007 to have a higher susceptibility to cache parity errors. Analysis has shown that the expected failure rate for these boards is around one percent. Detailed failure analysis has determined that these failures are related to a batch of processors used on the boards built in that time frame. As a result, a new processor has been certified for the CRS-MSC. The CRS-16-RP replacement will be the CRS-16-RP-B, which already uses the new processor. Additionally, the manufacturing test process for the processor vendor has been enhanced to capture cache parity errors.
Click here to see which CRS-MSC and CRS-16-RP serial numbers are affected.
Cache error symptoms can be seen in multiple ways. The samples below show cache errors that were determined from a combination of the console log (heartbeat loss error message) and the core dump (hex decode of cache errors).
Sample CRS-16-RP failure message on console or from show log
:
RP/0/RP0/CPU0:Nov 15 09:26:50.927 : sc_reddrv[284]:
%HA-REDDRV-7-KDUMPER_MESSAGE : Active RP received kdumper keepalive message
RP/0/RP0/CPU0:Nov 15 09:26:52.844 : shelfmgr[287]:
%PLATFORM-SHELFMGR-3-NODE_RESET_BRINGDOWN : Reset node 0/RP1/CPU0 due to heartbeat loss
RP/0/RP0/CPU0:Nov 15 09:26:53.105 : syncfs2[301]:
%MEDIA-SYNCFS2-7-LRD_NOTAVAILABLE : Standby has been rendered UNAVAILABLE -STOP SYNC
RP/0/RP0/CPU0:Nov 15 09:27:38.836 : shelfmgr[287]:
%PLATFORM-MBIMGR-7-IMAGE_VALIDATED :
0/RP1/CPU0: MBI
disk0:hfr-os-mbi-3.2.4/mbihfr-rp.vm validated RP/0/RP0/CPU0:Nov 15
09:27:39.040 : syncfs2[301]:
%MEDIA-SYNCFS2-7-LRD_NOTAVAILABLE : Standby has been rendered UNAVAILABLE -STOP SYNC
RP/0/RP0/CPU0:Nov 15 09:27:40.916 : exec[65771]:
%MGBL-LIBPARSER-6-HIDDEN_COMMAND : This command has been deprecated:
'show ip bgp summary
RP/0/RP0/CPU0:Nov 15 09:28:40.354 : sc_reddrv[284]:
%HA-REDDRV-7-KDUMPER_MESSAGE : Active RP received kdumper keepalive message
RP/0/RP0/CPU0:Nov 15 09:29:12.381 : oir_daemon[250]:
%PLATFORM-OIRD-5-OIROUT : OIR: Node 0/RP1/SP removed
Sample CRS-16-RP core dump:
RP/0/RP1/CPU0:Apr 9 20:58:32.631 : ipv6_io[227]: %FORWARDING-FIB-2-INIT : FIB initialization failure: ltrace initialization failed 'Lightweight Tracing' detected the 'fatal' condition 'System error': Not enough memoryCrash at mem_alloc line 100.
KDEBUG at 0x3d7c74, signal 5(TRAP) c=1 f=3
r0 r1 r2 r3 r4 r5 r6 r7
00001036 00035f70 00421390 0000001d 0000000a 00000000 31500000 00000005
r8 r9 r10 r11 r12 r13 r14 r15
00001036 00419ce0 00000000 00000001 24000004 00420468 00000000 00000000
r16 r17 r18 r19 r20 r21 r22 r23
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00418d40
r24 r25 r26 r27 r28 r29 r30 r31
00008000 0debcdd8 0dec2000 00002100 4815e000 00000000 00000000 04813d00
pc msr cnd lr cnt xer ear mq
003d7c74 00021036 24000004 003d7c64 003f14e4 20000000 00000000 04813d00
srr0 srr1 dar msssr0
003d7c70 40021036 48013000 00000000
Note: This particular error is a L1 Instruction PE.
SRR1 = 2XXXXXXX => L1 Data PE
SRR1 = 4XXXXXXX => L1 Instruction PE
MSSSR0= XXX2XXXX => L2 Data PE
MSSSR0= XXX1XXXX => L3 Tag PE
MSSSR0= XXXX8XXX => L3 Data PE
Releasing Mastership lock
init_driver: initing PROCESSOR_MODULE_TYPE_ASMP eth driver $kernel dumper: Initializing harddisk file system.
$Writing crash info
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
Writing kernel core file
Dumping core to harddisk:/kernel_core.Z
Current process pkg/bin/gsp
Sample CRS-MSC failure message on console or from show log
:
SP/0/15/SP:Apr 18 16:59:32.916 : alphadisplay[100]: %PLATFORM-ALPHA_DISPLAY-6-CHANGE :
Alpha display on node 0/15/SP changed to IOS XR FAIL in state default
RP/0/RP0/CPU0:Apr 18 16:59:35.649 : shelfmgr[342]:
%PLATFORM-SHELFMGR-3-NODE_RESET_BRINGDOWN : Reset node 0/15/CPU0 due to heartbeat loss
RP/0/RP0/CPU0:Apr 18 16:59:35.935 : invmgr[206]: %PLATFORM-INV-6-NODE_STATE_CHANGE : Node:
0/15/SP, state: BRINGDOWN
RP/0/RP0/CPU0:Apr 18 16:59:36.755 : isis[250]: %ROUTING-ISIS-4-ADJCHANGE : Adjacency to
CR2.DCK (TenGigE0/15/0/0) (L2) Down, BFD session DOWN
RP/0/RP0/CPU0:Apr 18 16:59:36.765 : isis[250]: %ROUTING-ISIS-4-ADJCHANGE : Adjacency to
MSR1.DCK (TenGigE0/15/2/0) (L2) Down, BFD session DOWN
RP/0/RP0/CPU0:Apr 18 16:59:37.045 : invmgr[206]: %PLATFORM-INV-6-NODE_STATE_CHANGE : Node:
0/15/CPU0, state: BRINGDOWN
LC/0/0/CPU0:Apr 18 16:59:37.356 : ingressq[159]: %DRIVERS-INGRESSQ_DLL-4-LNS_LOP_DROP :
low availability of planes, aggr cell drop count: 102
RP/0/RP0/CPU0:Apr 18 16:59:51.974 : shelfmgr[342]: %PLATFORM-MBIMGR-7-IMAGE_VALIDATED :
0/15/SP: MBI
bootflash:mbis/hfr-os-mbi-3.4.1.CSCsi00922-1.0.0/60a306d3f81de89d8db0878efc1ca2d4/mbihfr-s
p.vm validated
RP/0/RP0/CPU0:Apr 18 17:00:41.811 : shelfmgr[342]: %PLATFORM-MBIMGR-7-IMAGE_VALIDATED :
0/15/CPU0: MBI tftp:/disk0/hfr-os-mbi-3.4.1.CSCsi00922-1.0.0/lc/mbihfr-lc.vm validated
RP/0/RP0/CPU0:Apr 18 17:00:44.492 : invmgr[206]: %PLATFORM-INV-6-NODE_STATE_CHANGE : Node:
0/15/SP, state: IOS XR RUN
SP/0/15/SP:Apr 18 17:00:06.493 : init[65541]: %OS-INIT-7-MBI_STARTED : total time 10.490
seconds
SP/0/15/SP:Apr 18 17:00:34.541 : sysmgr[76]: %OS-SYSMGR-5-NOTICE : Card is COLD started
SP/0/15/SP:Apr 18 17:00:37.351 : init[65541]: %OS-INIT-7-INSTALL_READY : total time 41.367
seconds
Sample CRS-MSC core dump:
Doing CRC check.
............................................
###############################################
KDEBUG at 0x2f807c, signal 9(KILL) c=3 f=33
r0 r1 r2 r3 r4 r5 r6 r7
01010101 000ffe70 00313380 40000000 00000000 00017274 7fd7c000 0002c000
r8 r9 r10 r11 r12 r13 r14 r15
0002bfff 00000000 40014d88 40000000 22000004 00312478 00000000 00000000
r16 r17 r18 r19 r20 r21 r22 r23
00000000 00000000 00000000 00000000 00000000 00000000 00000000 0030b428
r24 r25 r26 r27 r28 r29 r30 r31
00000000 0002b86c 40000000 000c6f2c 0002c000 0002c000 000c6f18 00000000
pc msr cnd lr cnt xer ear mq
002f807c 00109930 42000002 002a8a1c 00000000 20000000 00000000 00000000
srr0 srr1 dar msssr0
002f807c 00109930 40014000 00008000
Note: This particular error is a L3 Data Parity Error.
SRR1 = 2XXXXXXX => L1 Data PE
SRR1 = 4XXXXXXX => L1 Instruction PE
MSSSR0= XXX2XXXX => L2 Data PE
MSSSR0= XXX1XXXX => L3 Tag PE
MSSSR0= XXXX8XXX => L3 Data PE
Cisco recommends a fix-on-fail strategy for this problem.
Although an upgrade program had previously been provided to replace potentially affected but otherwise working product, the upgrade program is now over and Cisco will only replace product which has actually failed. The standard RMA process should be used to replace failed product.
Note:
The CRS-16-RP-Bs require IOS-XR 3.3.0 or higher to run. These boards will be shipped with a new ROMMON version of 1.45.
Note for IOS-XR Release 3.4.1 only: When running Rel 3.4.1, FPD SMU CSCsi86270 (registered customers only) is required prior to upgrading the entire CRS-1 Router to a new ROMMON version - that is, 1.45. After this SMU is applied, the ROMMON FPD pie Upgrade Procedure should be followed to upgrade to the new ROMMON version.
Note: Cisco recommends that ROMMON versions should not be downgraded on any board as per FN62710.
As of May 21st, 2007 new products that were manufactured under Engineering Change Order (ECO) E089579 use the new processor. Refer to How to Identify Hardware Levels below for instructions on how to view the version of in-service product.
For identification of the CRS-MSC or the CRS-16-RP, use the show diag
command to display the both the board serial number and the TAN.
1) A CRS-MSC with a matching serial number on the affected serial number list that has a TAN (800-25021-10 or lower) does need to be replaced. TANs (800-25021-11 and higher) are not affected, even when a match occurs on the serial number list.
2) A CRS-16-RP with a matching serial number on the affected serial number list does need to be replaced.
Sample show diag
output (in admin mode) for identifying the CRS-MSC TAN and Serial Number:
RP/0/RP1/CPU0:ios#admin
RP/0/RP1/CPU0:ios(admin)#show diag det
NODE 0/2/SP : Cisco CRS-1 Series Modular Services Card
MAIN: board type 500060
800-25021-06 rev A0
dev 080840, 080229
S/N SAD10xxxxxx
PCA: 73-7648-08 rev C0
PID: CRS-MSC VID: V03
CLEI: IPUCAD0BAA
ECI: 135786
Board State : IOS XR RUN PLD:
Motherboard: 0x0025, Processor: 0xda13,
Power: N/A MONLIB: QNXFFS Monlib Version 3.1
ROMMON: Version 1.43(20061109:050237) [CRS-1 ROMMON]
If you require further assistance, or if you have any further questions regarding this field notice, please contact the Cisco Systems Technical Assistance Center (TAC) by one of the following methods:
Product Alert Tool - Set up a profile to receive email updates about reliability, safety, network security, and end-of-sale issues for the Cisco products you specify.