Guest

Cisco Catalyst 6000 Series Switches

Field Notice: FN - 63743 - Catalyst 6500 - Might Fail to Boot Up After a Software Upgrade or Power Cycle – Fix on Failure

Field Notice: FN - 63743 - Catalyst 6500 - Might Fail to Boot Up After a Software Upgrade or Power Cycle – Fix on Failure

Revised July 23, 2014
March 3, 2014


NOTICE:

THIS FIELD NOTICE IS PROVIDED ON AN "AS IS" BASIS AND DOES NOT IMPLY ANY KIND OF GUARANTEE OR WARRANTY, INCLUDING THE WARRANTY OF MERCHANTABILITY. YOUR USE OF THE INFORMATION ON THE FIELD NOTICE OR MATERIALS LINKED FROM THE FIELD NOTICE IS AT YOUR OWN RISK. CISCO RESERVES THE RIGHT TO CHANGE OR UPDATE THIS FIELD NOTICE AT ANY TIME.

Revision History

Revision Date Comment
1.3
23-JUL-2014
Updated the Workaround/Solution Section
1.2
13-JUN-2014
Fixed Workaround/Solution Section
1.1
31-MAR-2014
Fixed Products Affected and Workaround/Solution Sections
1.0
03-MAR-2014
Initial Public Release

Products Affected

Products Affected
WS-SUP720
WS-X6716-10G-3C 
WS-X6716-10G-3CXL
WS-X6716-10T-3CXL
VS-F6K-PFC3C  
VS-F6K-PFC3CXL 
VS-S720-10G-3C
VS-S720-10G-3CXL 
WS-F6700-DFC3B 
WS-F6700-DFC3BXL
WS-F6700-DFC3C 
WS-F6700-DFC3CXL
WS-F6K-PFC3B  
WS-F6K-PFC3BXL 
WS-SUP32-10GE-3B  
WS-SUP32P-GE-3B   
WS-SUP720-3B 
WS-SUP720-3BXL   
WS-X6148-FE-SFP 
WS-X6148A-45AF 
WS-X6148A-GE-45AF  
WS-X6148A-GE-TX 
WS-X6148A-RJ-45 
WS-X6148E-GE-45AT  
WS-X6704-10GE  
WS-X6708-10G-3C 
WS-X6708-10G-3CXL
WS-X6716-10T-3C  
WS-X6724-SFP 
WS-X6748-GE-TX 
WS-X6748-SFP 
ME-C6524GS-8S  
ME-C6524GT-8S  
WS-F6K-PISA 

Problem Description

Some Catalyst 6500 supervisors and linecards might fail to boot up after a software upgrade or other user actions where the board requires a power cycle operation.

Background

Cisco has been working with some customers on an issue related to memory components manufactured by a single supplier between 2005 and 2010. These memory components are widely used across the industry and are included in a number of Cisco products. 

Although the majority of Cisco products using these components are experiencing field failure rates below expected levels, some components may fail earlier than anticipated. A handful of our customers have recently experienced a higher number of failures, leading us to change our approach to managing this issue. 

While other vendors have chosen to address this issue in different ways, Cisco believes its approach is the best course of action for its customers. Despite the cost, we are demonstrating that we always make customer satisfaction a top priority. Customers can learn more about this topic at Memory Component Issue web page.

A degraded component will not affect the ongoing operation of a device, but will be exposed by a subsequent power cycle event. This event will result in a hard failure of the device, which cannot be recovered by a reboot or additional power cycle. For these reasons, additional caution is recommended for operational activities requiring the simultaneous power cycling of multiple devices. This issue has been observed most commonly on devices that have been in service for 24 months or more.

Problem Symptoms

If the suspected Catalyst 6500 supervisor, linecard, or fixed configuration hardware has been in operation for approximately 24 months, the product hardware might fail to boot up due to memory failure during a power cycle event.  This is caused by one or more of these actions:

  • Upgrade the software
  • Reload the entire product
  • Reload after installation
  • Chassis power cycle
  • Online Insertion Removal/Replacement (OIR)

Note: This issue does not affect boards while the boards are in operation.  The board failure might occur after one or more of the actions listed are executed.

For the affected Catalyst 6500 hardware, no error messages are common in all situations. If error messages are seen, they might have these characteristics.

-----------------------------------------------------------------------------------------------------------------

Here is an example of an error message seen when a memory module has failed on the baseboard of a 6700 linecard:

%ONLINE-SP-6-TIMER: Module *module number*, Proc. 0. Failed to bring online
because of timer event

%C6KPWR-SP-4-DISABLED: power to module in slot *slot number* set off
(Module Failed SCP dnld)

%ONLINE-SP-6-REGN_TIMER: Module *module number*, Proc. 0. Failed to
bring online because of registration timer event

%C6KPWR-SP-4-DISABLED: power to module in slot *slot number* set off
(Module Failed SCP dnld)
-----------------------------------------------------------------------------------------------------------------

Here is an example of an error message seen when a memory module has failed on a DFC daughter card:

%EARL-SP-2-PATCH_INVOCATION_LIMIT: 10 Recovery patch invocations in the last
30 secs have been attempted. Max limit reached

%EARL-SW1_SP-2-PATCH_INVOCATION_LIMIT: 10 Recovery patch invocations in the
last 30 secs have been attempted. Max limit reached

%EARL-SW2_SP-2-PATCH_INVOCATION_LIMIT: 10 Recovery patch invocations in the
last 30 secs have been attempted. Max limit reached
-----------------------------------------------------------------------------------------------------------------

Here is an example of an error message seen on a Supervisor 720 when a memory module has failed on a Policy Feature Card (PFC) daughter card:

%CONST_DIAG-SP-3-HM_TEST_FAIL: Module *module number* TestSPRPInbandPing
consecutive failure count:5

%CONST_DIAG-SP-4-HM_TEST_WARNING: Sup switchover will occur after 10
consecutive failures

%CONST_DIAG-SP-3-HM_TEST_FAIL: Module *module number* TestSPRPInbandPing
consecutive failure count:10

%CONST_DIAG-SP-2-HM_SUP_SWOVER: Supervisor card switchover due to unrecoverable
errors, Reason: Failed TestSPRPInbandPin

-----------------------------------------------------------------------------------------------------------------

Example of first of two possible error messages seen on a ME-C6524 when a memory module has failed:

%DIAG-SP-3-MINOR: Module 1: Online Diagnostics detected a Minor Error. Please
use 'show diagnostic result ' to see test results. 
%CONST_DIAG-SP-3-BOOTUP_TEST_FAIL: Module 1: TestLoopback failed on port(s) 1-12 
%CONST_DIAG-SP-3-BOOTUP_TEST_FAIL: Module 1: TestProtocolMatchChannel failed 
%PM-SP-4-ERR_DISABLE: diagnostics error detected on Gi1/3, putting Gi1/3 in
err-disable state 
%PM-SP-4-ERR_DISABLE: diagnostics error detected on Gi1/4, putting Gi1/4 in
err-disable state 
%PM-SP-4-ERR_DISABLE: diagnostics error detected on Gi1/6, putting Gi1/6 in
err-disable state 
%PM-SP-4-ERR_DISABLE: diagnostics error detected on Gi1/7, putting Gi1/7 in
err-disable state 
%PM-SP-4-ERR_DISABLE: diagnostics error detected on Gi1/12, putting Gi1/12 in
err-disable state 
%OIR-SP-6-INSCARD: Card inserted in slot 1, interfaces are now online

-----------------------------------------------------------------------------------------------------------------

Example of second of two possible error messages seen on a ME-C6524 when a memory module has failed:

IPC: Message 7C2EECD8 timed out waiting for Ack 
IPC: MSG: ptr: 0x7C2EECD8, flags: 0x24100, retries: 21, seq: 0x102FC, refcount:
2, rpc_result = 0x0, data_buffer 
= 0x6C8A61E8, header = 0xB83D0B0, data = 0xB83D0D0 || HDR: src: 0x2210000, dst:
0x10012, index: 0, seq: 764, sz: 28, type: 883, fla 
gs: 0x400, ext_flags: 0x0, hi: 0xA3D, lo: 0xB83D0D0 || DATA: 02 F9 00 00 00 02 00
00 00 02 00 10 D6 A8 B0 B2 97 8D 51 80 00 00 00 0 
4 00 00 00 00 

%DUMPER-3-CRASHINFO_FILE_NAME: 12313: Crashinfo for process sbin/ios-base at
bootflash:/crashinfo_ios-base-20140214- 151819 
%SYSMGR-3-ABNORMTERM: ios-base:1 (jid 76) abnormally terminated, restart disabled 
%SYSMGR-6-ERROR_EOK: ios-base:1 (jid 76) mandatory process exited, rebooting

Workaround/Solution

Fix on Failure Replacement Guidelines: Request RMA product through normal service support channels.

If you need assistance in order to determine which hardware part(s) might need replacement, consult the error messages documented in the Problem Symptom section.

For assistance with replacement part disposition, reference this table. In cases where replacing the memory DIMM and/or daughter card is not a viable option, a request may be made to replace the entire card.

Refer to this documentation for assistance on memory modules replacement:

How To Identify Hardware Levels

Enter the show inventory command in order to obtain the Product ID (PID). If the CLI is not available, physically inspect the device in order to locate the PID.

For More Information

If you require further assistance, or if you have any further questions regarding this field notice, please contact the Cisco Systems Technical Assistance Center (TAC) by one of the following methods:

Receive Email Notification For New Field Notices

Cisco Notification Service - Set up a profile to receive email updates about reliability, safety, network security, and end-of-sale issues for the Cisco products you specify.