Guest

Cisco UCS C-Series Rack Servers

Field Notice: FN - 63732 - UCS Product Family - LSI RAID Controller Impacted by Several Critical Issues - Replacement Required

Field Notice: FN - 63732 - UCS Product Family - LSI RAID Controller Impacted by Several Critical Issues - Replacement Required

Revised October 8, 2014
July 29, 2014


NOTICE:

THIS FIELD NOTICE IS PROVIDED ON AN "AS IS" BASIS AND DOES NOT IMPLY ANY KIND OF GUARANTEE OR WARRANTY, INCLUDING THE WARRANTY OF MERCHANTABILITY. YOUR USE OF THE INFORMATION ON THE FIELD NOTICE OR MATERIALS LINKED FROM THE FIELD NOTICE IS AT YOUR OWN RISK. CISCO RESERVES THE RIGHT TO CHANGE OR UPDATE THIS FIELD NOTICE AT ANY TIME.

Revision History

Revision Date Comments
1.2
08-OCT-2014
Updated the How To Identify Hardware Levels and Upgrade Program Sections
1.1
06-OCT-2014
Updated Workaround/Solution Section
1.0
29-JULY-2014
Initial Public Release

Products Affected

Products Affected Comments
CE-RAID9271-8I
Marketing PID based on base PID UCS-RAID9271CV-8I
CIT2-RAID-9271CV
Marketing PID based on base PID UCS-RAID9271CV-8I
EXP-RAID9271-8I
Marketing PID based on base PID UCS-RAID9271CV-8I
UC-RAID-9271
Marketing PID based on base PID UCS-RAID9271CV-8I
SCE10000-RAID9271(=)
Marketing PID based on base PID UCS-RAID9271CV-8I
MXE-UCS-RAID-9271
Marketing PID based on base PID UCS-RAID9271CV-8I
NAM-RAID9271CV-8I
Marketing PID based on base PID UCS-RAID9271CV-8I
CNRC-RAID9271CV-8I
Marketing PID based on base PID UCS-RAID9271CV-8I
CNRD-RAID9271CV-8I
Marketing PID based on base PID UCS-RAID9271CV-8I
CPAR-RAID9271CV-8I
Marketing PID based on base PID UCS-RAID9271CV-8I
UCS-RAID-9265CV(=)
UCS-RAID-9266(=)
Not affected by the TMM or Slow ASIC issues 
UCS-RAID-9266-CV(=)
UCS-RAID-9266-NB(=)
Not affected by the TMM or Slow ASIC issues 
UCS-RAID-9285CV-E(=)
UCS-RAID9270CV-8I(=)
UCS-RAID9271-8I(=)
Not affected by the TMM issue 
UCS-RAID9271CV-8I(=)
UCS-RAID9286CV-8E(=)
UCSC-NYTRO-200GB(=)
UCSV-RAID-9265CV=
Marketing PID based on base PID UCS-RAID-9265CV  
MDE-RAID9271CV-8I
Marketing PID based on base PID UCS-RAID9271CV-8I 
TCE-RAID9271CV-8I
Marketing PID based on base PID UCS-RAID9271CV-8I 
CSM-RAID-9271
Marketing PID based on base PID UCS-RAID9271CV-8I 
PRSM-RAID9271CV-8I
Marketing PID based on base PID UCS-RAID9271CV-8I 
CCS-RAID-9271
Marketing PID based on base PID UCS-RAID9271CV-8I 
CPS-RAID9271CV-8ID
Marketing PID based on base PID UCS-RAID9271CV-8I 

Problem Description

This document describes three issues that are related to the LSI 6GB RAID controllers that are used with the Cisco UCS C-series Rack servers:

  • PMU Fault - Several PMU fault codes are generated due to a Marginal voltage level on an internal voltage rail, which is observed on certain RAID controllers.

  • Slow ASIC/PMU Fault - The RAID controller resets itself due to an inadequate power supply noise decoupling on the second PPC power plane, which is aggravated by slow process material found on the Thunderbolt Series ASIC Silicon.

  • Transportable Flash Module (TFM) Transportable Memory Module (TMM/TMMB) - The PCIe Reset sequencing can cause DRAM data loss in scenarios where an unexpected external power outage (AC) occurs, or in the event of certain power supply failures.

Background

Cisco has worked with the supplier in order to address the issues related to RAID controllers that impact the UCS product family. The affected products were shipped in various overlapping time periods. This document, and the replacement program that is described herein, is intended to address all three of the known issues:

  • PMU Fault - The LSI RAID controllers PPC core voltage rail noise issue.  

    • This issue was initially observed on the UCS-RAID-9266 controller, which causes the PMU fault codes 0x620B, 00002651, or 00002656. The root cause for these codes were linked to the LSI Thunderbolt Series RAID cards. The issue is due to a marginal internal voltage issue that impacts several UCS RAID controllers.

    • An Engineering Change Order (ECO) was implemented in order to fix this issue.

  • Slow ASIC/PMU Fault - The PPC core voltage rail noise and design enhancement issue on the LSI Thunderbolt Series RAID controllers.

    • This issue causes the board to reset itself, and IO processing is delayed while the reset occurs. Thus, there is a performance impact or lack of system availability. 

    • No data corruption or data loss has been observed. 

    • An ECO was implemented in order to fix this issue.

  • TFM, TMM, and TMMB - The LSI Thunderbolt Series RAID controllers with the TFM/TMM improvement for the LSI RAID 9265CV, 9266CV, 9270CV, 9271CV, 9285CV, and 9286CV, or the TMMB improvement for the UCSC-NYTRO-200GB(=).

    • Data loss can occur in the DRAM that has not been written into the persistent storage upon power failure or other events that normally trigger a cache offload. 

    • The new TFM/TMM/TMMB design detects and responds to any loss of power based on the board detection of the power loss. This is completely independent of whether the power loss is due to an unexpected AC power loss or a power supply failure.

Problem Symptoms

These symptoms have been observed for each known issue:

  • PMU Fault - The fault code 0x620B, 00002651, or 00002656 is observed on the LSI Thunderbolt Series cards. These fault code symptoms can occur: 

    • Upon start up or under heavy IO load. 

    • If the issue occurs during normal operation, the FW transparently recovers from the fault with an automatic controller reset. 

    • If the fault occurs during a host boot, then manual or remote intervention via CIMC is required.

    Replacement of the 9266 RAID controller is recommended only when these symptoms are observed:

    • The Operating System (OS) is ESXi5, ESXi4, or Red Hat Version 6.4.  

    • One or more of these messages appear in the MegaCLI log:

      • Pmu Msg Fault!!! faultcode 00002651 

      • Pmu Msg Fault!!! faultcode 00002656

      • Pmu Msg Fault!!! faultcode 0000620B

      • Controller encountered a fatal error and was reset (multiple instances)

    Enter these commands in order to retrieve the logs:
    MegaCLI -AdpEventLog -GetEvents -f eventlog.txt -a0 -nolog 
    MegaCli -fwtermlog -dsply -a0 -nolog > lsi-fwterm.log
  • Slow ASIC/PMU Fault - This fault causes the board to reset itself and the IO processing is delayed during the reset. There is also a performance impact or lack of system availability.

    The replacement of the UCS-RAID9271-8I controller is recommended only if the LSI Serial Number (SN) is listed in the Cisco UCS-RAID Serial Number Validation Tool (SNV).

  • TFM, TMM, and TMMB - Data loss can occur in the DRAM that is not written into the persistent storage upon power failure or other events that normally trigger a cache offload. These two conditions have been observed: 

    • With the Cisco UCS systems that do not provide a PCIe Reset signal prior to the loss of operating power or after an unexpected AC power failure, the TFM or TMMB might not properly protect the content of the LSI MegaRAID controller write cache, which leads to a loss of customer data. 

    • With the Cisco UCS systems that do provide a PCIe Reset signal prior to the loss of operating power or after an unexpected AC power failure, if the power supply itself fails and does not generate the early warning (PCIe Reset signal), then the TFM or TMMB might not properly protect the content of the LSI MegaRAID controller write cache, which leads to a loss of the customer data. 

    Replacement of the TFM, TMM, or TMMB that affects the UCS RAID controllers is recommended.

Workaround/Solution

Note: The Product Identifiers (PIDs) that are listed in this document might not be affected by all three of the problems described. Use the Part Number and Revision Key Code table that is provided in the next section in order to determine, at the Cisco Part Number (CPN)-level, how these three issues have impacted the affected PIDs. You can use this table when you perform a physical inspection of the units, and compare the PID and CPN in order to ensure that the RAID card is in line with the latest CPN revision.

For all products, use the SNV tool in order to verify that your product is affected. Enter all of the suspect LSI SN(s) for your RAID controller(s) into the search box of the SNV tool with a comma-separated list or with one SN per line in order to determine if it is affected.

  • PMU Fault - Cisco recommends that you replace the affected UCS-RAID-9266(=) and UCS-RAID-9266-NB(=) controllers if the previously described problem symptoms are observed and the LSI SN shows Affected in the SNV tool. Use the process that is described in the Upgrade Program section in order to upgrade the affected hardware. If the hardware is in line with the current CPN version and/or is not listed as Affected in the SNV tool but you still encounter this or a similar issue, then open a case with the Cisco Technical Assistance Center (TAC) in order to troubleshoot the problem. 

    Note: As of August 14, 2013 (approximately), the new products that were manufactured under ECO E115678 (listed in the 1. PMU Fault column of the table provided in the next section) are guaranteed to be free of this problem. The current version of these controllers is only needed if the problem symptoms are observed.

  • Slow ASIC/PMU Fault - Cisco recommends that you replace the affected UCS-RAID9271-8I controllers. Use the SVT tool in order to determine if your product is affected, and use the process that is described in the Upgrade Program section in order to upgrade the affected hardware. If the hardware is in line with the current CPN revision and/or is not listed as Affected in the SNV tool but you still encounter this or a similar issue, then open a case with the Cisco TAC in order to troubleshoot the problem.  

    As of October 18, 2013 (approximately), the new products that were manufactured under ECO E116134 (listed in the 2. Slow ASIC / PMU Fault column of the table provided in the next section) are guaranteed to be free of this problem. 

  • TFM, TMM, and TMMB - Cisco recommends that you replace all of the affected hardware for this issue. Review the Part Number and Revision Key Code table that is provided in the next section if your hardware is not in line with the CPN version that is listed in green, bolded text and if the LSI SN shows that the hardware is affected in the SVT tool. Use the process that is described in the Upgrade Program section of this document in order to upgrade the affected hardware. If the hardware is in line with the current CPN revision and/or is not listed as Affected in the SNV tool but you still encounter this or a similar issue, then open a case with the Cisco TAC in order to troubleshoot the problem.

    Cisco released these three ECOs in order to address the TFM, TMM, and TMMB issues:

    As of January 28, 2014 (approximately), the new products that were manufactured under ECO E118005 are guaranteed to be free of this problem.
    As of March 06, 2014 (approximately), the new products that were manufactured under ECO E118299 are guaranteed to be free of this problem. 
    As of April 16, 2014 (approximately), the new products that were manufactured under ECO E119404 are guaranteed to be free of this problem. 

    Note: These ECOs address the different PIDs that are affected by this issue, all of which are described in this document and can be found in the table that is provided in the next section. Refer to the next section for instructions about how to view the version and or new tmmPcbAssmNo property name.

At the time that this document was published, the products that are shipped from Service Logistics might still exhibit these issues. In order to ensure that an RMA replacement is not affected by these issues, use the upgrade form (provided in the Upgrade Program section of this document) in order to obtain the replacements from a clean pool of inventory, which will be used until all of the service depots are purged and cleared of all issues. 

Note: Service-level agreements do not apply when an upgrade program is in place.

Replacement Procedures

Instructions for a RAID controller replacement can be found in the Install and Service Guide for your specific platform. Each server hardware guide describes how to replace a PCIe card, and the process is similar for all of the UCS C-Series platforms. This requires that you shut down the server and completely remove the power. In order to locate the hardware guide for your platform, refer to the Cisco UCS C-Series Rack Servers Install and Upgrade Guides web page.

In order to view an example that shows how to replace a PCIe card in a UCS C240 M3, refer to the Replacing a PCIe Card section of the Cisco UCS C240 M3 Server Installation and Service Guide.

The last step in these procedures, if you replace the RAID controller card, requires that you complete the steps described in the Restoring RAID Configuration After Replacing a RAID Controller section.

Each hardware guide has an appendix, the first section of which describes the supported RAID controllers (along with the required cables) for the specific server. The appendix also includes sections that describe how to route and install the cables. Refer to the RAID Controller Considerations section of the Cisco UCS C240 M3 Server Installation and Service Guide for an example.

How To Identify Hardware Levels

This is the Part Number and Revision Key Code table.

Note: The CPNs (74-xxxxx-xx) that are highlighted in green, bolded text identify the location that the fixes for these three issues is applied. The last two numbers of the CPN identify the version of the release. If the version of your product is beyond the latest version that is listed in this table, then the product should not encounter these issues. 

For all of the PIDs that have not been deployed, or you are unable to electronically find the SN, use the product (box, bag, or card) labels in order to locate the appropriate identifier (CPN and LSI SN). 

Here is an example of the box label:

Here is an example of the bag label: 

Here is an example of the card label:

For the RAID controllers that are already installed, use the product interface tools in order to obtain the needed information.

The CIMC CLI is a command-line management interface for the Cisco UCS C-Series servers. You can launch the CIMC CLI and manage the server over the network via Secure Shell (SSH) or Telnet. By default, Telnet access is disabled. 

In order to identify the SN of the LSI RAID controller in a C-Series server, enter the show storageadapter command into the CLI:

You can also enter the MegaCli64 -AdpAllInfo -aAll CLI command in order to obtain the required information:

You can log into the CIMC and identify the PID and LSI SN via the Storage/Controller Info Tab:

You can enter the storcli64 command in order to obtain the PID, LSI SN, and view the new TFM, TMM, or TMMB property name tmmPcbAssmNo that was added with the updated TFM, TMM, or TMMB release:

  1. Download Version 1.5(3)2 of the Cisco UCS Host Upgrade Utility ucs-c240-huu-1.5.3.2.iso

  2. Enter the storcli64 command via Microsoft Windows, Linux, or Virtual Machine Ware (VMWare).

  3. Locate the PID, the LSI SN, and the property name tmmPcbAssmNo.

If the tmmPcbAssmNo is present and the LSI SN is listed as Not Affected in the SNV tool, then your controller already has the fix. If the tmmPcbAssmNo is not present and the LSI SN is listed as Affected in the SNV tool, then your controller is affected and should be replaced.

Here is a Microsoft Windows PowerShell example:

Here is a Linux and VMware example:

Here are some important notes about the upgrade program that is currently in use for the hardware that is affected by the issues described in this document:

  • The next table addresses replacement consideration. The top portion of the table, Un-Like PID Replacement, will be updated as like inventory becomes available. The bottom portion of the table, Marketing PID Replacement, provides the replacement strategy for the Marketing PIDs.

  • For a short time only, Cisco will perform an unlike replacement for the PIDs that currently have availability issues and for the Marketing PIDs that map to a host PID.

  • The replacement PIDs are form, fit, and functionality equivalent to the Marketing PIDs.

  • This unlike replacement can end at any time and without any prior notice.

  • The unlike replacements are form, fit, and functionality equivalent or better than the PID they replace. 

Upgrade Program

UCSLSITMM
A Form must be filled out for each separate Ship to Address.

The Upgrade Order Reference Number should be unique for each time the Form is filled out.

Please enter a valid Serial Number in the Form.
You will receive an acknowledgement email immediately after submitting the Form with Request #. Depending on material availability, both you AND the Customer email address will receive a confirmation email with Order# in 7 to 10 days. UMPIRE orders are proactive replacements and do NOT adhere to normal SLAs or Service Contracts.

NOTE: If your Ship to Address is in the following countries, please expect delays of up to 3 months depending on importation regulations: Argentina, Brazil,, Columbia, Mexico, Venezuela, India, All countries in Asia (ei: Singapore, Malaysia, Hong Kong, China, Vietnam, Korea, Thailand, Philippines), and all non-EU countries (ie: UAE, Russia, Turkey). You will receive your Order# at that time. Thank you for your patience as this process is beneficial for the customer; it will save them the cost of vat/duty in these countries (which is very high).

If you were given a Sales Order Number for the shipment of your replacement parts, please refer to the SO Status Tool (Please note: you must have a CCO User ID and Password to access this site):https://cisco-apps.cisco.com/cisco/psn/commerce

If you were given an RMA Number for the shipment of your replacement parts, please refer to the "Service Order QuickSearch" Tool at the following location (Please note: you must have a CCO User ID and Password to access this site):http://tools.cisco.com/support/serviceordertool/home.svo

If you have not received an email with an Order# after 10 days, please send an email with your Request#(s) and Customer in the Subject line to:mailto:umpire-escalations@cisco.com
If you were given a Sales Order Number for the shipment of your replacement parts, please refer to the SO Status Tool (Please note: you must have a CCO User ID and Password to access this site): https://cisco-apps.cisco.com/cisco/psn/commerce


Note: Fields marked with an asterisk (*) are required fields.

Requestor Information
*Name
*E-mail Address
TAC SR Number
Customer Shipping Information
*Company
*Address
Address_line2
*City
*State/Province
*ZIP/Postal Code
*Country
Product
Product *Quantity *Serial# 2
UCS-RAID-9266=
UCS-RAID-9266CV=
UCS-RAID-9266-NB=
UCS-RAID9270CV-8I=
UCS-RAID9271-8I=
UCS-RAID9271CV-8I=
UCS-RAID9286CV-8E=
Customer Contact Information
*First Name
*Last Name
*Phone 1 Ext.
Fax 1 Ext.
*E-Mail
Please use the following format: user@domain.com
*Upgrade Order Reference Number
Please provide a number that you can use when inquiring about order status
Notes
1 For phone and fax, include 011 and the country code outside North America.

2 The serial number input field for each Product ID can hold up to 4,000 characters, including commas and white space. For longer lists of serial numbers, please submit additional requests.

3 For customers in Japan only *** please enter the building and the floor in the address field. Also, enter the contact person's name, the telephone number and the e-mail address in the appropriate fields..

For More Information

If you require further assistance, or if you have any further questions regarding this field notice, please contact the Cisco Systems Technical Assistance Center (TAC) by one of the following methods:

Receive Email Notification For New Field Notices

Cisco Notification Service—Set up a profile to receive email updates about reliability, safety, network security, and end-of-sale issues for the Cisco products you specify.