Guest

Cisco UCS B-Series Blade Servers

Field Notice: FN - 63472 - UCS B250 M2 Voltage Regulator Setting Causes Non-Recoverable Memory Errors - Firmware Upgrade Required

Field Notice: FN - 63472 - UCS B250 M2 Voltage Regulator Setting Causes Non-Recoverable Memory Errors - Firmware Upgrade Required

Revised April 19, 2013
February 29, 2012


NOTICE:

THIS FIELD NOTICE IS PROVIDED ON AN "AS IS" BASIS AND DOES NOT IMPLY ANY KIND OF GUARANTEE OR WARRANTY, INCLUDING THE WARRANTY OF MERCHANTABILITY. YOUR USE OF THE INFORMATION ON THE FIELD NOTICE OR MATERIALS LINKED FROM THE FIELD NOTICE IS AT YOUR OWN RISK. CISCO RESERVES THE RIGHT TO CHANGE OR UPDATE THIS FIELD NOTICE AT ANY TIME.

Revision History

Revision Date Comment
2.0
19-APR-2013

Updated the 'Workaround/Solution' and 'How to Identify Software Levels' sections for additional clarity with regards to a board controller update.

1.0
29-FEB-2012
Initial Public Release

Products Affected

Products Affected Top Assembly
Part # Rev.
N20-B6625-2
68-3726-05
A0
N20-B6625-2-UPG
68-3726-05
A0
N20-B6625-2=
68-3726-05
A0
N20-B6625-2D
68-3726-05
A0
UCSB-DBUN-B250-104
68-3726-05
A0
UCSB-DBUN-B250-105
68-3726-05
A0
B250M2-BUN1
68-3726-05
A0
B250M2-BUN2
68-3726-05
A0

Engineering Change Order (ECO)

Engineering Change Order (ECO) Remarks
E107438 Implementation of revised CIMC Controller firmware for UCS B250 M2 in factory

Problem Description

Cisco UCS B250 M2 blade servers experience intermittent uncorrectable ECC errors due to marginal voltage regulator settings.

Background

During the investigation of a field failure on a B250 M2 blade, it was discovered that there was an oscillation on the 1.5V power rail that is used to power the DDR3 DIMMs. On the failing system, under heavy load, the amplitude of this oscillation increased to the point where the 1.5V power rail was out of spec and an ECC memory error occurred.

Root cause for this has been found to be marginal values programmed into the digital compensation loop inside the voltage regulator, allowing this oscillation to occur.

The fix is to reduce the gain of the compensation loop to increase the stability, and thus reduce the oscillations. Since this is a digital voltage regulator, this is done entirely in firmware, and no change to the circuit board or components is required.

Problem Symptoms

The UCS C250 M2 blade shows Degraded status, and uncorrectable ECC errors are visible in the SEL Log. When caused by this issue, the intermittent uncorrectable ECC errors occur mostly when the system is under heavy load. The uncorrectable errors are often preceded by correctable errors but that is not always the case.

Example blade in degraded status:

Example SEL Log entries (note line f):

e | 11/23/2011 23:50:15 | CIMC | Memory DDR3_P1_A4_ECC #0x95 | | read 240 correctable ECC errors on Dimm 4 | Asserted
f | 11/23/2011 23:50:16 | BIOS | Memory #0x02 | Uncorrectable ECC/other uncorrectable memory error | RUN, Rank: 1, DIMM Socket: 5, Channel: A, Socket: 0, DIMM: A5 | Asserted
10 | 11/23/2011 23:50:17 | CIMC | Entity presence BIOS_POST_CMPLT #0x50 | Device Absent | Asserted
11 | 11/23/2011 23:50:17 | CIMC | Entity presence BIOS_POST_CMPLT #0x50 | Device Present | Deasserted

Workaround/Solution

The issue is resolved in the following Cisco UCS B-Series software releases or higher:

  • UCS B-Series software release 2.0(1w)
  • UCS B-Series software release 1.4(3u)
    Note: 1.4(4d) is the minimum recommended 1.4 release due to other severe issues in 1.4(3u). See the release notes for more detail.
  • UCS B-Series software release 1.3(1y)

The fix for the issue is in the Board Controller firmware 111026-111026 which is included with the release versions noted above. The board controller version will not be visible in UCSM unless the UCSM version is at or above one of the versions needed to resolve this issue.

UCS software can be downloaded from Cisco.com from the following location:
Download Software

UCS software upgrade instructions can be found at the following location:
http://www.cisco.com/en/US/products/ps10281/prod_installation_guides_list.html

BIOS update instructions can be found at the following location:
http://www.cisco.com/en/US/products/ps10280/products_configuration_example09186a0080af4547.shtml

The board controller update is required for the voltage regulator to be reprogrammed. A CIMC update is required to be able to do the board controller update. The CIMC update should be completed successfully prior to attempting the board controller update. The CIMC update does not require a reset of the blade server. Update of the board controller firmware requires a reset of the blade server. It is recommended to do the board controller update as part of the host firmware update package.

Deviation

Units reworked with new firmware prior to ECO E107438 are marked with Deviation number D123126.

How To Identify Hardware Levels

The fix for this issue is in the Board Controller firmware of the blade. The fix is already in place if the Board Controller version is 111026-111026 or higher. This should be checked carefully because the Board Controller and CIMC Controller are updated separately.

The running Board Controller and CIMC Controller versions can be identified by logging into UCS Manager and drilling down to the status of the UCS B250 M2 blade server and clicking the "Installed Firmware" tab.

Equipment > Chassis > Servers > (server in question)

Example screen shot of server status and firmware level:

For More Information

If you require further assistance, or if you have any further questions regarding this field notice, please contact the Cisco Systems Technical Assistance Center (TAC) by one of the following methods:

Receive Email Notification For New Field Notices

Cisco Notification Service—Set up a profile to receive email updates about reliability, safety, network security, and end-of-sale issues for the Cisco products you specify.