Advanced Memory Features and Troubleshooting on UCS M7 and M8 Platforms

Available Languages

Download Options

PDF (397.0 KB)
View with Adobe Reader on a variety of devices
ePub (399.3 KB)
View in various apps on iPhone, iPad, Android, Sony Reader, or Windows Phone
Mobi (Kindle) (296.8 KB)
View on Kindle device or Kindle app on multiple devices

Updated:February 4, 2026

Document ID:225429

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Background Information

Overview of memory errors

Cisco UCS M7/M8 Memory RAS features

System-level ECC

Virtual Lock-Step (VLS) / Adaptive Double Device Data Correction (ADDDC) Sparing

On-die ECC

Error Check and Scrub (ECS)

Post Package Repair (PPR)

PMIC (Power Management Integrated Circuit)

Log Analysis

Files to Check in Tech Support

Troubleshooting RAS Faults

Notable Bugs

Introduction

This document describes new memory features introduced in UCS M7 and M8 Generation Servers and steps to Understand & Troubleshoot memory errors

Prerequisites

Requirements

Cisco recommends that you have knowledge of these topics.

Basic understanding of UCS.
Basic understanding of Memory Architecture.

Components Used

The information in this document is based on these software and hardware versions:

UCS Family Servers M7 and M8
UCS Manager
Cisco Integrated Management Controller (CIMC)
Cisco Intersight Managed Mode (IMM)

The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.

Background Information

Overview of memory errors

Memory errors are among the most common types of errors on modern servers. Errors are often discovered when an attempt is made to read a memory location and the value read does not match the value last written.

Memory errors can be soft or hard. Some errors are correctable, but multiple simultaneous soft or hard errors on a single memory access can be uncorrectable.

Cisco UCS M7/M8 Memory RAS features

Cisco UCS M7 & M8 servers have a robust set of RAS features, as detailed here. These minimize the impact of memory errors on performance and system uptime.

System-level ECC

All Cisco UCS M7 servers use memory modules with ECC codes that can correct any error confined to a single x4 DRAM chip and detect any double-bit error in up to two devices. This is now referred to as system-level ECC, as in older-generation servers

Virtual Lock-Step (VLS) / Adaptive Double Device Data Correction (ADDDC) Sparing

ADDDC Sparing can correct two successive DRAM failures if they reside in the same region. This feature tracks correctable errors and dynamically maps out failing bits by spare-copying (“sparing”) contents into a “buddy” cache line. This mechanism can mitigate correctable errors that, if left untreated, could become uncorrectable. This feature uses virtual lockstep (VLS) to assign cache-line buddy pairs within the same memory channel at either the DRAM bank-level using bank VLS or the DRAM device-level using rank VLS.

On-die ECC

On-die ECC is a new feature in DDR5. This feature is enabled by default. All single-bit errors (hard and soft) are corrected by DRAM before data is transmitted to the host. However, this corrected data is not written back to DRAM. Error Check and Scrub (ECS) is the feature used to scrub and correct single-bit errors in memory.

Error Check and Scrub (ECS)

ECS checks for errors in the background by scrubbing each DRAM die periodically (every 24 hours), correcting them by writing data back to the array and providing a count of the errors found during the scrub. This feature is enabled by default.

Post Package Repair (PPR)

Post package repair is a feature where spare rows are used to replace a bad cell or row in a DRAM device.

There are three types:Soft PPR(reconfigurable),Hard PPR(permanent), andRuntime PPR.

Cisco UCS M7 servers with Intel CPUs support “hard” PPR. This is a permanent repair and is carried out during reboot based on the error data collected during the previous runtime or if any row errors are encountered during EMT.
Repairs typically occur during warm/cold resets or AC cycles.
On the UCS M8 support all three type of PPR ,Hard PPR is enabled by default, whileRuntime PPR is disabled.
Runtime PPRallows repairs to happen during system operation without affecting uptime.
If both Hard and Runtime PPR are enabled, all PPR features are utilized. If Hard PPR is disabled but Runtime PPR is enabled, the system defaults to Soft PPR.
PPR is closely linked with correctable errors, and each correctable error generate a SEL record when PPR is enabled.

PMIC (Power Management Integrated Circuit)

The PMIC on a DIMM is a key feature of DDR5 memory modules. This integration moves the power management function from the motherboard onto the memory module itself, offering several significant advantages.

For DDR5 memory, PMIC error handling is enabled.

PMIC failures generate CELL records during both runtime and post-boot.
During memory training, if a PMIC failure is detected in a memory channel, the affected DIMM is mapped out, and the system continues to boot with reduced memory

Log Analysis

Files to Check in Tech Support

UCSM_X_TechSupport > sam_techsupportinfo provides information about DIMM and memory array.

Chassis/server tech support

CIMCX_TechSupport\tmp\CICMX_TechSupport.txt -> Generic tech support information about sever X.

CIMCX_TechSupport\obfl\obfl-log -> OBFL logs provide an ongoing logs about status and boot of server X.

CIMCX_TechSupport\var\log\sel -> SEL logs for server X.

Based on the platform/version, navigate to the files in tech support bundle.

RAS -For ECS (error Check and Scrub) CE error locationetc. collected during runtime on every scrub

/nv/etc/BIOS/bt/DDR5_CISCO_ECS

AMT Auto Executes in the next boot if hits CE & UCE Error on DIMMs

nv/etc/BIOS/bt/MrcOut.

AMT_TEST_PATTERN:
ADV_MT_SAMSUNG

AMT_RESULT: PASSED.

PMIC error: /nv/etc/DIMM-PMIC.txt

M8 server contains :-

nv/etc/BIOS/bt >MrcOut

These files provide information about memory as seen from BIOS level.

Information there can be cross-referenced again with DIMM states report tables.

Example from AMD server :-

nv/etc/BIOS/bt >MrcOut

It contains :

BIOS version, build date and time
PSP firmware versions
DIMM Presence and status (indicates DIMM is present or not)
DIMM configuration details.




2025/08/14 13:44:34

BIOS ID : C245M8.4.3.6b.0 Built 04/28/2025 14:15:22

=====================
PSP Firmware Versions
=====================
ABL Version: 100E8012
PSP: 0.29.0.9B
PFMW (SMU): 4.71.126.0
SEV: 1.1.37.28
PHY: 0.1.38.0
MPIO: 1.0.2D.C4
TF MPDMA: 0.47.3.0
PM MPDMA: 0.47.46.0
GMI: AB.1.27.0
RIB: 2.0.8.39
SEC: D.E.90.71
PMU: 0.0.90.4E
EMCR: 0.0.E0.4E
uCode B1: 0xA101154


DIMM Status:
|=======================|
| Memory | DIMM Status |
| Channel | |
|=======================|
| P1_A | 01 |
| P1_B | 01 |
| P1_C | 01 |
| P1_D | 01 |
| P1_E | 01 |
| P1_F | 00 |
| P1_G | 01 |
| P1_H | 01 |
| P1_I | 01 |
| P1_J | 01 |
| P1_K | 01 |
| P1_L | 00 |
| P2_A | 01 |
| P2_B | 01 |
| P2_C | 01 |
| P2_D | 01 |
| P2_E | 01 |
| P2_F | 00 |
| P2_G | 01 |
| P2_H | 01 |
| P2_I | 01 |
| P2_J | 01 |
| P2_K | 01 |
| P2_L | 00 |
|=======================|


DIMM Configuration: 
================================================= 
MbistTest = Disabled
MbistAggressor = Disabled
MbistPerBitSlaveDieReport = Enabled
DramTempControlledRefreshEn = Disabled
UserTimingMode = Disabled
UserTimingValue = Disabled
MemBusFreqLimit = Disabled
EnablePowerDown = Disabled
DramDoubleRefreshRate = Disabled
PmuTrainMode = 0x0000 
EccSymbolSize = 0x0000 
UEccRetry = Disabled
IgnoreSpdChecksum = Disabled
EnableBankGroupSwapAlt = Disabled
EnableBankGroupSwap = Disabled
DdrRouteBalancedTee = Disabled
OdtsCmdThrotEn = Disabled
OdtsCmdThrotCyc = Disabled
=================================================

Enhanced Memory Context Restore : APOB_SAVED

2025/08/14 13:44:34

MCA out file inventory :-

This file contains information about MCA registers of all banks .

(Whenever UCE Error has been detected)




--- START OF MCA FILE --- 
Timestamp H:M:S 13:44:15 D:M:Y 14:8:2025

--- Note ---
The legacy MCA registers include:
MCA_CTL - Enables error reporting via machine check exception.
MCA_STATUS - Logs information associated with errors.
MCA_ADDR - Logs address information associated with errors. The use of AMD Secure Memory Encryption may change the information logged in the address register.
MCA_MISC0 - Logs miscellaneous information associated with errors.
The MCA Extension registers include:
MCA_CONFIG - Provide configuration capabilities for this MCA bank.
MCA_IPID - Provides information on the block associated with this MCA bank.
MCA_SYND - Logs physical location information associated with a logged error.
MCA_DESTATUS - Logs status information associated with a deferred error.
MCA_DEADDR - Logs address information associated with a deferred error.
MCA_MISC[1:4] - Provides additional threshold counters within an MCA bank.
MCA_TRANSSYND - Logs location information associated with a transparent error.
MCA_TRANSADDR - Logs address information associated with a transparent error.


LS - Load-Store Unit -> Bank 0
IF - Instruction Fetch Unit -> Bank 1
L2 - L2 Cache Unit -> Bank 2
DE - Decode Unit -> Bank 3
Empty/Unused bank -> Bank 4
EX - Execution Unit -> Bank 5
FP - Floating Point Unit -> Bank 6
L3 - L3 Cache Unit -> Bank 7 to 14
MP5 - Microprocessor5 Management Controller -> Bank 15
PB - Parameter Block -> Bank 16 
PCS-GMI - GMI Controller -> Bank 17 to 18
KPX-GMI - High Speed Interface Unit(GMI) -> Bank 19 to 20
UMC - Unified Memory Controller -> Bank 21 to 22
CS - Coherent Station -> Bank 23 to 24
NBIO - NorthBridge IO Unit -> Bank 25
PCIE - PCIe Root port -> Bank 26 to 27
PIE - Power Management, Interrupts, Etc -> Bank 28
SMU - System Management Controller Unit -> Bank 29
PCS_XGMI - XGMI Controller -> Bank 30
KPX_SERDES - High Speed Interface Unit(XGMI)-> Bank 31
Empty/Unused bank -> Bank 32 to 63

Total BankNumber = 32
MC Global Capability Value = 120
MC Global Status Value = 0
MC Global Control Value = 0
Number of processor = 64
ProcNum BankNum Socket CCD CCX Core Thread MCA Bank Status MCA Bank Address MCA Configuration MCA IPID MSR VAL MCA SYND MSR VAL MC MISC0 MSR VAL MC MISC1 MSR VAL MC DESTAT MSR VAL MC DEADDR MSR VAL MC SYND1 MSR VAL MC SYND2 MSR VAL


Timestamp H:M:S 13:44:32 D:M:Y 14:8:2025
--- END OF MCA FILE ---

Example of PMIC failure in Sel logs :-

Whenever there is a runtime PMIC failure on the DIMM, SEL log will be generated as shown below, and the host is turned off.

2024-06-11 20:26:36 IST ◆Warning System Software event: Memory sensor, Memory Failed (PMIC Fault detected and isolated) was asserted, DIMM socket 1, Channel A, CPU 2. was asserted

The faulty DIMM is mapped out by the BIOS on next host power on . We see the below SEL

A fault is raised as shown below.

Troubleshooting RAS Faults

Generally, you see these faults in UCS Manager as an RAS event.

DIMM1

UCSM CLI commands to reset all memory error counters:

UCS-A# scope server x/y

UCS-A /chassis/server # reset-all-memory-errors

UCS-A /chassis/server* # commit

To clear the SPD data :

Power off the server

Then run the folloiwng commands from UCSM CLI :

UCS-A# connect cimc x/y

UCS-A /chassis/server # reset-all-memory-errors

UCS-A /chassis/server* # commit

Notable Bugs

1. Cisco bug ID CSCwo62396

2. Cisco bug ID CSCwq33148

3. Cisco bug ID CSCwh73760

Revision History

Revision	Publish Date	Comments
1.0	04-Feb-2026	Initial Release

Contributed by Cisco Engineers

Gayatri Kumari
TAC Technical Leader
Anupam Asthana
TAC Technical Leader
Rahul Kumar
TAC Technical Leader

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

Advanced Memory Features and Troubleshooting on UCS M7 and M8 Platforms

Available Languages

Download Options

Bias-Free Language

Contents

Introduction

Prerequisites

Requirements

Components Used

Background Information

Overview of memory errors

Cisco UCS M7/M8 Memory RAS features

System-level ECC

Virtual Lock-Step (VLS) / Adaptive Double Device Data Correction (ADDDC) Sparing

On-die ECC

Error Check and Scrub (ECS)

Post Package Repair (PPR)

PMIC (Power Management Integrated Circuit)

Log Analysis

Files to Check in Tech Support

Troubleshooting RAS Faults

Notable Bugs

Revision History

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco

This Document Applies to These Products