This document describes how to troubleshoot memory modules related issues in Cisco Unified Computing System (UCS) solution. UCS usesDual In-line Memory Module (DIMM) as RAM modules.
Cisco recommends that you have knowledge of Cisco Unified Computing System (Cisco UCS).
This document is not restricted to specific software and hardware versions.
However, this document focus around
Cisco UCS B-Series Blade Servers
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, make sure that you understand the potential impact of any command.
This section covers main parts of UCS memory issues.
Troubleshoot DIMM’s via UCSM and CLI
Logs to check in tech support
Terms and Acronyms
Dual In-line Memory Module
Error Correcting Code
Low Voltage DIMM
Machine Check Architecture
Memory Built-in Self Test
Memory Reference Code
Power On Self Test
Serial Presence Detect
Double Data Rate
Reliability, Availability and Serviceability
Memory placement is probably one of the most notable physical aspects of UCS solution. Typically the server comes with memory pre-populated with requested amount. However, when in doubt refer to hardware installation guide, which should be updated regularly as new hardware is introduced.
For memory population rules please refer to B-series technical specifications for the specific platform.
POST it is mapped out by BIOS, OS does not see DIMM
Runtime usually causes OS reboot
Singlebit = Correctable
OS continues to see the DIMM
ECC(Error Correcting Code) Error
SPD (Serial Presence Detect) Error
Not supported DIMMs
Not supported DIMM population
Identity unestablishable error
Check and update the catalog
Correctable vs. Uncorrectable Errors
Whether a particular error is correctable or uncorrectable depends on the strength of the ECC code employed within the memory system. Dedicated hardware is able to fix correctable errors when they occur with no impact on program execution.
The DIMMs with correctable error are not disabled and are available for the OS to use. The Total Memory and Effective Memory be the same (taking memory mirroring into account). These correctable errors reported in UCSM operability state as Degraded while overall operability Operable with correctable errors.
Uncorrectable errors generally cannot be fixed, and may make it impossible for the application or operating system to continue execution. The DIMMs with uncorrectable error is disabled and OS does not see that memory. UCSM operState change to ""Inoperable"" in this case.
Troubleshooting DIMM’s via UCSM and CLI
To Check Errors from GUI
Check SEL log for DIMM related errors
A DIMM is installed and functional.
Check SEL for ECC errors
A correctable ECC DIMM error is detected during run time.
A DIMM is not installed or corrupted SPD data.
Check SEL for Identity unestablishable errors
Check and update capability catalog
Check SEL if another DIMM in failed in the same channel
A DIMM may be healthy but disabled because configuration rule could not be maintained by a failed DIMM in the same channel.
Failed to follow memory configuration rule because of missing DIMMs.
UE ECC Error was detected.
Check SEL for ECC errors
DIMM status and Operability changed due to ECC errors were detected before host rebooted.
Check SEL for ECC error during POST/MRC
Uncorrectable ECC error was detected during runtime, DIMM remains available to OS, OS crashes and comes back up but still can use this DIMM. Error can occur again later. DIMM should be replaced in most situations.
In order to obtain statistics navigate to Equipment > Chassis > Server > Inventory > Memory and then Right click on memory and select show navigator.
To Check Errors from CLI
These commands are useful when troubleshooting errors from CLI.
scope server x/y -> show memory detail scope server x/y -> show memory-array detail scope server x/y -> scope memory-array x -> show stats history memory-array-env-stats detail
From memory array scope you can also get access to DIMM.
scope server X/Y > scope memory-array Z > scope DIMM N
From there then you can obtain per-DIMM statistics or reset the error counters.
If you see a correctable error reported that matches the information above, the problem can be corrected by resetting the BMC instead of reseating or resetting the blade server. Use these Cisco UCS Manager CLI commands:
Resetting the BMC does not impact the OS running on the blade.
To reset memory-error counters on a Cisco UCS C-Series Rack Server operating in standalone mode, run the following script on the CLI:
Industry demands for greater capacity, greater bandwidth, and lower operating voltages lead to increased memoryerror rates. Traditionally, the industry has treated correctable errors in the same way as uncorrectable errors, requiring the module to be replaced immediately upon alert. Given extensive research that correctable errors are not correlated with uncorrectable errors, and that correctable errors do not degrade system performance, the Cisco UCS team recommends against immediate replacement of modules with correctable errors. Customers who experience a Degraded memory alert for correctable errors should reset the memory error and resume operation. If you follow this recommendation, it avoids unnecessary server disruption. Future enhancements to error management are coming and helps distinguish among various types of correctable errors and identify the appropriate actions, if any, needed.
It is recommended to be minimum of version 2.1(3c) or 2.2(1b) which has enhancement with UCS memory error management
If the above troubleshooting did not help please raise a support request for assistance.
Log Files to Check in Tech Support
UCSM_X_TechSupport > sam_techsupportinfo
Provides information about DIMM and memory array.
Chassis/server tech support
CIMCX_TechSupport\tmp\CICMX_TechSupport.txt -> Generic tech support information about sever X. CIMCX_TechSupport\obfl\obfl-log -> OBFL logs provide an ongoing logs about status and boot of server X. CIMCX_TechSupport\var\log\sel -> SEL logs for server X.
Based on the platform/version, navigate to the files in tech support bundle
var/nuova/BIOS > RankMarginTest.txt
var/nuova/BIOS > MemoryHob.txt
var/nuova/BIOS > MrcOut_*.txt
These files provide information about memory as seen from BIOS level.
Information there can be cross-referenced again DIMM states reporting tables shown above.
18h - DIMM status is marked as failed when it fails in MemBist test.Replace with a known good DIMM.
DIMM Status Description
00h Not Installed (No DIMM)
01h Installed (Working)
10h Failed (Training)
11h Failed (Clock training)
18h Failed (MemBIST)
20h Ignored (Disabled from debug console)
21h Ignored (SPD Error reported by BMC)
22h Ignored (Non-RDIMM)
23h Ignored (Non-ECC)
24h Ignored (Non-x4)
25h Ignored (Other PDIMM in same LDIMM failed)
26h Ignored (Other LDIMM in same channel failed)
27h Ignored (Other channel in LockStep or Mirror)
28h Ignored (Invalid memory population)
29h Ignored (Organization mismatch)
2Ah Ignored (Registervendormismatch)
2Bh- 7Fh Reserved
80h Ignored ( Workaround-Looping)
81h Ignored(Stuck I2C bus)
82h – FFh Reserved
In Cisco UCS Manager, the state of the Dual In-line Memory Module (DIMM) is based on SEL event records. When the BIOS encounters a noncorrectable memory error during memory test execution, the DIMM is marked as faulty. A faulty DIMM is a considered a nonfunctional device.
If you enable DIMM blacklisting, Cisco UCS Manager monitors the memory test execution messages and blacklists any DIMMs that encounter memory errors in the DIMM SPD data. To allow the host to map out any DIMMs that encounter uncorrectable ECC errors.
DIMM Blacklisting was introduced as an optional global policy in UCSM 2.2(2).
Server firmware must be 2.2(1)+ for B-series blades and 2.2(3)+ for C-series rack servers to properly implement this feature.
In UCSM 2.2(4), the DIMM Blacklisting enabled by default.
Open the tech support file …/var/log/DimmBL.log
Open the file /var/nuova/BIOS/MrcOut.txt if it is available
Find the DIMM Status table. Look for “DIMM Status:”
DIMM Blacklisted =1E
Find the DIMM Status table. Look for “DIMM Status:”