Table Of Contents
Troubleshooting Cisco UCS B-Series Server Hardware Issues
This chapter describes how to troubleshoot hardware issues not specific to a given model of Cisco UCS B-Series server.
This chapter includes the following sections:
Diagnostics Button and LEDs
At the blade start-up, the POST diagnostics test the CPUs, DIMMs, HDDs, and adapter cards. Any failure notifications are sent to Cisco UCS Manager. You can view these notification in the system error log (SEL) or in the output of the show tech-support command. If errors are found, an amber diagnostic LED lights up next to the failed component. During run time, the blade BIOS, component drivers, and OS monitor for hardware faults. The amber diagnostic LED lights up for any component if an uncorrectable error or correctable errors (such as a host ECC error) over the allowed threshold occurs.
The LED states are saved. If you remove the blade from the chassis, the LED values persist for up to 10 minutes. Pressing the LED diagnostics button on the motherboard causes the LEDs that currently show a component fault to light up for up to 30 seconds. The LED fault values are reset when the blade is reinserted into the chassis and booted.
If any DIMM insertion errors are detected, they can cause the blade discovery to fail and errors are reported in the server POST information. You can view these errors in either the Cisco UCS Manager CLI or the Cisco UCS Manager GUI. The blade servers require specific rules to be followed when populating DIMMs in a blade server. The rules depend on the blade server model. Refer to the documentation for a specific blade server for those rules.
The HDD status LEDs are on the front of the HDD. Faults on the CPU, DIMMs, or adapter cards also cause the server health LED to light up as a solid amber for minor error conditions or blinking amber for critical error conditions.
DIMM Memory Issues
A problem with the DIMM memory can cause a server to fail to boot or cause the server to run below its capabilities. If DIMM issues are suspected, consider the following:
•DIMMs tested, qualified, and sold by Cisco are the only DIMMs supported on your system. Third-party DIMMs are not supported, and if they are present, Cisco technical support will ask you to replace them with Cisco DIMMs before continuing to troubleshoot a problem.
•Check if the malfunctioning DIMM is supported on that model of server. Refer to the server's installation and service notes to verify whether you are using the correct combination of server, CPU and DIMMs.
•Check if the malfunctioning DIMM seated correctly in the slot. Remove and reseat the DIMMs.
•All Cisco servers have either a required or recommended order for installing DIMMs. Refer to the server's installation and service notes to verify that you are adding the DIMMs appropriately for a given server type.
•Most DIMMs are sold in matched pairs. They are intended to be added two at a time, paired with each other. Splitting the pairs can cause memory problems.
•If the replacement DIMMs have a maximum speed lower than those previously installed, all DIMMs in a server run at the slower speed or not work at all. All of the DIMMs in a server should be of the same type.
•The number and size of DIMMs should be the same for all CPUs in a server. Mismatching DIMM configurations can damage system performance.
Rule out the following known issues before you contact Cisco TAC with any DIMM-related issues:
The Cisco UCS Manager GUI incorrectly reports bad DIMMs.
The Cisco UCS Manager GUI can incorrectly report "inoperable memory" when the Cisco UCS Manager CLI indicates no failures. This problem has occurred when running Cisco UCS Manager, Release1.0(1e).
Upgrade to Cisco UCS Manager, Release1.0(2d) or a later release. If that is not possible, to confirm memory is OK, enter the following CLI commands in order (where x=chassis# and y=server# and z=memory array ID#):
•scope server x/y
show memory detail
•scope server x/y
show memory-array detail -> provides memory-array ID
•scope server x/y
scope memory-array z
show stats history memory-array-env-stats detail
Correctable DIMM error reporting in Cisco UCS Manager does not go away until BMC is rebooted.
Correctable DIMM errors report a DIMM as "Degraded" in Cisco UCS Manager, but the DIMMs are still available to the OS on the blade.
To correct this problem, use the following commands to clear the SEL logs from the BMC, then reboot the BMC of the affected blade, or just remove and reseat the blade server from the chassis.SAM-FCS-A# scope server x/ySAM-FCS-A /chassis/server # scope bmcSAM-FCS-A /chassis/server/bmc # resetSAM-FCS-A /chassis/server/bmc* # commit-buffer
Cisco UCS Manager incorrectly reports effective memory.
When running Cisco UCS Manager, Release 1.0(1e), Cisco UCS Manager can misread the SMBIOS table, and not be able to read it without a server reboot.
Upgrade to Cisco UCS Manager, Release 1.2(0) or a later release.
Memory misreported in Cisco UCS Manager.
Memory arrays show more memory sockets than are physically present on the system board.
Upgrade to Cisco UCS Manager, Release 1.0(2j) or a later release.
A single DIMM can cause other DIMMs to get marked as bad. POST fails.
The server does not complete its boot cycle, and the FSM remains stuck at 54 percent.
Upgrade to Cisco UCS Manager, Release 1.2.(1b) or a later release.
Types of DIMM Errors
The BIOS in the blade servers can detect and report the following two different types of DIMM errors:
Correctable DIMM Errors
DIMMs with correctable errors are not disabled and are available for the OS to use. The total memory and effective memory are the same (memory mirroring is taken into account). These correctable errors are reported in Cisco UCS Manager as degraded.
If you see a correctable error reported that matches the information above, the problem can be corrected by resetting the BMC instead of reseating or resetting the blade server. Use the following Cisco UCS Manager CLI commands:UCS1-A# scope server x/yUCS1-A /chassis/server # scope bmcUCS1-A /chassis/server/bmc # resetUCS1-A /chassis/server/bmc* # commit-buffer
Resetting the BMC does not impact the OS running on the blade.
Uncorrectable DIMM Errors
DIMMs with uncorrectable errors are disabled and the OS on the server does not see that memory. If a DIMM or DIMMs fail while the system is up, the OS could crash unexpectedly. Cisco UCS Manager shows the DIMMs as inoperable in the case of uncorrectable DIMM errors. These errors are not correctable via software. You can identify a bad DIMM and remove it to allow the server to boot. For example, the BIOS fails to pass the POST due to one or more bad DIMMs.
In situations where BIOS POST failures occur due to suspected memory issues and the particular DIMMs or DIMM slots are not identifiable, follow these steps to further isolate a particular failed part:
1. Remove all DIMMs from the system
2. Install a single DIMM (preferably a tested good DIMM) or a DIMM pair in the first usable slot for the first processor (minimum requirement for POST success). For example, on a B200 blade it is DIMM slot A1. Refer to the published memory population rules to determine which slot to use.
3. Reattempt to boot the system.
If the BIOS POST is still unsuccessful, repeat steps 1 to 3 using a different DIMM for Step 2.
If the BIOS POST is successful and the blade can associate to a service profile, continue adding memory. Follow the population rules for that server model. If the system can successfully pass the BIOS POST in some memory configurations but not others, use that information to help isolate the source of the problem.
Troubleshooting DIMM Errors
To use the Cisco UCS Manager GUI to determine the type of DIMM errors being experienced, in the navigation pane, expand the correct chassis and select the server. From the Inventory list, select the Memory tab. Memory errors on that server are displayed. You can also check memory environmental statistics under Statistics > Chart. Expand the relevant memory array.
To check memory information in the Cisco UCS Manager CLI, enter the following commands:UCS-A# scope server chassis-id/server-id UCS-A /chassis/server # show memory detailUCS-A# scope server chassis-id/server-id UCS-A /chassis/server # show memory-array detailUCS-A# scope server chassis-id/server-id UCS-A /chassis/server # scope memory-array x UCS-A /chassis/server/memory-array # show stats history memory-array-env-stats detail
Confirm that the amount of memory seen from the OS point-of-view matches that listed for the server's associated service profile. Check if the OS sees all the memory or just part of the memory. If possible, run a memory diagnostic tool from the OS.
In the first example in Figure 6-1 a DIMM is correctly inserted and latched. Unless there is a small bit of dust blocking one of the contacts, this DIMM should function correctly. The second example shows a DIMM that is mismatched with the key for its slot. That DIMM cannot be inserted in this orientation and must be rotated to fit into the slot. In the third example, the left side of the DIMM seems to be correctly seated and the latch is fully connected, but the right side is just barely touching the slot and the latch is not seated into the notch on the DIMM. In the fourth example, the left side is again fully inserted and seated, and the right side is partially inserted and incompletely latched.
Figure 6-1 Checking DIMM Insertion
Recommended Solutions for DIMM Issues
Table 6-1 lists issues and recommended solutions for troubleshooting DIMM issues. These suggested solutions include those solutions that are described in the "Known Issues" section and the "Troubleshooting DIMM Errors" section.
All Cisco UCS servers support 1-2 or 1-4 CPUs. A problem with a CPU can cause a server to fail to boot, run very slowly, or cause serious data loss or corruption. If CPU issues are suspected, consider the following:
•All CPUs in a server should be the same type, running at the same speed and populated with the same number and size of DIMMs.
•If the CPU was recently replaced or upgraded, make sure the new CPU is compatible with the server and that a BIOS supporting the CPU was installed. Refer to the server's documentation for a list of supported Cisco models and product IDs. Use only those CPUs supplied by Cisco. The BIOS version information can be found in the release notes for a software release.
•When replacing a CPU, make sure to correctly thermally bond the CPU and the heat sink. An overheating CPU produces fault messages visible in Cisco UCS Manager. The CPU can also lower its performance in order to prevent damage to itself.
•If CPU overheating is suspected, check the baffles and air flow for all servers in a chassis. Air flow problems in adjacent servers can also cause improper CPU cooling in a server.
•The CPU speed and memory speed should match. If they do not match, the server runs at the slower of the two speeds.
•In the event of a failed CPU, the remaining active CPU or CPUs do not have access to memory assigned to the failed CPU.
Troubleshooting CPU Issues
Using the Cisco UCS Manager GUI, determine the type of CPU errors being experienced. In the navigation pane, expand the correct chassis and select the server. In the Inventory window, select the CPU tab. CPU errors on that server are displayed.
Using the Cisco UCS Manager CLI, check CPU information by using the following commands:UCS-A# scope server chassis-id/server-id UCS-A /chassis/server # show cpuUCS-A# scope server chassis-id/server-id UCS-A /chassis/server # show biosUCS-A# scope server chassis-id/server-id UCS-A /chassis/server # show cimc
Recommended Solutions for CPU Issues
Table 6-2 contains a list of guidelines and recommended solutions that can assist you in troubleshooting CPU issues.
CPU CATERR Details
The system event log (SEL) contains events related to the processor's catastrophic error (CATERR) sensor. A CATERR message indicates a failure, while a CATERR_N message indicates that the sensor is not in a failure state.
a CATERR_N message indicates an assertion of a no-fault bit that indicates that a predictive failure was deasserted. The no-fault bit was turned on to indicate that there is no failure.
When the sensor is initialized, the BMC sends out a SEL event with the initial state of the sensor in order to stay in synchronization with the server manager software, which monitors when the sensors are active and the state of the sensors. In most cases, the initial reading of the sensor is that a predictive failure has been deasserted, resulting in a CATERR_N message being sent.
Transitions from a nonfault state to a fault state turn off a no-fault bit and turn on a fault bit. In this case, you can expect two events to occur:
•No-fault (predictive failure deasserted) bit has been deasserted
•Fault (predictive failure asserted) bit has been asserted
These events indicate that the no-fault bit is turned OFF (deasserted) and the fault bit (predictive failure asserted) is turned ON.
Transitions from a fault state to a nonfault state often are redundant and not generally logged, as they indicate a condition that is not an error or a false positive case. These messages state that a reading was received from the sensor and the no-failure bit in the sensor is turned ON. The initial sensor state readings are logged for synchronization reasons with the management software.
Disk Drive and RAID Issues
A problem with the disk drive or RAID controller can cause a server to fail to boot, or cause serious data loss or corruption. If drive issues are suspected, consider the following:
•Use OS tools regularly to detect and correct drive problems (for example, bad sectors). Cisco UCS Manager cannot correct drive problems as effectively as the server's OS.
•Each disk drive has an activity LED that indicates an outstanding I/O operation to the drive and a health LED that turns solid amber if a drive fault is detected. Drive faults can be detected in the BIOS POST. SEL messages can contain important information to help you find these problems.
•Disk drives are the only major component that can be removed from the server without removing the blade from the system chassis. Before removing a disk drive, always decommission the server. If you remove one or more disk drives from an active, commissioned server for any length of time, you could create problems with the RAID controller, service profile, and the drives.
•Disk drives are available in several sizes. If the disk drive performance is slow because the drive is full or there are issues with the drive that the OS cannot solve, you might need to back up the drive contents and install a larger or new hard drive.
How to Determine Which RAID Controller Is in Your Server
You can order or configure the B-Series servers with the following RAID controller options:
•The Cisco UCS B200 and B250 servers have the Intel ICH10R onboard SATA controller on the motherboard. The controller supports RAID 0 and 1 for up to two SATA drives. The controller must be enabled in Cisco UCS Manager before configuring RAID. All RAID options can be configured from Cisco UCS Manager.
•The Cisco UCS B440 servers have the LSI MegaRAID controller card (the model varies by server). Depending on the license key installed, these cards provide RAID 0, 1, 5, 6, 10, 50, and 60 support for up to four SAS or SATA drives.
If there is no record of which option is used in the server, disable the quiet boot feature and read the messages that appear during system boot.
•Information about the models of installed RAID controllers appears as part of the verbose boot feature. You are prompted to press Ctrl-H to launch configuration utilities for those cards. See the "How to Disable Quiet Boot" section.
•If no card models are displayed after you disable the quiet boot feature, but there is a RAID configuration, the server uses the onboard ICH10R controller. You are prompted to press Ctrl-M to launch the configuration utilities for this controller (see Figure 6-2). See the "How To Launch Option ROM-Based Controller Utilities" section.
Figure 6-2 Startup Screen for the ICH10R Controller Configuration Utilities
How to Disable Quiet Boot
When the quiet boot feature is disabled, the controller information and the prompts for the option ROM-based LSI utilities are displayed during bootup. To disable this feature, follow these steps:
Step 1 Boot the server and watch for the F2 prompt during the boot process.
Step 2 To enter the BIOS Setup Utility, press F2 when prompted.
Step 3 On the Main page of the BIOS Setup Utility, set Quiet Boot to disabled. This action allows nondefault messages, prompts, and POST messages to display during bootup instead of the Cisco logo screen.
Step 4 Press F10 to save the changes and exit the utility.
How To Launch Option ROM-Based Controller Utilities
To alter the RAID configurations on your hard drives, use the host-based utilities that were installed on top of the host OS. You can also use the LSI option ROM-based utilities that are installed on the server.
When you boot the server and quiet is disabled (see the "How to Disable Quiet Boot" section), information about the controller appears along with the prompts for the key combination to launch the LSI option ROM-based utilities for your controller.
During the verbose book process, watch for the prompt for the controller:
•The prompt for the LSI controller card utility is Ctrl-H.
•The prompt for the onboard Intel ICH10R controller utility is Ctrl-M.
For More Information
The LSI utilities have help documentation for more information. For basic information on RAID and how to use the LSI utilities, see the following documentation:
Moving a RAID Cluster
This section describes how to set a server to recognize a RAID array created on another server. This procedure is useful when upgrading from the M1 version of a server to the M2 server. It can also be used any time you need to move data on a RAID array between servers. An array that was created on another server and not recognized on its current server is a foreign array. A native array is an active array and is recognized on the server.
For UCS Manager Release 1.4(1), follow these steps to move a RAID cluster:
Step 1 Put both the start and destination servers for the RAID cluster in the associated state.
Step 2 Shut down both servers. The service profiles for both servers must have an identical local disk configuration policy.
Note When using this procedure during an M1 to M2 upgrade or a direct replacement within a slot, the destination server is not associated or does not have a disk policy. When the destination server is inserted into the slot where the start server was located, the destination server inherits the same policies as the start server.
Step 3 After the servers power off, physically move the drives in the array to the destination server. If you are changing servers but keeping the drives in the same slot, insert the new server into the slot of the original server.
Step 4 Connect the KVM dongle. Connect a monitor, keyboard, and mouse to the destination server.
Step 5 Boot the destination server, using the power switch on the front of the server. If necessary, disable the quiet boot feature and boot again. (See the "How to Disable Quiet Boot" section.)
Step 6 Wait for the LSI Configuration Utility banner.
Step 7 To enter the LSI Configuration Utility, press Ctrl-C.
Step 8 From the SAS Adapter List, choose the SAS Adapter used in the server. In needed, refer to the "How to Determine Which RAID Controller Is in Your Server" section.
Step 9 Choose RAID Properties. The View Array screen appears.
Step 10 Choose Manage Array. The Manage Array screen appears.
Step 11 Choose Activate Array. When the activation is complete, the RAID status changes to Optimal.
Step 12 On the Manage Array screen, choose the Synchronize Array option.
Step 13 Wait for the mirror synchronization to complete, and monitor the progress bar that comes up. Please note that the time to complete the synchronization can vary depending upon the size of the disks in the RAID array.
Step 14 When the mirror synchronization is complete, press the ESC key several times to go back through each of the screens (one at a time) and then exit the LSI Configuration Utility. Choose the reboot option to implement the changes.
For UCS Manager Release 1.4(2) and later versions, follow these steps to move a RAID cluster:
Step 1 Verify that the service profiles for both the source and destination servers have an identical local disk configuration policy and can boot successfully.
Step 2 Decommission both the source and destination servers from UCS Manager.
Step 3 Wait for the servers to shut down (Decommission Server will prompt the user to shut the server down).
Note When you use this procedure during an M1 to M2 upgrade or a direct replacement within a slot, the destination server is not associated or does not have a disk policy. When the destination server is inserted into the slot where the start server was located, the destination server inherits the same policies as the start server.
Step 4 After the servers power off, physically move the drives in the array to the destination server. If you are changing servers but keeping the drives in the same slot, insert the new server into the slot of the original server.
Step 5 Power on the servers by pressing the front power button of each of the servers.
Step 6 Choose Reacknowledge Slot for each of the slots (Source and Destination). If UCS Manager prompts you to "Resolve Slot Issue", then choose the here link in the Resolve Slot screen and resolve the slot issue before server discovery begins.
Step 7 Wait for server discovery and association to complete for each server.
If each of the preceding steps runs without issues, the servers will boot up with the OS that was installed on the respective RAID volumes prior to the RAID Cluster Migration.
A problem with the Ethernet or FCoE adapter can cause a server to fail to connect to the network and make it unreachable from Cisco UCS Manager. All adapters are unique Cisco designs and non-Cisco adapters are not supported. If adapter issues are suspected, consider the following:
•Check if the Cisco adapter is genuine.
•Check if the adapter type is supported in the software release you are using. The Internal Dependencies table in the Cisco UCS Manager Release Notes provides minimum and recommended software versions for all adapters.
•Check if the appropriate firmware for the adapter has been loaded on the server. In Release versions 1.0(1) through 1.3(1), the Cisco UCS Manager version and the adapter firmware version must match. To update the Cisco UCS software and the firmware, refer to the appropriate Upgrading Cisco UCS document for your installation.
•If the software version update was incomplete, and the firmware version no longer matches the Cisco UCS Manager version, try updating the adapter firmware as described in the "Managing Firmware" chapter of the Cisco UCS Manager CLI Configuration Guide.
•If you are deploying two Cisco UCS M81KR Virtual Interface Cards on the Cisco UCS B250 Extended Memory Blade Server running ESX 4.0, you must upgrade to the patch 5 (ESX4.0u1p5) or later release of ESX 4.0.
•If you are migrating from one adapter type to another, make sure that the drivers for the new adapter type are available. Update the service profile to match the new adapter type. Configure appropriate services to that adapter type.
•If you are using dual adapters, note that there are certain restrictions on the supported combinations. The following combinations are supported:
Server Dual Card Same Type Dual Card Mixed Type
Cisco UCS B250
M71KR-Q or -E + M81KR
M72KR-Q or -E + M81KR
Cisco UCS B440
All except 82598KR-CI
M72KR-Q or -E + M81KR
There are a number of known issues and open bugs with adapters. These problems are called out in the Release Notes documentation. Refer to the document for your software release. The following is a persistent known condition:
(CSCtd32884 and CSC71310) The type of the adapter in a server affects the maximum transmission unit (MTU) supported. The network MTU that is above the maximum can cause the packet to be dropped for the following adapters:
•The Cisco UCS CNA M71KR adapter supports an MTU of 9216.
•The Cisco UCS 82598KR-CI adapter supports an MTU of 14000.
Troubleshooting Adapter Errors
The link LED on the front of the server is off if the adapter cannot establish even one network link. It is green if one or more of the links are active. Any adapter errors are reported in the LEDs on the motherboard. See the "Diagnostics Button and LEDs" section.
To use the Cisco UCS Manager GUI to determine the type of adapter errors being experienced, in the navigation pane, expand the chassis and choose the server. In the Inventory window, choose the Interface Cards tab. Any adapter errors on that server are displayed on the screen.
You can check adapter state information in the CLI by using the following commands:UCS-A# scope server chassis-id/server-id UCS-A /chassis/server # show adapter [detail]
Table 6-3 contains a list of guidelines and recommended solutions that can assist you in troubleshooting adapter issues. These suggested solutions include those solutions that are described in the "Known Issues" section and the "Troubleshooting DIMM Errors" section.
A problem with a server's onboard power system can cause a server to shut down without warning, fail to power on, or fail the discovery process.
The following are known power issues:
FET Failure in a Cisco UCS B440 Server
The failure of a field effect transistor (FET) in a Cisco UCS B440 server's power section can cause the server to shut down, fail to power on, or fail the discovery process. When the server has detected the failure, you are unable to power on the server, even using the front panel power button.
To determine whether a FET failure has occurred, perform the following steps:
Step 1 Using the procedure in the "Faults" section on page 1-2, check the reported faults for Fault Code F0806, "Compute Board Power Fail." This fault will cause the server's overall status to be "Inoperable."
Step 2 Check the system event log (SEL) for a power system fault of the type in this example:58f | 06/28/2011 22:00:19 | BMC | Power supply POWER_SYS_FLT #0xdb | Predictive Failure deasserted | Asserted
Step 3 From the CLI of the fabric interconnect, access the CIMC of the failed server and display the fault sensors as shown in this example:Fabric-Interconnect-A# connect cimc chassis/serverTrying 127.5.1.1...Connected to 127.5.1.1.Escape character is '^]'.CIMC Debug Firmware Utility Shell[ help ]# sensors faultHDD0_INFO | 0x0 | discrete | 0x2181| na | na | na | na | na | naHDD1_INFO | 0x0 | discrete | 0x2181| na | na | na | na | na | na..[lines removed for readability].LED_RTC_BATT_FLT | 0x0 | discrete | 0x2180| na | na | na | na | na | naPOWER_SYS_FLT | 0x0 | discrete | 0x0280| na | na | na | na | na | na[ sensors fault]#
For the POWER_SYS_FLT sensor, a reading of 0x0280 confirms the FET failure. In normal operation, this sensor will have reading of 0x0180.
If you determine that a FET failure has occurred, perform the following steps:
Step 1 In the Cisco UCS Manager CLI, collect the output of the following commands:
•show tech-support ucsm detail
•show tech-support chassis x all detail
Step 2 Contact the Cisco Technical Assistance Center (TAC) to confirm the failure.
Step 3 Install a replacement server using the Recover Server action in Cisco UCS Manager.
Gathering Information Before Calling Support
If you cannot isolate the issue to a particular component, consider the following questions. They can be helpful when contacting the Cisco Technical Assistance Center (TAC).
1. Was the blade working before the problem occurred? Did the problem occur while the blade was running with a service profile associated?
2. Was this a newly inserted blade?
3. Was this blade assembled onsite or did it arrive assembled from Cisco?
4. Has the memory been reseated?
5. Was the blade powered down or moved from one slot to another slot?
6. Have there been any recent upgrades of Cisco UCS Manager? If so, was the BIOS also upgraded?
When contacting Cisco TAC for any Cisco UCS issues, it is important to capture the tech-support output from Cisco UCS Manager and the chassis in question. For more information, see the "Creating a Technical Support File" section on page 3-2.
Check for known issues with individual server models in the Cisco UCS Blade Server Installation and Service Notes.