Troubleshooting Server Hardware or Software Issues

This chapter includes the following sections:

Troubleshooting Operating System and Drivers Installation

Table 1 Operating System and Driver Issues

Issue

Recommended Solution

  • Basic server configuration steps

  • Steps for CIMC or BMC configuration

  • BIOS settings information

  • BIOS upgrade steps

  • CIMC or BMC firmware upgrade steps

The Windows 2003 R2 64-bit install is not starting because the system is not seeing the install CD on the C200 servers.

Slow performance (slow mouse and keyboard) on C200 or C210 servers when running Windows 2008 R2.

There is a known issue with Intel 82576 driver included with Windows 2008 R2. Update to the latest Intel driver for this chipset at the following link: https:/​/​downloadcenter.intel.com/​product/​32261/​Intel-82576-Gigabit-Ethernet-Controller

Installation of the Windows 2008 R2 OS failed with error message:

The computer restarted unexpectedly or encountered an unexpected error. Windows installation cannot proceed.

On the C200 server, Windows 2008 R2 install fails with the Intel Quad Port NIC. Start the install without the NIC and put it in after the install is complete. Also, see this forum message: https:/​/​supportforums.cisco.com/​message/​3179297

VMware ESX/ESXi on C200, C210, or C250 failed.

  • The onboard NIC might be disabled or not recognized. Check the BIOS to ensure the onboard NICs are enabled.

  • It is possible that the device ID of the Intel NIC is wrong. Use the Host Upgrade Utility to update the LOM firmware.

  • Download the latest ISO of the SCU from Cisco.com for the specific server.

Running Windows 2008 R2, Task Manager shows multiple spikes.

Go to this URL and update the drivers to the latest version: http:/​/​www.cisco.com/​en/​US/​docs/​unified_computing/​ucs/​overview/​guide/​UCS_​rack_​roadmap.html

The ESXi installation does not recognize the LOM or NIC Ethernet ports.

  • Update when the LOM is used for ESXi.

  • Update when add-on adapters are used for ESXi.

The ESXi update does not recognize the NICs.

Update the LOM firmware using the Cisco Host Update Utility. Download the 1.2.x version from this link: http:/​/​www.cisco.com/​en/​US/​docs/​unified_computing/​ucs/​c/​sw/​lomug/​install/​LOMUG.html

Download the 1.3.x version from this link: http:/​/​www.cisco.com/​en/​US/​docs/​unified_computing/​ucs/​c/​sw/​lomug/​1.3.x/​install/​HUUUG.html

Unable to install older OS.

Different C-Series servers support different versions of OS. Use the following link to see matrix of supported operating systems: http:/​/​www.cisco.com/​en/​US/​products/​ps10477/​prod_​technical_​reference_​list.html

Cannot upgrade BIOS on the system with no OS.

Use the BIOS upgrade instructions for the HW installation and service guide for their server. Go to: http:/​/​www.cisco.com/​en/​US/​products/​ps10493/​prod_​installation_​guides_​list.html

With ESXi installed on the drives, unable to boot from the partition.

Review the documentation at the following link: http:/​/​www.VMware.com

CIMC defaults to DHCP and will not retain the IP address.

Review the documentation at the following link: http:/​/​www.cisco.com/​en/​US/​products/​ps10739/​products_​installation_​and_​configuration_​guides_​list.html

System becomes unresponsive during BIOS POST.

When the system boots, if the system is hanging at LSI, waiting for user input, follow the instructions on the screen. Possible reasons would be:

  • Battery HW missing or disabled. This warning can be disabled by entering D to disable this message during the next boot. This bypasses the warning and the system will not hang for this reason.

  • The message could be about importing a foreign configuration. A foreign configuration could be imported by pressing F. An alternative procedure is to enter the config utility (press Ctrl+C) and enter the WebBIOS which is the LSI config utility. Preview the foreign configuration and decide if it should be imported.

Drives are not detected or the system hangs when the adapter ROM for the ICH10R SATA Software RAID scans the SATA ports.

  • ICH10R is SATA controller software embedded in the motherboard on the C200 and C210 servers only. There is no adapter. It might not see a SAS drive because it does not support SAS drives. Only SATA drives are supported.

  • The cable from the HDD backplane must be connected to the motherboard to use ICH10R.

The drives are not detected or the system hangs when the adapter ROM for the LSI RAID Controller scans the SAS/SATA Drives.

  • ICH10R is SATA controller software embedded in the motherboard on the C200 and C210 servers only. There is no adapter. It might not see a SAS drive because it does not support SAS drives. Only SATA drives are supported.

  • The onboard ICH10R controller is not compatible for use with VMware software." They must use an add-on controller card in this case.

  • The cable from the HDD backplane must be connected to the motherboard to use ICH10R.

  • Make sure all the drives are plugged in properly (reseat the drives if needed).

The Operating System does not boot.

  • Make sure that the correct virtual drive on which the OS is installed is selected in the LSI WebBIOS. Do this by entering the LSI WebBIOS using Ctrl+H during system boot up. In the LSI WebBIOS menu, navigate to the virtual drive menu and get a list of the virtual drives. Choose the virtual drive as the boot drive by selecting it.

  • Make sure that you have properly selected the boot device in the system BIOS setup by pressing F2. Navigate to the boot devices screen and make sure the LSI RAID controller appears before all of the other bootable devices attached to the server. We recommend that this be the third bootable device in the list.

Troubleshooting Disk Drive and RAID Issues

Disk Drive/RAID Configuration Issues

Table 2 RAID Configuration Issues

Issue

Recommended Solution

Windows does not detect hard drives.

LSI drivers may not be bundled with the Windows OS version being installed. These drivers must be installed during the installation process. During the install process, if the hard drives fail to be detected, use the load driver option to point the drives to the correct drivers for the LSI controller in the system. The drivers can be loaded using a USB drive. When loaded, the hard drives are displayed and the hard drive for the OS can be selected.

Installing Windows 2008 64-bit and RAID controller had issues.

LSI drivers are not bundled in Windows 2008 64-bit. These must be installed during the installation process. During the install process, if the hard drives fail to be detected, use the load driver option to point the drives to the correct drivers for the LSI controller in the system. The drivers can be loaded using a USB drive. When loaded, the hard drives are displayed and the hard drive for the OS can be selected.

Unable to install ESX on server with only the onboard controller.

The LSI hardware RAID controller is required.

  • Unable to see the LSI RAID controller in the BOOT environment.

  • Unable to access the onboard RAID controller.

  • During the BIOS POST, the LSI option ROM should be displayed. The LSI RAID controller can be configured using Ctrl+H to create virtual drives. When configured, the BIOS should list the RAID controller in the boot device menu. To verify, enter the BIOS POST menu by pressing F2. Confirm that the LSI RAID controller is listed in the boot device menu.

  • If, after completing the above process, the LSI RAID card is not detected, power off the system and reseat the LSI card. Make sure that the cables are connected to the backplane and then follow the above procedure to verify that the LSI card is seen in the BIOS Setup menu.

  • If reseating the card does not solve the problem, replace the LSI controller (the card could be bad) and verify if this card is seen during BIOS POST.

VMware does not show the local drive during installation.

VMware supports a maximum of two TB partitions sizes. Resize the partition to not exceed the 2TB partition size limitation.

The RAID controller card is not working.

Verify that the card installed is supported for this server. If supported, follow the steps listed in Unable to see LSI RAID controller in BOOT environment. (above).

Problem with setup of the RAID6 virtual device and installation of Windows 2003 X64.

  • When the system boots up and the LSI Option ROM screen displays, press Ctrl+H to enter the LSI option ROM screen.

  • Choose the Configuration Wizard and follow the instructions to configure the RAID 6 array group. (RAID 6 needs a minimum of three drives.) Once RAID 6 is created, initialize the virtual drives (full initialization) on which the OS is to be installed.

  • After the virtual drive is initialized, the virtual drive on which the OS is to be installed must be set as the boot drive.

  • Go to the virtual drive menu and choose the virtual drive number and click Set Virtual drive. This is very important because Windows will report an error message during install if this is not set.

  • When the Windows 2003 installation is started, follow the instructions on the screen to install the LSI controller drivers on Win2003. The LSI drivers need to be copied on a floppy disc and the floppy drive connected to the server. During install, press F6 to install the drivers. This is a very important step to follow for Windows LSI driver installation. This will ensure that the LSI virtual drive is seen during the install process.

Unable to see HDD.

  • If not able to see the LSI controller during system boot up, follow the instructions in Unable to see LSI controller (above) to ensure the LSI controller is seen during BIOS bootup.

  • If the LSI controller does not see the hard drives, ensure they are properly plugged in and making contact and that the green LED is visible. If still not seen, insert a different HDD (in case of a bad HDD).

  • Note that the BIOS will not see the physical drives plugged in the boot device menu. It will only display the RAID controller which points to the virtual drive (set as the boot virtual disk). Make sure to configure the virtual drives using the LSI WebBIOS to ensure the RAID controller is seen in the boot device menu of the BIOS setup.

Problem setting up the RAID configuration.

  • During system boot, enter the WebBIOS by pressing Ctrl+H. Use the Configuration Wizard and follow the screen instructions to create the RAID configurations.

  • Check the BIOS and CIMC version and upgrade to the latest version. Get the upgrade software at the following link: http:/​/​www.cisco.com/​cisco/​software/​navigator.html

Configuring Multiple (Redundant) RAID controllers

Cisco does not support multiple (redundant) RAID controllers that automatically fail over if one RAID controllers fails. It is possible to recover from a RAID controller failure. Install a new RAID card of the same type and model.

Configuration data about a RAID array is stored inside the disks being managed by the controller. A new controller can import those configurations from disks to restore proper RAID operation. Each disk has its own copy of the metadata. If there are 16 disks in an array, each disk can contain its own copy of the metadata.

Detailed steps are available in the LSI document 80-00156-01_RevH_SAS_SW_UG.pdf.

This document is available from the Documents & Downloads section of the LSI support site at this URL: http:/​/​www.lsi.com

When configuring the RAID card for the first time, the step “Import foreign config” in the file provides details on how to import the RAID configuration from previously configured disks.

RHEL 5.4 64-bit Recommended Installation with RAID (C200)

To ensure that the RAID drives are properly recognized, complete the following steps:

Procedure
    Step 1   Follow the normal installation process of RHEL 5.4 i386 from the ISO or DVD.
    Step 2   At the prompt, enter the command:

    boot: linux dd noprobe=ata1 noprobe=ata2 noprobe=ata3 noprobe=ata4

    Step 3   Mount the megaraid driver and map it from the virtual media. The .img file is emulated as a floppy. The file Drivers\Linux\Storage\Intel\ICH10R\RHEL\RHEL5.4 is also on the driver CD available on CCO and the path from the root.
    Step 4   At the “before installation starts” step, the system will ask whether you want to add any additional drivers.
    Step 5   Provide the drivers (usually the mapped file will be /dev/sdb, because it is a floppy).
    Step 6   Continue the installation.
    Step 7   When the system looks for storage, it should list the RAID as “LSI MegaSR”.

    DIMM Memory Issues

    Types of DIMM Errors

    Cisco UCS Servers can detect and report correctable and uncorrectable DIMM errors.

    Correctable DIMM Errors

    DIMMs with correctable errors are not disabled and are available for the OS to use. The total memory and effective memory are the same (memory mirroring is taken into account). These correctable errors are reported in Cisco IMC as degraded once they exceed pre-determined error thresholds.

    Uncorrectable DIMM Errors

    Uncorrectable errors generally cannot be fixed, and may make it impossible for the application or operating system to continue execution. The DIMMs with uncorrectable error will be disabled if DIMM blacklisting is enabled or if the DIMM fails upon reboot during BIOS POST and OS will not see that memory. Cisco IMC operState will be inoperable for this DIMM in this case.

    A problem with the DIMM memory can cause a server to fail to boot or cause the server to run below its capabilities. If DIMM issues are suspected, consider the following:

    • DIMMs tested, qualified, and sold by Cisco are the only DIMMs supported on your system. Third-party DIMMs are not supported, and if they are present, Cisco technical support will ask you to replace them with Cisco DIMMs before continuing to troubleshoot a problem.

    • Check if the malfunctioning DIMM is supported on that model of server. Refer to the server’s installation guide and technical specifications to verify whether you are using the correct combination of server, CPU and DIMMs.

    • Check if the malfunctioning DIMM seated correctly in the slot. Remove and reseat the DIMMs.

    • All Cisco servers have either a required or recommended order for installing DIMMs. Refer to the server’s installation guide and technical specifications to verify that you are adding the DIMMs appropriately for a given server type.

    • If the replacement DIMMs have a maximum speed lower than those previously installed, all DIMMs in a server run at the slower speed or not work at all. All of the DIMMs in a server should be of the same type. All of the DIMMs in a server should be of the same type for optimal performance.

    • The number and size of DIMMs should be the same for all CPUs in a server. Mismatching DIMM configurations can degrade system performance.

    Memory Terms and Acronyms

    Table 3 Memory Terms and Acronyms

    Acronym

    Meaning

    DIMM

    Dual In-line Memory Module

    DRAM

    Dynamic Random Access Memory

    ECC

    Error Correction Code

    LVDIMM

    Low voltage DIMM

    MCA

    Machine Check Architecture

    MEMBIST

    Memory Built-In Self Test

    MRC

    Memory Reference Code

    POST

    Power On Self Test

    SPD

    Serial Presence Detect

    DDR

    Double Data Rate

    CAS

    Column Address Strobe

    RAS

    Row Address Strobe

    Troubleshooting DIMM Errors

    Correct Installation of DIMMs

    Verify that the DIMMs are installed correctly.

    In the first example in the following figure, a DIMM is correctly inserted and latched. Unless there is a small bit of dust blocking one of the contacts, this DIMM should function correctly. The second example shows a DIMM that is mismatched with the key for its slot. That DIMM cannot be inserted in this orientation and must be rotated to fit into the slot. In the third example, the left side of the DIMM seems to be correctly seated and the latch is fully connected, but the right side is just barely touching the slot and the latch is not seated into the notch on the DIMM. In the fourth example, the left side is again fully inserted and seated, and the right side is partially inserted and incompletely latched.

    Figure 1. Installation of DIMMs



    Troubleshooting DIMM Errors Using Cisco IMC CLI

    You can check memory information to identify possible DIMM errors in the Cisco IMC CLI.

    Procedure
       Command or ActionPurpose
      Step 1Server# scope chassis  

      Enters chassis command mode.

       
      Step 2Server /chassis # show dimm [detail]  

      Displays memory properties.

       

      The following example shows how to check memory information using the Cisco IMC CLI:

      Server# scope chassis
      Server /chassis# show dimm detail
      
          Name DIMM_A1:
          Capacity: Failed
          Channel Speed (MHz): NA
          Channel Type: NA
          Memory Type Detail: NA
          Bank Locator: NA
          Visibility: NA
          Operability: NA
          Manufacturer: NA
          Part Number: NA
          Serial Number: NA
          Asset Tag: NA
          Data Width: NA
      Name DIMM_A2:
          Capacity: Not Installed
          Channel Speed (MHz): NA
          Channel Type: NA
          Memory Type Detail: NA
          Bank Locator: NA
          Visibility: NA
          Operability: NA
          Manufacturer: NA
          Part Number: NA
          Serial Number: NA
          Asset Tag: NA
          Data Width: NA
          ...
      

      Troubleshooting DIMM errors using Cisco IMC GUI

      You can determine the type of DIMM errors being experienced using the Cisco IMC GUI.

      Procedure
        Step 1   In the Navigation pane, click the Server tab.
        Step 2   On the Server tab, click Inventory.
        Step 3   In the Inventory pane, click the Memory tab.
        Step 4   In the Memory Summary area, review the summary information about memory. A list of DIMMs are displayed. Corrupt or bad DIMMs are displayed as Failed.
        Step 5   Replace the corrupt or bad DIMM with a good DIMM.

        Troubleshooting Degraded DIMM Errors

        DIMMs with correctable errors are not disabled and are available for the OS to use. The total memory and effective memory are the same (memory mirroring is taken into account). These correctable errors are reported in Cisco IMC as degraded.

        If you see a correctable error reported in Cisco IMC, the problem can be corrected by resetting the BMC. Resetting the BMC just hides the DIMM with correctable error. However, to troubleshoot the DIMM physically, see Troubleshooting Inoperable DIMMs Errors

        Use the following Cisco IMC CLI commands to reset BMC:

        Procedure
           Command or ActionPurpose
          Step 1Server # scope chassis  

          Enters chassis configuration mode.

           
          Step 2Server /chassis # show dimm  

          Displays if there are any correctable DIMMs. Correctable DIMMs display capacity as Failed. Clear the DIMM error flag by running the error correctable code (ECC) command.

           
          Step 3Server /chassis # scope reset-ecc  

          Enters error correctable code configuration mode.

           
          Step 4Server /chassis/reset-ecc # set enabled yes  

          Enables ECC.

           
          Step 5Server /chassis/reset-ecc * # commit  

          Commits the transaction to the system configuration.

           

          The following example shows how to view and reset the DIMM error flag:

          Server/ scope chassis
          Server /chassis # show dimm
          Name                 Capacity        Channel Speed (MHz) Channel Type
          -------------------- --------------- ------------------- ---------------
          DIMM_A1              Failed          NA                  NA
          DIMM_A2              Ignored/Disa... NA                  NA
          DIMM_B1              16384 MB        1866                DDR3
          DIMM_B2              16384 MB        1866                DDR3
          DIMM_C1              16384 MB        1866                DDR3
          DIMM_C2              16384 MB        1866                DDR3
          DIMM_D1              16384 MB        1866                DDR3
          DIMM_D2              16384 MB        1866                DDR3
          DIMM_E1              16384 MB        1866                DDR3
          DIMM_E2              16384 MB        1866                DDR3
          DIMM_F1              16384 MB        1866                DDR3
          DIMM_F2              16384 MB        1866                DDR3
          DIMM_G1              16384 MB        1866                DDR3
          DIMM_G2              16384 MB        1866                DDR3
          DIMM_H1              16384 MB        1866                DDR3
          DIMM_H2              16384 MB        1866                DDR3
           
           
          Clear DIMM Error flag:
          Server/chassis# top
          Server/chassis# scope reset-ecc
          Server/chassis /reset-ecc # set enabled yes
          Server/chassis /reset-ecc *# commit

          Troubleshooting Inoperable DIMMs Errors

          DIMMs with uncorrectable errors are disabled and the OS on the server does not see that memory. If a DIMM or DIMMs fail while the system is up, the OS could crash unexpectedly. Cisco IMC shows the DIMMs as inoperable in the case of uncorrectable DIMM errors. These errors are not correctable using the software. You can identify a bad DIMM and remove it to allow the server to boot. For example, the BIOS fails to pass the POST due to one or more bad DIMMs.

          To view and identify a bad DIMM using the Cisco IMC GUI, see Troubleshooting DIMM errors using Cisco IMC GUI

          Procedure
            Step 1   Remove the inoperable DIMM from the system.
            Step 2   Install a single DIMM (preferably a tested good DIMM) or a DIMM pair in the first usable slot for the first processor (minimum requirement for POST success).
            Step 3   Re-attempt to boot the system.
            Step 4   If the BIOS POST is still unsuccessful, repeat steps 1 through 3 using a different DIMM in step 2.
            Step 5   If the BIOS POST is successful, continue adding memory. Follow the population rules for that server model. If the system can successfully pass the BIOS POST in some memory configurations but not others, use that information to help isolate the source of the problem.

            Recommended Solutions for DIMM Issues

            The following table lists guidelines and recommended solutions for troubleshooting DIMM issues.

            Table 4 DIMM Issues

            Issue

            Recommended Solution

            DIMM is not recognized.

            Verify that the DIMM is in a slot that supports an active CPU.

            Verify that the DIMM is sourced from Cisco. Third-party memory is not supported in Cisco UCS.

            DIMM does not fit in slot.

            Verify that the DIMM is supported on that server model.

            Verify that the DIMM is oriented correctly in the slot. DIMMs and their slots are keyed and only seat in one of the two possible orientations.

            The DIMM is reported as bad in the SEL, POST, or LEDs, or the DIMM is reported as inoperable in Cisco IMC.

            Verify that the DIMM is supported on that server model.

            Verify that the DIMM is populated in its slot according to the population rules for that server model.

            Verify that the DIMM is seated fully and correctly in its slot. Reseat it to assure a good contact and rerun POST.

            Verify that the DIMM is the problem by trying it in a slot that is known to be functioning correctly.

            Verify that the slot for the DIMM is not damaged by trying a DIMM that is known to be functioning correctly in the slot.

            Reset the BMC.

            The DIMM is reported as degraded in the GUI or CLI, or is running slower than expected.

            Reset the BMC.

            Reseat the rack server in the chassis.

            The DIMM is reported as overheating.

            Verify that the DIMM is seated fully and correctly in its slot. Reseat it to assure a good contact and rerun POST.

            Verify that all empty HDD bays, server slots, and power supply bays use blanking covers to assure that the air is flowing as designed.

            Verify that the server air baffles are installed to assure that the air is flowing as designed.

            Verify that any needed CPU air blockers are installed to assure that the air is flowing as designed.

            Troubleshooting Server and Memory Issues

            Table 5 Server and Memory Issues

            Issue

            Recommended Solution

            Server Related Issues

            Every several days, the server requires a hard boot.

            Host is unreachable via IP, the CIMC works but KVM shows a blank screen.

            Upgrade the CIMC firmware and BIOS.

            Memory Configuration Issues

            Memory fault LED is amber on a new server.

            Upgrade the CIMC and BIOS.

            Memory errors on a previously working server.

            • Replace any DIMM with a reported error.

            • Upgrade the BIOS.

            Troubleshooting Communication Issues

            “No Signal” on vKVM and Physical Video Connection

            If immediately at boot you receive a “No Signal” message from the vKVM and physical video connection, the PCI riser card might not be properly seated to the motherboard. To resolve the issue, complete these steps:

            Procedure
              Step 1   Power off the server and disconnect the power cord.
              Step 2   Confirm that all cards are properly seated.
              Step 3   Connect the power cord and power on the server.