Troubleshooting Operating System Issues

This Troubleshooting operating system issues contains the following topics:

Troubleshooting operating system issues

This section helps you troubleshoot operating system issues in your system.

NOTE: If the problem persists, contact Wistron Technical Support for further assistance.

Topics:

  • How to install the operation system via Virtual media – Ubuntu 22.04

  • AMD GPU

    • How to add amdgpu driver to blacklist in Linux

    • How to Install and Reinstall AMD MI300X ROCm and Driver in Ubuntu

    • AMDGPU driver cause fatal error with AMD platform

    • How to uninstall ROCm

    • Why do AMD Applications hang on Multi-GPU systems

    • How to update AMD MI300X UBB Firmware via BMC Console

How to install the operation system using Virtual media - Ubuntu 22.04

Before you begin

  1. Ensure that the server supports the operating system you intend to install.

  2. Make sure you have the OS source media available for the server.

  3. Prepare a storage location where you can store the necessary installation files.

SUMMARY STEPS

  1. Access the OpenBMC Web UI interface by entering the server's IP address in a web browser.
  2. Mount OS Source media into the Virtual Media drive.
  3. Click Add file (such as an ISO file) to add the virtual media device.
  4. Click Start to start the virtual media after the file is added.
  5. After mount iso image success.
  6. Save the changes, reboot the server, and boot into the OS installation image.
  7. In the BIOS setup utility or boot menu, navigate to the Save & Exit section or a similar option.
  8. Select the UEFI: Linux File-CD Gadget 0515 or similar entry from the available boot devices.
  9. Save the changes and exit the BIOS setup utility or boot menu.
  10. The server will now boot from the selected boot device, which is the OS installation image on the virtual media drive.
  11. Once the installation completes, the server reboots again, this time booting from the installed operating system on the server's local disk.

DETAILED STEPS

  Command or Action Purpose

Step 1

Access the OpenBMC Web UI interface by entering the server's IP address in a web browser.

Note

 

The default username is root and password is 0penBmc.

Figure 1. Login BMC WebGUI

Step 2

Mount OS Source media into the Virtual Media drive.

Step 3

Click Add file (such as an ISO file) to add the virtual media device.

Figure 2. Add media file

Step 4

Click Start to start the virtual media after the file is added.

Figure 3. Start the ISO image

Step 5

After mount iso image success.

Figure 4. Mounted the ISO

Step 6

Save the changes, reboot the server, and boot into the OS installation image.

Step 7

In the BIOS setup utility or boot menu, navigate to the Save & Exit section or a similar option.

Step 8

Select the UEFI: Linux File-CD Gadget 0515 or similar entry from the available boot devices.

Step 9

Save the changes and exit the BIOS setup utility or boot menu.

Step 10

The server will now boot from the selected boot device, which is the OS installation image on the virtual media drive.

Figure 5. Boot the Virtual Media from BIOS

Step 11

Once the installation completes, the server reboots again, this time booting from the installed operating system on the server's local disk.

AMD GPU

How to add amdgpu driver to blacklist in Linux

Cause: If the amdgpu driver or firmware have some issues, it can result in the system shutting down or rebooting when the user loads the amdgpu driver. Therefore, it is necessary for the user to blacklist the amdgpu driver to prevent it from being automatically loaded. If the user forgets to set up the blacklist, the issue may persist.

Solution:

To add the parameter behind GRUB_CMDLINE_LINUX in /etc/default/grub, perform the following:

SUMMARY STEPS

  1. Run the following command to open the GRUB configuration file in a text editor: $ sudo nano /etc/default/grub
  2. Locate the line that starts with GRUB_CMDLINE_LINUX in the file.
  3. Save the changes by pressing Ctrl + X, followed by Y, and then Enter.
  4. Update the GRUB configuration by running the following command:
  5. Reboot your system for the changes to take effect:
  6. When the amdgpu is added to blacklist, you should manually load the module using the following command:

DETAILED STEPS

  Command or Action Purpose

Step 1

Run the following command to open the GRUB configuration file in a text editor: $ sudo nano /etc/default/grub

Step 2

Locate the line that starts with GRUB_CMDLINE_LINUX in the file.

For example, if you need to add the parameter " modprobe.blacklist=amdgpu iommu=pt pci=noats ", the line would look like:

GRUB_CMDLINE_LINUX=”modprobe.blacklist=amdgpu iommu=pt
pci=noats”

Step 3

Save the changes by pressing Ctrl + X, followed by Y, and then Enter.

Step 4

Update the GRUB configuration by running the following command:

$sudo grub-mkconfig -o
/boot/efi/EFI/ubuntu/grub.cfg$sudo update-grub

Step 5

Reboot your system for the changes to take effect:

$sudo
reboot

Step 6

When the amdgpu is added to blacklist, you should manually load the module using the following command:

$modprobe amdgpu

What to do next

NOTE: After rebooting, the amdgpu driver will be loaded with the specified parameters or options. Be aware that manually loading the driver may introduce potential issues or conflicts. If you experience any problems, consider reviewing the parameters or options added in Step 2 or consult AMD technical support for further assistance.

How to Install AMD MI300X ROCm and Driver in Ubuntu

Installation prerequisites

Before installing ROCm, complete the following prerequisites.

SUMMARY STEPS

  1. Confirm the system has a supported Linux version.
  2. Verify the kernel version.

DETAILED STEPS

  Command or Action Purpose

Step 1

Confirm the system has a supported Linux version.

  • To obtain the Linux distribution information, use the following command:

    $ uname -m && cat /etc/*release

  • Confirm that your Linux distribution matches a supported distribution .

    Example: Running the preceding command on an Ubuntu system produces the following output:

    
    x86_64
    DISTRIB_ID=Ubuntu
    DISTRIB_RELEASE=20.04
    DISTRIB_CODENAME=focal
    DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"

Step 2

Verify the kernel version.

What to do next
  • To check the kernel version of your Linux system, type the following command:

    $uname -srmv

    Example: The preceding command lists the kernel version in the following format:

    Linux 5.15.0-46-generic #44~20.04.5-Ubuntu SMP Fri Jun 24
    13:27:29 UTC 2022 x86_64
  • Confirm that your kernel version matches the system requirements, as listed in Supported operating systems.

  • Install ROCm Method:

    For a quick summary on installing ROCm on Linux, choose your preferred operating system and installation method and follow the steps listed in the table. If you want more in-depth installation instructions, refer to ROCm installation options.

Method 1:
  • Ubuntu 22.04

    Native package manager

    sudo apt install "linux-headers-$(uname -r)"
    "linux-modules-extra-$(uname -r)"sudo usermod -a -G render,video
    $LOGNAME # Adding current user to Video, Render groups. See
    prerequisites.wget
    https://repo.radeon.com/amdgpu-install/6.1.1/ubuntu/jammy/amdgpu-install_6.1.60101-1_all.debsudo
    apt install ./amdgpu-install_6.1.60101-1_all.debsudo apt updatesudo
    apt install amdgpu-dkmssudo apt install rocmecho "Please reboot
    system for all settings to take effect."
  • Install AMDGPU Method:

    AMDGPU installer

    sudo apt install "linux-headers-$(uname -r)"
    "linux-modules-extra-$(uname -r)"sudo usermod -a -G render,video
    $LOGNAME # Adding current user to Video, Render groups. See
    prerequisites.sudo apt updatewget
    https://repo.radeon.com/amdgpu-install/6.1.1/ubuntu/jammy/amdgpu-install_6.1.60101-1_all.debsudo
    apt install ./amdgpu-install_6.1.60101-1_all.debsudo amdgpu-install
    --usecase=graphics,rocm
Method 2:
  1. Before installing ROCm, ensure that your operating system is clean or uninstall any old versions of ROCm.

  2. Check your kernel version by running the command uname -r to confirm that it matches the requirements specified in the ROCm README or documentation.

  3. Extract the ROCm package by running the command:

    $ tar -xvzf $ROCm_FILE

    Replace $ROCm_FILE with the name of the downloaded ROCm file.

  4. Change to the ROCm directory by running the command:

    $ cd $ROCm_DIR

    Replace $ROCm_DIR with the name of the extracted ROCm directory.

  5. Give execute permissions to the amdgpu-install script by running the command:

    $chmod u+x amdgpu-install
    Begin the installation process by running the command:
    $./amdgpu-install --usecase=hiplibsdk,rocm

    Follow the prompts and provide any necessary inputs during the installation process.

    Once the ROCm installation is complete, it is recommended to blacklist the amdgpu driver to prevent auto-probing upon booting into the operating system, before rebooting the system.

    You can refer to the How to Add amdgpu to Blacklist in Ubuntu section for instructions on how to blacklist the amdgpu driver.

    Please note that the steps provided are general guidelines, and the exact commands or procedures may vary depending on your specific OS and ROCm version. For detailed instructions specific to your setup, consult the ROCm documentation or official resources.

AMDGPU driver cause fatal error with AMD platform

Cause: Some versions of the amdgpu driver may have issues that can result in fatal errors on AMD platforms.

Solution: To avoid potentially fatal errors caused by the amdgpu driver, you can try the following two methods:

Method 1: Add amdgpu driver to blacklist in Linux
  1. Run the following command to open the GRUB configuration file in a text editor:

    $ sudo nano /etc/default/grub

  2. Locate the line that starts with "GRUB_CMDLINE_LINUX" in the file.

    Add the “modprobe.blacklist=amdgpu” "iommu=pt" and "pci=noats" parameters behind the existing content within the quotation marks ("").

    For example, if the line is:

    $ GRUB_CMDLINE_LINUX=" modprobe.blacklist=amdgpu iommu=pt pci=noats "

  3. Save the changes by pressing Ctrl + X, then Y, and finally Enter.

  4. Update the GRUB configuration by running the following command:

    $sudo grub-mkconfig -o /boot/efi/EFI/ubuntu/grub.cfg

    $sudo update-grub

  5. Reboot your system for the changes to take effect:

    $sudo reboot

    Figure 6. After updated the grub configuration

    Note


    After rebooting, the iommu=pt and pci=noats parameters will be added to the kernel command line. These parameters may help address certain issues or improve compatibility for your AMD platform.

    Please note that modifying the GRUB configuration can have potential consequences on system behavior. If you experience any further issues or instabilities, consider reverting the changes or seeking assistance from AMD technical support or your distribution's support channels.


Method 2: Disable IOMMU under BIOS.
  1. Set IOMMU to disable in BIOS: Advance > AMD CBS > NBIO Common Options > IOMMU disable

    Figure 7. Disable IOMMU
How to uninstall ROCm

Solution:

To uninstall ROCm, please follow the steps below:

SUMMARY STEPS

  1. Open a terminal window.
  2. Run the following command to uninstall the ROCm packages from your system:
  3. Next, clean the ROCm repository configuration by running the following command:
  4. Use the following command to remove the amdgpu-install package:
  5. Finally, remove the amd-nonfree-radeon package by running the following command:

DETAILED STEPS

  Command or Action Purpose

Step 1

Open a terminal window.

Step 2

Run the following command to uninstall the ROCm packages from your system:

$sudo amdgpu-uninstall

Step 3

Next, clean the ROCm repository configuration by running the following command:

$sudo amdgpu-repo –clean

Step 4

Use the following command to remove the amdgpu-install package:

$sudo apt-get -y purge amdgpu-install

Step 5

Finally, remove the amd-nonfree-radeon package by running the following command:

$sudo apt-get -y purge amd-nonfree-radeon

These steps will uninstall the ROCm packages and related components from your system.

What to do next

Note: Please note that the specific commands provided are suitable for Ubuntu or Debian-based systems. If you are using a different Linux distribution, the package manager commands may vary. Adjust the commands accordingly or refer to the documentation provided by your distribution for the correct package management commands.

After completing these steps, it is recommended to reboot your system to ensure any remaining ROCm components are fully cleared.

Why does the AMD application become unresponsive on multi-GPU systems

Cause: Running on a system with multiple GPUs the application hangs with the GPU use at 100%, but without the expected GPU temperature buildup

This issue often results in the following message in the application transcript:

NCCL WARN Missing "iommu=pt" from kernel command line which can lead to system instability or hang!

Solution:

To resolve this issue add iommu=pt to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub.

Then run the following command:

$ sudo update-grub

Reboot the system, and run the following command:

$ cat /proc/cmdline

The returned information should reflect the addition of iommu:

BOOT_IMAGE=/vmlinuz-5.15.0-101-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro iommu=pt