GPU Management

GPU Management

Overview

GPUs are widely used for high-performance computing and graphics processing in various applications. The BMC monitors the health status of GPUs, such as temperature, to prevent overheating or malfunction during heavy computational loads, thereby ensuring the reliability and longevity of the hardware.

Monitored and Controlled Features

The BMC monitors and controls the following GPU features:

  • Monitor GPU temperature

  • Monitor GPU current power consumption

  • Monitor the temperature of components on the GPU board

  • Monitor the power consumption of components on the GPU board

  • Display the version of components on the GPU board

  • Remotely update the GPU firmware and the component firmware on the GPU board

Configuring GPU Date and Time Settings


Note


This option is available only for few Cisco UCS C885A M8 Rack Server configurations.


Procedure


Step 1

From the Navigation Pane, select Settings > Date and time.

Step 2

Under Configure Settings, choose between the following options:

  • Manual

  • Set GPU Datetime to be the same as BMC Datetime

Step 3

For Manual, update the following properties:

Name

Description

Date field

Enter in YYYY-MM-DD format.

24-hour time (UTC) field

Enter time in HH:MM format.

Step 4

Select Set GPU Datetime to be the same as BMC Datetime to automatically import the settings from BMC.

Step 5

Click Set.


Viewing GPU FRU Information

Procedure


Step 1

From the Navigation Pane, select GPU Management > Information.

Step 2

Under FRU Assembly, you can view the following properties:

Name

Description

Model

Displays the GPU model.

Name

Displays the GPU name.

Part Number

Lists the part number associated with the GPU.

Physical Context

Describes the physical context or placement of the GPU.

Serial Number

Displays the serial number of the GPU.

Vendor

Identifies the vendor or manufacturer of the GPU.

Step 3

Under Versions, you can view the following properties:

Name

Description

Name column

Identifies the component or software related to the GPU.

Version column

Shows the version number associated with the component or software.


Viewing GPU Power and Temperature Sensor

Procedure


Step 1

From the Navigation Pane, select GPU Management > Sensors.

Step 2

Under Power, you can view the following properties:

Name

Description

Name column

Identifies the power sensor.

Current Value column

Displays the current power reading.

Min Value column

Shows the minimum recorded power value.

Max Value column

Displays the maximum recorded power value.

Step 3

Under Temperature, you can view the following properties:

Name

Description

Name column

Identifies the temperature sensor.

Current Value column

Displays the current temperature reading.

Min Value column

Shows the minimum recorded temperature value.

Max Value column

Displays the maximum recorded temperature value.

Critical High column

Indicates the critical high threshold for temperature sensors.

Critical Low column

Indicates the critical low threshold for temperature sensors.


Viewing GPU Power Configuration

Procedure


Step 1

From the Navigation Pane, select GPU Management > Powers.

Step 2

You can view the following properties:

Name

Description

Name column

Identifies the GPU.

Power Consumption column

Displays the current power usage.

Power Cap column

Indicates the maximum power limit set for the GPU.


Applying GPU Power Cap

Procedure


Step 1

From the Navigation Pane, select GPU Management > Powers.

Step 2

Check the Apply power cap check box.

Step 3

In the Power cap value (in watts) field, enter a value between 200 and 750.

Step 4

Click Save.


Event Logs

Viewing GPU Event Logs

Procedure


Step 1

From the Navigation Pane, select GPU Management > Event logs.

Step 2

You can filter the event logs based on the following options:

  • From and to dates

  • Based on severity: OK, Warning, and Critical

  • Search keyword using the search field

You can view the following log properties:

Name

Description

ID column

Displays the unique identifier for each log entry.

Severity column

Indicates the level of importance or impact of the log entry. This can be one of the following:

  • OK—Indicates that the log entry represents a normal or successful operation.

  • Critical—Indicates a severe issue that requires immediate attention.

  • Warning—Indicates a potential issue that should be monitored.

Date column

Shows the date and time when the log entry was recorded.

Description column

Provides a brief summary or details about the log entry.


Exporting GPU Event Logs

Procedure


Step 1

From the Navigation Pane, select GPU Management > Event logs.

Step 2

To export one log entry, click the export icon corresponding to the row you want to export.

Step 3

(Optional) To export all log entries, click Export all.

Depending on your browser settings, you may be prompted to open or save the JSON log file.


Updating GPU Firmware

Before you begin

Ensure that the firmware file is available on the client before starting this procedure.

Procedure


Step 1

From the Navigation Pane, select GPU Management > Firmware.

Step 2

Click Add File and browse to locate the firmware file.

Select the firmware file.

Step 3

Click Start Update to initiate the firmware update.


What to do next

After the firmware update completes, perform an AC power cycle to activate and complete the GPU upgrade.