Solutions - Lenovo Hybrid AI 285 with Cisco Networking

Introduction

Lenovo Hybrid AI 285 is a platform that enables enterprises of all sizes to quickly deploy hybrid AI factory infrastructure, supporting Enterprise AI use cases as either a new, greenfield environment or an extension of their existing IT infrastructure. The Lenovo Hybrid AI 285 platform, combined with Cisco Networking, offers a validated design for enterprises to deploy hybrid AI infrastructure.

This guide provides a step-by-step approach to implementing Cisco Networking and Cisco Nexus^® Dashboard and refers users to the foundational Lenovo Hybrid AI 285 platform guide as necessary.

The Lenovo Hybrid AI 285 is a platform based on the NVIDIA 2-8-5 PCIe-optimized configuration — 2x CPUs, 8x GPUs, and 5x network adapters — and is ideally suited for medium (per GPU) to large (per node) Inference use cases and for small-to-large model training or fine-tuning, depending on the chosen scale. It combines market-leading GPU-rich Lenovo ThinkSystem servers with options to be equipped with NVIDIA L40S, H200 NVL, or RTX PRO 6000 Blackwell Server Edition GPUs with Cisco Networking. It enables the use of the NVIDIA AI Enterprise (NVAIE) software stack with NVIDIA Blueprints. In this validation, we are using NVIDIA L40S GPUs.

Lenovo Hybrid AI 285 platform overview with Cisco Networking

Figure 1.

Lenovo Hybrid AI 285 platform overview with Cisco Networking

The Lenovo Hybrid AI 285 Platform with Cisco Networking scales from a starter-kit environment with between 4 and 32 PCIe GPUs to a Scalable Unit deployment (SU) with four servers and 32 GPUs in each SU and up to 3 scalable units with 12 servers and 96 GPUs. See Figure 2 for a sizing overview.

Lenovo Hybrid AI 285 with Cisco Networking scaling from starter kit to 96 GPUs

Figure 2.

Lenovo Hybrid AI 285 with Cisco Networking scaling from starter kit to 96 GPUs

Figure 3 shows the networking architecture of the platform deployed with 96 GPUs.

Lenovo Hybrid AI 285 with Cisco Networking platform with three scalable units

Figure 3.

Lenovo Hybrid AI 285 with Cisco Networking platform with three scalable units

Platform overview

The Lenovo Hybrid AI 285 platform is built on the NVIDIA 2-8-5 PCIe-optimized configuration:

● 2 CPUs, 8 GPUs, and 5 network adapters

● Designed for medium-to-large inference and model training workloads

● Consists of Lenovo ThinkSystem SR675 V3 servers with GPUs and NICs based on use-case and SR635 V3 servers as additional service nodes to manage the overall AI cluster environment such as management, orchestration, etc.

Scalability:

● Starter kit: 4 GPUs

● Scalable Unit (SU):4 servers, 32 GPUs per SU

● Maximum deployment: 3 SUs, 12 servers, 96 GPUs

In summary, here are the key features:

● Scalability: from a starter kit (4 GPUs) to a full-scale deployment (96 GPUs across 3 scalable units)

● Flexibility: supports NVIDIA GPUs (L40S, H200 NVL, and RTX PRO 6000) and NVIDIA AI Enterprise software stack

● Performance: optimized for medium-to-large inference and training workloads

Hardware components

The main hardware components of Lenovo Hybrid AI 285 platforms are compute nodes and the Cisco Networking infrastructure. As an integrated solution, they are combined in either a Lenovo EveryScale Rack (Machine Type 1410) or Lenovo EveryScale Client Site Integration Kit (Machine Type 7X74).

Topics in this section:

● AI compute node – SR675 V3

● Cisco Networking

Lenovo AI compute node

The AI compute node leverages the GPU-rich Lenovo ThinkSystem SR675 V3 server.

The SR675 V3 is a 2-socket 5th Gen AMD EPYC 9005 server supporting up to 8 PCIe DW GPUs with up to 5 network adapters in a 3-Unit (3U) rack server chassis. This makes it the ideal choice for NVIDIA’s 2-8-5 configuration requirements.

Lenovo ThinkSystem SR675 V3 in 8 PCIe DW setup

Figure 4.

Lenovo ThinkSystem SR675 V3 in 8 PCIe DW setup

AI compute node: Lenovo ThinkSystem SR675 V3

● Processor: Dual AMD EPYC 9535 (64 cores, 2.4 GHz, 3.5 GHz boost)

● GPU support: Up to 8 NVIDIA GPUs (L40S, H200 NVL, RTX PRO 6000)

● Network adapters: 5 PCIe Gen5 slots for low-latency NVIDIA GPUDirect

● Storage: RAID1 OS drives and high-capacity NVMe drives for application data

The AI compute node is configured with two AMD EPYC 9535 64 core 2.4 GHz processors with an all-core boost frequency of 3.5 GHz. Besides providing consistently more than 2 GHz frequency, this ensures that, with seven NVIDIA Multi-Instance GPUs (MIG) on eight physical GPUs, there are two cores available per MIG plus a few additional cores for the operating system and other operations.

The GPUs are connected to the CPUs through two PCIe Gen5 switches, each supporting up to four GPUs. With the NVIDIA H200 NVL PCIe GPU, the four GPUs are additionally interconnected through an NVLink bridge, creating a unified memory space. In an entry configuration with two GPUs per PCIe switch, the Lenovo ThinkSystem SR675 V3 supports connecting all four GPUs with an NVLink bridge for maximized shared memory, thereby accommodating larger inference models. With the NVIDIA RTX PRO 6000 Blackwell Server Edition, no NVLink bridge is applicable; the same applies to configurations with the L40S. For this test, we are using the 4x L40S GPUs on each compute-server system.

The Ethernet adapters for the compute (east/west) network are directly connected to the GPUs through PCIe switches, thus minimizing latency and enabling NVIDIA GPUDirect and GPUDirect storage operations. For pure inference workloads, they are optional, but for training and fine-tuning operations, they should provide at least 200 Gb/s per GPU.

Finally, the system is completed by local storage with two 960 GB read-intensive M.2 in a RAID1 configuration for the operating system and four 3.84 TB read-intensive E3.S drives for local application data.

Cisco AI Networking

The default setup leverages Cisco Networking with the Nexus 9364D-GX2A for the converged and compute network.

Cisco Nexus 9364D-GX2A switch

The Cisco Nexus 9364D-GX2A is a 2-Rack-Unit (2RU) switch that supports 25.6 Tbps of bandwidth and 8.35 bpps across 64 fixed 400G QSFP-DD ports and two fixed 1/10G SFP+ ports. QSFP-DD ports also support native 200G (QSFP56), 100G (QSFP28), and 40G (QSFP+). Each port can also support 4x 10G, 4x 25G, 4x 50G, 4x 100G, and 2x 200G breakouts.

It supports flexible configurations, including 128 ports of 200GbE or 256 ports of 100/50/25/10GE ports, thereby accommodating diverse AI cluster requirements.

Cisco Nexus 9364D-GX2A switch

Figure 5.

Cisco Nexus 9364D-GX2A switch

Cisco Nexus 9300-FX3 Series Switch

The Cisco Nexus 93108TC-FX3P Series Switch is a high-performance, fixed-port switch designed for modern data centers. It features 48 ports of 100M/1/2.5/5/10GBASE-T, providing flexible connectivity options for various network configurations. Additionally, it includes six uplink ports that support 40/100 Gigabit Ethernet QSFP28, ensuring high-speed data transfer and scalability.

Cisco Nexus 93108TC-FX3P Series Switch

Figure 6.

Cisco Nexus 93108TC-FX3P Series Switch

The out-of-band (management) network encompasses all AI compute nodes and NVIDIA BlueField-3 DPU Base Management Controllers (BMCs) as well as the network infrastructure management.

Cisco Nexus Dashboard

Cisco Nexus Dashboard, included with every Cisco Nexus 9000 Series Switch tiered licensing purchase, serves as a centralized hub that unifies disparate network configurations and views from multiple switches and data centers. For AI fabric operations, it acts as the ultimate command center, from the initial setup of AI fabric automation to continuous fabric analytics with few clicks.

AI fabric workflow on Cisco Nexus Dashboard

Figure 7.

AI fabric workflow on Cisco Nexus Dashboard

Key capabilities of Cisco Nexus Dashboard, such as congestion scoring, PFC/ECN statistics, and microburst detection, empower organizations to proactively identify and address performance bottlenecks for their AI backend infrastructure.

Advanced features, including anomaly detection, event correlation, and suggested remediation, ensure that networks are not only resilient but also self-healing, thus minimizing downtime and accelerating issue resolution.

Cisco Nexus Dashboard offers flexible deployment options—including physical appliances (pND) on Cisco UCS^® M6 and M8 servers, a virtual appliance (vND) for existing virtual infrastructure, and a cloud-hosted option—thereby meeting diverse data-center needs. These options deliver high performance, scalability, and simplified management across LAN, SAN, and IPFM environments without requiring dedicated hardware. The installation and deployment guide is available here.

Server configuration and tuning

Server hardware configuration

The AI compute node is configured with two AMD EPYC 9535 64 Core 2.4 GHz processors with an all-core boost frequency of 3.5GHz. The GPUs are connected to the CPUs through two PCIe Gen5 switches, each supporting up to four GPUs. The Ethernet adapters for the compute (east/west) network are directly connected to the GPUs through PCIe switches, thus minimizing latency and enabling NVIDIA GPUDirect and GPUDirect Storage operations. For this test, we are using the 4x NVIDIA L40S GPUs on each compute-server system. The system is completed by local storage with two 960GB read-intensive M.2 in a RAID1 configuration for the operating system and four 3.84TB read-intensive E3.S drives for local application data.

Understanding NUMA domains

The Lenovo ThinkSystem SR675 V3 is a dual-socket AI compute node. The architecture divides the system into two non-uniform memory access (NUMA) domains, one associated with each CPU socket. For optimal AI workload performance, it is crucial to ensure that the process threads and the data they operate on are allocated within the same local NUMA domain.

Four NVIDIA L40S GPUs and two NVIDIA BlueField3 B3140H NICs are configured in the AI compute nodes for this testing. To optimize performance and minimize latency, the first 2x GPUs and 1x NIC are installed on NUMA0, and the other 2x GPUs and 1x NIC are installed on NUMA1.

BIOS UEFI best practices

Properly configuring UEFI parameters in a server is important for achieving a desired outcome such as high performance or efficiency. For AI compute nodes, the UEFI operating-mode setting should be set to “Maximum Performance Mode.”

UEFI setup operating mode

Figure 8.

UEFI setup operating mode

For custom mode-configurations, these additional settings are required: enable both P-state (performance state) and CPPC (collaborative processor performance control).

OS-level performance tuning

Set the frequency governor to performance:

# cpupower frequency-set --governor performance

. . .

# cpupower frequency-info

analyzing CPU 82:

driver: acpi-cpufreq

CPUs which run at the same hardware frequency: 82

CPUs which need to have their frequency coordinated by software: 82

maximum transition latency: Cannot determine or is not supported.

hardware limits: 1.50 GHz - 2.45 GHz

available frequency steps: 2.45 GHz, 1.90 GHz, 1.50 GHz

available cpufreq governors: conservative ondemand userspace powersave performance schedutil

current policy: frequency should be within 1.50 GHz and 2.45 GHz.

The governor "performance" may decide which speed to use

within this range.

current CPU frequency: 2.45 GHz (asserted by call to hardware)

boost state support:

Supported: yes

Active: yes

Boost States: 0

Total States: 3

Pstate-P0: 2450MHz

Pstate-P1: 1900MHz

Pstate-P2: 1500MHz

Disable idle state 2 and 1 (disabling 1 is optional):

#cpupower idle-set -d 2

#cpupower idle-set -d 1

Software stack on Lenovo servers

Deploying AI to production involves implementing multiple layers of software. The process begins with server BIOS tuning, system management, and the operating system of the compute nodes. It progresses through workload or container scheduling and cluster management. And it culminates in the AI software stack that enables delivering AI tools and agents to users.

For an AI starter kit deployment, the Kubernetes control plane operates directly on the AI compute nodes, negating the requirement for dedicated service nodes to run the Kubernetes control plane. The AI compute nodes will function as primary-secondary nodes. A minimum of one service node is still required to run NVIDIA Base Command Manager (BCM). The software stack for AI starter kit or single-node deployments is similar to the full AI software stack, though some components may be considered optional or less practical for these smaller configurations.

Table 1. Lenovo AI software stack

Software role	Software package
Bare-metal management	Lenovo XClarity One Management Hub 2.0
Bare-metal management	Lenovo XClarity One (cloud or on-premises VM)
Linux operating system	Ubuntu Server 22.04.4 LTS
Container orchestration	Upstream Kubernetes 1.31.5
Container runtime	Containerd 1.7.23
Orchestration	NVIDIA Base Command Manager 10.0
	Prometheus 2.55.0
	Permission-Manager 0.5.1
Container network interface (CNI)	Calico 3.27.4
Package manager	Helm 3.16.0
Load balancer – control plane	NGINX 1.12.0
Load balancer – network services	MetalLB 0.14.8
Operator	NVIDIA GPU Operator 24.9.2
	Linux GPU driver 550.127.05
	NVIDIA NIM Operator 1.0.1
	NVIDIA Network Operator v25.1.0
DOCA host	2.9.0-0.4.7
NCCL	2.21
Grafana	11.2.2
Storage	NFS Provisioner 4.0.2
Storage	Local Path Provisioner 0.0.31

Lenovo and Cisco AI network validated deployment

Test topology

Figure 9.

Test topology

The key purpose of the testing is to make sure we understand the lossless fabric configurations required for high networking-performance and replicate them in different environments (NVIDIA L40S, H200 NVL, or RTX PRO 6000 Blackwell Server Edition GPUs systems). The following components were used for testing:

● Compute:

◦ 2x Lenovo SR675v3 with NVIDIA L40S, NVIDIA BlueField-3 B3140H for east/west communication, NVIDIA BlueField-3 B3220 for north/south communication

◦ Ubuntu Server 22.04.4 LTS

● Network:

◦ Cisco Nexus 9364D-GX2A (NX-OS: 10.5.3(F))

● Network management:

◦ Cisco Nexus Dashboard (4.1.1g)

● Optics:

◦ Cisco N9364D-GX2A: QSFP-DD-400G-DR4 with CB-M12-M12-SMF cable

◦ NVIDIA B3140H: QSFP-400G-DR4 with CB-M12-M12-SMF cable

● License:

◦ Cisco Data Center Networking (Cisco DCN) Advantage for the Cisco Nexus switch

Verification steps for OS and drivers

1. Verifying that CUDA driver is properly installed and running:

The first and most important verification step is to use the NVIDIA System Management Interface (nvidia-smi) command-line utility. This tool is essential for monitoring GPU status, checking driver versions, and managing GPU states.

Running this command successfully confirms that the kernel driver is loaded correctly and can communicate with the GPU hardware.

tester@ubuntu-ai-1:~$ nvidia-smi

Fri Oct 17 16:01:12 2025

+-----------------------------------------------------------------------------------------+

| NVIDIA-SMI 575.64.03 Driver Version: 575.64.03 CUDA Version: 12.9 |

|-----------------------------------------+------------------------+----------------------+

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|=========================================+========================+======================|

| 0 NVIDIA L40S Off | 00000000:03:00.0 Off | 0 |

| N/A 33C P8 33W / 350W | 0MiB / 46068MiB | 0% Default |

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

| 1 NVIDIA L40S Off | 00000000:04:00.0 Off | 0 |

| N/A 31C P8 33W / 350W | 0MiB / 46068MiB | 0% Default |

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

| 2 NVIDIA L40S Off | 00000000:83:00.0 Off | 0 |

| N/A 31C P8 34W / 350W | 0MiB / 46068MiB | 0% Default |

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

| 3 NVIDIA L40S Off | 00000000:84:00.0 Off | 0 |

| N/A 31C P8 32W / 350W | 0MiB / 46068MiB | 0% Default |

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=========================================================================================|

| No running processes found |

2. Verifying the NVIDIA DOCA installation and NVIDIA BlueField-3 Status:

You must verify that the NVIDIA BlueField-3 is recognized correctly by the host and that the internal NVIDIA DOCA services on the DPU are operational. This verification process is divided into two parts: checks performed on the host server and checks performed directly on the NIC.

Check PCIe device recognition (lspci)

First, verify that the host system detects the NVIDIA BlueField DPU on the PCIe bus.

tester@ubuntu-ai-1:~$ lspci | grep Mellanox

31:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)

31:00.1 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)

75:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)

75:00.1 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)

85:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)

85:00.1 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)

85:00.2 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)

c1:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)

c1:00.1 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)

f5:00.0 Ethernet controller: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller (rev 01)

f5:00.1 DMA controller: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface (rev 01)

Check device status with Mellanox Software Tools (mst status)

The Mellanox Software Tools (MST) provide a more detailed status. This is the most common tool used to manage NVIDIA networking devices.

tester@ubuntu-ai-1:~$ sudo mst start

Starting MST (Mellanox Software Tools) driver set

Loading MST PCI module - Success

[warn] mst_pciconf is already loaded, skipping

Create devices

Unloading MST PCI module (unused) – Success

tester@ubuntu-ai-1:~$ sudo mst status -v

MST modules:

------------

MST PCI module is not loaded

MST PCI configuration module loaded

PCI devices:

------------

DEVICE_TYPE MST PCI RDMA NET NUMA

BlueField3(rev:1) /dev/mst/mt41692_pciconf4 f5:00.0 mlx5_2 net-enp245s0f0np0 1

BlueField3(rev:1) /dev/mst/mt41692_pciconf3 c1:00.0 mlx5_5 net-enp193s0f0np0 1

BlueField3(rev:1) /dev/mst/mt41692_pciconf2.1 85:00.1 mlx5_4 net-enp133s0f1np1 1

BlueField3(rev:1) /dev/mst/mt41692_pciconf2 85:00.0 mlx5_3 net-enp133s0f0np0 1

BlueField3(rev:1) /dev/mst/mt41692_pciconf1 75:00.0 mlx5_1 net-ens16f0np0 0

BlueField3(rev:1) /dev/mst/mt41692_pciconf0 31:00.0 mlx5_0 net-ens15f0np0 0

3. Verifying NIC status and mapping

Before configuring RoCEv2, it is essential to verify that the system correctly recognizes the RDMA-capable Network Interface Cards (RNICs) and to identify their corresponding Linux network interface names. The ibdev2netdev utility is the standard tool for this task.

This command maps the low-level InfiniBand device names to the familiar kernel network interface names that are used for IP addressing and other network configurations.

tester@ubuntu-ai-1:~$ ibdev2netdev

mlx5_0 port 1 ==> ens15f0np0 (Up)

mlx5_1 port 1 ==> ens16f0np0 (Up)

mlx5_2 port 1 ==> enp245s0f0np0 (Up)

mlx5_3 port 1 ==> enp133s0f0np0 (Up)

mlx5_4 port 1 ==> enp133s0f1np1 (Down)

mlx5_5 port 1 ==> enp193s0f0np0 (Up)

4. Configuring per-source interface routing with netplan

This configuration is designed for a server with two or more network interfaces connected to different subnets, each with its own gateway. The goal is to ensure that when the server initiates traffic from an IP address on one interface, it uses the gateway associated with that specific interface, rather than relying on a single, global default gateway. In this example, we are using enp133s0f0np0 for management while others are used as data interfaces.

Below is an example of a netplan. Please use the interfaces and IP addressing scheme according to your setup requirements.

tester@ubuntu-ai-1:~$ sudo cat /etc/netplan/00-installer-config.yaml

network:

ethernets:

enp133s0f0np0:

addresses:

- 172.22.194.7/24

nameservers:

addresses:

- 171.70.168.183

search:

- cisco.com

routes:

- to: default

via: 172.22.194.1

enp133s0f1np1:

dhcp4: true

enp245s0f0np0:

dhcp4: true

ens15f0np0:

dhcp4: true

ens16f0np0:

mtu: 9000

dhcp4: false

addresses:

- 192.168.101.25/31

routing-policy:

- from: 192.168.101.25

table: 101

routes:

- to: 0.0.0.0/0

via: 192.168.101.24

table: 101

enp245s0f0np0:

mtu: 9000

dhcp4: false

addresses:

- 192.168.101.45/31

routing-policy:

- from: 192.168.101.45

table: 102

routes:

- to: 0.0.0.0/0

via: 192.168.101.44

table: 102

version: 2

When you make changes to your netplan YAML configuration files, you need to apply them to the system. Once applied, we need restart the networking service:

sudo netplan apply

sudo systemctl restart systemd-networkd

5. Verify connectivity between interfaces on the servers

tester@ubuntu-ai-1:~$ ping -I 192.168.101.45 192.168.201.45

PING 192.168.201.45 (192.168.201.45) from 192.168.101.45 : 56(84) bytes of data.

64 bytes from 192.168.201.45: icmp_seq=1 ttl=63 time=15.3 ms

64 bytes from 192.168.201.45: icmp_seq=2 ttl=63 time=0.673 ms

64 bytes from 192.168.201.45: icmp_seq=3 ttl=63 time=0.833 ms

64 bytes from 192.168.201.45: icmp_seq=4 ttl=63 time=0.853 ms

64 bytes from 192.168.201.45: icmp_seq=5 ttl=63 time=0.844 ms

--- 192.168.201.45 ping statistics ---

5 packets transmitted, 5 received, 0% packet loss, time 4074ms

rtt min/avg/max/mdev = 0.673/3.710/15.348/5.819 ms

6. Configure RoCEv2, ECN, and PFC on the NICs (follow instructions in the NVIDIA documentation)

● Set mode to RoCEv2 on the NIC

sudo cma_roce_mode -d mlx5_1 -p 1 -m 2

● Set Type of Service (ToS) 104

sudo cma_roce_tos -d mlx5_1 -t 104

● Map ToS 104 to traffic-class 1 (TC1)

sudo bash -c echo 104 > /sys/class/infiniband/mlx5_1/tc/1/traffic_class'

● PFC and DSCP-based classification for class 3

sudo bash -c 'echo 1 > /sys/class/net/ens1f1np1/ecn/roce_np/enable/3'

sudo mlnx_qos -i ens1f1np1 --trust=dscp --pfc 0,0,0,1,0,0,0,0

● Enable ECN and change the CNP packets DSCP value to 48

sudo sysctl -w net.ipv4.tcp_ecn=1

sudo bash -c 'echo 6 > /sys/class/net/ens1f1np1/ecn/roce_np/cnp_802p_prio'

sudo bash -c 'echo 48 > /sys/class/net/ens1f1np1/ecn/roce_np/cnp_dscp'

Configuring Cisco Nexus 9000 Series Switches with CLI

1. Configuring a lossless fabric for RoCEv2

To create a high-performance, lossless network fabric suitable for traffic such as RoCEv2, Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) must be configured to work together. This guide explains the necessary command-line configurations using the Cisco Modular Quality of Service (MQC) framework on Cisco Nexus 9000 Series Switches.

The MQC framework separates the configuration into three logical parts:

● type qos: classifies incoming traffic based on markers such as DSCP and assigns it to an internal qos-group

● type queuing: defines how traffic is buffered, scheduled, and handled during congestion (for example, bandwidth allocation and ECN/WRED)

● type network-qos: configures system-level lossless characteristics, such as enabling PFC for specific traffic classes

The following example demonstrates a complete configuration applied to all switches and interfaces in the fabric. The goal is to properly handle RoCEv2 traffic (marked with DSCP 26) and its associated Congestion Notification Packet (CNP) traffic (marked with DSCP 48).

Step 1: Classify incoming traffic (type qos)

First, we must identify the RoCEv2 and CNP traffic as it enters the switch. We create class maps to match their DSCP values and then a policy map to assign them to internal qos-groups. This qos-group acts as a tag that other QoS policies will use to identify the traffic.

● class-map: matches RoCEv2 traffic (DSCP 26) and CNP traffic (DSCP 48)

● policy-map: assigns matched RoCEv2 traffic to qos-group 3 and CNP traffic to qos-group 7. All other traffic is assigned to the default qos-group 0.

class-map type qos match-any CNP

match dscp 48

class-map type qos match-any ROCEv2

match dscp 26

policy-map type qos QOS_classification_policy

class class-q3

set qos-group 3

class class-q7

set qos-group 7

class class-default

set qos-group 0

2. Define queuing and congestion behavior (type queuing)

Next, we define how the classified traffic is handled in the egress queues. This policy uses the qos-groups assigned in step 1 to apply specific queuing behaviors.

● CNP traffic (qos-group 7): placed in a strict priority queue (priority level 1) to ensure that these critical notifications are never delayed

● RoCEv2 traffic (qos-group 3): allocated 50 percent of the remaining bandwidth. ECN is enabled on this queue using Weighted Random Early Detection (WRED). When the queue depth reaches the minimum threshold (150 KB), the switch begins marking packets with an ECN flag instead of dropping them. The maximum threshold (3000 KB) defines the upper limit. These values are optimized for high-speed 100G/400G networks.

● Default traffic: receives the remaining 50 percent of the bandwidth

policy-map type queuing custom-8q-out-policy

class type queuing c-out-8q-q7

priority level 1

class type queuing c-out-8q-q6

bandwidth remaining percent 0

class type queuing c-out-8q-q5

bandwidth remaining percent 0

class type queuing c-out-8q-q4

bandwidth remaining percent 0

class type queuing c-out-8q-q3

bandwidth remaining percent 50

random-detect minimum-threshold 150 kbytes maximum-threshold 3000 kbytes drop-probability 7 weight 0 ecn

class type queuing c-out-8q-q2

bandwidth remaining percent 0

class type queuing c-out-8q-q1

bandwidth remaining percent 0

class type queuing c-out-8q-q-default

bandwidth remaining percent 50

3. Enable lossless behavior (type network-qos)

This policy enables PFC for the specific traffic class carrying RoCEv2 traffic, making its queue lossless.

● pause pfc-cos 3: enables the generation and processing of PFC pause frames for Class of Service (CoS) 3. This command links PFC to the queue handling RoCEv2 traffic (qos-group 3), preventing packet drops during congestion by pausing transmission.

● mtu 9216: sets a jumbo MTU. While this does not directly enforce an MTU on traffic, it is used by the system to properly calculate the headroom buffer required for the lossless queue.

● PFC thresholds: The xOFF (pause) and xON (resume) thresholds are left at their default values and are not displayed in the configuration.

policy-map type network-qos custom-8q-nq-policy

<...>

class type network-qos c-8q-nq3

mtu 9216

pause pfc-cos 3

<...>

4. Apply the policies

Finally, the policies must be attached to the correct scopes to become active.

System-wide application

The queuing and network-qos policies are applied globally to ensure consistent behavior across the entire switch. This guarantees that ECN marking and PFC pause handling are active system wide.

system qos

service-policy type network-qos custom-8q-nq-policy

service-policy type queuing output custom-8q-out-policy

5. Interface-level application

The initial qos classification policy must be applied to the input of every interface that will receive RoCEv2 traffic. Additionally, PFC must be explicitly enabled on these interfaces. This configuration is required on all spine and leaf interfaces in the fabric.

● service-policy type qos input ...: attaches the classification policy to handle incoming traffic

● priority-flow-control mode on: activates PFC functionality on the physical port

● priority-flow-control watch-dog-interval on: enables a watchdog (with a default 100ms interval) to detect and recover from PFC storm conditions

interface Ethernet1/2

no switchport

priority-flow-control mode on

priority-flow-control watch-dog-interval on

mtu 9216

service-policy type qos input QOS_CLASSIFICATION

ip address 192.168.101.24/31

no shutdown

Cisco Nexus Dashboard for AI network automation

Cisco Nexus Dashboard serves as a centralized orchestration platform for deploying network fabrics optimized for AI workloads. The controller is agnostic to the underlying transport architecture, providing automated provisioning for both Layer-3 routed designs and VXLAN-EVPN overlay networks.

Nexus Dashboard leverages validated design templates to rapidly instantiate the entire fabric, including the complex policy configurations required for lossless Ethernet. This includes the programmatic deployment of Quality of Service (QoS) settings for Priority Flow Control (PFC) and Explicit Congestion Notification (ECN), ensuring high-performance, low-latency transport. The platform also automates day-2 operational tasks, such as fabric scaling through the addition of new leaf or spine switches, and modifications to host-facing interface configurations.

Note: Direct back-to-back leaf switch connections are unsupported in L3 Routed or EVPN VXLAN fabrics for AI networks due to the strict enforcement of a CLOS topology. Example used here is with 1 leaf switch for illustration purposes only.

1. Create AI Fabric under Manage -> Fabric -> Create Fabric -> Create a new LAN fabric

Cisco Nexus Dashboard for AI network automation

2. Select type as AI (For this example, we are selecting AI Routed.)

Cisco Nexus Dashboard for AI network automation 2

3. Provide details for your fabric. Default configurations are based on best practices as described in previous CLI-based steps. For more advanced settings, you can proceed with Advanced configuration steps.

4. Upon fabric creation, the subsequent phase is the onboarding of the network switches. This process is initiated by providing Nexus Dashboard with a seed switch's IP address and administrative credentials. Nexus Dashboard then leverages the Cisco Discovery Protocol (CDP) to recursively discover all interconnected devices, automatically building out the fabric topology based on physical adjacencies.

Cisco Nexus Dashboard for AI network automation 3

5. Once the fabric is built, edit the interfaces under the Connectivity Tab to Enable Priority Flow Control and Enable QoS configuration along with entering a custom QoS policy (for example, QOS_CLASSIFICATION).

Connectivity Tab to Enable Priority

6. Verify the configurations by selecting one of the switches in Inventory, clicking on Actions -> Maintenance -> Show Commands

Connectivity Tab to Enable Priority 3

7. To diagnose a network congestion event identified by an anomaly (if or when it is present in your network), follow the procedures given below to analyze detailed interface-level statistics.

a. Navigate to the affected interface: From the Network Congestion Indication anomaly alert, select the hyperlink for the specific leaf switch and interface that triggered the event. This will take you to the detailed monitoring page for that interface.

b. Locate congestion metrics: On the interface details page, scroll to the Congestion Score and Congestion Details graphs. These sections provide a visual representation and statistical breakdown of congestion-related events.

c. Analyze congestion counters: The Congestion Details section displays key performance indicators essential for root-cause analysis. Examine the reported values for:

i. WRED/AFD/drops: metrics related to proactive queue management and packet drops

ii. PFC (priority flow control): counters for pause frames, indicating lossless traffic behavior

iii. ECN (explicit congestion notification): statistics for packets marked for congestion notification

d. Perform per-queue investigation: For a more granular diagnosis, select the View Queues link located next to each counter. This action provides a detailed breakdown of statistics at the individual hardware queue level, enabling precise identification of the source of congestion.

Related image, diagram or screenshot

RDMA/IB performance test for network performance validation

To validate the performance of RoCEv2 for direct communication between GPUs, Cisco employed the RDMA/InfiniBand (IB) performance benchmark suite: https://github.com/linux-rdma/perftest. The tests were designed to confirm that the underlying lossless network and routing protocols were configured correctly and that the end-to-end connections were stable enough to support traffic at a full 100 percent line-rate.

You can download the suite using the github link and follow steps for installation on your servers. After successful installation, run the following commands for performance validation:

Server-1:

numactl --cpunodebind=0 --membind=0 ib_write_bw -a -d mlx5_1 --report_gbits -q 16

This command is composed of several key parts. The numactl portion is a critical performance tuning step that pins the test process and its memory to a specific CPU socket (NUMA node 0), preventing latency spikes that can occur when accessing memory across different CPUs. The ib_write_bw command itself is the executable for the InfiniBand Write Bandwidth test; when run without an IP address, it defaults to server mode.

The parameters modify the test's behavior:

● -a: runs the test across all available message sizes to provide a comprehensive performance profile

● -d mlx5_1: specifies the RDMA-capable network device to use, in this case, the second Mellanox/NVIDIA adapter found by the system (mlx5_1)

● --report_gbits: changes the output unit from the default Megabytes Per Second (MB/s) to the more network-centric Gigabits per second (Gbps)

● -q 16: sets the number of Queue Pairs (QPs) to 16. Using multiple QPs enables parallel data transfers, which is necessary to saturate high-speed links.

Server-2

numactl --cpunodebind=0 --membind=0 ib_write_bw 192.168.101.25 -a -d mlx5_1 --report_gbits -q 16

This command is nearly identical to the server command, ensuring a consistent test environment on both ends. The one crucial difference is the inclusion of the server's IP address.

● 192.168.101.25: This IP address tells ib_write_bw to operate in client mode and specifies the target server to connect to.

● All other parameters (numactl, -a, -d mlx5_1, --report_gbits, and -q 16) function exactly as they do on the server side, ensuring that both nodes are using the same NUMA policy, device, and test configuration.

In practice, you first run the server command on one machine. It will indicate that it is listening for a connection. You then run the client command on the second machine. The test will execute automatically, and the bandwidth results will be printed to the console on the client side upon completion.

Output:

Server-1

************************************

* Waiting for client to connect... *

************************************

---------------------------------------------------------------------------------------

RDMA_Write BW Test

Dual-port : OFF Device : mlx5_1

Number of qps : 16 Transport type : IB

Connection type : RC Using SRQ : OFF

PCIe relax order: ON Lock-free : OFF

ibv_wr* API : ON Using DDP : OFF

CQ Moderation : 100

CQE Poll Batch : 16

Mtu : 4096[B]

Link type : Ethernet

GID index : 3

Max inline data : 0[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet

---------------------------------------------------------------------------------------

local address: LID 0000 QPN 0x07c5 PSN 0xf04146 RKey 0x203d00 VAddr 0x007fffefac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

local address: LID 0000 QPN 0x07c6 PSN 0x4018b8 RKey 0x203d00 VAddr 0x007ffff02c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

local address: LID 0000 QPN 0x07c7 PSN 0x9157a2 RKey 0x203d00 VAddr 0x007ffff0ac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

local address: LID 0000 QPN 0x07c8 PSN 0x1c7b49 RKey 0x203d00 VAddr 0x007ffff12c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

local address: LID 0000 QPN 0x07c9 PSN 0xa1ca28 RKey 0x203d00 VAddr 0x007ffff1ac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

local address: LID 0000 QPN 0x07ca PSN 0xfb586a RKey 0x203d00 VAddr 0x007ffff22c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

local address: LID 0000 QPN 0x07cb PSN 0x15a281 RKey 0x203d00 VAddr 0x007ffff2ac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

local address: LID 0000 QPN 0x07cc PSN 0xe752e6 RKey 0x203d00 VAddr 0x007ffff32c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

local address: LID 0000 QPN 0x07cd PSN 0xcfc992 RKey 0x203d00 VAddr 0x007ffff3ac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

local address: LID 0000 QPN 0x07ce PSN 0xf07cac RKey 0x203d00 VAddr 0x007ffff42c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

local address: LID 0000 QPN 0x07cf PSN 0x8d2e0 RKey 0x203d00 VAddr 0x007ffff4ac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

local address: LID 0000 QPN 0x07d0 PSN 0xf1c9f3 RKey 0x203d00 VAddr 0x007ffff52c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

local address: LID 0000 QPN 0x07d1 PSN 0x703c63 RKey 0x203d00 VAddr 0x007ffff5ac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

local address: LID 0000 QPN 0x07d2 PSN 0x9a91d4 RKey 0x203d00 VAddr 0x007ffff62c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

local address: LID 0000 QPN 0x07d3 PSN 0xc41402 RKey 0x203d00 VAddr 0x007ffff6ac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

local address: LID 0000 QPN 0x07d4 PSN 0x98209e RKey 0x203d00 VAddr 0x007ffff72c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x0871 PSN 0x2be139 RKey 0x1fff00 VAddr 0x007fffefa26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x0872 PSN 0xd1f866 RKey 0x1fff00 VAddr 0x007ffff0226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x0873 PSN 0x957582 RKey 0x1fff00 VAddr 0x007ffff0a26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x0874 PSN 0xbe0072 RKey 0x1fff00 VAddr 0x007ffff1226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x0875 PSN 0x4cc5d4 RKey 0x1fff00 VAddr 0x007ffff1a26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x0876 PSN 0xfe9570 RKey 0x1fff00 VAddr 0x007ffff2226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x0877 PSN 0x5b8ed9 RKey 0x1fff00 VAddr 0x007ffff2a26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x0878 PSN 0xad9128 RKey 0x1fff00 VAddr 0x007ffff3226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x0879 PSN 0x892476 RKey 0x1fff00 VAddr 0x007ffff3a26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x087a PSN 0x3e9489 RKey 0x1fff00 VAddr 0x007ffff4226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x087b PSN 0xebfd2f RKey 0x1fff00 VAddr 0x007ffff4a26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x087c PSN 0xf83acd RKey 0x1fff00 VAddr 0x007ffff5226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x087d PSN 0x13b1ff RKey 0x1fff00 VAddr 0x007ffff5a26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x087e PSN 0x375a09 RKey 0x1fff00 VAddr 0x007ffff6226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x087f PSN 0xb763c9 RKey 0x1fff00 VAddr 0x007ffff6a26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x0880 PSN 0xc8558f RKey 0x1fff00 VAddr 0x007ffff7226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

---------------------------------------------------------------------------------------

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]

8388608 80000 391.53 391.53 0.005834

---------------------------------------------------------------------------------------

Server-2

---------------------------------------------------------------------------------------

RDMA_Write BW Test

Dual-port : OFF Device : mlx5_1

Number of qps : 16 Transport type : IB

Connection type : RC Using SRQ : OFF

PCIe relax order: ON Lock-free : OFF

ibv_wr* API : ON Using DDP : OFF

TX depth : 128

CQ Moderation : 100

CQE Poll Batch : 16

Mtu : 4096[B]

Link type : Ethernet

GID index : 3

Max inline data : 0[B]

rdma_cm QPs : OFF

Data ex. method : Ethernet

---------------------------------------------------------------------------------------

local address: LID 0000 QPN 0x0871 PSN 0x2be139 RKey 0x1fff00 VAddr 0x007fffefa26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

local address: LID 0000 QPN 0x0872 PSN 0xd1f866 RKey 0x1fff00 VAddr 0x007ffff0226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

local address: LID 0000 QPN 0x0873 PSN 0x957582 RKey 0x1fff00 VAddr 0x007ffff0a26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

local address: LID 0000 QPN 0x0874 PSN 0xbe0072 RKey 0x1fff00 VAddr 0x007ffff1226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

local address: LID 0000 QPN 0x0875 PSN 0x4cc5d4 RKey 0x1fff00 VAddr 0x007ffff1a26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

local address: LID 0000 QPN 0x0876 PSN 0xfe9570 RKey 0x1fff00 VAddr 0x007ffff2226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

local address: LID 0000 QPN 0x0877 PSN 0x5b8ed9 RKey 0x1fff00 VAddr 0x007ffff2a26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

local address: LID 0000 QPN 0x0878 PSN 0xad9128 RKey 0x1fff00 VAddr 0x007ffff3226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

local address: LID 0000 QPN 0x0879 PSN 0x892476 RKey 0x1fff00 VAddr 0x007ffff3a26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

local address: LID 0000 QPN 0x087a PSN 0x3e9489 RKey 0x1fff00 VAddr 0x007ffff4226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

local address: LID 0000 QPN 0x087b PSN 0xebfd2f RKey 0x1fff00 VAddr 0x007ffff4a26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

local address: LID 0000 QPN 0x087c PSN 0xf83acd RKey 0x1fff00 VAddr 0x007ffff5226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

local address: LID 0000 QPN 0x087d PSN 0x13b1ff RKey 0x1fff00 VAddr 0x007ffff5a26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

local address: LID 0000 QPN 0x087e PSN 0x375a09 RKey 0x1fff00 VAddr 0x007ffff6226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

local address: LID 0000 QPN 0x087f PSN 0xb763c9 RKey 0x1fff00 VAddr 0x007ffff6a26000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

local address: LID 0000 QPN 0x0880 PSN 0xc8558f RKey 0x1fff00 VAddr 0x007ffff7226000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:201:25

remote address: LID 0000 QPN 0x07c5 PSN 0xf04146 RKey 0x203d00 VAddr 0x007fffefac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x07c6 PSN 0x4018b8 RKey 0x203d00 VAddr 0x007ffff02c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x07c7 PSN 0x9157a2 RKey 0x203d00 VAddr 0x007ffff0ac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x07c8 PSN 0x1c7b49 RKey 0x203d00 VAddr 0x007ffff12c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x07c9 PSN 0xa1ca28 RKey 0x203d00 VAddr 0x007ffff1ac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x07ca PSN 0xfb586a RKey 0x203d00 VAddr 0x007ffff22c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x07cb PSN 0x15a281 RKey 0x203d00 VAddr 0x007ffff2ac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x07cc PSN 0xe752e6 RKey 0x203d00 VAddr 0x007ffff32c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x07cd PSN 0xcfc992 RKey 0x203d00 VAddr 0x007ffff3ac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x07ce PSN 0xf07cac RKey 0x203d00 VAddr 0x007ffff42c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x07cf PSN 0x8d2e0 RKey 0x203d00 VAddr 0x007ffff4ac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x07d0 PSN 0xf1c9f3 RKey 0x203d00 VAddr 0x007ffff52c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x07d1 PSN 0x703c63 RKey 0x203d00 VAddr 0x007ffff5ac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x07d2 PSN 0x9a91d4 RKey 0x203d00 VAddr 0x007ffff62c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x07d3 PSN 0xc41402 RKey 0x203d00 VAddr 0x007ffff6ac3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

remote address: LID 0000 QPN 0x07d4 PSN 0x98209e RKey 0x203d00 VAddr 0x007ffff72c3000

GID: 00:00:00:00:00:00:00:00:00:00:255:255:01:01:101:25

---------------------------------------------------------------------------------------

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]

2 80000 0.095617 0.095542 5.971350

4 80000 0.19 0.19 6.043600

8 80000 0.39 0.39 6.057085

16 80000 0.77 0.77 6.029973

32 80000 1.56 1.55 6.067049

64 80000 3.11 3.11 6.069521

128 80000 6.18 6.18 6.031414

256 80000 12.36 12.36 6.033941

512 80000 24.72 24.67 6.022545

1024 80000 49.33 49.29 6.016425

2048 80000 98.16 98.06 5.985260

4096 80000 194.37 25.64 0.782418

8192 80000 381.19 54.09 0.825285

16384 80000 376.17 108.07 0.824473

32768 80000 382.82 270.39 1.031458

65536 80000 386.12 372.57 0.710622

131072 80000 388.62 355.34 0.338879

262144 80000 390.25 390.22 0.186074

524288 80000 391.65 391.63 0.093373

1048576 80000 391.96 391.95 0.046724

2097152 80000 392.09 392.09 0.023370

4194304 80000 391.67 391.67 0.011673

8388608 80000 391.53 391.53 0.005834

---------------------------------------------------------------------------------------

We leverage the ib_write_bw benchmark to assess RoCEv2 throughput for inter-GPU workloads, confirming the health and proper configuration of the lossless network fabric. The tests verified that the end-to-end paths on a Cisco Nexus 9364D-GX2A switch could sustain 100 percent line-rate traffic. The measured average bandwidth of 391 Gbps on a 400 Gbps link represents the maximum achievable goodput after taking the encapsulation overhead of the RoCEv2 protocol headers into consideration.

Verify end-end QoS on Cisco Nexus 9300 switch for RDMA traffic

You should verify QOS Group 3 numbers TX packets incrementing for all test runs, which makes sure the right treatment is given for AI traffic.

Nexus-9364D-GX2A-1# sh queuing interface et1/1

slot 1

=======

Egress Queuing for Ethernet1/1 [System]

-------------------------------------------------------------------------------

QoS-Group# Bandwidth%% PrioLevel Shape QLimit

Min Max Units

-------------------------------------------------------------------------------

7 - 1 - - - 9(D)

6 0 - - - - 9(D)

5 0 - - - - 9(D)

4 0 - - - - 9(D)

3 99 - - - - (N/A)

2 0 - - - - 9(D)

1 0 - - - - 9(D)

0 1 - - - - 9(D)

+-------------------------------------------------------------+

| QOS GROUP 0 |

+-------------------------------------------------------------+

| | Unicast |Multicast |

+-------------------------------------------------------------+

| Tx Pkts | 1959| 0|

| Tx Byts | 257502| 0|

| WRED/AFD & Tail Drop Pkts | 0| 0|

| WRED/AFD & Tail Drop Byts | 0| 0|

| ECN Pkts | 0| 0|

| ECN Byts | 0| 0|

| Q Depth Byts | 0| 0|

+-------------------------------------------------------------+

| QOS GROUP 1 |

+-------------------------------------------------------------+

| | Unicast |Multicast |

+-------------------------------------------------------------+

| Tx Pkts | 0| 0|

| Tx Byts | 0| 0|

| WRED/AFD & Tail Drop Pkts | 0| 0|

| WRED/AFD & Tail Drop Byts | 0| 0|

| ECN Pkts | 0| 0|

| ECN Byts | 0| 0|

| Q Depth Byts | 0| 0|

+-------------------------------------------------------------+

| QOS GROUP 2 |

+-------------------------------------------------------------+

| | Unicast |Multicast |

+-------------------------------------------------------------+

| Tx Pkts | 0| 0|

| Tx Byts | 0| 0|

| WRED/AFD & Tail Drop Pkts | 0| 0|

| WRED/AFD & Tail Drop Byts | 0| 0|

| ECN Pkts | 0| 0|

| ECN Byts | 0| 0|

| Q Depth Byts | 0| 0|

+-------------------------------------------------------------+

| QOS GROUP 3 |

+-------------------------------------------------------------+

| | Unicast |Multicast |

+-------------------------------------------------------------+

| Tx Pkts | 85923028| 0|

| Tx Byts | 343892489604| 0|

| WRED/AFD & Tail Drop Pkts | 0| 0|

| WRED/AFD & Tail Drop Byts | 0| 0|

| ECN Pkts | 0| 0|

| ECN Byts | 0| 0|

| Q Depth Byts | 0| 0|

+-------------------------------------------------------------+

| QOS GROUP 4 |

+-------------------------------------------------------------+

| | Unicast |Multicast |

+-------------------------------------------------------------+

| Tx Pkts | 0| 0|

| Tx Byts | 0| 0|

| WRED/AFD & Tail Drop Pkts | 0| 0|

| WRED/AFD & Tail Drop Byts | 0| 0|

| ECN Pkts | 0| 0|

| ECN Byts | 0| 0|

| Q Depth Byts | 0| 0|

+-------------------------------------------------------------+

| QOS GROUP 5 |

+-------------------------------------------------------------+

| | Unicast |Multicast |

+-------------------------------------------------------------+

| Tx Pkts | 0| 0|

| Tx Byts | 0| 0|

| WRED/AFD & Tail Drop Pkts | 0| 0|

| WRED/AFD & Tail Drop Byts | 0| 0|

| ECN Pkts | 0| 0|

| ECN Byts | 0| 0|

| Q Depth Byts | 0| 0|

+-------------------------------------------------------------+

| QOS GROUP 6 |

+-------------------------------------------------------------+

| | Unicast |Multicast |

+-------------------------------------------------------------+

| Tx Pkts | 0| 0|

| Tx Byts | 0| 0|

| WRED/AFD & Tail Drop Pkts | 0| 0|

| WRED/AFD & Tail Drop Byts | 0| 0|

| ECN Pkts | 0| 0|

| ECN Byts | 0| 0|

| Q Depth Byts | 0| 0|

+-------------------------------------------------------------+

| QOS GROUP 7 |

+-------------------------------------------------------------+

| | Unicast |Multicast |

+-------------------------------------------------------------+

| Tx Pkts | 0| 0|

| Tx Byts | 0| 0|

| WRED/AFD & Tail Drop Pkts | 0| 0|

| WRED/AFD & Tail Drop Byts | 0| 0|

| ECN Pkts | 0| 0|

| ECN Byts | 0| 0|

| Q Depth Byts | 0| 0|

+-------------------------------------------------------------+

| CONTROL QOS GROUP |

+-------------------------------------------------------------+

| Tx Pkts | 162940| 0|

| Tx Byts | 53055304| 0|

| WRED/AFD & Tail Drop Pkts | 0| 0|

| WRED/AFD & Tail Drop Byts | 0| 0|

+-------------------------------------------------------------+

| SPAN QOS GROUP |

+-------------------------------------------------------------+

| Tx Pkts | 0| 0|

| Tx Byts | 0| 0|

| WRED/AFD & Tail Drop Pkts | 0| 0|

| WRED/AFD & Tail Drop Byts | 0| 0|

Per Slice Egress SPAN Statistics

---------------------------------------------------------------

SPAN Copies Tail Drop Pkts 0

SPAN Input Queue Drop Pkts 0

SPAN Copies/Transit Tail Drop Pkts 0

SPAN Input Desc Drop Pkts 0

Ingress Queuing for Ethernet1/1

--------------------------------------------------------

QoS-Group# Pause

Buff Size Pause Th Resume Th

--------------------------------------------------------

7 - - -

6 - - -

5 - - -

4 - - -

3 1114176 584832 580800

2 - - -

1 - - -

0 - - -

Per Port Ingress Statistics

--------------------------------------------------------

Hi Priority Drop Pkts 0

Low Priority Drop Pkts 0

Ingress Overflow Drop Statistics

--------------------------------------------------------

All Pause Drop Pkts 0

High Pause Drop Pkts 0

Low Pause Drop Pkts 0

PFC Statistics

------------------------------------------------------------------------------

TxPPP: 0, RxPPP: 0

------------------------------------------------------------------------------

PFC_COS QOS_Group TxPause TxCount RxPause RxCount

0 - Inactive 0 Inactive 0

1 - Inactive 0 Inactive 0

2 - Inactive 0 Inactive 0

3 3 Inactive 0 Inactive 0

4 - Inactive 0 Inactive 0

5 - Inactive 0 Inactive 0

6 - Inactive 0 Inactive 0

7 - Inactive 0 Inactive 0

------------------------------------------------------------------------------

Conclusion

This validated design provides a comprehensive blueprint for deploying a high-performance, scalable AI infrastructure by combining Lenovo's powerful ThinkSystem SR675 V3 compute nodes with Cisco's robust Nexus networking fabric. By following the detailed steps for server tuning, software stack deployment, and lossless network configuration, enterprises can build a reliable foundation for demanding AI/ML workloads. The guide demonstrates two powerful methods for fabric configuration: a granular CLI approach using the Cisco Modular Quality of Service framework and an automated, template-driven process using Cisco Nexus Dashboard. Performance validation using the ib_write_bw benchmark confirms the success of the configuration, achieving an impressive 391 Gbps of goodput on a 400 Gbps link. This shows that the end-to-end architecture is correctly tuned to prevent packet loss and maximize throughput for RoCEv2 traffic. Ultimately, this guide empowers organizations to accelerate their AI initiatives, providing a clear path from a starter kit to a multi-rack, production-grade AI factory.

Sample bill of materials

AI compute nodes, as tested

Part number	Product description	Qty
7D9RCTO1WW	Server: Lenovo ThinkSystem SR675 V3 3yr Warranty with Controlled GPU	2
BR7F	Lenovo ThinkSystem SR675 V3 8DW PCIe GPU Base	2
C2AL	Lenovo ThinkSystem AMD EPYC 9535 64C 300W 2.4GHz Processor	4
C0CK	Lenovo ThinkSystem 64GB TruDDR5 6400MHz (2Rx4) RDIMM-A	48
C1AE	Lenovo ThinkSystem E3.S PM9D3a 3.84TB Read Intensive NVMe PCIe 5.0 x4 HS SSD	4
BTMB	Lenovo ThinkSystem 1x4 E3.S Backplane	2
B8P9	Lenovo ThinkSystem M.2 NVMe 2-Bay RAID Adapter	2
BXMH	Lenovo ThinkSystem M.2 PM9A3 960GB Read Intensive NVMe PCIe 4.0 x4 NHS SSD	4
C0Q4	Lenovo ThinkSystem NVIDIA BlueField-3 B3140H VPI QSFP112 1P 400G PCIe Gen5 x16 Adapter	8
BVBG	Lenovo ThinkSystem NVIDIA BlueField-3 B3220 VPI QSFP112 2P 200G PCIe Gen5 x16 Adapter	2
BYFH	Lenovo ThinkSystem NVIDIA L40S 48GB PCIe Gen4 Passive GPU	8
BR7L	Lenovo ThinkSystem SR675 V3 x16/x16 PCIe Riser Option Kit	4
BR7H	Lenovo ThinkSystem SR675 V3 2x16 PCIe Front IO Riser	2
BR7S	Lenovo ThinkSystem SR675 V3 Switched 4x16 PCIe DW GPU Direct RDMA Riser	4
BKTJ	Lenovo ThinkSystem 2600W 230V Titanium Hot-Swap Gen2 Power Supply	8
6252	2.5m, 16A/100-250V, C19 to C20 Jumper Cord	8
C3KA	Lenovo ThinkSystem SR670 V2/SR675 V3 Heavy Systems Toolless Slide Rail Kit	2
C3EF	Lenovo ThinkSystem SR675 V3 System Board v2	2
BABV	Lenovo ThinkSystem Screw for fix M.2 Adapter	2
BR7U	Lenovo ThinkSystem SR675 V3 Root of Trust Module	2
BFTH	Lenovo ThinkSystem SR670 V2/ SR675 V3 Front Operator Panel ASM	2
BFD6	Lenovo ThinkSystem SR670 V2/ SR675 V3 Power Mezzanine Board	2
C5WW	Lenovo ThinkSystem SR675 V3 Dual Rotor System High Performance Fan	10
BTME	Lenovo ThinkSystem SR675 V3 E3.S Backplane Cage Assembly for 8DW PCIe GPU Base	2
5641PX3	Lenovo XClarity Pro, Per Endpoint w/3 Yr SW S&S	2
1340	Lenovo XClarity Pro, Per Managed Endpoint w/3 Yr SW S&S	2
7Q01CTSAWW	SERVER KEEP YOUR DRIVE ADD-ON	2
7Q01CTS4WW	SERVER PREMIER 24X7 4HR RESP	2

Service nodes

Part Number	Description	Qty
7D9GCTO1WW	Server: Lenovo ThinkSystem SR635 V3 - 3yr Warranty
BLK4	Lenovo ThinkSystem V3 1U 10x2.5" Chassis	1
BVGL	Data Center Environment 30 Degree Celsius / 86 Degree Fahrenheit	1
C2AQ	Lenovo ThinkSystem AMD EPYC 9335 32C 210W 3.0GHz Processor	1
BQ26	Lenovo ThinkSystem SR645 V3/SR635 V3 1U High Performance Heatsink	1
C1PL	Lenovo ThinkSystem 32GB TruDDR5 6400MHz (1Rx4) RDIMM-A	12
BC4V	Non RAID NVMe	1
C0ZU	Lenovo ThinkSystem 2.5" U.2 VA 3.84TB Read Intensive NVMe PCIe 5.0 x4 HS SSD	2
BPC9	Lenovo ThinkSystem 1U 4x 2.5" NVMe Gen 4 Backplane	1
B5XJ	Lenovo ThinkSystem M.2 SATA/NVMe 2-Bay Adapter	1
BTTY	M.2 NVMe	1
BKSR	Lenovo ThinkSystem M.2 7450 PRO 960GB Read Intensive NVMe PCIe 4.0 x4 NHS SSD	2
BQBN	Lenovo ThinkSystem NVIDIA ConnectX-7 NDR200/200GbE QSFP112 2-port PCIe Gen5 x16 Adapter	1
BLK7	Lenovo ThinkSystem SR635 V3/SR645 V3 x16 PCIe Gen5 Riser 1	1
BLK9	Lenovo ThinkSystem V3 1U MS LP+LP BF Riser Cage	1
BNFG	Lenovo ThinkSystem 750W 230V/115V Platinum Hot-Swap Gen2 Power Supply v3	2
BH9M	Lenovo ThinkSystem V3 1U Performance Fan Option Kit v2	7
BLKD	Lenovo ThinkSystem 1U V3 10x2.5" Media Bay w/ Ext. Diagnostics Port	1
7Q01CTS2WW	5Yr Premier NBD Resp + KYD SR635 V3	1

Converged network switches

Part number	Description	Qty
7DLKCTO1WW	Cisco Nexus 9300-GX2 Series Switch (N9K-C9364D-GX2A)
C5P0	Cisco Nexus 9300-GX2 Series Switch (N9K-C9364D-GX2A)	2
C6FK	Mode selection between Cisco ACI^® and Cisco NX-OS (MODE-NXOS)	2
6252	2.5m, 16A/100-250V, C19 to C20 Jumper Cord	4
C1P1TN9300XF2-5Y	5 Years (60 months) Cisco^® software Premier license	2

Out-of-band network switches

Part number	Description	Qty
7DL8CTO1WW	Cisco Nexus 9300-FX3 Series Switch (N9K-C93108TC-FX3)
C5PB	Cisco Nexus 9300-GX2 Series Switch (N9K-C9364D-GX2A)	2
C6FK	Mode selection between ACI and NXOS (MODE-NXOS)	2
6252	2.5m, 16A/100-250V, C19 to C20 Jumper Cord	4
C1P1TN9300XF-5Y	5 Years (60 months) Cisco software Premier license	2

Learn more

The product guide and configurator tool can be found here:
https://lenovopress.lenovo.com/lp2236-lenovo-hybrid-ai-285-with-cisco-networking

Lenovo Hybrid AI 285 with Cisco Networking

AI-Ready Data Center Validated Design Guide

Available Languages

Download Options

Bias-Free Language

Available Languages

Download Options

Table of Contents

Learn more