Scaling FlexPod for GPU Intensive Applications

Available Languages

Download Options

  • PDF
    (8.1 MB)
    View with Adobe Reader on a variety of devices
  • ePub
    (9.0 MB)
    View in various apps on iPhone, iPad, Android, Sony Reader, or Windows Phone
  • Mobi (Kindle)
    (4.8 MB)
    View on Kindle device or Kindle app on multiple devices

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Available Languages

Download Options

  • PDF
    (8.1 MB)
    View with Adobe Reader on a variety of devices
  • ePub
    (9.0 MB)
    View in various apps on iPhone, iPad, Android, Sony Reader, or Windows Phone
  • Mobi (Kindle)
    (4.8 MB)
    View on Kindle device or Kindle app on multiple devices

Table of Contents

 

 

Published:  December 2023

A logo for a companyDescription automatically generated

New NetApp FlexPod Cisco Validated Designs available! | Why Is The Internet  Broken?

 

In partnership with:

A black and white logoDescription automatically generated

About the Cisco Validated Design Program

The Cisco Validated Design (CVD) program consists of systems and solutions designed, tested, and documented to facilitate faster, more reliable, and more predictable customer deployments. For more information, go to: http://www.cisco.com/go/designzone.

Executive Summary

Organizations across various industries are facing challenges in overcoming competitive pressures, improve quality, boost productivity, accelerate digital transformation and decrease time to market, costs, and risks. The exponential growth in data is propelling the adoption of GPU Intensive Applications, HPC and AI technologies across multiple domains, as organizations seek to harness the power of these tools to extract meaningful insights, drive innovation, and make data-driven decisions in an increasingly data-rich world.

The definition of HPC (High-Performance Computing) is constantly changing, traditionally focused on transports that are sensitive to latency and the highest level of parallelization. GPUs take this evolution a step further. These advancements are enhanced by the amalgamation of GPU-accelerated applications and AI, leading to significant breakthroughs in various fields such as healthcare, finance, materials science, and climate research. This fusion allows researchers to delve into more complex models, simulate real-world scenarios, and analyze large datasets for data-driven discoveries. To fully leverage these capabilities and reap maximum benefits, a well-balanced and high-performing infrastructure encompassing compute, network, and storage is necessary. FlexPod facilitates this comprehensive approach.

AI has made a significant impact by transforming how the IT systems and services are developed, managed, and utilized. Adoption of AI-native infrastructure includes hardware accelerators such as GPU (Graphics Processing Unit), FPGA (Field Programmable Gate Arrays) and Tensor Processing Units (TPU) to handle AI computation effectively at scale. capable of extracting insights, learning from data patterns, and generating predictions, and making intelligent decisions, all at a scale and speed previously unattainable. Convergence of HPC and AI are reshaping the future datacenter design paradigm.

The convergence of GPU Intensive Applications, High-Performance Computing (HPC) and Artificial Intelligence (AI) technologies in a unified architecture brings about enhanced performance, scalability, and versatility for data-intensive tasks. This integration seamlessly incorporates popular AI development frameworks like TensorFlow and PyTorch, allowing data scientists and researchers to develop, train, and deploy AI models within the HPC environment. The fusion of custom solutions, hyperparameter optimization, and cross-validation enables the fine-tuning of models, empowering the exploration of complex scenarios and analysis of vast datasets in fields such as healthcare, finance, materials science, and climate research. Groundbreaking advancements are evident in applications like climate modeling, weather prediction, finance (risk analysis and fraud detection), energy exploration, material science, and healthcare (drug discovery and patient care).

While challenges exist, addressing them through strategic approaches involving talent development, best practices, robust data governance, and awareness of evolving technologies and regulations can unlock the transformative benefits of HPC and AI integration. This collaborative approach fosters improved decision-making, scientific advancements, and innovation across diverse industries and research domains.

This document summarizes SpecHPC 2021 based benchmark applications targeted for real-life model simulation in the fields of but not limited to; linear scalability when executing various size of dataset on HPC cluster consisting of eight node Cisco UCS C240 M7 Rack Server hosting NVIDIA A100-80G GPUs in FlexPod architecture.

   Weather simulation – Weather forecasting and climate modeling, agriculture, aviation, natural disaster prediction and prevention, renewable energy.

   Nuclear engineering (radiation transport) – Nuclear energy, radiation shielding, nuclear security and safeguards, medical imaging, and treatment.

   High performance geometric multigrid – Biomedical simulation, oil and gas reservoir simulation, Fluid Dynamics and Aerodynamics, structural mechanics.

 

FlexPod is designed to meet the demands of AI workloads. It offers:

   Simplified deployment and operation of general-purpose AI workloads.

   Seamless integration into AI eco systems.

   Gain operational simplicity and efficiency.

   Accelerate time to value and speed up AI implementation.

   Protect AI infra to safeguard systems, management, data, and applications.

   Linear Scalability: Demonstrated through benchmark tests, showcasing consistent performance even with varying dataset sizes.

   Centralized Management and Automation: Powered by Cisco Intersight, FlexPod reduces deployment times, optimizes resource utilization, minimizes energy consumption, and streamlines operations.

   NVIDIA HPC-X Software Toolkit Setup and Configuration: We've validated FlexPod using the NVIDIA HPC-X software toolkit, ensuring seamless integration and optimal performance. This toolkit harnesses the power of technologies like MPI (Message Passing Interface), OpenACC (Open Accelerators), and UCX (Unified Communication X) to enhance the capabilities of our solution.

   NetApp Tools: NetApp DataOps Toolkit is a Python library that makes it easy for developers, data scientists, and data engineers to perform numerous data management tasks.

   Comprehensive Testing for Real-World Workloads: Our rigorous testing process evaluates the scalability of FlexPod across different dataset sizes and application areas. We compare CPU-only performance with GPU-equipped systems, providing valuable insights into the capabilities of our solution.

The pre-validated design of FlexPod with centralized management and automation capabilities of Cisco Intersight reduces deployment times, optimize resource utilization, minimize energy consumption, and streamlining operations leading to better TCO (Total Cost of Ownership) and improved ROI (Return on Investment).

Solution Overview

This chapter contains the following:

   Audience

   Purpose of this Document

   What’s New in this Release?

   Solution Summary

The FlexPod AI (Artificial Intelligence) solution for HPC and AI workloads aims to deliver a seamless integration of the current FlexPod portfolio to enable single architecture which can be sized and optimized for GPU acceleration and faster access to data through high-speed data-fabric.

This document describes steps to install and configure the Cisco UCS C240 M7 Rack Server with NVIDIA GPU in FlexPod AI. The deployment details can be extended on the Cisco UCS server with supported NVIDIA GPUs such as Cisco UCS C22X M7, C24X M7 and Cisco UCS X210c compute nodes with X440p PCIe node within the FlexPod AI.

Audience

The intended audience for this document includes, but is not limited to, sales engineers, field consultants, professional services, IT managers, IT engineers, partners, and customers who wants to take advantage of an infrastructure catering diverse workload HPC, AI/ML and analytics in a single architecture and be able to deliver IT efficiency and enable IT innovation.

Purpose of this Document

This document serves as a comprehensive guide for integrating the Cisco Intersight-managed Cisco UCS M7 servers into the FlexPod AI infrastructure. It provides essential design guidance, covering various elements and considerations necessary for a successful deployment. This document highlights the significant value of horizontal scaling in accelerating applications through the addition of GPU and CPU resources. Horizontal scaling enables organizations to harness the power of multiple GPUs and CPUs, unlocking enhanced processing capabilities and improved performance for AI/ML workloads. By leveraging the scalability of the Cisco UCS M7 servers, businesses can achieve optimal utilization of computational resources and drive breakthroughs in fields that demand intensive computing power. Additionally, this document emphasizes the design and product requirements for incorporating scalable AI/ML solutions to address high-performance computing use cases, including weather modeling, high energy physics, scientific experiment simulation, and life sciences.

Furthermore, through the integration of Cisco Intersight management and the FlexPod AI infrastructure, this document provides valuable insights into the best practices and considerations for deploying a scalable and high-performing solution. It empowers organizations to leverage advanced technologies and accelerate their AI/ML initiatives, ultimately driving innovation and achieving transformative results in their respective industries.

What’s New in this Release?

The following design elements distinguish this version of FlexPod from previous models:

     Optimized integration of Cisco UCS C240 M7 servers with NVIDIA A100 GPU into the platform design.

     Scalable HPC cluster with GPU (Nvidia A100) and CPU (Intel 4th gen scalable processors) based practices.

     Support for 4th Gen Intel Xeon Scalable Processors (Sapphire Rapids) with up to 60 cores per processor and up to 8TB memory with 32 x 256GB DDR5-4800 DIMMs.

     Integration of NetApp A400 NVMe based all flash storage system to support AI/ML dataset.

     NetApp ONTAP 9.12.1.

     FlexPod AI architecture with end-to-end 100Gbps.

     Cisco Intersight managed stand-alone UCS C-Series rack server connected to Cisco Nexus switch.

     Cisco Intersight automated operating system installation.

     Ansible for post-OS configuration of HPC-AI cluster on bare metal.

Solution Summary

The FlexPod AI solution offers the following key benefits:

   Converged Infrastructure: FlexPod provides a pre-validated and integrated solution that combines compute, storage, and networking components, streamlining deployment and management.

   Modular Scalability: With its modular architecture, FlexPod allows for the independent scaling of resources, enabling organizations to adapt to changing workload demands.

   Simplified Management: FlexPod includes management and orchestration tools that automate provisioning and ensure consistent and compliant infrastructure management.

   Multitenancy Support: It offers multitenancy capabilities, allowing multiple workloads or applications to run concurrently while maintaining isolation and security.

   Compatibility Assurance: Cisco and NetApp certification processes ensure that all components are compatible and reliable, reducing the risk of integration issues and providing stable and supported infrastructure.

   Investment protection: as technology evolves, businesses can add new components or upgrade existing ones to adapt to changing workload demands while maintaining the integrity of their initial investment in FlexPod.

Technology Overview

This chapter contains the following:

   FlexPod AI

   Cisco Unified Computing System

   Cisco UCS C-Series Rack Server

   Cisco Nexus 93600CD-GX

   NetApp AFF A-Series Storage

FlexPod AI

FlexPod is an integrated data center solution that combines compute, storage, and networking components, simplifying deployment, and offering scalability. It ensures compatibility and supports multiple workloads, making it an efficient and adaptable choice for modern data centers.

Figure 1.   FlexPod AI components

Related image, diagram or screenshot

Go to the Cisco Design Zone for pre-validated FlexPod Design Guides containing a wide variety of enterprise applications.

Cisco Unified Computing System

Cisco Unified Computing System (Cisco UCS) is a next-generation data center platform that integrates compute, network, storage, and virtualization resources into a single cohesive architecture simplifying data center management, improved operational efficiency, reduce total cost of ownership, and increase business agility.

Cisco UCS Manager is the central management software that provides a unified interface for configuring, monitoring, and automating the entire UCS infrastructure, streamlining operations and enabling rapid resource provisioning. Cisco Intersight is a cloud-based management platform that offers enhanced visibility, analytics, and automation for Cisco UCS and other Cisco infrastructure, providing a scalable and intelligent solution for optimizing data center operations and achieving agility in the modern IT landscape.

Cisco UCS Differentiators

Cisco Unified Computing System is revolutionizing the way servers are managed in the datacenter. The following are the unique differentiators of Cisco Unified Computing System and Cisco UCS Manager:

   Embedded Management—In Cisco UCS, the servers are managed by the embedded firmware in the Fabric Inter-connects, eliminating the need for any external physical or virtual devices to manage the servers.

   Unified Fabric—In Cisco UCS, from blade server chassis or rack servers to FI, there is a single Ethernet cable used for LAN, SAN, and management traffic. This converged I/O results in reduced cables, SFPs and adapters – reducing capital and operational expenses of the overall solution.

   Auto Discovery—By simply inserting the blade server in the chassis or connecting the rack server to the fabric interconnect, discovery and inventory of compute resources occurs automatically without any management intervention. The combination of unified fabric and auto-discovery enables the wire-once architecture of Cisco UCS, where compute capability of Cisco UCS can be extended easily while keeping the existing external connectivity to LAN, SAN, and management networks.

Cisco Intersight

Cisco Intersight is a lifecycle management platform for your infrastructure, regardless of where it resides. In your enterprise data center, at the edge, in remote and branch offices, at retail and industrial sites—all these locations present unique management challenges and have typically required separate tools. Cisco Intersight Software as a Service (SaaS) unifies and simplifies your experience of the Cisco Unified Computing System (Cisco UCS). See Figure 2.

Figure 2.          Cisco Intersight

Configure Cisco Intersight Managed Mode for FlashStack: Technical Preview  Release - Cisco

Cisco UCS C-Series Rack Server

Cisco UCS C-Series Rack-Mount Servers keep pace with Intel Xeon and AMD EPYC processor innovation by offering the latest processors with increased processor frequency and improved security and availability features. Cisco UCS C-Series servers offer an improved price-to-performance ratio. They also extend Cisco UCS innovations to an industry-standard rack-mount form factor, including a standards-based unified network fabric, Cisco VN-Link virtualization support, and Cisco Extended Memory Technology.

It is designed to operate both in standalone environments and as part of Cisco UCS managed by Cisco Intersight or Cisco UCS Manager. Cisco UCS C-Series servers enable organizations to deploy systems incrementally—using as many or as few servers as needed—on a schedule that best meets the organization’s timing and budget.

Cisco UCS C240 M7 Rack Server

The Cisco UCS C240 M7 Rack Server extends the capabilities of the Cisco UCS rack server portfolio with up to two 4th Gen Intel Xeon Scalable CPUs, with up to 60 cores per socket. The maximum memory capacity for 2 CPUs is 8 TB (for 32 x 256 GB DDR5 4800 MT/s DIMMs). The Cisco UCS C240 M7 has a 2-Rack-Unit (RU) form and supports up to 8 PCIe 4.0 slots or up to 4 PCIe 5.0 slots plus a modular LAN on motherboard (mLOM) slot. Up to five GPUs are supported. The server delivers significant performance and efficiency gains that will improve your application performance.

Figure 3.          Cisco UCS C240 M7 Rack Server – front and rear image

Related image, diagram or screenshot

For more details, go to: Cisco UCS C240 M7 Rack Server Data Sheet.

Cisco Nexus 93600CD-GX

Based on Cisco Cloud Scale technology, the Cisco Nexus 9300-GX switches are the next generation of fixed Cisco Nexus 9000 Series Switches capable of supporting 400 Gigabit Ethernet (GE). With the increase in use cases for applications requiring Artificial Intelligence (AI) and Machine Learning (ML), the platform addresses the need for high-performance, power-efficient, compact switches in the networking infrastructure. These switches are designed to support 100G and 400G fabrics for mobile service provider environments, including the network edge, 5G, IoT, Professional Media Networking platform (PMN), and Network Functions Virtualization (NFV).

The Cisco Nexus 93600CD-GX Switch (Figure 2) is a 1RU switch that supports 12 Tbps of bandwidth and 4.0 bpps across 28 fixed 40/100G QSFP-28 ports and 8 fixed 10/25/40/50/100/200/400G QSFP-DD ports.

Cisco provides two modes of operation for Cisco Nexus 9000 Series Switches. Organizations can deploy Cisco Application Centric Infrastructure (Cisco ACI) or Cisco NX-OS mode.

Figure 4.          Cisco UCS Nexus 93600CD-GX Switch

Cisco Nexus 93600CD Switch

For more details, see https://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-series-switches/nexus-9300-gx-series-switches-ds.html

NetApp AFF A-Series Storage

NetApp AFF A-Series controller lineup provides industry leading performance while continuing to provide a full suite of enterprise-grade data management and data protection features. NetApp AFF A-Series systems support end-to-end NVMe technologies, from NVMe-attached SSDs to frontend NVMe over Fibre Channel (NVMe/FC) host connectivity. These systems deliver enterprise class performance, making them a superior choice for driving the most demanding workloads and applications. With a simple software upgrade to the modern NVMe/FC SAN infrastructure, you can drive more workloads with faster response times, without disruption or data migration. Additionally, more organizations are adopting a “cloud first” strategy, driving the need for enterprise-grade data services for a shared environment across on-premises data centers and the cloud. As a result, modern all-flash arrays must provide robust data services, integrated data protection, seamless scalability, and new levels of performance — plus deep application and cloud integration. These new workloads demand performance that first-generation flash systems cannot deliver.

For more information about the NetApp AFF A-series controllers, see the AFF product page: https://www.netapp.com/us/products/storage-systems/all-flash-array/aff-a-series.aspx.

You can view or download more technical specifications of the NetApp AFF A-series controllers here: https://www.netapp.com/us/media/ds-3582.pdf

NetApp AFF A400

The NetApp AFF A400 offers full end-to-end NVMe support. The frontend NVMe/FC connectivity makes it possible to achieve optimal performance from an all-flash array for workloads that include artificial intelligence, machine learning, and real-time analytics as well as business-critical databases. On the back end, the A400 supports both serial-attached SCSI (SAS) and NVMe-attached SSDs, offering the versatility for current customers to move up from their legacy A-Series systems and satisfying the increasing interest that all customers have in NVMe-based storage.

The NetApp AFF A400 offers greater port availability, network connectivity, and expandability. The NetApp AFF A400 has 10 PCIe Gen3 slots per high availability pair. The NetApp AFF A400 offers 25GbE or 100GbE, as well as 32Gb/FC and NVMe/FC network connectivity. This model was created to keep up with changing business needs and performance and workload requirements by merging the latest technology for data acceleration and ultra-low latency in an end-to-end NVMe storage system.

Figure 5.          NetApp AFF A400 Storage – front and rear image

Related image, diagram or screenshot

NetApp ONTAP 9

NetApp storage systems harness the power of ONTAP to simplify the data infrastructure from edge, core, and cloud with a common set of data services and 99.9999 percent availability. NetApp ONTAP 9 data management software from NetApp enables customers to modernize their infrastructure and transition to a cloud-ready data center. ONTAP 9 has a host of features to simplify deployment and data management, accelerate and protect critical data, and make infrastructure future-ready across hybrid-cloud architectures.

NetApp ONTAP 9 is the data management software that is used with the NetApp AFF A400 all-flash storage system in this solution design. ONTAP software offers secure unified storage for applications that read and write data over block- or file-access protocol storage configurations. These storage configurations range from high-speed flash to lower-priced spinning media or cloud-based object storage. ONTAP implementations can run on NetApp engineered FAS or AFF series arrays and in private, public, or hybrid clouds (NetApp Private Storage and NetApp Cloud Volumes ONTAP). Specialized implementations offer best-in-class converged infrastructure, featured here as part of the FlexPod AI solution or with access to third-party storage arrays (NetApp FlexArray virtualization). Together these implementations form the basic framework of the NetApp Data Fabric, with a common software-defined approach to data management, and fast efficient replication across systems. FlexPod and ONTAP architectures can serve as the foundation for both hybrid cloud and private cloud designs.

Read more about all the capabilities of ONTAP data management software here: https://www.netapp.com/us/products/data-management-software/ontap.aspx

ONTAP 9.12 brings additional enhancements in manageability, data protection, networking and security protocols, and SAN and object storage. It also includes updated hardware support, increased MetroCluster IP solution scale, and supports IP-routed MetroCluster IP backend connections. See the ONTAP 9.12.1 release note below for more details: https://docs.netapp.com/us-en/cloud-volumes-ontap-9121-relnotes/

FlexClone

NetApp FlexClone technology enables instantaneous point-in-time copies of a FlexVol volume without consuming any additional storage until the cloned data changes from the original. FlexClone volumes add extra agility and efficiency to storage operations. They take only a few seconds to create and do not interrupt access to the parent FlexVol volume. FlexClone volumes use space efficiently, applying the ONTAP architecture to store only data that changes between the parent and clone. FlexClone volumes are suitable for testing or development environments, or any environment where progress is made by locking-in incremental improvements. FlexClone volumes also benefit any business process where you must distribute data in a changeable form without endangering the integrity of the original.

NetApp DataOps Toolkit

The NetApp DataOps Toolkit is a Python library that makes it easy for developers, data scientists, and data engineers to perform numerous data management tasks. These tasks include provisioning a new data volume or development workspace, cloning a data volume or development workspace almost instantaneously, and creating a NetApp Snapshot copy of a data volume or development workspace for traceability and baselining. This Python library can function as either a command-line utility or a library of functions that can be imported into any Python program or Jupyter Notebook.

The DataOps Toolkit supports Linux and macOS hosts. The toolkit must be used in conjunction with a NetApp data storage system or service. It simplifies various data management tasks that are executed by the data storage system or service. To facilitate this simplification, the toolkit communicates with the data storage system or service through an API.

The NetApp DataOps Toolkit for Kubernetes abstracts storage resources and Kubernetes workloads up to the data-science workspace level. These capabilities are packaged in a simple, easy-to-use interface that is designed for data scientists and data engineers. Using the familiar form of a Python program, the Toolkit enables data scientists and engineers to provision and destroy JupyterLab workspaces in just seconds. These workspaces can contain terabytes, or even petabytes, of storage capacity, enabling data scientists to store all their training datasets directly in their project workspaces. Gone are the days of separately managing workspaces and data volumes.

Figure 6.          NetApp Data Science Toolkit

Related image, diagram or screenshot

AI/ML Use Cases - DataOps for Data Scientist

With the NetApp DataOps Toolkit, a data scientist can almost instantaneously create a space-efficient data volume that’s an exact copy of an existing volume regardless of the size of the dataset. Data scientists can quickly create clones of datasets that they can reformat, normalize, and manipulate, while preserving the original “gold-source” dataset. Under the hood, these operations use highly efficient and battle-tested NetApp FlexClone feature, but they can be performed by a data scientist without storage expertise. What used to take days or weeks (and the assistance of a storage administrator) now takes seconds.

Figure 7.          Clone of dataset

A diagram of a data sourceDescription automatically generated

Data scientists can also save a space-efficient, read-only copy of an existing data volume. Based on the famed NetApp Snapshot technology, this functionality can be used to version datasets and implement dataset-to-model traceability. In regulated industries, traceability is a baseline requirement, and implementing it is extremely complicated with most other tools. With the NetApp DataOps Toolkit, it’s quick and easy.

More operations and capabilities are available and documented here: https://github.com/NetApp/netapp-data-science-toolkit

Solution Design

This chapter contains the following:

   Requirements

   Considerations

Requirements

The FlexPod AI with Cisco UCS and NetApp storage meets the following general design requirements:

   Resilient design across all layers of the infrastructure with no single point of failure.

   Scalable design with the flexibility to add compute capacity, storage, or network bandwidth as needed.

   Modular design that can be replicated to expand and grow as the needs of the business grow.

   Flexible design that can support different models of various components with ease.

   Simplified design with ability to integrate and automate with external automation tools.

   Cloud-enabled design which can be configured, managed, and orchestrated from the cloud using GUI or APIs.

To deliver a solution which meets all these design requirements, various solution components are connected and configured as explained in the following sections.

Physical Components

Table 1 lists the required physical components and hardware for FlexPod AI.

Table 1.      FlexPod AI Hardware Components

Component

Hardware

Servers

Eight (8) Cisco UCS C240 M7 rack server

Storage

NetApp AFF A400

Network

Two (2) Cisco Nexus 93600CD-GX

Table 2.      Cisco UCS C240 M7 Hardware Components

Component

Hardware

Processor

Two (2) 4th Gen Intel® Xeon® Scalable Processor 6454S 32C/2.2GHz/270W

Memory

Sixteen (16) 64GB DDR5-4800 RDIMM

NIC (Network Interface Card)

Mellanox ConnectX-6 DX 2 x 100Gb Ethernet

GPU (Graphics Processing Unit)

NVIDIA Tesla A100-80GB GPU

Table 3.      NetApp AFF A400 Components

Component

Hardware

AFF Flash Array

NetApp All Flash AFF A400 Storage Array (4RU)

Capacity

27.8TB (12 x 3.49TB NVMe SSD Drives)

Connectivity

4 x 100Gb/s (2 x 100Gb per controller)

Data Rate: 100 Gb/s Ethernet, PCI Express Gen3: SERDES @ 8.0GT/s, 16 lanes) (MCX516A-CCAT)

1 Gb/s redundant Ethernet (Management port)

Software Components

Table 4 lists the software components and the versions required for FlexPod AI as tested and validated in this document.

Table 4.      FlexPod AI Software Components

Component

Software version

Cisco Intersight

SaaS platform

Cisco UCS Server Firmware

4.3(2.230207)

Host OS

Ubuntu 22.04 LTS

Mellanox ConnectX-6 NIC

22.36.1010

MLNX OFED 5.8-1.1.2.1

NVIDIA Tesla A100-80GB GPU

NVIDIA CUDA 12.2.2

NVIDIA Driver 535.104.05

NetApp AFF A400

ONTAP 9.12.1

Cisco Nexus 93600CD-GX

NX-OS 10.3.3

Note:      See the Bill of Materials section for a complete list and corresponding PID.

Required VLANs

Table 5 lists various VLANs configured for setting up the FlexPod environment including their specific usage.

Table 5.       FlexPod AI – VLAN Usage

VLAN ID

Name

Usage

2

Native-VLAN

Use VLAN 2 as Native VLAN instead of default VLAN (1)

248

OOB-MGMT-VLAN

Management VLAN to access and manage the servers

110

AI-ML-NFS_1

NFS VLAN to access AI/ML NFS volume hosting HPC data

160

AI-ML-NFS_2

NFS VLAN to access AI/ML NFS volume hosting HPC data

Some of the key highlights of VLAN usage are as follows:

   Bare-metal servers are managed using same management in-band VLAN: IB-MGMT-VLAN (248).

   Utilizing dedicated NFS VLANs for HPC and AI; hosts provide path selection flexibility and the ability to configure specific QoS policies. You are encouraged to use separate, dedicated VLANs for NFS traffic.

Physical Topology

Figure 8.          FlexPod AI Physical Topology

Diagram of a computer serverDescription automatically generated

For this solution, we tested left side configuration as shown in Figure 8 with Cisco UCS C-Series Rack Server and NetApp AFF A400 storage array connected to Cisco Nexus 93600CD-GX leaf switch with layer 2 configuration for a single rack testing. For large cluster spanning across multiple racks and Spine/Leaf networking best practices, please refer to Cisco Data Center Networking Blueprint for AI/ML Applications.

Note: This solution is tested with Cisco VIC 15428 ( 4x 10/25/50G mLOM C-Series) for 10GbE connection with ToR switch and utilized for separate OS management from data traffic on Mellanox ConnectX6-DX Ethernet NIC.

Logical Topology

The single rack topology of the FlexPod AI architecture highlighted in this CVD can be easily expanded to a larger size cluster connecting hundreds of servers with Spine-Leaf network design where congestion management enabled on leaf layer and spine later with PFC and ECN as shown in Figure 9.

Figure 9.          FlexPod AI Logical Topology

Diagram of a computer serverDescription automatically generated

Considerations

The information in this section is provided as a reference for cabling the physical equipment in a FlexPod environment. To simplify cabling requirements, a cabling diagram was used.

The cabling diagram in this section contains the details for the prescribed and supported configuration of the NetApp AFF 400 running NetApp ONTAP 9.12.1.

Tech tip

For any modifications of this prescribed architecture, consult the NetApp Interoperability Matrix Tool (IMT).

This document assumes that out-of-band management ports are plugged into an existing management infrastructure at the deployment site. These interfaces will be used in various configuration steps.

Be sure to use the cabling directions in this section as a guide.

The NetApp storage controller and disk shelves should be connected according to best practices for the specific storage controller and disk shelves. For disk shelf cabling, refer to NetApp Support.

Figure 10 details the cable connections used in the validation lab for the FlexPod topology based on the Cisco UCS C240 M7 stand-alone server managed via Intersight and NetApp AFF A400 storage array. Two 100Gb uplinks connect as port-channels from each Cisco UCS Nexus 93600CD-GX switches to the NetApp AFF controllers. Additional 1Gb management connections will be needed for an out-of-band network switch that sits apart from the FlexPod infrastructure. Each Cisco UCS C240 M7 rack server and Cisco Nexus switch is connected to the out-of-band network switch, and each AFF controller has a connection to the out-of-band network switch. Layer 3 network connectivity is required between the Out-of-Band (OOB) and In-Band (IB) Management Subnets.

Figure 10.       Physical cabling for Cisco UCS C-Series

Related image, diagram or screenshot

Install and Configure

This chapter contains the following:

   Network Switch Configuration

   NetApp ONTAP Storage Configuration

   Install Cisco UCS

   Cisco Intersight Configuration

   Post-OS Configuration

   Mellanox ConnectX-6 NIC Best Practice

   NVIDIA GPU Configuration for HPC Workload

Network Switch Configuration

Designing a network infrastructure for an HPC and AI cluster requires careful consideration to ensure efficient data transfer, low latency, and scalability. The network configuration tuned to match the specific requirements of your HPC and AI workloads via implementation of QoS (Quality of Service) to ensure efficient utilization and monitoring of the network resources available.

Please refer to Cisco Data Center Networking Blueprint for AI/ML Applications for more detailed documentation on best practices for lossless network design and configuration required for HPC and AI workload.

This section provides a detailed procedure for configuring the Cisco Nexus 93600CD-GX switches for use in a FlexPod AI/ML environment. The Cisco Nexus 93600CD-GX will be used for LAN switching in this solution. This configuration allows deployment of Cisco AI/ML platforms in bare-metal server configuration.

Tech tip

The following procedures describe how to configure the Cisco Nexus switches for use in a base FlexPod environment. This procedure assumes the use of Cisco Nexus 9000 NX-OS 10.3.3F.

The following procedure includes the setup of NTP distribution on both the mgmt0 port and the in-band management VLAN. The interface-vlan feature and ntp commands are used to set this up. This procedure also assumes that the default VRF is used to route the in-band management VLAN.

This procedure sets up and uplink virtual port channel (vPC) with the IB-MGMT and OOB-MGMT VLANs allowed.

This validation assumes that both switches have been reset to factory defaults by using the “write erase” command followed by the “reload” command.

Physical Connectivity

Follow the physical connectivity guidelines for FlexPod as explained in the Physical Topology section.

Initial Configuration

The following procedures describe this basic configuration of the Cisco Nexus switches for use in the FlexPod environment. This procedure assumes the use of Cisco Nexus 9000 10.2(5)M, the Cisco suggested Nexus switch release at the time of this validation.

Procedure 1.       Set Up Initial Configuration from a serial console

Step 1.      Set up the initial configuration for the Cisco Nexus 93600CD-GX switch.

Procedure 2.       Configure Global Settings on both Cisco Nexus Switches

Tech tip

Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE); for RoCEv2 transport, the network must provide high throughput and low latency while avoiding traffic drops in situations where congestion occurs.

The following procedures describe how to configure the Cisco Nexus switches global configuration highlighting for RoCE V2 and QoS, no-drop shown. Please refer Appendix for full configuration.

Step 1.      Login as admin user into the Cisco Nexus switch A and run the following commands to set the global configuration on switch:

configure terminal

policy-map type network-qos qos_network

  class type network-qos c-8q-nq3

    mtu 9216

    pause pfc-cos 3

  class type network-qos c-8q-nq-default

    mtu 9216

class-map type qos match-any CNP

  match dscp 48

class-map type qos match-all ROCEv2

  match dscp 24,26

policy-map type qos QOS_MARKING

  class ROCEv2

    set qos-group 3

  class CNP

    set qos-group 7

  class class-default

    set qos-group 0

policy-map type queuing QOS_EGRESS_PORT

  class type queuing c-out-8q-q6

    bandwidth remaining percent 0

  class type queuing c-out-8q-q5

    bandwidth remaining percent 0

  class type queuing c-out-8q-q4

    bandwidth remaining percent 0

  class type queuing c-out-8q-q3

    bandwidth remaining percent 50

    random-detect minimum-threshold 150 kbytes maximum-threshold 3000 kbytes drop-probability 7 weight 0 ecn

  class type queuing c-out-8q-q2

    bandwidth remaining percent 0

  class type queuing c-out-8q-q1

    bandwidth remaining percent 0

  class type queuing c-out-8q-q-default

    bandwidth remaining percent 50

  class type queuing c-out-8q-q7

    priority level 1

system qos

  service-policy type network-qos qos_network

  service-policy type queuing output QOS_EGRESS_PORT

copy running-config startup-config

Step 2.      Login as admin user into Cisco Nexus Switch B; repeat steps 1 and 2 above to set up the global configuration.

Tech tip

Make sure to run copy run start to save the configuration on each switch after the configuration is completed.

NetApp ONTAP Storage Configuration

NetApp AFF A-Series Storage System

NetApp AFF A-Series is a family of all-flash storage systems by NetApp. The systems are designed to provide high-performance, low-latency and highly available storage solution suitable for a wide range of enterprise applications such as AI/ML, databases, and virtualization.

NetApp storage systems support a wide variety of disk shelves and disk drives. The complete list of disk shelves that are supported by the AFF A400 is available at the NetApp Support site.

Follow the physical installation procedures for the controllers found here: https://docs.netapp.com/us-en/ontap-systems/index.html.

When using SAS disk shelves with NetApp storage controllers, refer to: https://docs.netapp.com/us-en/ontap-systems/sas3/install-new-system.html for proper cabling guidelines.

When using NVMe drive shelves with NetApp storage controllers, refer to: https://docs.netapp.com/us-en/ontap-systems/ns224/hot-add-shelf.html for installation and servicing guidelines.

For more information about the setup and configuration of the NetApp AFF A-Series storage system setup and deployment FlexPod solution, go to:

https://www.cisco.com/c/en/us/solutions/design-zone/data-center-design-guides/flexpod-design-guides.html

https://www.netapp.com/data-storage/flexpod/validated-designs/

Install Cisco UCS

This section contains the following:

   Physical Connectivity

Physical Connectivity

To manage Cisco UCS C240 M7 Rack Server connected to Cisco Nexus switch directly; connect and configure Cisco Integrated Management Controller connect CIMC port to ToR (top of rack) switch as shown in Figure 11.

Tech tip

The Cisco IMC management service is used only when the server is operating in Standalone Mode. If your C-Series server is integrated into a UCS system, you must manage it using a Cisco UCS Fabric Interconnect running in either CSM or IMM. For information about using UCS Manager, see the configuration guides listed in the Cisco UCS B-Series Servers Documentation Roadmap at http://www.cisco.com/go/unifiedcomputing/b-series-doc.

Figure 11.       Cisco UCS C240 M7 stand-alone rack server connectivity with Cisco Nexus Switches

Related image, diagram or screenshot

Cisco Intersight Configuration

This section contains the following:

   Claim Target

   Pool Configuration

   Policy Configuration

   Template and Profile Configuration

   Install Operating System using Cisco Intersight

Cisco Intersight Infrastructure Service allows policy managed infrastructure configuration for ease of management, monitoring and Remediate issues and stay ahead with proactive awareness of Cisco UCS Infrastructure.

Tech tip

Please visit Intersight Help Center for more details: https://us-east-1.intersight.com/help/saas/home

This section highlights initial configuration required to claim a target in a new or existing Cisco Intersight account. Based on the requirement and intended configuration for Cisco Intersight managed Infrastructure either create new pools, policies, template, and profiles or leverage existing configuration for faster deployment. Github repositories for automated FlexPod deployment and configuration can be found here: https://github.com/orgs/ucs-compute-solutions/repositories

Claim Target

This procedure details how to claim a Cisco UCS standalone server using Cisco Intersight.

Step 1.      Log into CIMC web UI https://<cimc_mgmt_ip> and enter your username and password.

Step 2.       Go to Admin > Device Connector.

Related image, diagram or screenshot

Note:      This is the first step. In general, processes should have multiple procedures, and procedures should have multiple steps.

Related image, diagram or screenshot

Step 3.      Update settings input as appropriate and click Save.

Related image, diagram or screenshot

Step 4.      Copy Device ID and Claim code to be entered while claiming a  target in Cisco Intersight.

Step 5.      Log into Cisco Intersight; go to System > Targets. Click Claim a new Target.

Related image, diagram or screenshot

Step 6.      Select Cisco UCS Server (standalone) in Compute/Fabric section. Click Start.

Related image, diagram or screenshot

Step 7.      Enter Device ID and Claim Code copied from Cisco CIMC for the standalone server to claim in the Cisco Intersight.

Related image, diagram or screenshot

Cisco CIMC and Cisco Intersight WebUI reporting Cisco UCS Rack Server claimed and connected is shown below:

Related image, diagram or screenshot

Related image, diagram or screenshot

Pool Configuration

Tech tip

For more information on licensing requirements for Infrastructure Services, see Infrastructure Services License.

You can either create the required pools and policies to attach to a server profile or create them as progressing through the server profile/template creation.

Step 1.      Log into Cisco Intersight. Go to Infrastructure Service.

Related image, diagram or screenshot

Step 2.      Select Pools and click Create Pool.

Related image, diagram or screenshot

Step 3.      Select Resources. Click Start. Enter a name for the new Resource Pool. Select the Target Platform to add in the resource pool. Select by filtering the set of nodes to be part of the resource pool. For example, UCSC-C240-M7 as shown below:

Related image, diagram or screenshot

Step 4.      Create the  IP Pool for KVM access as shown below:

Related image, diagram or screenshot

Policy Configuration

This section provides the screenshots where input is required to complete the policy creation. The process remains the same, such as selection of organization, entering name for the policy, tags, and description. 

Step 1.      Go to Policies. Select type of policy to create.

Related image, diagram or screenshot

Step 2.      Select BIOS and click Start.

Step 3.      We kept all settings as platform-default, except for the following parameters changes in the BIOS policy:

   LLC dead line – Disabled

   Intel Virtualization Technology – Disabled

   LLC prefetch – Disabled

   Workload configuration – Balanced

Tech tip

Selection of BIOS parameters determines the cluster optimization for performance or power saving or balanced of performance and power savings goal.

Please review Performance Tuning Best Practices Guide for Cisco UCS M7 Platforms: https://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-b-series-blade-servers/ucs-m7-platforms-wp.html

A screenshot of a computerDescription automatically generated

A screenshot of a computer programDescription automatically generated

Step 4.      Create Boot order policy. Select UCS Standalone Server. Select Boot Mode. Click Add Boot Device. Select LocalDisk from the drop-down list for “Add Boot Device.” For the device name and Slot enter “MSTOR-RAID.”

A screenshot of a computerDescription automatically generated

Note:      Additional boot devices can be added as per the requirement.

A screenshot of a computerDescription automatically generated

A screenshot of a computerDescription automatically generated

Step 5.      Create the SSH policy.

A screenshot of a computerDescription automatically generated

Step 6.      Create the NTP policy.

A screenshot of a computerDescription automatically generated

Step 7.      Create the Virtual KVM policy.

A screenshot of a computerDescription automatically generated

Step 8.      Create the Storage policy to create RAID 1 on two M.2 SATA SSD with boot optimized RAID controller. Select M.2 RAID configuration. From the drop-down list select MSTOR-RAID-1(MSTOR-RAID).

Related image, diagram or screenshot

Template and Profile Configuration

Step 1.      In Cisco UCS Infrastructure Services page, go to Templates. Click Create UCS Server Profile Template.

A screenshot of a computerDescription automatically generated

Step 2.      Enter a name for the Server Profile Template and select the Target Platform.

A screenshot of a computerDescription automatically generated

Step 3.      Select policies or create new to associate with the server profile template.

A screenshot of a computerDescription automatically generated

A screenshot of a computerDescription automatically generated

A screenshot of a server profileDescription automatically generated

Step 4.      Review and click Derive Profiles.

A screenshot of a computerDescription automatically generated

Step 5.      Select the Server Assignment.

A screenshot of a computerDescription automatically generated

Step 6.      Edit the name for the server profile(s) to be derived from the template.

A screenshot of a computerDescription automatically generated

Step 7.      Click Deploy Server Profile. Monitor the Server profile deployment task.

A screenshot of a computerDescription automatically generated

Install Operating System using Cisco Intersight

Prerequisites

Create the NFS or HTTPS file share location to be consumed for the Operating System, Cisco UCS SCU, and the Cisco UCS HUU repository.

Step 1.      Download ISO and copy the path location to be added in Software Repository tab.

Step 2.      Log into Cisco Intersight. Go to System > Software Repository.

A screenshot of a computerDescription automatically generated

Step 3.      Go to OS Image Links and click Add OS Image Link.

A screenshot of a computerDescription automatically generated

Step 4.      Select NFS or CIFS or HTTP/s, enter file location, username, and password.

A screenshot of a computerDescription automatically generated

Step 5.      Enter the Details.

A screenshot of a computerDescription automatically generated

Step 6.      Go to SCU Links page System > Software Repository. Click Add SCU Link.

A screenshot of a computerDescription automatically generated

Step 7.      Enter the File Location, Username and Password. Click Next.

A screenshot of a computerDescription automatically generated

Step 8.      Review the Server Configuration Utility.

A screenshot of a computerDescription automatically generated

Step 9.      Go to the OS Configuration Files tab. Add the OS configuration file for Operating System Installation (optional).

A screenshot of a computerDescription automatically generated

Step 10.  Go to Infrastructure Service > Servers. Select system(s) to perform Install Operating System. Right-click on the ellipses and select Install Operating System.

A screenshot of a computerDescription automatically generated

Step 11.  The selected system(s) are already part of the Operating System Install task. Edit the list if required. Click Next.

A screenshot of a computerDescription automatically generated

Step 12.  Select the OS Image Link for the intended operating system to be installed.

A screenshot of a computerDescription automatically generated

Step 13.  Enter the configuration details for the Operating System to be installed as shown in the screenshot below. Repeat these steps for all system(s).

A screenshot of a computerDescription automatically generated

Step 14.  Select the Server Configuration Utility Image.

A screenshot of a computerDescription automatically generated 

Step 15.  Select the Installation Target. We selected M.2 VD from the drop-down list.

Related image, diagram or screenshot

Step 16.  Review the Operating System Installation Summary. Click Install.

A screenshot of a computerDescription automatically generated

Step 17.  Review the Installation workflow in execution.

A screenshot of a computerDescription automatically generated

A screenshot of a computerDescription automatically generated

A screenshot of a computerDescription automatically generated

A screenshot of a computerDescription automatically generated

Post-OS Configuration

This section contains the post-OS configuration steps:

   OS Configuration

   Ansible Automation

OS Configuration

Step 1.      Log into the Ubuntu OS management IP address configured during the Cisco Intersight based Install Operating System using the SSH client. Enter the username “ubuntu” and the password entered while setting up the custom configuration:

$ ssh ubuntu@10.29.148.171

Edit /etc/hosts

$ cat /etc/hosts

127.0.0.1 localhost

127.0.1.1 hpc-node1

 

10.29.148.171 hpc-node1

10.29.148.172 hpc-node2

10.29.148.173 hpc-node3

10.29.148.174 hpc-node4

10.29.148.175 hpc-node5

10.29.148.176 hpc-node6

10.29.148.177 hpc-node7

10.29.148.178 hpc-node8

 

# The following lines are desirable for IPv6 capable hosts

::1     ip6-localhost ip6-loopback

fe00::0 ip6-localnet

ff00::0 ip6-mcastprefix

ff02::1 ip6-allnodes

ff02::2 ip6-allrouters

Note:      If it’s not already installed, the following steps require openssh-server to be installed using "sudo apt install -y openssh-server"

Step 2.      Add a new user:

$ adduser <username>

Adding user `<username>' ...

Adding new group `<username>' (1001) ...

Adding new user `<username>' (1001) with group `<username> ...

Creating home directory `/home/<username>' ...

Copying files from `/etc/skel' ...

New password: <Password!>

Retype new password: <Password!>

passwd: password updated successfully

Changing the user information for <username>

Enter the new value, or press ENTER for the default

        Full Name []: FirstName LastName

        Room Number []:

        Work Phone []:

        Home Phone []:

        Other []:

Is the information correct? [Y/n] Y

$ usermod -aG sudo <username>

$ su - <username>

To run a command as administrator (user "root"), use "sudo <command>".

See "man sudo_root" for details.

Step 3.      Setup a password-less login:

This process needs to be completed on each node so that all nodes can be accessed without password for HPC cluster test as documented in following section.

$ ssh-keygen -N '' -f ~/.ssh/id_rsa

$ for i in {1..8}; do echo "copying hpc-node$i"; ssh-copy-id -i /home/ubuntu/.ssh/id_rsa.pub ubuntu@HPC-Node$i; done;

Ansible Automation

Step 1.      Install ansible:

$ sudo apt update

$ sudo apt install software-properties-common

$ sudo add-apt-repository --update ppa:ansible/ansible --> Press Enter when prompted

$ sudo apt update

$ sudo apt install -y ansible-core ansible

$ ansible --version

ansible [core 2.15.4]

  config file = /etc/ansible/ansible.cfg

  configured module search path = ['/home/hardipat/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']

  ansible python module location = /usr/lib/python3/dist-packages/ansible

  ansible collection location = /home/hardipat/.ansible/collections:/usr/share/ansible/collections

  executable location = /usr/bin/ansible

  python version = 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (/usr/bin/python3)

  jinja version = 3.0.3

  libyaml = True

Step 2.      Edit /etc/ansible/hosts:

$ sudo vi /etc/ansible/hosts

[hpcnodes]

hpc-node1 ansible_host=10.29.148.171

hpc-node2 ansible_host=10.29.148.172

hpc-node3 ansible_host=10.29.148.173

hpc-node4 ansible_host=10.29.148.174

hpc-node5 ansible_host=10.29.148.175

hpc-node6 ansible_host=10.29.148.176

hpc-node7 ansible_host=10.29.148.177

hpc-node8 ansible_host=10.29.148.178

 

[all:vars]

ansible_python_interpreter=/usr/bin/python3

Step 3.      Test ansible:

$ ansible-inventory --list -y

all:

  children:

    admin:

      hosts:

        10.29.148.22:

          ansible_python_interpreter: /usr/bin/python3

    hpcnodes:

      hosts:

        hpc-node1:

          ansible_host: 10.29.148.171

          ansible_python_interpreter: /usr/bin/python3

        hpc-node2:

          ansible_host: 10.29.148.172

          ansible_python_interpreter: /usr/bin/python3

        hpc-node3:

          ansible_host: 10.29.148.173

          ansible_python_interpreter: /usr/bin/python3

        hpc-node4:

          ansible_host: 10.29.148.174

          ansible_python_interpreter: /usr/bin/python3

        hpc-node5:

          ansible_host: 10.29.148.175

          ansible_python_interpreter: /usr/bin/python3

        hpc-node6:

          ansible_host: 10.29.148.176

          ansible_python_interpreter: /usr/bin/python3

        hpc-node7:

          ansible_host: 10.29.148.177

          ansible_python_interpreter: /usr/bin/python3

        hpc-node8:

          ansible_host: 10.29.148.178

          ansible_python_interpreter: /usr/bin/python3

$ ansible all -m command -a "uname -r"

hpc-node2 | CHANGED | rc=0 >>

5.15.0-84-generic

hpc-node4 | CHANGED | rc=0 >>

5.15.0-84-generic

Tech tip

To enable root login;  edit "/etc/ssh/sshd_config" as --> PermitRootLogin yes

$ sudo passwd root

[sudo] password for ubuntu:

New password:

Retype new password:

passwd: password updated successfully

 

Step 4.      Setup NTP:

$ ansible all -m command -a "sudo timedatectl set-timezone America/Los_Angeles"

$ # ansible all -m command -a "sudo timedatectl set-ntp false"

$ ansible all -m command -a "sudo timedatectl set-time hh:mm"

$ ansible all -m command -a "sudo timedatectl set-time true"

$ ansible all -m command -a "sudo apt install ntp -y"

Step 5.      Edit /etc/ntp.conf and copy on all nodes:

$ vi /etc/ntp.conf

pool 72.163.32.44 iburst

$ ansible all -m copy -a "src=/etc/ntp.conf dest=/etc/ntp.conf"

Step 6.      Start NTP service:

$ ansible all -m command -a "sudo systemctl start ntp"

$ ansible all -m command -a "sudo systemctl enable ntp"

Step 7.      Install required packages:

$ ansible all -m command -a "sudo apt-get update"

$ ansible all -m command -a "sudo apt-get install build-essential -y"

$ ansible all -m command -a "sudo apt install -y pip sshpass unzip -y"

$ ansible all -m command -a "sudo apt install environment-modules -y"

$ ansible all -m command -a "sudo apt install dkms -y"

$ ansible all -m command -a “sudo apt install make cmake -y”

Step 8.      Update Grub by editing /etc/default/grub:

$ ansible all -m shell -a "sed -i 's/GRUB__CMDLINE_LINUX_DEFAULT="[^"]*/& iommu=pt numa_balancing=disable pci=realloc=off processor.max_cstate=0/' /etc/default/grub"

$ ansible all -m shell -a "update-grub"

Step 9.      Disable Nouveau by running following commands:

$ ansible all -m shell -a "bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf""

$ ansible all -m shell -a "bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf""

$ ansible all -m shell -a "update-initramfs -u"

Tech tip

The previous steps require a reboot:

$ ansible all -m command -a "sudo reboot"

Step 10.  Disable firewall (or edit firewall configuration to allow NTP, NFS, other network traffic as appropriate):

$ ansible all -m shell -a "sudo UFW disable”

Step 11.  Disable SELinux:

$ sudo vi /etc/selinux/config

SELINUX=disabled

$ ansible all -m copy -a “src=/etc/selinux/config dest=/etc/selinux/config”

Mellanox ConnectX-6 NIC Best Practice

This section highlights the steps required for the Mellanox ConnectX6-DX ehternet network interface card configuration for Ubuntu OS:

   Mellanox OFED Driver Installation

   Configure Network Adapter Ports for Ubuntu

   Mellanox ConnectX-6 NIC best practice for PFC, ECN and DSCP

   Connect NetApp Storage provided FlexVolume(s) using NFS Mount

Mellanox OFED Driver Installation

Step 1.      Install Mellanox OFED for Mellanox ConnectX-6 DX Ethernet Network Interface Card. Download Mellanox OFED from the following URL: https://www.mellanox.com/page/mlnx_ofed_eula?mtag=linux_sw_drivers&mrequest=downloads&mtype=ofed&mver=MLNX_OFED-5.8-1.1.2.1&mname=MLNX_OFED_LINUX-5.8-1.1.2.1-ubuntu22.04-x86_64.tgz

Step 2.      Accept EUA to download Mellanox OFED Software:

$ scp MLNX_OFED_LINUX-5.8-1.1.2.1-ubuntu22.04-x86_64.tgz ubuntu@10.29.148.171:/home/ubuntu/

$ ansible all -m copy -a “src=/home/ubuntu/ MLNX_OFED_LINUX-5.8-1.1.2.1-ubuntu22.04-x86_64.tgz dest==/home/ubuntu/ MLNX_OFED_LINUX-5.8-1.1.2.1-ubuntu22.04-x86_64.tgz”

$ ansible all -m shell -a “tar xvf /home/ubuntu/ MLNX_OFED_LINUX-5.8-1.1.2.1-ubuntu22.04-x86_64.tgz”

ansible all -m command -a "sudo /home/ubuntu/MLNX_OFED_LINUX-5.8-1.1.2.1-ubuntu22.04-x86_64/mlnxofedinstall --without-fw-update --force"

$ ansible all -m command -a "sudo /etc/init.d/openibd restart"

$ ansible all -m command -a "sudo reboot"

$ ansible all -m command -a "ofed_info -s"

hpc-node1 | CHANGED | rc=0 >>

MLNX_OFED_LINUX-5.8-1.1.2.1:

hpc-node2 | CHANGED | rc=0 >>

MLNX_OFED_LINUX-5.8-1.1.2.1:

Step 3.      Start MST:

$ sudo mst start

$ sudo mst status -v

MST modules:

------------

    MST PCI module is not loaded

    MST PCI configuration module loaded

PCI devices:

------------

DEVICE_TYPE             MST                                               PCI         RDMA            NET                       NUMA

ConnectX6DX(rev:0)      /dev/mst/mt4125_pciconf0.1    27:00.1   mlx5_1          net-ens2f1np1             0

ConnectX6DX(rev:0)      /dev/mst/mt4125_pciconf0      27:00.0   mlx5_0          net-ens2f0np0             0

Tech tip

GPU Direct RDMA and GPU Direct Storage link:

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/23.6.1/gpu-operator-rdma.html

Once tested and validated next version of Cisco Validated Design will cover GPU Direct Storage.

Configure Network Adapter Ports for Ubuntu

Step 1.      Run the following commands:

$ ansible all -m command -a "sudo apt update"

$ ansible all -m command -a "sudo apt install lldpad -y"

$ ansible all -m command -a "sudo modprobe 8021q"

$ ansible all -m command -a "sudo su -c 'echo "8021q" >> /etc/modules'"

$ ansible all -m command -a "sudo apt install vlan -y"

Step 2.      Edit /etc/netplan/00-installer-config.yaml to configure Mellanox NIC port with desired network configuration. For example, the following configuration is for Mellanox port1 (ens2f0np0) and port2 (ens2f1np1) connected to Cisco Nexus Switch A and B in Spine – Leaf architecture for data traffic.

$ cat /etc/netplan/00-installer-config.yaml

# This is the network config written by 'subiquity'

network:

  ethernets:

    eno5:

      addresses:

      - 10.29.148.171/24

      dhcp4: false

      nameservers:

        addresses:

        - 171.70.168.183

        search: []

      routes:

      - to: default

        via: 10.29.148.1

    ens2f0np0:

      dhcp4: false

    ens2f1np1:

      dhcp4: false

  vlans:

    ens2f0np0.110:

      id: 110

      link: ens2f0np0

      mtu: 9000

      addresses:

      - 192.168.110.171/24

      routes:

      - to: 192.168.110.0/24

        via: 192.168.110.1

        metric: 100

    ens2f1np1.160:

      id: 160

      link: ens2f1np1

      mtu: 9000

      addresses:

      - 192.168.160.171/24

      routes:

      - to: 192.168.160.0/24

        via: 192.168.160.1

        metric: 100

  version: 2

Mellanox ConnectX-6 NIC Best Practice for PFC, ECN and DSCP

Step 1.      Enable ECN (Explicit congestion notification) priority 3:

$ ansible all -m command -a "sudo echo 1 > /sys/class/net/ens2f0np0/ecn/roce_rp/enable/3"

$ ansible all -m command -a "sudo echo 1 > /sys/class/net/ens2f1np1/ecn/roce_rp/enable/3"

$ ansible all -m command -a "sudo echo 1 > /sys/class/net/ens2f0np0/ecn/roce_np/enable/3"

$ ansible all -m command -a "sudo echo 1 > /sys/class/net/ens2f1np1/ecn/roce_np/enable/3"

Step 2.      Set CNP L2 egress priority:

$ ansible all -m command -a "sudo echo 6 > /sys/class/net/ens2f0np0/ecn/roce_np/cnp_802p_prio"

$ ansible all -m command -a "sudo echo 6 > /sys/class/net/ens2f1np1/ecn/roce_np/cnp_802p_prio"

Step 3.      Map CNP priority to Differentiated Services Code Point (DSCP):

$ ansible all -m command -a "sudo echo 48 > /sys/class/net/ens2f0np0/ecn/roce_np/cnp_dscp"

$ ansible all -m command -a "sudo echo 48 > /sys/class/net/ens2f1np1/ecn/roce_np/cnp_dscp"

Step 4.      Configure PFC (Priority-based Flow Control) - class 3:

$ ansible all -m command -a "sudo mlnx_qos -i ens2f0np0 --pfc 0,0,0,1,0,0,0,0"

$ ansible all -m command -a "sudo mlnx_qos -i ens2f1np1 --pfc 0,0,0,1,0,0,0,0"

Step 5.      Configure DSCP trust:

$ ansible all -m command -a "sudo mlnx_qos -i ens2f0np0 --trust dscp"

$ ansible all -m command -a "sudo mlnx_qos -i ens2f1np1 --trust dscp"

Step 6.      Enable ECN; edit /etc/sysctl.conf:

$ ansible all -m command -a "sudo sysctl -w net.ipv4.tcp_ecn=1"

Step 7.      Set RoCE mode to v2:

$ ansible all -m command -a "sudo cma_roce_mode -d mlx5_0 -p 1 -m 2"

$ ansible all -m command -a "sudo cma_roce_mode -d mlx5_1 -p 1 -m 2"

Step 8.      Set DSCP value to 24 for RoCE traffic:

$ ansible all -m command -a "sudo cma_roce_tos -d mlx5_0 -t 24"

$ ansible all -m command -a "sudo cma_roce_tos -d mlx5_1 -t 24"

Step 9.      Map roce traffic to priority 3:

$ ansible all -m command -a "sudo vconfig set_egress_map ens2f0np0.110 4 3"

$ ansible all -m command -a "sudo vconfig set_egress_map ens2f1np1.160 4 3"

Tech tip

$ ansible all -m command -a "sudo ip link set ens2f0np0.110 type vlan egress egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7"

$ ansible all -m command -a "sudo ip link set ens2f1np1.160 type vlan egress egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7"

Step 10.  Performance tuning on Mellanox NIC:

$ ansible all -m command -a "sudo mlnx_tune -p HIGH_THROUGHPUT"

$ ansible all -m command -a "sudo ip link set ens2f0np0.110 type vlan egress egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7"

$ ansible all -m command -a "sudo ip link set ens2f1np1.160 type vlan egress egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7"

Connect NetApp Storage provided FlexVolume(s) using NFS Mount

Step 1.      Install nfs-common on client server:

sudo apt install nfs-common -y

Step 2.      Create mount directory:

$ sudo mkdir /opt/hpc -p

$ sudo mkdir /home/hpcdata -p

Step 3.      Edit /etc/fstab to add permanent NFS mount:

$ sudo vi /etc/fstab

192.168.110.5:/weather /opt/hpc nfs auto,noatime,nolock,bg,nfsvers=4.2,intr,tcp 0 0

192.168.110.5:/hpcdata /home/hpcdata/ nfs auto,noatime,nolock,bg,nfsvers=4.2,intr,tcp 0 0

Step 4.      Mount NFS volume added in /etc/fstab:

$ mount -a

Step 5.      Validate mount point:

$ df -h

NVIDIA GPU Configuration for HPC Workload

This section highlights the steps required to configure NVIDIA GPU with Ubuntu OS and HPC workload:

   NVIDIA CUDA Installation on Ubuntu 22.04 LTS

   Post Installation Steps for NVIDIA CUDA

   Sample CUDA Test

   Install NVIDIA HPC SDK

NVIDIA CUDA Installation on Ubuntu 22.04 LTS

Step 1.      Verify You Have a CUDA-Capable GPU:

$ ansible all -m shell -a "lspci | grep -i nvidia"

hpc-node4 | CHANGED | rc=0 >>

99:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)

hpc-node5 | CHANGED | rc=0 >>

99:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)

hpc-node3 | CHANGED | rc=0 >>

Step 2.      Verify You Have a Supported Version of Linux:

$ ansible all -m shell -a "uname -m && cat /etc/*release"

hpc-node5 | CHANGED | rc=0 >>

x86_64

DISTRIB_ID=Ubuntu

DISTRIB_RELEASE=22.04

DISTRIB_CODENAME=jammy

DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"

PRETTY_NAME="Ubuntu 22.04.3 LTS"

NAME="Ubuntu"

VERSION_ID="22.04"

VERSION="22.04.3 LTS (Jammy Jellyfish)"

VERSION_CODENAME=jammy

ID=ubuntu

ID_LIKE=debian

HOME_URL="https://www.ubuntu.com/"

SUPPORT_URL="https://help.ubuntu.com/"

BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"

PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"

UBUNTU_CODENAME=jammy

hpc-node4 | CHANGED | rc=0 >>

Step 3.      Verify the System Has gcc Installed:

$ ansible all -m shell -a "gcc --version"

hpc-node5 | CHANGED | rc=0 >>

gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Copyright (C) 2021 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

hpc-node4 | CHANGED | rc=0 >>

gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Copyright (C) 2021 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

hpc-node2 | CHANGED | rc=0 >>

Step 4.      Verify the system has the correct kernel headers and development packages installed:

$ ansible all -m shell -a "uname -r"

hpc-node4 | CHANGED | rc=0 >>

5.15.0-84-generic

hpc-node5 | CHANGED | rc=0 >>

5.15.0-84-generic

Step 5.      The kernel headers and development packages for the currently running kernel can be installed running the following:

$ ansible all -m shell -a "sudo apt-get install linux-headers-$(uname -r)"

Step 6.      Download NVIDIA CUDA based on the requirement. Please refer to the screenshot below:

A screenshot of a computerDescription automatically generated

Step 7.      Choose the Installation method.

Step 8.      Remove Outdated Signing Key:

$ ansible all -m shell -a "sudo apt-key del 7fa2af80"

Step 9.      Add the pin file to prioritize CUDA repository:

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin

$ scp cuda-ubuntu2204.pin ubuntu@10.29.148.171:/home/ubuntu/

$ ansible all -m copy -a "src=/home/ubuntu/cuda-ubuntu2204.pin dest=/home/ubuntu/cuda-ubuntu2204.pin"

$ ansible all -m command -a "sudo mv /home/ubuntu/cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600

Step 10.  Install local repository on file system:

$ wget https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda-repo-ubuntu2204-12-2-local_12.2.2-535.104.05-1_amd64.deb

$scp cuda-repo-ubuntu2204-12-2-local_12.2.2-535.104.05-1_amd64.deb ubuntu@10.29.148.171:/home/ubuntu/

$ ansible all -m copy -a "src=/home/ubuntu/cuda-repo-ubuntu2204-12-2-local_12.2.2-535.104.05-1_amd64.deb dest=/home/ubuntu/cuda-repo-ubuntu2204-12-2-local_12.2.2-535.104.05-1_amd64.deb"

$ ansible all -m shell -a "sudo dpkg -i /home/ubuntu/cuda-repo-ubuntu2204-12-2-local_12.2.2-535.104.05-1_amd64.deb"

$ ansible all -m command -a "sudo cp /var/cuda-repo-ubuntu2204-12-2-local/cuda-F73B257B-keyring.gpg /usr/share/keyrings/"

Step 11.  Update the Apt repository cache:

$ ansible all -m command -a "sudo apt-get update"

Step 12.  Install CUDA SDK

$ ansible all -m command -a "sudo apt-get install cuda -y"

Step 13.  (optional) To include all GDS packages:

$ ansible all -m command -a "sudo apt-get install -y nvidia-gds"

Step 14.  Reboot the system:

$ ansible all -m command -a "sudo reboot"

Post Installation Steps for NVIDIA CUDA

Step 1.      Environment Setup:

$ ansible all -m shell -a "export PATH=/usr/local/cuda-12.2/bin${PATH:+:${PATH}}"

Note:      When using the runfile installation method, the LD_LIBRARY_PATH variable needs to contain /usr/local/cuda-12.2/lib64 on a 64-bit system, or /usr/local/cuda-12.2/lib on a 32-bit system.

Step 2.      To change the environment variables for 64-bit operating systems:

Note:      The following paths change when using a custom install path with the runfile installation method.

$ ansible all -m shell -a "export CUDA_HOME=/usr/local/cuda-12.2"

$ ansible all -m shell -a "export PATH=${CUDA_HOME}/bin:${PATH}"

$ ansible all -m shell -a "export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH"

Note:      Recommended Actions – the following steps are recommended to verify the integrity of the installation.

Step 3.      Verify the driver version:

$ ansible all -m command -a "cat /proc/driver/nvidia/version"

hpc-node5 | CHANGED | rc=0 >>

NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.104.05  Sat Aug 19 01:15:15 UTC 2023

GCC version:  gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)

hpc-node4 | CHANGED | rc=0 >>

NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.104.05  Sat Aug 19 01:15:15 UTC 2023

GCC version:  gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)

hpc-node3 | CHANGED | rc=0 >>

Step 4.      Run nvidia-smi command:

A screen shot of a computerDescription automatically generated

Sample CUDA Test

Step 1.      Based on github for cuda-samples: https://github.com/NVIDIA/cuda-samples:

hpc-node1:/home/ubuntu/cuda-samples/Samples/1_Utilities/bandwidthTest# ./bandwidthTest

[CUDA Bandwidth Test] - Starting...

Running on...

 

 Device 0: NVIDIA A100 80GB PCIe

 Quick Mode

 

 Host to Device Bandwidth, 1 Device(s)

 PINNED Memory Transfers

   Transfer Size (Bytes)        Bandwidth(GB/s)

   32000000                     25.0

 

 Device to Host Bandwidth, 1 Device(s)

 PINNED Memory Transfers

   Transfer Size (Bytes)        Bandwidth(GB/s)

   32000000                     24.9

 

 Device to Device Bandwidth, 1 Device(s)

 PINNED Memory Transfers

   Transfer Size (Bytes)        Bandwidth(GB/s)

   32000000                     549.7

 

Result = PASS

Note:      The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Install NVIDIA HPC SDK

Prerequisites

Step 1.      Download NVIDIA HPC SDK software: https://developer.nvidia.com/hpc-sdk-downloads

$ wget https://developer.download.nvidia.com/hpc-sdk/23.9/nvhpc_2023_239_Linux_x86_64_cuda_multi.tar.gz

$ scp nvhpc_2023_239_Linux_x86_64_cuda_multi.tar.gz ubuntu@10.29.148.171:/home/ubuntu/

Step 2.      Copy NVIDIA HPC SDK to all nodes:

$ ansible all -m copy -a "src=/home/ubuntu/nvhpc_2023_239_Linux_x86_64_cuda_multi.tar.gz dest=/home/ubuntu/nvhpc_2023_239_Linux_x86_64_cuda_multi.tar.gz"

$ ansible all -m shell -a "tar xpzf /home/ubuntu/nvhpc_2023_239_Linux_x86_64_cuda_multi.tar.gz"

Step 3.      Install NVIDIA HPC SDK (optionally – select where it gets installed):

$ ansible all -m shell -a "tar xpzf /home/ubuntu/nvhpc_2023_239_Linux_x86_64_cuda_multi.tar.gz"

Step 4.      Install NVIDIA HPC SDK:

$ su – root

$ cd nvhpc_2023_239_Linux_x86_64_cuda_multi/

$ sudo ./install

Step 5.      Press Enter to continue installing NVIDIA HPC SDK:

Step 6.      Select Install option:

An auto installation is appropriate for any scenario. The HPC SDK

configuration (localrc) is created at first use and stored in each user's

home directory.

Installation directory? [/opt/nvidia/hpc_sdk]

1  Single system install

2  Network install

3  Auto install

Please choose install option:

3

Step 7.      Select default directory or intended location:

Please specify the directory path under which the software will be installed.

The default directory is /opt/nvidia/hpc_sdk, but you may install anywhere

you wish, assuming you have permission to do so.

Installation directory? [/opt/nvidia/hpc_sdk]

Tech tip

Installing NVIDIA HPC SDK version 23.9 into /opt/nvidia/hpc_sdk

Making symbolic link in /opt/nvidia/hpc_sdk/Linux_x86_64

generating environment modules for NV HPC SDK 23.9 ... done.

Installation complete.

HPC SDK successfully installed into /opt/nvidia/hpc_sdk

If you use the Environment Modules package, that is, the module load

command, the NVIDIA HPC SDK includes a script to set up the

appropriate module files.

% module load /opt/nvidia/hpc_sdk/modulefiles/nvhpc/23.9

% module load nvhpc/23.9

Alternatively, the shell environment may be initialized to use the HPC SDK.

In csh, use these commands:

% setenv MANPATH "$MANPATH":/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/man

% set path = (/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/bin $path)

In bash, sh, or ksh, use these commands:

$ MANPATH=$MANPATH:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/man; export MANPATH

$ PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/bin:$PATH; export PATH

Once the 64-bit compilers are available, you can make the OpenMPI

commands and man pages accessible using these commands.

% set path = (/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/mpi/bin $path)

% setenv MANPATH "$MANPATH":/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/mpi/man

And the equivalent in bash, sh, and ksh:

$ export PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/mpi/bin:$PATH

$ export MANPATH=$MANPATH:/opt/nvidia/hpc_sdk/Linux_x86_64/23.9/comm_libs/mpi/man

Please check https://developer.nvidia.com for documentation,

use of NVIDIA HPC SDK software, and other questions.

 

Step 8.      Install environement-modules:

$ ansible all -m shell -a "sudo apt-get install -y environment-modules"

$ cat /etc/profile.d/modules.sh

shell=$(/usr/bin/basename $(/bin/ps -p $$ -ocomm=))

 

if [ -f /usr/share/modules/init/$shell ]; then

   . /usr/share/modules/init/$shell

else

   . /usr/share/modules/init/sh

fi

Step 9.      Create environment variables to be set at the time of user login:

$ vi /etc/environment

# Use the HPC SDK toolkit compilers

export NVIDIA=/opt/hpc/nvidia/hpc_sdk

module use $NVIDIA/modulefiles

module load nvhpc

# But then override their choice of Open MPI to use the HPCX that is inside the HPC SDK

module use /opt/hpc/nvidia/hpc_sdk/Linux_x86_64/2023/comm_libs/12.2/hpcx/hpcx-2.15/modulefiles

module load hpcx

$ source /etc/environment

$printenv

Step 10.  Verify above configured modules are loaded and “mpirun” and “mpicc”available: 

$ which mpirun

/opt/hpc/nvidia/hpc_sdk/Linux_x86_64/23.7/comm_libs/12.2/hpcx/hpcx-2.15/ompi/bin/mpirun

$ which mpicc

/opt/hpc/nvidia/hpc_sdk/Linux_x86_64/23.7/comm_libs/12.2/hpcx/hpcx-2.15/ompi/bin/mpicc

Solution Validation

This chapter contains the following:

   What is GPUDirect?

What is GPUDirect?

NVIDIA GPUDirect RDMA (Remote Direct Memory Access) is a technology that enables a direct path for data exchange between the GPU and third-party peer devices using standard PCIe features. GPUDirect RDMA relies on the ability of NVIDIA GPUs to expose portions of device memory in a PCIe base address register region, as shown in Figure 12. The GPUDirect technology requires a PCIe Switch to facilitate direct memory transfer between NVIDIA NIC and GPU.

A PCIe switch is not strictly required to enable GPUDirect RDMA on a supported system. Please refer to How GPUDirect RDMA works, NVIDIA GPUDirect Storage design guide, and GPUDirect RDMA supported systems for more details.

If there is no PCIe switch between network interface card and NVIDIA GPU, data has to traverse through processor. Performance improvement achieved through GPUDirect RDMA, and benefits depend on your use case and system configuration.

Figure 12.       GPUDirect Vs No GPUDirect

Related image, diagram or screenshot

Note:      A requirement for GPUDirect RDMA to work is that the NVIDIA GPU and the Mellanox Adapter share the same root complex through a PCIe switch.

Note:      For our solution, we could not use GPUDirect as Cisco UCS C240 M7 rack server does not contain PCIe switch between CPU and Mellanox Ethernet NIC and NVIDIA GPU as shown in Figure 12.

NVIDIA HPC-X Software Toolkit Setup and Configuration

This section details the validation of the HPC/AI cluster previously configured.

   Prerequisites

   Test Result for NVIDIA HPC-X Software Toolkit

   NVIDIA HPC-X Software Toolkit Test Summary

Prerequisites

Test toolkit to validate the end-end connectivity.

Note:      We installed NVIDIA HPC-X software toolkit version 2.16 for the validation.

For more details, go to: https://docs.nvidia.com/networking/display/hpcxv216

Tech tip

MPI (Message Passing Interface) is a standard and widely used library and protocol for writing parallel applications in HPC.

“mpirun” is a command used in HPC part of MPI implementation, such as OpenMPI and MPICH which provides the underlying libraries and runtime support for parallel programming. When combined with these MPI libraries, “mpirun” becomes a powerful tool for launching, managing, and scaling parallel applications in HPC environments, making it possible to harness the computational power of clusters and supercomputers for scientific simulations, data analysis, and other compute-intensive tasks.

“mpicc” streamlines the process of compiling parallel C and C++ programs for use in HPC environments, allowing developers to take full advantage of the parallel processing capabilities of these systems.

“mpich” is one of the primary implementations of the MPI standard. It provides a set of libraries, tools, and runtime environments that allow developers to create parallel applications in C, C++, and Fortran.

OpenACC (Open Accelerators) is an open standard for parallel programming of heterogeneous computer systems, including multi-core CPUs and GPUs (Graphics Processing Units). It provides a set of directives, libraries, and APIs that enable developers to accelerate their applications by offloading compute-intensive portions to accelerators, such as GPUs, while maintaining a single source code that can run on both CPUs and GPUs.

UCX (Unified Communication X) s an open-source communication library designed for high-performance computing (HPC) and other parallel computing environments. UCX is particularly valuable for applications that rely on efficient communication in HPC and parallel computing scenarios. It allows these applications to achieve high performance while maintaining portability across diverse hardware and networking environments.

Step 1.      Accept EUL agreement to download NVIDIA HPC-X clusterkit:

$ wget http://www.mellanox.com/page/hpcx_eula?mrequest=downloads&mtype=hpc&mver=hpc-x&mname=v2.16.2/hpcx-v2.16-gcc-mlnx_ofed-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64.tbz

Step 2.      Copy HPC-X software toolkit on all nodes:

$ scp hpcx-v2.16-gcc-mlnx_ofed-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64.tbz ubuntu@10.29.148.171:/home/ubuntu/

$ ansible all -m copy -a "src=/home/ubuntu/hpcx-v2.16-gcc-mlnx_ofed-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64.tbz dest=/home/ubuntu/hpcx-v2.16-gcc-mlnx_ofed-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64.tbz"

Step 3.      Extract HPC-X software toolkit:

$ ansible all -m shell -a "tar -xvf /home/ubuntu/hpcx-v2.16-gcc-mlnx_ofed-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64.tbz"

Step 4.      Update shell variable of the location of HPC-X installation:

$ cd /home/ubuntu/hpcx-v2.16-gcc-mlnx_ofed-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64

$ export HPCX_HOME=$PWD

Step 5.      Build and Run applications with HPC-X. To load OpenMPI based package run following commands:

$ source $HPCX_HOME/hpcx-init.sh

$ hpcx_load

$ env | grep HPCX

$ mpicc $HPCX_MPI_TESTS_DIR/examples/hello_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_c

$ mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c

Hello, world, I am 0 of 2, (Open MPI v4.1.5rc2, package: Open MPI root@hpc-kernel-03 Distribution, ident: 4.1.5rc2, repo rev: v4.1.5rc1-17-gdb10576f40, Unreleased developer copy, 150)

Hello, world, I am 1 of 2, (Open MPI v4.1.5rc2, package: Open MPI root@hpc-kernel-03 Distribution, ident: 4.1.5rc2, repo rev: v4.1.5rc1-17-gdb10576f40, Unreleased developer copy, 150)

$ oshcc $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c

$ oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c

Hello, world, I am 1 of 2: http://www.open-mpi.org/ (version: 1.4)

Hello, world, I am 0 of 2: http://www.open-mpi.org/ (version: 1.4)

$ hpcx_unload

Step 6.      Run following command: (optional) If not installed already; install environment-modules:

$ ansible all -m shell -a “sudo apt install environment-modules -y”

$ ansible all -m shell -a “source ~/.bashrc”

$ ansible all -m shell -a “source ~/.profile”

$ ansible all -m shell -a “sudo modprobe knem”

Step 7.      Building HPC-X with Intel Compiler Suite:

Note:      As of version 1.7, HPC-X builds are no longer distributed based on the Intel compiler suite. However, after following the HPC-X deployment example below, HPC-X can subsequently be rebuilt from source with your Intel compiler suite as follows:

$ tar xfp ${HPCX_HOME}/sources/openmpi-gitclone.tar.gz

$ cd ${HPCX_HOME}/sources/openmpi-gitclone

$ ls -l ${HPCX_HOME}/sources/openmpi-gitclone.tar.gz

-rw-r--r-- 1 ubuntu ubuntu 18287983 Aug  3 13:59 /home/ubuntu/hpcx-v2.16-gcc-mlnx_ofed-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64//sources/openmpi-gitclone.tar.gz

$ cd ${HPCX_HOME}/sources/openmpi-gitclone

$ ./configure CC=icx CXX=icpx F77=ifort FC=ifort --prefix=${HPCX_HOME}/ompi-icc \

--with-hcoll=${HPCX_HOME}/hcoll \

--with-ucx=${HPCX_HOME}/ucx \

--with-platform=contrib/platform/mellanox/optimized \

2>&1 | tee config-icc-output.log

$ make -j32 all 2>&1 | tee build_icc.log && make -j24 install 2>&1 | tee install_icc.log

Step 8.      Load HPC-X environment from modules:

$ module use $HPCX_HOME/modulefiles

$ module load hpcx

$ mpicc $HPCX_MPI_TESTS_DIR/examples/hello_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_c

$ mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c

Hello, world, I am 1 of 2, (Open MPI v4.1.5rc2, package: Open MPI root@hpc-kernel-03 Distribution, ident: 4.1.5rc2, repo rev: v4.1.5rc1-17-gdb10576f40, Unreleased developer copy, 150)

Hello, world, I am 0 of 2, (Open MPI v4.1.5rc2, package: Open MPI root@hpc-kernel-03 Distribution, ident: 4.1.5rc2, repo rev: v4.1.5rc1-17-gdb10576f40, Unreleased developer copy, 150)

$ oshcc $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c

$ oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c

Hello, world, I am 1 of 2: http://www.open-mpi.org/ (version: 1.4)

Hello, world, I am 0 of 2: http://www.open-mpi.org/ (version: 1.4)

Step 9.      To profile MPI API run following command: (Optional)

$ export IPM_KEYFILE=$HPCX_IPM_DIR/etc/ipm_key_mpi

$ export IPM_LOG=FULL

$ export LD_PRELOAD=$HPCX_IPM_DIR/lib/libipm.so

$ $HPCX_IPM_DIR/bin/ipm_parse -html outfile.xml

$ export IPM_ADD_BARRIER_TO_REDUCE=1

$ export IPM_ADD_BARRIER_TO_ALLREDUCE=1

$ export IPM_ADD_BARRIER_TO_GATHER=1

$ export IPM_ADD_BARRIER_TO_ALL_GATHER=1

$ export IPM_ADD_BARRIER_TO_ALLTOALL=1

$ export IPM_ADD_BARRIER_TO_ALLTOALLV=1

$ export IPM_ADD_BARRIER_TO_BROADCAST=1

$ export IPM_ADD_BARRIER_TO_SCATTER=1

$ export IPM_ADD_BARRIER_TO_SCATTERV=1

$ export IPM_ADD_BARRIER_TO_GATHERV=1

$ export IPM_ADD_BARRIER_TO_ALLGATHERV=1

$ export IPM_ADD_BARRIER_TO_REDUCE_SCATTER=1

Step 10.  Create hotfile.txt in /home/ubuntu/hpcx-v2.16-gcc-mlnx_ofed-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/clusterkit/bin/

$ vi hostfile.txt

hpc-node1

hpc-node2

hpc-node3

….

hpc-node8

Step 11.  UCX intra-node communication uses the KNEM module, which improves the performance significantly. Make sure this module is loaded on your system:

$ modprobe knem

Step 12.  When HPC-X is launched with Open MPI without a resource manager job environment (slurm,pbs, and so on), or when it is launched from a compute node, the default rsh/ssh-based launcher will be used. This launcher does not propagate environment variables to the compute nodes. Thus, it is important to ensure the propagation of LD_LIBRARY_PATH variable from HPC-x is done as follows:

$ mpirun -x LD_LIBRARY_PATH -np 2 -H hpc-node1,hpc-node2,hpc-node3,hpc-node4 $HPCX_MPI_TESTS_DIR/examples/hello_c

Sample Command:

$ mpirun -x LD_LIBRARY_PATH -np 2 -H host1,host2  $HPCX_MPI_TESTS_DIR/examples/hello_c

Step 13.  After loading HPC-X package; run clusterkit script or using the “mpirun” command:

# mpirun

$ mpirun -x LD_LIBRARY_PATH -np 2 -H c240m7-13,c240m7-14  $HPCX_CLUSTERKIT_DIR/bin/clusterkit

Note:      ClusterKit runs by default in pairwise test cases, which requires at least two nodes to run.

# clusterkit script

Note:      $ ./clusterkit.sh -h for list of options and parameters.

Test Results for NVIDIA HPC-X Software Toolkit

Clusterkit part of the NVIDIA HPC-X software toolkit is a multifaceted node assessment tool for high performance clusters. Currently, ClusterKit is capable of testing latency, bandwidth, effective bandwidth, memory bandwidth, GFLOPS by node, per-rack collective performance, as well as bandwidth and latency between GPUs and local/remote memory. ClusterKit employs well known techniques and tests to arrive at these performance metrics and is intended to give the user a general look at the health and performance of a cluster.

Step 1.      A sample command ran to measure performance and HPC/AI cluster configuration for this solution test and validate:

$ ./clusterkit.sh --ssh --hostfile hostfile.txt -y --hca_list "mlx5_0:1,mlx5_1:1" -z 15

Note:      mlx5_0 and mlx5_1 are two physical ports on ConnectX6-DX ethernet NIC installed on each Cisco UCS C240 M7 Rack Server

Step 2.      Test result generated by HPC-X Clusterkit command which runs for 15 minute interval can be found here:

~/hpcx-v2.16-gcc-mlnx_ofed-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/clusterkit/bin/20231103_134557$ cat bandwidth.txt

Cluster: Unknown

User: ubuntu

Testname: bandwidth

Date_and_Time: 2023/11/03 13:45:58

JOBID: 0

PPN: 128

Bidirectional: True

Skip_Intra_Node: True

HCA_Tag: Unknown

Technology: Unknown

 

Bandwidth_bidir (MB/s):

Message size: 8388608 B  Iterations: 10000

Rank:                        0       1       2       3       4       5       6       7

       0 (hpc-node1):      0.0 48829.3 48381.3 48377.6 48823.9 48840.8 48839.7 48842.8

       1 (hpc-node2):  48829.3     0.0 48393.2 48372.8 48841.4 48853.4 48856.3 48853.9

       2 (hpc-node3):  48381.3 48393.2     0.0 48347.1 48396.5 48387.9 48389.8 48382.1

       3 (hpc-node4):  48377.6 48372.8 48347.1     0.0 48369.2 48378.3 48367.7 48374.7

       4 (hpc-node5):  48823.9 48841.4 48396.5 48369.2     0.0 48839.0 48842.8 48839.5

       5 (hpc-node6):  48840.8 48853.4 48387.9 48378.3 48839.0     0.0 48857.7 48865.4

       6 (hpc-node7):  48839.7 48856.3 48389.8 48367.7 48842.8 48857.7     0.0 48859.3

       7 (hpc-node8):  48842.8 48853.9 48382.1 48374.7 48839.5 48865.4 48859.3     0.0

Minimum bandwidth: 48347.1 MB/s between hpc-node3 and hpc-node4

Maximum bandwidth: 48865.4 MB/s between hpc-node6 and hpc-node8

Average bandwidth: 48628.7 MB/s

NVIDIA HPC-X Software Toolkit Test Summary

When running NVIDIA HPC-X Software toolkit based Clusterkit test; we observed ~200Gbps bandwidth is achieved (Duplex on two 100Gbps port; mlx5_0 and mlx5_1 are two physical ports on ConnectX6-DX ethernet NIC installed on each Cisco UCS C240 M7 Rack Server) when measuring host connectivity bandwidth between Cisco UCS C240 M7 server and a pair of Cisco Nexus 93600CD-GX leaf switch.

The following screenshot shows line-rate network traffic generated when running NVIDIA HPC-X Clusterkit command above to measure network bandwidth for longer duration. Line-rate traffic was monitored between Cisco UCS C240 M7 with Mellanox ConnectX-6 ethernet NIC ports connected to a pair of Cisco Nexus 93600CD-GX leaf switch.

Figure 13.       Screenshot monitoring port utilization on Cisco Nexus 93600CD-GX leaf1(A) and leaf2(B) connected to Cisco UCS C240 M7 Rack Server with 2 x 100GbE Mellanox ConnectX6 Ethernet NIC

Related image, diagram or screenshot

Figure 14.       Grafana dashboard monitoring ingress and egress network bandwidth on Cisco Nexus 93600CD-GX leaf 1

A screenshot of a computerDescription automatically generated

Test

This section highlights test result for the datasets of different size and application area. Tests were executed to determine cluster scalability, clock time required to complete each said dataset on CPU only and repeat the same on a system with one or two GPU per system.

CPU only tests were executed with all logical cores utilized on each node via adding parameter -np 128*x (x= multiplier representative of number of nodes utilized).

We captured system performance statistics such as CPU utilization (vmstat), GPU utilization (nvidia-smi and NVIDIA GPU exporter Grafana WebUI console) and Network port utilization on Cisco Nexus switch connected to Cisco UCS server and NetApp A400 Storage when running the same test in a clustered configuration.

We tested combination of one, two, four and eight nodes to showcase linear scalability of the solution.

Test result shows clock time in second to complete each test scenario whether running on CPU or GPU.

HPC Workload Dataset

Prerequisites

We tested documented solution with HPC datasets representing different area/applications list below:

Note:      Workload listed is for the proof of concept (PoC) purpose only.

1. The MiniWeather Mini App (Weather)

The miniWeather code mimics the basic dynamics seen in atmospheric weather and climate. The dynamics themselves are dry compressible, stratified, non-hydrostatic flows dominated by buoyant forces that are relatively small perturbations on a hydrostatic background state. The equations in this code themselves form the backbone of pretty much all fluid dynamics codes, and this particular flavor forms the base of all weather and climate modeling.

For more details: https://github.com/mrnorman/miniWeather#openacc-accelerator-threading

2. Minisweep (Nuclear Engineering)

Minisweep is a deterministic Sn radiation transport miniapp used for performance optimization and performance model evaluation on HPC architectures.

For more details: https://github.com/wdj/minisweep

3. HPGMG: High-performance Geometric Multigrid (Cosmology/Astrophysics)

HPGMG implements full multigrid (FMG) algorithms using finite-volume and finite-element methods. Different algorithmic variants adjust the arithmetic intensity and architectural properties that are tested. These FMG methods converge up to discretization error in one F-cycle, thus may be considered direct solvers. An F-cycle visits the finest level a total of two times, the first coarsening (8x smaller) 4 times, the second coarsening 6 times, etc.

For more details: https://bitbucket.org/hpgmg/hpgmg/src/master/

Tech tip

Run the following command to find all the parameter.

$mpicc –help

Subset is pasted below from “mpicc --help” for reference:

Target-specific switches:

-acc[=gpu|host|multicore|[no]autopar|[no]routineseq|legacy|strict|verystrict|sync|[no]wait]

                    Enable OpenACC directives

    gpu             OpenACC directives are compiled for GPU execution only; please refer to -gpu for target specific options

    host            Compile for serial execution on the host CPU

    multicore       Compile for parallel execution on the host CPU

    [no]autopar     Enable (default) or disable loop autoparallelization within acc parallel

    [no]routineseq  Compile every routine for the device

    legacy          Suppress warnings about deprecated PGI accelerator directives

    strict          Issue warnings for non-OpenACC accelerator directives

    verystrict      Fail with an error for any non-OpenACC accelerator directive

    sync            Ignore async clauses

    [no]wait        Wait for each device kernel to finish

The workload dataset tested in this solution to create a baseline has minimum system requirement based on the application and dataset size. Intention is to target varied cluster sizes with node count, core count and GPU count and their relative impact when running benchmark application to determined overall cluster performance.

   The Small size dataset - target a single larger node or a small cluster of nodes using between 64 and 1024 ranks. It uses a maximum of 480GB of memory per benchmark.

   The Medium size dataset - target a mid-sized cluster of nodes using between 256 and 4096 ranks. It uses a maximum of 4TB of memory per benchmark.

   The Large size dataset - target a larger cluster of nodes using between 2048 and 32,768 ranks. It uses a maximum of 14.5TB of memory per benchmark.

For this solution, datasets were downloaded and configured in the shared NFS volume on NetApp A400 storage and used NetApp FlexClone volumes to mount on each node part of the HPC cluster test. We tested following possible combinations for scale tests on HPC cluster workloads. 

Table 6.      Configuration tested for scale tests

Scale Test

1-node

2-node

4-node

8-node

C240M7 Server with 1x A100 GPU

Badge Tick1 with solid fill

Badge Tick1 with solid fill

Badge Tick1 with solid fill

Badge Tick1 with solid fill

C240M7 Server with 2x A100 GPU

Badge Tick1 with solid fill

Badge Tick1 with solid fill

Badge Tick1 with solid fill

 

C240M7 Server with CPU only

Badge Tick1 with solid fill

Badge Tick1 with solid fill

Badge Tick1 with solid fill

Badge Tick1 with solid fill

Note:      We will extend this exercise with 8 node HPC cluster with 2 x NVIDIA Tesla A100-80G GPU per node in the future.

Application 1: miniWeather (Large Size Dataset)

Table 7 lists the results from scale out tests on miniWeather.

Table 7.      Scale test results with miniWeather large size dataset in second

Scale Test

1-node (Total time in second)

2-node (Total time in second)

4-node (Total time in second)

8-node (Total time in second)

C240M7 Server with 1x A100 GPU

1048.16

518.98

246.16

109.27

C240M7 Server with 2x A100 GPU

518.34

245.61

105.85

NA

C240M7 Server with CPU only

3418.28

1726.53

865.28

436.14

Figure 15.       Cluster scalability when running Large miniWeather Dataset

Related image, diagram or screenshot

Figure 16.       % Avg CPU utilization when running Large miniWeather Dataset only on CPU

Related image, diagram or screenshot

A screenshot of a computer screenDescription automatically generated

Figure 17.       % Avg GPU utilization when running Large miniWeather Dataset on nodes w/ 1 NVIDIA A100-80G GPU

Related image, diagram or screenshot

Figure 18.       % Avg GPU utilization when running Large miniWeather Dataset on nodes w/ 2 NVIDIA A100-80G GPU

Related image, diagram or screenshot

Figure 19.       % Avg GPU memory utilization when running Large miniWeather Dataset on nodes w/ 1 NVIDIA A100-80G GPU

Related image, diagram or screenshot

Figure 20.       % Avg GPU memory utilization when running Large miniWeather Dataset on nodes w/ 2 NVIDIA A100-80G GPU

Related image, diagram or screenshot

Figure 21.       Sample Grafana dashboard showing GPU statistics.

A screenshot of a computerDescription automatically generated

Figure 22.       Sample % Avg network utilization when running Large miniWeather Dataset on 8 node w/ 1 x A100-80G GPU per node

A screenshot of a computer screenDescription automatically generated

Observations

Figure 23.       8 Node HPC Cluster comparison with miniWeather larger dataset test on 1 x NVIDIA A100-80G GPU vs Dual Socket CPU

A graph of a graph showing the number of server nodesDescription automatically generated

The following observations were made from the test results obtained by running miniWeather large dataset on the 8 x Cisco UCS C240 M7 node based HPC cluster as defined in this solution study:

   The end-end FlexPod HPC cluster as highlighted in this solution scaled linearly whether the cluster test ran by CPU only or w/ GPU.

   Each node resources i.e., compute, storage and network consumed evenly throughout the cluster test.

   100% GPU utilization and memory utilization.

   No CPU cycles were consumed when ran the test on NVIDIA A100-80G GPU.

   Time to complete the test with one GPU per node in an eight-node cluster is 4x faster than running the same test on eight-node cluster without GPU.

Application 2: Minisweep (Small Size Dataset)

Table 8 lists the results from scale out tests on Minisweep.

Table 8.      Scale test results with Minisweep small size dataset in second

Scale Test

1-node (Total time in second)

2-node (Total time in second)

4-node (Total time in second)

8-node (Total time in second)

C240M7 Server with 1x A100 GPU

438.10

217.18

122.93

76.21

C240M7 Server with 2x A100 GPU

216.99

130.58

87.10

NA

C240M7 Server with CPU only

1927.84

1045.48

742.04

435.59

Figure 24.       Cluster scalability when running small miniSweep Dataset.

Related image, diagram or screenshot

Figure 25.       % Avg CPU utilization when running Small miniSweep Dataset only on CPU

Related image, diagram or screenshot

 

A screenshot of a screenDescription automatically generated

Figure 26.       % Avg GPU utilization when running small Minisweep Dataset on nodes w/ 1 NVIDIA A100-80G GPU

A screen shot of a graphDescription automatically generated

Figure 27.       % Avg GPU utilization when running small Minisweep Dataset on nodes w/ 2 NVIDIA A100-80G GPU

A graph of a computerDescription automatically generated with medium confidence

Figure 28.       % Avg GPU memory utilization when running small Minisweep Dataset on nodes w/ 1 NVIDIA A100-80G GPU

A screen shot of a computerDescription automatically generated

Figure 29.       % Avg GPU memory utilization when running small Minisweep Dataset on nodes w/ 2 NVIDIA A100-80G GPU

A screen shot of a computer screenDescription automatically generated

Figure 30.       Sample % Avg network utilization when running small Minisweep Dataset on 8 node w/ 1 x A100-80G GPU per node

A screenshot of a computer screenDescription automatically generated

Figure 31.       Sample Grafana dashboard showing GPU statistics

A screenshot of a computerDescription automatically generated

Observations

Figure 32.       8 Node HPC Cluster comparison with Minisweep small dataset test on 1 x NVIDIA A100-80G GPU vs Dual Socket CPU

A graph of a number of serversDescription automatically generated

The following observations were made from the test results obtained by running Minisweep small dataset on the 8 x Cisco UCS C240 M7 node based HPC cluster as defined in this solution study:

   The end-end FlexPod HPC cluster as highlighted in this solution scaled linearly whether the cluster test ran by CPU only or w/ GPU.

   Each node resources i.e., compute, storage and network consumed evenly throughout the cluster test.

   Different than weather simulation exercise above the Nuclear Engineering - Radiation Transport based Minisweep small dataset showed difference in overall GPU utilization; while we saw at the full cluster test approximately 90% GPU utilization and 70% memory utilization.

   No CPU cycles were consumed when ran the test on NVIDIA A100-80G GPU.

   Time to complete the test with one GPU per node in an eight-node cluster is 5.7x faster than running the same test on eight-node cluster without GPU.

Application 3: HPGMG (Medium Size Dataset)

Table 9 lists the results from scale out tests on HPGMG.

Table 9.      Scale test results with HPGMG medium size dataset in second.

Scale Test

1-node (Total time in second)

2-node (Total time in second)

4-node (Total time in second)

8-node (Total time in second)

C240M7 Server with 1x A100 GPU

NA

NA

308.76

190.16

C240M7 Server with 2x A100 GPU

NA

305.33

182.86

NA

C240M7 Server with CPU only

4524.42

2330.06

1229.69563

677.1356146

Figure 33.       Cluster scalability when running medium HPGMG Dataset

Related image, diagram or screenshot

Figure 34.       % Avg CPU utilization when running Medium HPGMG Dataset only on CPU

A graph with lines and textDescription automatically generated with medium confidence

A screenshot of a computer screenDescription automatically generated

Figure 35.       % Avg GPU utilization when running medium HPGMG Dataset on nodes w/ 1 NVIDIA A100-80G GPU

A graph of a computerDescription automatically generated with medium confidence

Figure 36.       % Avg GPU utilization when running medium HPGMG Dataset on nodes w/ 2 NVIDIA A100-80G GPU

Related image, diagram or screenshot

Figure 37.       % Avg GPU memory utilization when running medium HPGMG Dataset on nodes w/ 1 NVIDIA A100-80G GPU

A graph of a computerDescription automatically generated with medium confidence

Figure 38.       % Avg GPU memory utilization when running medium HPGMG Dataset on nodes w/ 2 NVIDIA A100-80G GPU

A screen shot of a graphDescription automatically generated

Figure 39.       Sample % Avg network utilization when running medium HPGMG Dataset on 8 node w/ 1 x A100-80G GPU per node

A screen shot of a black screenDescription automatically generated

Figure 40.       Sample Grafana dashboard showing GPU statistics

A screenshot of a computerDescription automatically generated

Observations

Figure 41.       8 Node HPC Cluster comparison with HPGMG medium dataset test on 1 x NVIDIA A100-80G GPU vs Dual Socket CPU

A graph of a number of server serversDescription automatically generated

The following observations were made from the test results obtained by running HPGMG medium dataset on the 8 x Cisco UCS C240 M7 node based HPC cluster as defined in this solution study:

   Depends on the workload and dataset size more GPU is required. HPGMG medium dataset consumed.

   The end-end FlexPod HPC cluster as highlighted in this solution scaled linearly whether the cluster test ran by CPU only or w/ GPU.

   Each node resources i.e., compute, storage and network consumed evenly throughout the cluster test.

   Compared to miniWeather and Minisweep test exercise above HPGMG medium dataset which is targeting application area of Cosmology, Astrophysics, Combustion showed 80% GPU utilization and 40% memory utilization.

   Due to the minimum memory required to execute the HPGMG medium dataset was higher than other two dataset we were not able to run the test on 1 GPU or 2 GPU in a single system.

   No CPU cycles were consumed when ran the test on NVIDIA A100-80G GPU.

   Time to complete the test with one GPU per node in an eight-node cluster is 3.5x faster than running the same test on eight-node cluster without GPU.

 

Conclusion

The amalgamation of High-Performance Computing (HPC) and Artificial Intelligence (AI) represents a powerful synergy that unleashes unprecedented computational capabilities, enabling us to tackle complex and data-intensive challenges with greater speed, accuracy, and efficiency. The combination of CPUs and GPUs with high-speed data fabric with end-end 100GbE network is essential for achieving optimal performance, scalability, and responsiveness.

Here's the importance of each component because it allows for the best of HPC and AI worlds:

   Diverse Workload Support: CPUs are essential for handling diverse tasks, including system management, control flow, and sequential processes, making them crucial for both HPC and AI infrastructure.

   Parallel Processing: GPUs are vital for parallelizable workloads, such as deep learning and scientific simulations, where their massive parallel processing power accelerates complex calculations.

   Task Offloading: Combining CPUs and GPUs allows for intelligent task distribution, enabling CPUs to offload parallel workloads to GPUs for enhanced efficiency and speed.

   Optimal Performance: Together, CPUs and GPUs offer a balanced and high-performance computing environment, capable of handling a wide range of workloads seamlessly.

   Energy Efficiency: CPUs are generally more power-efficient for certain tasks and are important for overall system management. GPUs, on the other hand, excel in computational throughput. Combining the two can lead to better energy efficiency and performance.

   Fast data pipeline: Data intensive workloads of HPC and AI often involve large datasets. A 100GbE network provides an extensive data pipeline, ensuring efficient data exchange between CPUs, GPUs, storage rapidly and without bottlenecks, improving overall performance.

   Low Latency: Low-latency communication is crucial for HPC and AI workloads, as it reduces the time spent waiting for data transfers and results in faster overall processing.

   Scalability: High-speed networking supports the scaling of resources, enabling you to add more compute nodes, GPUs, or storage as needed for growing workloads.

   Resource Utilization: CPUs and GPUs are fully utilized as data moves quickly between them, minimizing idle times and maximizing the overall system efficiency.

In this solution study, we tested various application (use cases) targeted for weather simulation (miniWeather), Nuclear Engineering - Radiation Transport (Minisweep), and Cosmology, Astrophysics, Combustion (HPGMG).

We documented recommended tunable parameters to achieve balanced configuration amongst compute, network and storage components and proved near linear scalability of the solution as the size of the cluster grew from 1 to 8 node.

About the Authors

Hardik Patel, Technical Marketing Engineer, Cloud and Compute Product Group, Cisco Systems, Inc.

Hardik Patel is a Solution Architect in Cisco System’s Cloud and Compute Engineering Group. Hardik has over 15 years of experience in datacenter solutions and technologies. He is currently responsible for design and architect of next-gen infrastructure solutions and performance in AI/ML and analytics. Hardik holds a Master of Science degree in Computer Science with various career-oriented certification in virtualization, network, and Microsoft.

Tushar Patel, Distinguished Technical Marketing Engineer, Cloud and Compute Product Group, Cisco Systems, Inc.

Tushar Patel is a Solution Architect in Cisco System’s Cloud and Compute Engineering Group. Tushar has 25+ years of experience in datacenter solutions and emerging technologies focusing on Databases, Virtualization, Clustering and Storage technologies. For last 5 years, he has been leading Cisco UCS AI/ML strategy and solutions design for next-gen infrastructure.

Acknowledgements

For their support and contribution to the design, validation, and creation of this Cisco Validated Design, the authors would like to thank:

   Arthur Raefsky, Principal Software Engineer, Cisco Systems, Inc.

   Jeff Squyres, Principal Software Engineer, Cisco Systems, Inc.

   Vijay Durairaj, Technical Marketing Engineering, Cisco Systems, Inc.

   Rajendra Yogendra, Technical Marketing Engineering, Cisco Systems, Inc.

   Esteban Marin, Lead Software Engineer, Cisco Systems, Inc.

   Bobby Oomen, Sr. Manager FlexPod Solutions, NetApp

Appendix

This appendix contains the following:

   Appendix A - Bill of Materials

   Appendix B - Cisco Nexus 9000 Switch Configuration

   Appendix C - NetApp AFF A400 Storage Configuration

   Appendix D - Monitor GPU Utilization

   Appendix E - References used in this guide

   Appendix F - Recommended for you

Appendix A – Bill of Materials

Cisco UCS C240 M7

Table 10 lists the bill of materials for the Cisco UCS C240 M7.

Cisco Nexus 93600CD-GX

Table 11 lists the bill of materials for the Cisco Nexus 93600CD-GX

Table 10.   Bill of Material for Cisco UCS C240 M7

Part ID

Description

Qty

UCS-M7-MLB

UCS M7 RACK MLB

1

DC-MGT-SAAS

Cisco Intersight SaaS

1

DC-MGT-IS-SAAS-ES

Infrastructure Services SaaS/CVA - Essentials

8

SVS-DCM-SUPT-BAS

Basic Support for DCM

1

DC-MGT-UCSC-1S

UCS Central Per Server - 1 Server License

8

UCSC-C240-M7SX

UCS C240M7 Rack w/oCPU, mem, drv, 2Uw24SFF HDD/SSD backplane

8

UCSC-GPUAD-C240M7

GPU AIR DUCT FOR C240M7

8

UCS-M2-960G-D

960GB M.2 SATA Micron G2 SSD

16

UCS-M2-HWRAID-D

Cisco Boot optimized M.2 Raid controller

8

UCSX-TPM-002C-D

TPM 2.0, TCG, FIPS140-2, CC EAL4+ Certified, for servers

8

UCSC-RAIL-D

Ball Bearing Rail Kit for C220 & C240 M7 rack servers

8

CIMC-LATEST-D

IMC SW (Recommended) latest release for C-Series Servers.

8

UCSC-HSLP-C220M7

UCS C220 M7 Heatsink for & C240 GPU Heatsink

16

UCSC-BBLKD-M7

UCS C-Series M7 SFF drive blanking panel

176

UCS-DDR5-BLK

UCS DDR5 DIMM Blanks

128

UCSC-RISAB-24XM7

UCS C-Series M7 2U Air Blocker GPU only

8

CBL-FNVME-C240M7

C240M7 NVMe CABLE,  MB P-4 to BP (NVMe 3-4)

8

UCSC-M2EXT-240-D

C240M7 2U M.2 Extender board

8

UCS-P100CBL-240-D

C240M7 NVIDIA P100 /A100 /A40 / A16 / A30 Cable

16

UCS-CPU-I6448H

Intel I6448H 2.4GHz/250W 32C/60MB DDR5 4800MT/s

16

UCS-MRX64G2RE1

64GB DDR5-4800 RDIMM 2Rx4  (16Gb)

128

UCSC-RIS1C-24XM7

UCS C-Series M7 2U Riser 1C PCIe Gen5 (2x16)

8

UCSC-RIS2C-24XM7

UCS C-Series M7 2U Riser 2C PCIe Gen5 (2x16) (CPU2)

8

UCSC-RIS3B-24XM7

UCS C-Series M7 2U Riser 3B support rear SAS & NVMe Drives

8

UCS-NVME4-1920-D

1.9TB 2.5in U.2 15mm P5520 Hg Perf Med End NVMe

32

UCSC-P-MCD100GF-D

Cisco-MLNX MCX623106AC-CDAT 2x100GbE QSFP56 PCIe NIC

8

UCSC-GPUA100-80-D

TESLA A100, PASSIVE, 300W, 80GB

8

NV-GRID-OPT-OUT-D

NVIDIA GRID SW OPT-OUT

8

UCSC-GPUA100-80-D

TESLA A100, PASSIVE, 300W, 80GB

8

NV-GRID-OPT-OUT-D

NVIDIA GRID SW OPT-OUT

8

UCSC-PSU1-2300W-D

Cisco UCS 2300W AC Power Supply for Rack Servers Titanium

16

CAB-C19-CBN

Cabinet Jumper Power Cord, 250 VAC 16A, C20-C19 Connectors

16

UCS-SID-INFR-UNK-D

Unknown

8

UCS-SID-WKL-UNK-D

Unknown

8

CON-OSP-UCSCPC34

SNTC-24X7X4OS UCS C240M7 Rack w/oCPU, mem, drv, 2Uw24S

8

Table 11.   Bill of Material for Cisco Nexus 93600CD-GX

Part ID

Description

Qty

N9K-C93600CD-GX

Nexus 9300 with 28p 100G and 8p 400G

2

NXK-AF-PI

Dummy PID for Airflow Selection Port-side Intake

2

MODE-NXOS

Mode selection between ACI and NXOS

2

NXOS-9.3.10

Nexus 9500, 9300, 3000 Base NX-OS Software Rel 9.3.10

2

NXK-ACC-KIT-1RU

Nexus 3K/9K Fixed Accessory Kit,  1RU front and rear removal

2

NXA-FAN-35CFM-PI

Nexus Fan, 35CFM, port side intake airflow

12

NXA-PAC-1100W-PI2

Nexus AC 1100W PSU -  Port Side Intake

4

CAB-C13-C14-AC

Power cord, C13 to C14 (recessed receptacle), 10A

4

C1-SUBS-OPTOUT

OPT OUT FOR "Default" DCN Subscription Selection

2

CON-SNC-N9KC936G

SNTC-NCD Nexus 9300 with 28p 100G and 8p 400G

2

Appendix B – Cisco Nexus 9000 Switch Configuration

Procedure 1.       Virtual Port-Channel (vPC) Configuration

A port channel bundles individual links into a channel group to create a single logical link that provides the aggregate bandwidth of up to eight physical links. If a member port within a port channel fails, traffic previously carried over the failed link switches to the remaining member ports within the port channel. Port channeling also load balances traffic across these physical interfaces. The port channel stays operational as long as at least one physical interface within the port channel is operational. Using port channels, Cisco NX-OS provides wider bandwidth, redundancy, and load balancing across the channels.

In the Cisco Nexus Switch topology, a single vPC feature is enabled to provide HA, faster convergence in the event of a failure, and greater throughput. The Cisco Nexus vPC configurations with the vPC domains and corresponding vPC names and IDs for Oracle Database Servers are listed in Table 12.

Table 12.   vPC Summary

vPC Domain

vPC Name

vPC ID

18

Peer-Link

1

18

vPC Storage A

13

18

vPC Storage B

14

As listed in Table 12, a single vPC domain with Domain ID 1 is created across two Nexus switches to define vPC members to carry specific VLAN network traffic. In this topology, we defined a total number of 11 vPCs.

vPC ID 1 is defined as Peer link communication between the two Cisco Nexus switches. vPC IDs 13 and 14 are configured between both Cisco Nexus Switches and NetApp Storage Controller. vPC IDs 71-78 are configured between Mellanox ConnectX-6 Ehternet network interface card (NIC) ports on each Cisco UCS C240 M7 server connecting to Cisco Nexus Switches.

Figure 42.       Virtual Port-Channel on Cisco Nexus Switches

Related image, diagram or screenshot

Procedure 2.       Create Virtual Port-Channel Domain

Step 1.      Login as admin user into the Cisco Nexus A switch, run the following commands:

vpc domain 18

  peer-keepalive destination 10.29.148.12 source 10.29.148.11

  delay restore 150

  peer-gateway

  auto-recovery

  ip arp synchronize

  role priority 10

Step 2.      Login as admin user into the Cisco Nexus B switch, run the following commands:

vpc domain 18

  peer-keepalive destination 10.29.148.11 source 10.29.148.12

delay restore 150

  peer-gateway

  auto-recovery

  ip arp synchronize

  role priority 20

interface Ethernet1/33

  description Peer link connected to Leaf3-Spine1

  switchport mode trunk

  switchport trunk allowed vlan 1,110,160,248

  priority-flow-control mode on

interface Ethernet1/34

  description Peer link connected to Leaf3-Spine1

  switchport mode trunk

  switchport trunk allowed vlan 1,110,160,248

  priority-flow-control mode on

interface Ethernet1/35

  description Peer link connected to Leaf3-Spine2

  switchport mode trunk

  switchport trunk allowed vlan 1,110,160,248

  priority-flow-control mode on

  interface Ethernet1/36

  description Peer link connected to Leaf3-Spine2

  switchport mode trunk

  switchport trunk allowed vlan 1,110,160,248

Step 1.      Repeat this procedure on Cisco Nexus B switch.

Procedure 3.       Create Virtual Port-Channel between Cisco Nexus and NetApp Storage

Step 1.      Login as admin user into the Cisco Nexus A switch, run the following commands:

interface port-channel13

  description PC-NetApp-A

  switchport mode trunk

  switchport trunk allowed vlan 110,160

  spanning-tree port type edge trunk

  mtu 9216

  service-policy type qos input QOS_MARKING

  vpc 13

  no shutdown

interface port-channel14

  description PC-NetApp-B

  switchport mode trunk

  switchport trunk allowed vlan 110,160

  spanning-tree port type edge trunk

  mtu 9216

  service-policy type qos input QOS_MARKING

  vpc 14

  no shutdown

interface Ethernet1/27

  description FlexPod-A400-CT1:e5a

  switchport mode trunk

  switchport trunk allowed vlan 110,160

  mtu 9216

  channel-group 13 mode active

  no shutdown

interface Ethernet1/28

  description FlexPod-A400-CT2:e5a

  switchport mode trunk

  switchport trunk allowed vlan 110,160

  mtu 9216

  channel-group 14 mode active

  no shutdown

Step 2.      Repeat this procedure on Cisco Nexus B switch for FlexPod-A400-CT1:e5b and FlexPod-A400-CT2:e5b

Procedure 4.       Create Virtual Port-Channel between Cisco UCS C Series server and Cisco Nexus Switches

Step 1.      Login as admin user into the Cisco Nexus A switch, run the following commands:

interface Ethernet1/3-10

switchport mode trunk 

switchport trunk allowed vlan 110

priority-flow-control mode on

spanning-tree port type edge trunk

mtu 9216

service-policy type qos input QOS_MARKING

no shutdown

Step 2.      Repeat this procedure on the Cisco Nexus A switch for all C-Series rack servers with Mellanox ConnectX6-DX Ethernet NIC connected to Nexus 93600CD-GX

Step 3.       Repeat this procedure on the Cisco Nexus B switch.

Appendix C – NetApp AFF A400 Configuration

NetApp ONTAP 9.12.1

Procedure 1.       Complete Configuration Worksheet

Before running the setup script, complete the Cluster setup worksheet in the NetApp ONTAP 9 Documentation Center. You must have access to the NetApp Support site to open the cluster setup worksheet.

Tech tip

It is beyond the scope of this document to explain the detailed information about the NetApp storage connectivity and infrastructure configuration. For installation and setup instruction for the NetApp AFF A400 System, go to: https://docs.netapp.com/us-en/ontap-systems/a400/index.html

For more information, visit FlexPod Design Guides: https://www.cisco.com/c/en/us/solutions/design-zone/data-center-design-guides/flexpod-design-guides.html

Step 1.      For all the nodes data storage as part of the HPC cluster deployment, two large data aggregates (one aggregate on each storage node) were configured as shown below:

A screenshot of a computerDescription automatically generated

Step 2.       The screenshot below shows the Storage VMs (formally known as vserver) configured as “HPC-SVM” for this solution:

A screen shot of a computer programDescription automatically generated

Step 3.      The SVM named “HPC-SVM” was configured to carry all NFS traffic for this HPC Data Analytics solution.

A screenshot of a computerDescription automatically generated

Step 4.      Detailed configuration for HPC-SVM is shown below:

A screen shot of a computerDescription automatically generated

Step 5.      The broadcast-domain was configured as “NFS-data” with MTU 9000 and assigned to default IPspace as shown below:

A computer screen shot of a black screenDescription automatically generated

Step 6.      The following screenshot shows the overview of the network configuration

A screenshot of a computerDescription automatically generated

Step 7.      The export policy “default” was configured and added rules with clients subnets for UNIX systems to allow the NFSv4 protocol as shown below:

Step 8.      Enable Cisco Discovery protocol.

A400-CLUSTER::> node run -node * options cdpd.enable on

Step 9.      Enable Link-Layer Discovery protocol on all ethernet ports:

A400-CLUSTER::> node run * options lldp.enable on

Appendix D – Monitor GPU Utilization

To monitor GPU utilization of our HPC cluster node while running different application; we created local Grafana dashboard for our solution test setup. This is for education purposes and recommended for lab/PoC only.

Prerequisites

1.    Install and configure Grafana open-source server.

2.    Verify Grafana server is running:

$ sudo systemctl status grafana-server

Install Prometheus and configure Prometheus to run as service

3.    Verify that Prometheus is successfully installed using the below command:

$ prometheus –version

$ sudo systemctl start prometheus

sys

Step 1.      Go to http://<Grafana_Server_IP_Address>:3000; Log into Grafana using default username and password admin/password.

A screenshot of a login screenDescription automatically generated

Step 2.      Change the password when prompted.

Step 3.      Add Prometheus as data source.

A screenshot of a computerDescription automatically generated

Step 4.      Add a new dashboard in Grafana. Go to home > Dashboard > new > import.

A screenshot of a computerDescription automatically generated

Step 5.      Copy Grafana Dashboard ID “14574” for Nvidia GPU Metris: https://grafana.com/grafana/dashboards/14574-nvidia-gpu-metrics/

Step 6.      Enter the Dashboard ID in Import via Grafana.com; click Load.

A screenshot of a black screenDescription automatically generated

Step 7.      Edit /etc/prometheus/prometheus.yml  file as shown in the example below:

$ sudo vi /etc/prometheus/prometheus.yml

# my global config

global:

  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

  # scrape_timeout is set to the global default (10s).

 

# Alertmanager configuration

alerting:

  alertmanagers:

    - static_configs:

        - targets:

          # - alertmanager:9093

 

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

  # - "first_rules.yml"

  # - "second_rules.yml"

 

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: "prometheus"

 

    # metrics_path defaults to '/metrics'

    # scheme defaults to 'http'.

 

    static_configs:

      - targets: ["10.29.148.171:9835"]

      - targets: ["10.29.148.172:9835"]

      - targets: ["10.29.148.173:9835"]

      - targets: ["10.29.148.174:9835"]

      - targets: ["10.29.148.175:9835"]

      - targets: ["10.29.148.176:9835"]

      - targets: ["10.29.148.177:9835"]

      - targets: ["10.29.148.178:9835"]

Step 8.      Download node exporter from GitHub:

https://github.com/utkuozdemir/nvidia_gpu_exporter/releases

Step 9.      On each client node install nvidia-gpu-exporter; for example:

$ sudo dpkg -i nvidia-gpu-exporter_1.2.0_linux_amd64.deb

Step 10.  Log into Grafana webUI; go to dashboard created for Grafana. From the drop-down list next to GPU allows to switch between different GPU monitoring.

A screenshot of a computerDescription automatically generated

Appendix E– References used in this guide

Anycast RP Technology White Paper: https://www.cisco.com/c/en/us/td/docs/ios/solutions_docs/ip_multicast/White_papers/anycast.html

Campus Network for High Availability Design Guide, Tuning for Optimized Convergence: https://www.cisco.com/c/en/us/td/docs/solutions/Enterprise/Campus/HA_campus_DG/hacampusdg.html#wp1107578

Campus Network for High Availability Design Guide: https://www.cisco.com/c/en/us/td/docs/solutions/Enterprise/Campus/HA_campus_DG/hacampusdg.html

Appendix F – Recommended for you

Cisco Unified Computing System: https://www.cisco.com/site/us/en/products/computing/servers-unified-computing-systems/index.html

Cisco UCS 6536 Fabric Interconnects: https://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs6536-fabric-interconnect-ds.html

Cisco UCS X9508 Chassis: https://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-x-series-modular-system/datasheet-c78-2472574.html

Cisco UCS X210c M7 Compute Node: https://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-x-series-modular-system/ucs-x210c-m7-compute-node-ds.html

Cisco UCS X440p PCle Node: https://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-x-series-modular-system/ucs-x440p-pcle-node-ds.html

Cisco UCS C240 M7 Rack Server: https://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-c-series-rack-servers/ucs-c240-m7-rack-server-ds.html

Cisco UCS VIC 1500 Adapters: https://www.cisco.com/c/en/us/products/collateral/interfaces-modules/unified-computing-system-adapters/ucs-vic-15000-series-ds.html

Cisco Intersight Infrastructure Service: https://www.cisco.com/c/en/us/products/collateral/cloud-systems-management/intersight/intersight-ds.html

Cisco UCS Manager: https://www.cisco.com/c/en/us/products/servers-unified-computing/ucs-manager/index.html

NVIDIA GPU Cloud: https://www.nvidia.com/en-us/gpu-cloud/

NVIDIA AI Enterprise: https://www.nvidia.com/en-us/data-center/products/ai-enterprise/

NVIDIA HPC SDK: https://developer.nvidia.com/hpc-sdk

Cisco Nexus 9300-GX Series Switch: https://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-series-switches/nexus-9300-gx-series-switches-ds.html

Cisco Nexus 9336C-FX2 Switch: https://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-series-switches/datasheet-c78-742282.html

NetApp Data ONTAP: https://www.netapp.com/data-management/ontap-data-management-software/

NetApp AFF A400: https://www.netapp.com/data-storage/aff-a-series/aff-a400/

SpecHPC 2021 Benchmark Suite: https://www.spec.org/hpc2021/

Feedback

For comments and suggestions about this guide and related guides, join the discussion on Cisco Community at https://cs.co/en-cvds.

CVD Program

ALL DESIGNS, SPECIFICATIONS, STATEMENTS, INFORMATION, AND RECOMMENDATIONS (COLLECTIVELY, "DESIGNS") IN THIS MANUAL ARE PRESENTED "AS IS," WITH ALL FAULTS. CISCO AND ITS SUPPLIERS DISCLAIM ALL WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OR ARISING FROM A COURSE OF DEALING, USAGE, OR TRADE PRACTICE. IN NO EVENT SHALL CISCO OR ITS SUPPLIERS BE LIABLE FOR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, OR INCIDENTAL DAMAGES, INCLUDING, WITHOUT LIMITATION, LOST PROFITS OR LOSS OR DAMAGE TO DATA ARISING OUT OF THE USE OR INABILITY TO USE THE DESIGNS, EVEN IF CISCO OR ITS SUPPLIERS HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

THE DESIGNS ARE SUBJECT TO CHANGE WITHOUT NOTICE. USERS ARE SOLELY RESPONSIBLE FOR THEIR APPLICATION OF THE DESIGNS. THE DESIGNS DO NOT CONSTITUTE THE TECHNICAL OR OTHER PROFESSIONAL ADVICE OF CISCO, ITS SUPPLIERS OR PARTNERS. USERS SHOULD CONSULT THEIR OWN TECHNICAL ADVISORS BEFORE IMPLEMENTING THE DESIGNS. RESULTS MAY VARY DEPENDING ON FACTORS NOT TESTED BY CISCO.

CCDE, CCENT, Cisco Eos, Cisco Lumin, Cisco Nexus, Cisco StadiumVision, Cisco TelePresence, Cisco WebEx, the Cisco logo, DCE, and Welcome to the Human Network are trademarks; Changing the Way We Work, Live, Play, and Learn and Cisco Store are service marks; and Access Registrar, Aironet, AsyncOS, Bringing the Meeting To You, Catalyst, CCDA, CCDP, CCIE, CCIP, CCNA, CCNP, CCSP, CCVP, Cisco, the Cisco Certified Internetwork Expert logo, Cisco IOS, Cisco Press, Cisco Systems, Cisco Systems Capital, the Cisco Systems logo, Cisco Unified Computing System (Cisco UCS), Cisco UCS B-Series Blade Servers, Cisco UCS C-Series Rack Servers, Cisco UCS S-Series Storage Servers, Cisco UCS X-Series, Cisco UCS Manager, Cisco UCS Management Software, Cisco Unified Fabric, Cisco Application Centric Infrastructure, Cisco Nexus 9000 Series, Cisco Nexus 7000 Series. Cisco Prime Data Center Network Manager, Cisco NX-OS Software, Cisco MDS Series, Cisco Unity, Collaboration Without Limitation, EtherFast, EtherSwitch, Event Center, Fast Step, Follow Me Browsing, FormShare, GigaDrive, HomeLink, Internet Quotient, IOS, iPhone, iQuick Study,  LightStream, Linksys, MediaTone, MeetingPlace, MeetingPlace Chime Sound, MGX, Networkers, Networking Academy, Network Registrar, PCNow, PIX, PowerPanels, ProConnect, ScriptShare, SenderBase, SMARTnet, Spectrum Expert, StackWise, The Fastest Way to Increase Your Internet Quotient, TransPath, WebEx, and the WebEx logo are registered trade-marks of Cisco Systems, Inc. and/or its affiliates in the United States and certain other countries. (LDW_P4)

All other trademarks mentioned in this document or website are the property of their respective owners. The use of the word partner does not imply a partnership relationship between Cisco and any other company. (0809R)

Learn more