Cisco UCS C-Series Rack Servers

Cisco usNIC Performance on C220 M3 with Intel E5 (v1) Processors

  • Viewing Options

  • PDF (1.4 MB)
  • Feedback


Key Findings. 3

Introduction. 3

Cisco usNIC.. 3

Performance Results. 3

Point-to-Point Benchmarks. 3

NetPIPE.. 3

Intel MPI Benchmarks. 4

System under Test Configuration. 9

System Hardware and Software. 9

Network Adapter 9

Server System... 9

Network Switch. 9

BIOS Settings. 9

CPU Configuration. 9

Memory Configuration. 10

Operating System Settings. 10

Kernel Parameters. 10

CPU Governor 11

Other 11

Networking Settings. 11

Adapter Configuration. 11

Network Switch Configuration. 11

Network Topology. 13

Benchmarking Software. 13

Conclusion. 13

Appendix. 13

Key Findings

Cisco VIC with usNIC technology achieves 2.13 microseconds ping-pong latency using Open MPI across a Cisco Nexus 3548 Switch.

Cisco VIC with usNIC technology achieves 1168 MBps of MPI ping-pong throughput.

Cisco VIC with usNIC technology achieves 2336 MBps of MPI Exchange and MPI SendRecv throughput.


With the advent of highly dense multicore compute systems, the need for low-latency network communication between compute systems has become paramount. In addition, the cost of building such compute cluster systems, as well as the cost of managing them, has become a major consideration in the design and deployment of these systems.

This white paper presents a simple introduction to Cisco® user-space network interface card (usNIC) technology and also describes performance results using the Intel MPI Benchmark (IMB).

Cisco usNIC

Cisco usNIC is a low-latency interface on top of Cisco UCS® Virtual Interface Card (VIC) 1225. The interface consists of a set of software libraries that enable a data path bypassing the kernel. Cisco UCS VIC 1225 is a converged network adapter (CNA) that supports Ethernet NICs and Fibre Channel host bus adapters (HBAs).

Cisco usNIC technology is a viable approach, since it enables ultra-low-latency network communication between nodes. As such, it can be deployed for various latency-dependent tasks, high-performance computing (HPC) clusters being one of them.

Performance Results

Point-to-Point Benchmarks


NetPIPE performs a simple ping-pong test between two nodes and reports half-round-trip (HRT) latencies in microseconds and throughput in MBps for a range of message sizes. Figure 1 shows a graph of the test results.

The following mpi command was used to run the NetPIPE test:

/opt/cisco/openmpi/bin/mpirun \

--host n1, n2 \

--mca btl usnic, self, sm \

--bind-to-core \


Figure 1. NetPIPE Latency and Throughput

Intel MPI Benchmarks

IMB runs a set of MPI tests between two nodes and reports latencies (HRT) and throughput in MBps for a range of messages at sizes between and including 2^0 and 2^22.

The following tests were run:

PingPong, PingPing

Sendrecv, Exchange

Allreduce, Reduce, Reduce_scatter

Gather, Gatherv, Scatter, Scatterv

Allgather, Allgatherv, Alltoall, Alltoallv, Bcast

For more information on the benchmarks, refer to the IMB user guide link in the appendix.

The following mpi command was used to run the IMB tests:

/opt/cisco/openmpi/bin/mpirun \

--host n1, n2 \

--mca btl usnic, self, sm \

--bind-to-core \

/opt/IMB-MPI1 -iter 10000, 10000, 10000

Figures 2, 3, and 4 present IMB ping-pong latencies and throughput for a range of message sizes using Cisco usNIC. The message sizes are split up into small, medium, and large to allow a closer look at the trend.

Figure 2. IMB PingPong Small Message

Figure 3. IMB PingPong Medium Message

Figure 4. IMB PingPong Large Message

The rest of the tables present consolidated performance information from various IMB performance tests.

Figure 5. PingPong, PingPing

Figure 6. SendRecv, Exchange

Figure 7. Allreduce, Reduce, Reduce_scatter

Figure 8. Gather, Gatherv, Scatter, Scatterv

Figure 9. Allgather, Allgatherv, Alltoall, Alltoallv, Bcast

System under Test Configuration

System Hardware and Software

Network Adapter

Hardware: Cisco UCS VIC 1225 CNA

Firmware: 2.1 (2.127)

Driver software: enic-

usNIC software:




Server System

Hardware: Cisco UCS C220 M3 Rack Server with Intel® E5-2690 processor with 1600-MHz DDR3 memory

Firmware: Cisco UCS C-Series Software, Release 1.5 (2)

Network Switch

Hardware: Cisco Nexus® 3548 Switch

Software: Release 6.0 (2) A1(1a)

BIOS Settings

The following BIOS settings were used for this performance testing:

CPU Configuration

Intel Hyper-Threading technology: Disabled

Number of enabled cores: All

Execute disable: Disabled

Intel VT: Enabled

Intel VT-d: Enabled

Intel VT-d coherency support: Enabled

Intel VT-d ATS support: Enabled

CPU performance: HPC

Hardware prefetcher: Enabled

Adjacent cache line prefetcher: Enabled

DCU streamer prefetch: Enabled

DCU IP prefetcher: Enabled

Power technology: Custom

Enhanced Intel SpeedStep® technology: Enabled

Intel Turbo Boost technology: Enabled

Processor power state C6: Disabled

Processor power state C1 enhanced: Disabled

Frequency floor override: Disabled

P-STATE coordination: HW_ALL

Energy performance: Performance

Memory Configuration

Select memory RAS: Maximum performance

DRAM clock throttling: Performance

NUMA: Enabled

Low-voltage DDR mode: Performance mode

DRAM refresh rate: 2x

Channel interleaving: Auto

Rank interleaving: Auto

Patrol scrub: Disabled

Demand scrub: Enabled

Altitude: 300M

Other settings were left at default values.

Operating System Settings

Kernel Parameters

The following kernel parameter enables support for the Intel I/O memory management unit (IOMMU): intel_iommu=on

The following parameter turns off the Intel CPU idle driver: intel_idle.max_cstate=0

The following kernel parameter explicitly disables the CPU C1 and C1E states: idle=poll

Note: Using the idle=poll kernel parameter can result in increased power consumption. Use with caution.

The above kernel parameters were configured in the/etc/grub.conf file.

CPU Governor

The OS CPU governor was configured for “performance.” In the file/etc/sysconfig/cpuspeed, the “GOVERNOR” variable was set to “performance.”



The test nodes were also configured to operate in runlevel 3 to avoid unnecessary background processes.

In the file /etc/inittab, the following line replaced the OS default:


SELinux was disabled.

In the file/etc/selinux/config, the following line replaced the OS default:


Networking Settings

Adapter Configuration

The following vNIC adapter configuration was used:

MTU 9000

Number of VFs instances = 16

Interrupt coalescing timer = 0

The vNIC is directly configured from the onboard Cisco Integrated Management Controller (IMC).

Network Switch Configuration

The Cisco Nexus 3548 was configured with the following settings:

Flow control: On

No Drop mode: Enabled

Pause: Enabled

Network MTU: 9216

WARP mode: Enabled

To enable flow control on the Nexus 3548:

configure terminal
interface ethernet 1/1-48
flowcontrol receive on
flowcontrol send on

To enable No Drop mode and Pause:

configure terminal
class-map type network-qos class1
match qos-group 1
policy-map type network-qos my_network_policy
class type network-qos class1
pause no-drop
system qos
service-policy type network-qos my_network_policy
show running ipqos
configure terminal
class-map type qos class1
match cos 2
policy-map type qos my_qos_policy
class type qos class1
set qos-group 1
system qos
service-policy type qos input my_qos_policy

To enable MTU 9216:

configure terminal
policy-map type network-qos jumbo
class type network-qos class-default
mtu 9216
system qos
service-policy type network-qos jumbo

To enable WARP mode:

configure terminal
hardware profile forwarding-mode warp
copy running-config startup-config

Note 1: The above configuration is specific to the system under test presented here and may not be directly applicable to all use cases. Please consult your local network administrator or refer to the Cisco Nexus 3548 Command Reference (see the link in the appendix) for more information.

Note 2: The above configuration also specifies send/recv flow control and No Drop mode enabled. This prevents the switch from dropping packets on the network with a combined use of port buffer management and network pause. These settings are specific to the test. In some application instances, it may be optimal not to enable send/recv flow control and ‘no drop’ mode.

Network Topology

Two nodes were connected to a single Cisco Nexus 3548 Switch. The Cisco Nexus 3548 is an ultra-low-latency-capable switch from Cisco that is well suited for low-latency network messaging. For more details, refer to the product page link for the switch in the appendix.

Figure 10 shows the network topology that was used.

Figure 10. Network Topology

Benchmarking Software

NetPIPE version 3.7.1 was used for testing point-to-point latency and for a throughput comparison between Cisco usNIC and Kernel TCP.

Intel MPI Benchmarks (IMB) version 3.2.4 was used for testing. Refer to the links in the appendix for more information about this software.


With a small packet MPI ping-pong latency of 2.13 microsec and a maximum ping-pong throughput of 1168 MBps the Cisco usNIC on the Cisco UCS VIC 1225 with the Cisco Nexus 3548 enables a full HPC stack solution that is capable of both low latency and high throughput. Therefore, it is a compelling approach to running HPC tasks with Open MPI on standard Ethernet networks.


Cisco UCS VIC 1225 CNA product page:

Cisco UCS C220 M3 Rack Server product page:

Cisco Nexus 3548 Switch product page:

Cisco Nexus 3548 Command Reference:

NetPIPE homepage:

Intel MPI Benchmarks (IMB):

IMB user guide is available at: