This paper presents a networking performance comparison between Cisco Data Center Virtual Machine Fabric Extender (VM-FEX) and VMware vSwitch network connectivity technologies using the Cisco UCS
® Virtual Interface Card (VIC) 1240 on Cisco UCS B200 M3 Blade Server. The following observations are presented:
• Cisco VM-FEX technology can transmit or receive 9.8 Gbps of uni-directional TCP network throughput while utilizing 44.80 percent system CPU for transmit and 65.60 percent system CPU for receive.
• Cisco VM-FEX uses 16 percent lower system CPU for transmit and 30.4 percent lower system CPU for receive compared to VMware vSwitch for the same amount of bandwidth.
• Cisco VM-FEX uses 65.60 percent of system CPU for transmit and receive while driving 10.89 Gbps of bi-directional TCP network throughput compared to VMware vSwitch, which uses 81.60 percent of system CPU while driving only 7.97 Gbps.
• Cisco VM-FEX takes 36 percent less time for an average round trip compared to VMware vSwitch.
• Cisco VM-FEX offers over 40 percent reduction in latency, compared to VMware vSwitch
Cisco VM-FEX technology is a Cisco innovation that allows VMs to bypass the hypervisor networking stack and access the network directly.
Cisco VM-FEX utilizes the capability to create multiple vNICs in combination with VMware VMDirectPath and Intel
® VT-d technologies. This, in turn, allows the VMs to bypass the hypervisor for their networking connectivity by allowing direct access to the underlying adapter hardware (see Figure 1). This approach avoids the overhead of the hypervisor software networking stack, resulting in lower system CPU utilization and higher networking throughput.
Cisco VM-FEX therefore enables the VMs to support higher networking traffic capacity and be more responsive, especially where TCP networking is used and/or the application is CPU-bound.
Cisco VM-FEX uses Cisco UCS Virtual Interface Card 1240 for hardware connectivity. Cisco UCS 1240 VIC is a 4 x 10 Gbps-capable networking adapter.
The Cisco VIC can also be used with vSwitch. The VIC is capable of supporting multiple, independent Peripheral Component Interconnect Express (PCIe) devices. They user can create multiple virtual network interface cards (vNICs) (in this case, PCIe devices) and associate them with one or more vSwitches to distribute interrupt load across multiple CPU cores if so desired. Cisco VM-FEX, however, allows for directly attaching these independent PCIe devices (vNICs) into the VM by bypassing the hypervisor networking stack.
The Cisco VIC is therefore a uniquely flexible networking adapter that offers both scalability and performance without compromise.
Figure 1. VM Network Connectivity with Cisco VM-FEX
For the purpose of the benchmarking effort presented in this paper, only one vNIC was used for the vSwitch under consideration. Our goal was a simple, straightforward yet fair performance comparison between a VM using a Cisco VM-FEX derived vNIC and a VM using a VMware vSwitch derived vNIC (see Figure 2).
Figure 2. VM Network Connectivity with VMware vSwitch
This paper focuses on networking performance with a single VM configured with a single vCPU and a single virtual NIC (vNIC). In the case of Cisco VM-FEX the vNIC is an actual hardware PCIe device plumbed directly into the VM. In the case of VMware vSwitch the vNIC is a virtualized network driver (VMXnet3).
TCP Stream Performance Results
The VMware esxtop tool presents CPU utilization for each individual CPU core, as well as all the cores in the system. Core utilization percent is the percentage of an individual CPU core that is used. A core utilization percent value of less than or equal to 100 percent denotes a single core and 1600 percent denotes all 16 cores.
Uni-directional TCP Performance
In this test, a single TCP stream is sent from the source VM to the destination VM.
Transmit CPU utilization (TX CPU%) was captured on the hypervisor hosting the source VM and receive CPU utilization (RX CPU%) was captured on the hypervisor hosting the destination VM (see Table 1).
For TCP transmit, Cisco VM-FEX consumes 16 percent less CPU when compared to VMware vSwitch. For the same stream on the receive side, Cisco VM-FEX consumes 30.4 percent less CPU when compared to vSwitch (see Figure 3).
Table 1. Uni-directional TCP Performance
Uni-directional TCP Performance (Single vCPU)-8192B Payload / 9000B MTU
Figure 3. CPU Utilization Difference between VM-FEX and vSwitch for Uni-directional TCP
Figure 3 shows the CPU utilization difference between VM-FEX and vSwitch. For the same amount of network bandwidth, VM-FEX clearly consumes less CPU. This is simply because with VM-FEX, the traffic stream does not have to traverse the hypervisor networking stack on the sender or on the receiver. By avoiding the software overhead while performing direct memory access (DMA) from the VM to the hardware vNIC, VM-FEX can save valuable CPU cycles.
Bi-directional TCP Performance
In this test, TCP streams were sent to and from the source VM simultaneously. Bi-directional TCP traffic requires more CPU than uni-directional TCP traffic.
In this test, both transmit CPU utilization (TX CPU%) and receive CPU utilization (RX CPU%) were captured on the hypervisor hosting the source VM (see Table 2).
The bandwidth metric was captured from within the source VM and verified against the esxtop network view of the hypervisor (Figure 4). The RTT (round trip time) was captured within the source VM (Figure 5).
Table 2. Bi-directional CPU Performance
Bi-directional TCP Performance (Single vCPU)-8192B Payload / 9000B MTU
Avg RTT (usecs)
Figure 4. CPU Utilization Difference between VM-FEX and vSwitch for Bi-directional TCP
Cisco VM-FEX consumes less CPU while delivering higher bandwidth.
From a CPU utilization perspective, Cisco VM-FEX consumes 65.60 percent of a single CPU compared to 81.60 percent consumed by VMware vSwitch for the same workload.
And Cisco VM-FEX can deliver up to 10.89 Gbps of bi-directional bandwidth compared to VMware vSwitch which can deliver up to 7.97 Gbps while consuming more CPU.
In addition, as Figure 5 shows, the average RTT of a request/response of 8192B TCP packet with Cisco VM-FEX is significantly lower when compared to VMware vSwitch. This is the cumulative side effect of using fewer CPU cycles and bypassing the hypervisor networking stack for traffic flows.
Figure 5. RTT Difference between VM-FEX and vSwitch
TCP Transactional Performance
TCP Latency Performance
Both the source and destination VMs were configured for low latency (see the section
The actual latency numbers were captured inside the source VM. The netperf TCP_RR test was used to derive the latency numbers.
The netperf tool reports total number of such send/receive transactions per second. With the number of TCP_RR transactions:
Latency = (1000000 / TCP_RR_transactions) / 2
TCP_RR is a request/response test where the source VM sends a packet to the destination VM and waits to receive the packet before re-sending the same packet. Each send/receive operation is a single TCP_RR transaction.
The number of microseconds in a second is 1000000. Dividing (1 million / TCP_RR transactions) will give the RTT of a single transaction in microseconds. Further dividing the number by 2 will give the one way latency of a single transaction (Figure 6).
Figure 6. TCP_RR Test
The TCP_RR test reports on multiple send/receive operations over single persistent TCP connection. Table 3 shows the performance results.
Table 3. Latency Results for VM-FEX and vSwitch
Cisco VM-FEX (usecs)
VMware vSwitch (usecs)
Across the board, Cisco VM-FEX delivers lower latency compared to VMware vSwitch. With packet payloads of 1 byte through 512 bytes, Cisco VM-FEX offers up to 46 percent lower latency (Figure 7).
Figure 7. TCP Latency Comparison between VM-FEX and vSwitch
TCP Connect/Close Performance
TCP connect/close transaction results were derived using netperf TCP_CC test. This test reports the number of TCP connect/close transactions per second (Figure 8). There are no request/response operations within the connect/close operations.
TCP_CC Latency = (1000000 / Number of TCP_CC transactions) / 2.
Figure 8. TCP_CC Test
TCP_CC test shows the performance results for TCP connection setup and close (Figure 9).
Figure 9. Connect/Close Performance Comparison between VM-FEX and vSwitch
TCP Connect/Request/Response/Close Performance
TCP Connect/Request/Response/Close results were derived using netperf TCP_CRR test. This test reports the number of TCP connect/request/response/close transactions per second (Figure 10).
TCP_CRR Latency = (1000000 / Number of TCP_CC transactions) / 2.
Figure 10. TCP_CRR Test
TCP_CRR test shows the times for TCP connection setup, request send, response received and connection close (Figure 11). The TCP_CRR test is similar to what happens with HTTP.
Figure 11. TCP Connect/Request/Response/Close Performance Comparison between VM-FEX and vSwitch
For both vSwitch and VM-FEX, two separate ESX hosts were each configured with a source and destination VM. Both the VMs were isolated on a common physical Layer 2 network, as shown in Figure 12 and 13.
Figure 12. Cisco VM-FEX Fabric Topology
With Cisco VM-FEX, both VMs bypass the hypervisor networking stack. However, with VMware vSwitch, traffic between the source and destination VMs has to traverse two sets of networking stacks, which has implications for performance.
Figure 13. VMware vSwitch Fabric Topology
Identical compute hardware was used for both vSwitch and VM-FEX performance testing (see Table 4).
Table 4. Hardware Configuration Used in Testing
Cisco UCS B200 M3
Cisco UCS 1240 Virtual Interface Card
Intel® Xeon E2690 @ 2.93 GHz/Core
16GB x 8 1600 MHz DDR3 RAM
Default BIOS configurations were used. By default, Cisco UCS B200 M3 has Intel VT-x and VT-d extensions enabled. These options are available under Advanced CPU Configuration section of the BIOS. Enabling these options is a mandatory requirement for VM-FEX functionality.
The default adapter configuration was used.
Table 5 shows the VM configuration settings used in testing.
Table 5. VM Configuration Settings Used in Testing
Number of vCPUs
Number of vNICs
Note that Cisco VM-FEX requires the VMXNet3 guest network device driver. Even though VM-FEX bypasses the hypervisor, it still relies on VMXNet3 to bring up and manage the device and also during VMware vMotion VM migration process. Once the VM is active, the device is relinquished from hypervisor and attached directly to the VM. When the user initiates vMotion, the device is reattached to the hypervisor using VMXNet3, migrated to the target host, and then unattached from the hypervisor and re-attached for the direct access to the VM.
Low-Latency Configuration for VMware vSwitch
Table 6 shows the low-latency configuration for vSwitch.
Table 6. vSwitch Low-Latency Configuration
VM Guest Settings
Additionally, from within the ESX console, the following command was used to ensure the interrupt coalescing timer for vmnic0 was set to 0 (turned off).
ethtool -c vmnic0 rx-usecs 0
Low-Latency Configuration for Cisco VM-FEX
See Table 7.
Table 7. VM-FEX Low-Latency Configuration
VM Guest Settings
Additionally, the interrupt coalescing timer was set to 0 (turned off) in the VIC Adapter Policy in UCSM. This is the Adapter Policy assigned to the dynamic vNICs created using UCSM.
The `monitor_control.halt_desched = False' option configures the hypervisor to never de-schedule the VM process. This can result in higher CPU utilization.