Guest

Cisco SFS 7000 Series InfiniBand Server Switches

Parallel Computing and the Message Passing Interface

White Paper

Parallel computing frequently relies upon message passing to exchange information between computational units. In high-performance computing, the most common message passing technology is the Message Passing Interface (MPI), which is being developed in an open-source implementation supported by Cisco Systems® and other vendors. This document is an introduction to parallel computing, basic MPI concepts, and how the behavior of parallel computation applications and their communications patterns that they use may affect network design.

INTRODUCTION

Traditionally, parallel computing was considered to be at the highest end of computing, which addressed the most demanding types of computational problems, such as high-performance supercomputers modeling simulations of complex systems. The systems modeled include weather and climate, chemical and nuclear reactions, biological interactions, seismic activity, and aerodynamics. Until recently, these highly specialized applications were considered to be outside the realm of mainstream enterprise computing.
One obstacle to wider adoption of high-performance applications is the expense of supercomputers. The advent of clustering software made possible a new kind of virtual supercomputer, made up of multiple, networked, industry-standard computers, which could execute parallel applications at a fraction of the cost of traditional supercomputers. The development of computer clusters linked with high-speed networks has radically changed the economics of high-performance computing (HPC) systems. HPC applications are now attractive to many enterprises-not just large research labs-because these applications can help create competitive advantages by reducing time-to-information and enabling more reliable investment and business decision-making.
The most common form of HPC application centers around message passing: the ability to pass data and instruction messages within an HPC cluster. One message passing technology, the MPI, is used extensively within HPC clusters because of its rich functionality, performance, and standardized support for a broad range of supercomputing platforms. This document is intended as a brief introduction to concepts of and motivation for parallel computing and MPI, and how an application's use of MPI communications can affect network design.

Parallel Computing

Applications written for serial computation use a single processor that solves a problem using a series of instructions, executed one after the other by the CPU. As an example of a serial computation, consider the way a pocket calculator is used to solve a simple equation:
To calculate r = m x n + p x q
substitute m = 100, n = 200, p = 300, and q = 400
Time +0-multiply: 100 x 200 = 20,000
Time +1-multiply: 300 x 400 = 120,000
Time +2-add: 20,000 + 120,000 = 140,000
By contrast, parallel computing uses multiple compute resources simultaneously to execute many instructions simultaneously. Specifically, the computation is split into multiple parts that are then calculated independently on different processors. The results of each part are then combined to provide the final result. Applying parallel computing to this example, one master calculator distributes different parts of the equation to two slave calculators, which then perform their calculations in parallel. Once the slave calculators have computed their result, the master calculator takes each output from the slave calculators and combines them to provide the answer:
To calculate r = m x n + p x q
Calculator 1 calculates m x n = r1 and Calculator 2 calculates p x q = r2, then the master calculator can process the final calculation:
r1 + r2 = r
Time +0-Calculator 1-multiply: 100 x 200 = 20,000=r1
Time +0-Calculator 2-multiply: 300 x 400 = 120,000 = r2
Time +1-Master Calculator-add: 20,000 + 120,000 = 140,000
This trivial example shows that when computed in parallel, the same computations and final result are achieved faster (i.e., two timesteps vs. three timesteps). Although two or three timesteps may seem inconsequential, applying the same parallel computation techniques to much larger applications can result in the same kind of speedups. "Perfect" speedups, where an application operates N times faster (as compared to serial execution) when running on N processors, are not uncommon, although the exact performance improvement due to parallel techniques is strongly tied to the application and its behavior.
Figure 1 generalizes this example and illustrates how parallel computing provides a significant performance gain in terms of time-to-information. Note that some time is spent in passing messages between processors that, depending on the frequency of the data synchronization between nodes, can have a significant effect on application run time.

Figure 1. Serial and Parallel Processing

One factor in determining whether an application can benefit from parallel computing is the extent to which the application can be broken up into discrete pieces of work that can be processed more-or-less independently. If the application cannot be parallelized, then parallel computing will provide no benefit.
Another class of application known as parametric execution can benefit from clustered or parallel computing resources. In operation, a parametric execution runs the same application many times but with different input sets (that is, varying the parameters). Monte Carlo simulations, which calculate the probability of an event based on multiple scenarios with different input criteria, are a common example. Each process computes its result based upon the input data received and returns the result to the master node. Hence, parametric execution achieves the results of many runs by running them simultaneously.

Other Considerations

An important consideration regarding parallelization is whether the application is tightly or loosely coupled with other processes. If the application is tightly coupled, processes need to exchange data at frequent intervals, and communications latency may become a predominant design factor. If the application is loosely coupled, information need not be exchanged as often. As a result, larger data sets may be exchanged, and thus bandwidth may become a predominant factor. Therefore, the degree of coupling directly influences which network technology to use-Gigabit Ethernet, 10 Gigabit Ethernet, or InfiniBand-for exchanging information between processes.
Although raw CPU processing power-millions or billions of floating-point operations per second-is one of the most often discussed performance characteristics within the context of high-performance computing, physical memory constraints are another aspect of computing large data sets. A single computer has a finite amount of memory resources. If the computer is required to process large data sets, performance may be compromised as a result of swapping data between memory and hard disk. For large data sets, parallel computing offers the ability to use the memory of multiple computers within an HPC cluster. This improves the efficiency of the computation by reducing the data set size on each node, and consequently reduces the overall CPU and I/O overhead associated with memory paging.
The development of cost-effective, scalable, multiprocessor computer architectures using high-performance networks such as InfiniBand, Gigabit Ethernet, 10 Gigabit Ethernet, and message-passing standards indicates that cluster computing may be the future of high-performance computing.

Parallel Computing and Networking

Parallel computing is most often associated with high-performance computing, where multiple computers-sometimes thousands-are used to run a single application. The most common paradigm for parallel computing is the use of message passing to exchange information between processes.
As discussed earlier, when messages are passed between nodes (Figure 1), some time is spent transmitting these messages, and depending on the frequency of the data synchronization between processes, that factor can have a significant effect on total application run time. It is critically important to understand how the application works with respect to interprocess communications patterns and the frequency of updates, because these affect the performance and design of the parallel application, the design of the interconnecting network, and the choice of network technology.
Using traditional transport protocols such as TCP/IP, the CPU is responsible for managing how data is moved between I/O memory and for transport protocol processing. The effect of this is that time spent in communicating between nodes is time not spent on processing the application. Therefore, minimizing communications time is a key consideration for certain classes of applications.
Message latency is the total time taken to send a message that has a zero-length payload from one process to another. This includes all latencies incurred through direct memory access (DMA) copies within the node, network interface card (NIC) latencies, transmission protocol latencies, and network switching delays. Protocol-stack offload techniques, such as TCP offload engines (TOE) for Gigabit Ethernet and 10 Gigabit Ethernet and remote direct memory access (RDMA) for InfiniBand, can be used to offload data movement processing from the CPU to dedicated network hardware. Such techniques generally allow for faster message passing processing, lower total latencies and enable more of the node's main CPU cycles to be spent on application processing, thereby increasing processing efficiency.

MPI

MPI is an industry standard consisting of two documents, MPI-1 and MPI-2, that specify a library of functions to enable the passing of messages between nodes within a parallel computing environment. Although the MPI standards specify over 300 functions, a large class of applications can perform all required message passing using fewer than 10 common MPI functions.
MPI is "middleware" software that sits between the application and the network hardware. It provides a portable mechanism to enable messages to be exchanged between processes regardless of the underlying network or parallel computational environment. As such, implementations of the MPI standard use underlying communications stacks such as TCP or UDP over IP, InfiniBand, or Myrinet to communicate between processes.
MPI offers a rich set of functions that can be combined in simple or complex ways to solve any type of parallel computation. The ability to exchange messages enables instructions or data to be passed between nodes to distribute data sets for calculation. MPI has been implemented on a wide variety of platforms, operating systems, and cluster and supercomputer architectures. Due to its rich functionality and flexibility, MPI has become the de-facto standard within HPC for building message passing applications.
There are many software implementations of the MPI standard; some are derived from open-source projects while others are commercial products. MPICH, an early open-source project from Argonne National Labs, was one of the first open source projects and was therefore considered to be a reference implementation. Many projects were forked from the original MPICH code base, such as MVAPICH from The Ohio State University that was specifically targeted at InfiniBand networks. Every HPC vendor includes some implementation with their solutions. Cisco® has joined the Open MPI project (http://www.open-mpi.org), which is a next-generation implementation of the MPI standard, optimized for both Ethernet and InfiniBand networks.

MPI Basics: Point-to-Point Communications

MPI provides point-to-point communications between two processes: one sender and one receiver. Each message consists of an MPI signature and a body. The signature, analogous to a packet header, includes a source and destination process, a communicator field, and a tag for classifying messages. The communicator field specifies a group of processes and a unique context in which both the source and destination processes are members, and the tag further classifies messages within the single communicator. MPI guarantees to deliver messages of the same signature in order. MPI messages are typed to support both contiguous and noncontiguous data in a consistent fashion and resolve data format issues, such as 64-bit and 32-bit representations and big-endian and little-endian format differences.
The simplest form of sending in MPI is the blocking mode send. The MPI_SEND function is used by the sender to specify a buffer pointing to the body of the message and its corresponding signature. MPI_SEND returns when it is safe for the application to reuse the message buffer. This does not necessarily imply that the message has been sent or received, however, only that the buffer is available for reuse. For example, MPI or the underlying network may have copied the message to an internal buffer, or the message may have actually been sent. A synchronous mode send can be used when guarantees about the receiver are required; a synchronous send does not return until the receiver has posted a matching receive and started to receive the message.
A blocking receive is performed by calling the MPI_RECV function. MPI_RECV blocks the application until a matching message is received into buffer. The receive process specifies a particular message signature-source, destination, communicator, and tag-that is used to compare to incoming messages. If there is a match, that message is then received into the target buffer. The receiver may specify wildcard values for the source and/or tag portions of the message signature. See Figure 2.
MPI also supports nonblocking modes of send and receive operations. Using non-blocking modes, the application starts the communication and later polls MPI for completion. This separation allows MPI to potentially make progress "in the background," even while the application is off doing other work, allowing for greater efficiency by allowing overlap of communication and computation. Not all MPI implementations are capable of making true background progress (known as asynchronous progress), but most are capable of at least some degree of progress while not in MPI functions.

Figure 2. MPI Point-to-Point Send and Receive

MPI Basics: Collective Communications

MPI also supports the exchange of data with arbitrary subgroups of processes using communicators and collective operations.
Communicators make it possible for parallel applications to be logically segmented into subgroups or processes that may be used to enable local data processing. An additional benefit of using communicators is that they can be used to segment a task, which can reduce the complexity and improve the readability of a program and provide fault separation. Note that processes can belong to multiple communicators simultaneously, as shown in Figure 3. The process on the first node in each node column in Figure three is shown in two communicators: Alpha and its respective column communicator.
MPI provides native collective operations for different communications patterns-broadcast, scatter, gather, and so on-that can be used to communicate messages between members of a communicator. The use of collectives abstracts or hides the complexity of different communications models from the programmer, to make programming simpler, and allows the MPI implementation to use the most efficient algorithms for a particular communication pattern. Although collective communications send data to processes within a specific communicator, collectives do not use the tag mechanism for classifying messages. Instead, applications that use collective operations must take care that they execute in the same order in all processes involved in the computation.
The concept of communicators can be extended to support hierarchies that allow processing to be localized in specific subgroups of nodes to increase efficiencies for network resource usage (Figure 3). In addition, this approach can reduce interprocess latency. Understanding which communicator groups and collectives are being used is important when designing HPC networks, because different communicator groups and collectives can affect messaging traffic patterns. Understanding the different traffic patterns may allow for some level of oversubscription in the core-facing connections without reducing performance.

Figure 3. Hierarchical MPI Collectives

By using hierarchical communicators as depicted, the bandwidth required can be reduced significantly. The principle of this design is that the communicators "Beta" to "Epsilon" subgroups each use local broadcast communications to limit the network distance that the broadcast needs to traverse and significantly reduces network overhead (Figure 3). The local result can then be shared between the subgroups using a broadcast collective within the "Alpha" communicator group.
The use of this type of agglomeration is dependant on the application being used and the problem being solved. In some cases, the application may allow localization of data and, consequently, greater oversubscription in core links than other applications would allow.
An interesting feature of the MPI-1 standard, which takes advantage of agglomeration, is the concept of topology-aware communications groups, in which communicators can be formed based on topological affinities. This capability can be used to enable optimal communications and scheduling based on network location. For example, some MPI implementations will detect the network topology shown in Figure 3 and will automatically structure collective operations in a hierarchical fashion, making it unnecessary for the user to do so.

Collective Communication Patterns

MPI collectives support different operations-broadcast, reduce, scatter, and gather-to send and receive data from other processes within the communicator.
The MPI broadcast operation (Figure 4) enables a root process to send a message to all processes in the communicator. The MPI scatter operation (Figure 5) provides one-to-all communications, where the end result is that each process ends up with a portion of the original message.

Figure 4. MPI Broadcast

Figure 5. MPI Scatter

The MPI gather operation (Figure 6) is the reverse of MPI scatter; it provides all-to-one communications in which each process in the communicator sends the contents of its send buffer to the root process, where the contents are stored in rank order. A variation of gather is the all-gather routine, which gathers the contents of all the processes within the communicator so the results are resident in every process (Figure 7).

Figure 6. MPI Gather

Figure 7. MPI All Gather

Another class of collective MPI operations is reductions, which allows the combination of data from each process (Figure 8). The final result is stored at the root process. MPI defines several built-in operations (global sum, global product, bitwise AND, bitwise OR, and so on), but also allows for user-defined operations.

Figure 8. MPI Reduce

The example depicted in Figure 9 describes how a hypothetical MPI application works with respect to calculating the results of a data set. Within the MPI communicator each process is assigned a rank. In this example, a "master" process on node 0 is assigned rank 0. Using the pseudo code depicted, the processes first discover the rank that they have within the communicator, and from that rank they can determine the subset of data from the file a() that they will be processing. The master process in this instance has divided the data into three equal parts and is using MPI scatter. The file in this instance may be downloaded from an external Network File System (NFS) or Parallel File System (PFS) server, or from the master node.

Figure 9. Conceptual Parallel Processing Model

Each node then processes the assigned data sets and produces a result set, which is returned to the master node with an MPI gather. The ensuing result file is then written to memory within the master node.

MPI and Fault Tolerance

In high-performance computing, application run times can be measured in many hours, days, or even weeks. Because HPC clusters are constructed using large numbers of processors, the likelihood of a failure grows with the number of nodes in the cluster. To address this challenge, various fault-tolerant techniques are now being incorporated into mainstream MPI implementations.
Although the MPI-2 standard allows processes to be dynamically added or removed from a cluster while the application is running, few applications are able to make use of this capability. Many MPI applications provide protection by using strategies such as checkpointing within the application, which when a checkpoint is reached saves the data such that the data can be recovered from that point in the event of a node failure. It is worth noting that using checkpoints may require storage and access to storage to be considered in terms of the amount of data that is being saved and how much bandwidth is required.
While no "silver bullet" solution is available yet that solves all reliability problems in all scenarios, research in large scale parallel fault tolerance is an active field of study. This indirectly highlights one of the advantages of the HPC community: academic and industry efforts are closely tied together. As such, the research community directly feeds industry with the latest techniques and cutting-edge algorithms, which are then turned into stable products that benefit the entire HPC ecosystem.

MPI Availability and Certification

MPI was jointly developed as a standard message passing interface by a consortium of academic and industry computer scientists to meet the requirements for heterogeneous compute environments and is widely available from a variety of sources. As enterprises adopt high-performance computing, certification of hardware drivers with specific MPI stacks is of particular relevance. Cisco Systems has certified the InfiniBand host channel adapters (HCAs) with several HPC applications and MPI variants, details of which are available at http://www.cisco.com/go/hpc.
The number of MPI implementations that are available has created problems for broad HPC adoption. As MPI implementations have proliferated, different versions of MPI have become less likely to interoperate, and they may exhibit different performance characteristics. Another problem is that features available in one version may not be available in another, or may not be supported on a particular operating system version, or may not support the underlying hardware.
This fragmentation has also made application software vendor certification problematic as different operating systems, hardware, and other aspects need to be qualified, often without any support available to assist users. This can make deploying an HPC application a daunting prospect, because selecting an MPI implementation may require trade-offs between what functions the application requires and what the underlying hardware and operating system support.
These problems are widely recognized as a barrier to broad adoption of high-performance computing. To address these problems, the Open MPI Project (http://www.open-mpi.org) was formed to provide a consolidated, production quality, open source implementation of MPI to simplify adoption and deployment. Recognizing that different users want different levels of flexibility, Open MPI provides comprehensive tuning and extensibility features for those who want to customize MPI to their requirements, but also provides a reasonable set of defaults for users who want dependable basic functions. Fundamentally based on a plug-in architecture, the flexibility that Open MPI offers is extremely valuable as it allows users to adapt to whatever back-end system is required-schedulers, compilers, applications, network fabrics, and so on-to meet their particular needs.
Although Open MPI had broad research and academic involvement, it lacked commercial services and support. Cisco Systems became the first commercial vendor to join the Open MPI Project to promote and develop a commercial-quality MPI implementation and is providing development, testing, and support resources. Cisco also made commitments to develop comprehensive services and support for enterprises that adopt Open MPI. Because of the value that Open MPI offers in terms of consolidation and support, other industry vendors have since joined the Open MPI Project.

SUMMARY

The MPI API is a standard that enables the construction of parallel applications in a flexible and portable fashion. MPI enables the description of different communication patterns that enable the most efficient usage of resources, from an interprocess communications perspective, to optimize the performance of an HPC cluster.
Some MPI implementations have native ability to use RDMA hardware, such as InfiniBand HCAs or Ethernet RDMA-aware NICs, to make the most efficient use of compute resources. These implementations also provide a small footprint, high performance, and low latency access mechanism to compute resources. These attributes have led to the adoption of MPI as the de facto standard for most HPC environments.
MPI is now widely used for enterprise-class applications across many markets, including weapons development, oil and energy, life sciences, financial markets, manufacturing, high tech, and digital media. Cisco Systems provides a suite of HPC solutions for Ethernet and InfiniBand using Cisco Catalyst® multilayer switching and InfiniBand SFS-7000 Server Switches. Cisco Systems' InfiniBand product line includes InfiniBand HCAs that have been certified with many commercially available applications and MPI stacks.