Message Passing Interface

This is an old revision of this page, as edited by Tonyskjellum (talk | contribs) at 23:04, 24 November 2006. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

The Message Passing Interface (MPI) is a language-independent computer communications descriptive application programmer interface (API), with defined semantics, and with flexible interpretations; it does not define the protocol by which these operations are to be performed in the sense of sockets for TCP/IP or other layer-4 and below models in the ISO/OSI Reference Model. It is consequently a layer-5+ type set of interfaces, although implementations can cover most layers of the reference model, with sockets+TCP/IP as a common transport used inside the implementation. MPI's dual goals are high performance (scalability), and high portability. High productivity of the interface, in programmer terms, is not one of the key goals of MPI, and MPI is generally considered to be low-level. It expresses parallelism explicitly, rather than implicitly. MPI is considered successful in achieving high performance and high portability, but is often criticized for its low-level qualities; there at present is no effective replacement, so it remains a crucial part of parallel programming to this date. MPI is not sanctioned by any major standards body, but nonetheless has worldwide practical acceptance.

MPI is a de facto standard for communication among the processes modeling a parallel program on a distributed memory system. Often these programs are mapped to clusters, actual distributed memory supercomputers, and to other environments. However, the principle MPI-1 model has no shared memory concept, and MPI-2 has only a limited distributed shared memory concept used in one portion of that set of extensions.

Most MPI implementations consist of a specific set of routines (API) callable from Fortran, C, or C++ and, by extension, from any language capable of interfacing with such routine libraries. The advantages of MPI over older message passing libraries are portability (because MPI has been implemented for almost every distributed memory architecture) and speed (because each implementation is in principle optimized for the hardware on which it runs). Interestingly, MPI is supported on shared-memory, and NUMA (Non-Uniform Memory Access) architectures as well, where it often serves both as an important portability architecture, but also helps achieve high performance in applications that are naturally owner-computes oriented.

MPI is a specification, not an implementation. MPI has Language Independent Specifications (LIS) for the function calls, and language bindings. The first MPI standard specified ANSI C and Fortran-77 language bindings together with the LIS. The draft of this standard was presented at Supercomputing 1994 (November 1994), and finalized soon thereafter. About 128 functions comprise the MPI-1.2 standard as it is now defined.

There are two versions of the standard that are currently popular: version 1.2, which emphasizes message passing and has a static runtime environment (fixed size of world), and, MPI-2.1, which includes new features such as scalable file I/O, dynamic process management, collective communication with two groups of processes, and C++ language bindings. MPI-2's LIS specifies over 500 functions and provides language bindings for ANSI C, ANSI Fortran (Fortran90), and ANSI C++. Interoperability of objects defined in MPI was also added to allow for easier mixed-language message passing programming. A side effect of MPI-2 standardization (completed in 1996) was clarification of the MPI-1 standard, creating the MPI-1.2 level.

It is important to note that MPI-1.2 programs, now deemed "legacy MPI-1 programs," still work under the MPI-2 standard although some functions have been deprecated. This is important since many older programs use only the MPI-1 subset.

MPI is often compared with PVM, which was a popular distributed environment and message passing system developed in 1989, and which was one of the systems that motivated the need for standard parallel message passing systems. Most computer science students who study parallel programming are taught both Pthreads and MPI programming as complementary programming models.

Functionality overview

The MPI interface is meant to provide essential virtual topology, synchronization and communication functionality between a set of processes (that have been mapped to nodes/servers/ computer instances) in a language independent way, with language specific syntax (bindings), plus a few features that are language specific. MPI programs always work with processes, although commonly people talk about processors. When one tries to get maximum performance, one process per processor (or more recently core) is selected, as part of the mapping activity; this mapping activity happens at runtime, through the agent that starts the MPI program, normally called mpirun or mpiexec.

Such functions include, but are not limited to, point-to-point rendezvous-type send/receive operations, choosing between a Cartesian or graph-like logical process topology, exchanging data between process pairs (send/receive operations), combining partial results of computations (gathering and reduction operations), synchronizing nodes (barrier operation) as well as obtaining network-related information such as the number of processes in the computing session, current processor identity that a process is mapped to, neighboring processes accessible in a logical topology, and so on. Point-to-point operations come in synchronous, asynchronous, buffered, and ready forms, to allow both relatively stronger and weaker semantics for the synchronization aspects of a rendezvous-send. Many outstanding operations are possible in asynchronous mode, in most implementations.

MPI guarantees that there be progress of asynchronous messages independent of the subsequent calls to MPI made by user processes (threads). This rule is often neglected in practical implementations, but is an important underlying principle when one thinks of using asynchronous operations. The relative value of overlapping communication and computation, asynchronous vs. synchronous transfers, and low latency vs. low overhead communication remain important controversies in the MPI user and implementer communities, although recent advances in multi-core architecture are likely to reenliven such debate. MPI-1 and MPI-2 both enable implementations that do good work in overlapping communication and computation, but practice and theory differ. MPI also specifies thread safe interfaces, which have cohesion (computer science) and coupling (computer science) strategies that help avoid the manipulation of unsafe hidden state within the interface. As such, it is relatively easy to write multithreaded point-to-point MPI codes, and some implementation support such codes. Multithreaded collective communication is best accomplished by using multiple copies of Communicators, as described below.

Concept Overview

Concepts of MPI-1 and MPI-2 are given next.

Communicator overview

Although MPI has many functions, there are a few concepts that are very important, and these concepts when taken a few at a time, help people learn MPI quickly, and decide what functionality to use in their application programs.

Communicators are groups of processes in the MPI session, each of which have rank order, and their own virtual communication fabric for point-to-point operations. They also have independent communication addressibility or space for collective communication. MPI also has explicit groups, but these are mainly good for organizing and reorganizing subsets of processes, before another Communicator is made. MPI understands single group Intracommunicator operations, and bi-partite (two-group) Intercommunicator communication. In MPI-1, single group operations are most prevalent, with bi-partitute operations finding their biggest role in MPI-2 where their usability is expanded to include collective communication and in dynamic process management.

Communicators can be partitioned using several commands in MPI, these commands include a graph-coloring-type algorithm called MPI_COMM_SPLIT, which is commonly used to derive topological and other logical subgroupings in an efficient way.

Point-to-point Basics

This section needs to be developed.

Collective Basics

This section needs to be developed.

One-sided Communication (MPI-2)

This section needs to be developed.

Collective Extensions (MPI-2)

This section needs to be developed.

Dynamic Process Management (MPI-2)

This section needs to be developed.

MPI I/O (MPI-2)

This section needs to be developed.

Miscellaneous Improvements of MPI-2

This section needs to be developed.

Guidelines for Writing Multithreaded MPI-1 and MPI-2 programs

This section needs to be developed.

Implementations

'Classical' Cluster and Supercomputer Implementations

The implementation language for MPI is different in general from the language or languages it seeks to support at runtime. Most MPI implementations are done in a combination of C, C++ and assembly language, and target C, C++, and Fortran programmers. However, the implementation language and the end-user language are in principle always decoupled.

The initial implementation of the MPI 1.x standard was MPICH, from Argonne National Laboratory (correctly pronounced MPI-C-H, not pronounced as a single syllable) and Mississippi State University. IBM also was an early implementor of the MPI standard, and most supercomputer companies of the early 1990's either commercialized MPICH, or built their own implementation the MPI 1.x standard. LAM/MPI from Ohio Supercomputing Center was another early open implementation. Argonne National Laboratory has continued developing MPICH for over a decade, and now offers MPICH 2, which is an implementation of MPI-2.1 standard. LAM/MPI, and a number of other MPI efforts recently merged to form a new world-wide project, called the OpenMPI implementation, but this name does not imply any connection with a special form of the standard. There are many other efforts that are derivatives of MPICH, LAM, and other works, too numerous to name here. Recently, Microsoft added an MPI effort to their Cluster Computing Kit (2005), based on MPICH 2. MPI has become and remains a vital interface for concurrent programming to this date.

Many Linux distributions include MPI (either or both MPICH and LAM, as particular examples), but it is best to get newest versions from MPI developer sites. Many vendors have specialized open source versions of MPICH, LAM, and/or OpenMPI, which provide better performance and stability.

Besides the mainstream of MPI programming for high performance, MPI has been used widely with Python, Perl, and Java. These communities are growing. MATLAB-based MPI appear in many forms, but no consensus on a single way of using MPI with MATLAB yet exists. The next sections detail some of these efforts.

There are at least five known attempts to implement MPI for Python: mpi4py, PyPar, PyMPI, MYMPI, and The MPI submodule of ScientificPython. PyMPI is notable because it is a variant python interpreter making the multi-node application the interpreter itself, rather than the code the interpreter runs. PyMPI implements most of the MPI spec and automatically works with compiled code that needs to make MPI calls. PyPar, MYMPI, and ScientificPython's module all are designed to work like a typical module used with nothing but an import statement. They make it the coder's job to decide when and where the call to MPI_Init belongs.

The OCamlMPI Module implements a large subset of MPI functions and is in active use in scientific computing. To get a sense of its maturity: it was reported on caml-list that an eleven thousand line OCaml program was "MPI-ified", using the module, with an additional 500 lines of code and slight restructuring and has run with excellent results on up to 170 nodes in a supercomputer.

Although Java does not have an official MPI binding, there have been several attempts to bridge Java with MPI, with different degrees of success and compatibility. One of the first attempt was Bryan Carpenter's mpiJava, essentially a collection of JNI wrappers to a local C MPI library, resulting in a hybrid implementation with limited portability, which also has to be recompiled versus the specific MPI library being used.

However, this original project also defined the mpiJava API (a de-facto MPI API for Java following the equivalent C++ bindings closely) which other subsequent Java MPI projects followed. An alternative although less used API is the MPJ API, designed to be more object-oriented and closer to Sun Microsystems' coding conventions. Other than the API used, Java MPI libraries can be either dependant on a local MPI library, or implement the message passing functions in Java, while some like P2P-MPI Java also provide Peer to peer functionality and allow mixed platform operation (e.g. mixed Linux and Windows clusters).

Some of the most challenging parts of any MPI implementation for Java arise from the language's own limitations and peculiarities, such as the lack of proper pointers and linear memory address space for its objects , which make transferring multi-dimensional arrays and complex objects inefficient. The workarounds usually used involve transferring one line at a time or and/or performing explicit de-serialization and casting both at the sending and receiving end, simulating C or FORTRAN-like arrays by the use of a one-dimensional array, and pointers to primitive types by the use of single-element arrays, thus resulting in programming styles quite extraneous from Java's conventions.

This section needs to be developed.

Example program

Here is "Hello World" in MPI written in C. In this example, we send a "hello" message to each processor, manipulate it trivially, send the results back to the main process, and print the messages out.

/*
 "Hello World" Type MPI Test Program
*/
#include <mpi.h>
#include <stdio.h>
#include <string.h>

#define BUFSIZE 128
#define TAG 0

int main(int argc, char *argv[])
{
  char idstr[32];
  char buff[BUFSIZE];
  int numprocs;
  int myid;
  int i;
  MPI_Status stat; 
 
  MPI_Init(&argc,&argv); /* all MPI programs start with MPI_Init; all 'N' processes exist thereafter */
  MPI_Comm_size(MPI_COMM_WORLD,&numprocs); /* find out how big the SPMD world is */
  MPI_Comm_rank(MPI_COMM_WORLD,&myid); /* and this processes' rank is */

  /* At this point, all the programs are running equivalently, the rank is used to
     distinguish the roles of the programs in the SPMD model, with rank 0 often used
     specially... */
  if(myid == 0)
  {
    printf("%d: We have %d processors\n", myid, numprocs);
    for(i=1;i<numprocs;i++)
    {
      sprintf(buff, "Hello %d! ", i);
      MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD);
    }
    for(i=1;i<numprocs;i++)
    {
      MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat);
      printf("%d: %s\n", myid, buff);
    }
  }
  else
  {
    /* receive from rank 0: */
    MPI_Recv(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD, &stat);
    sprintf(idstr, "Processor %d ", myid);
    strcat(buff, idstr);
    strcat(buff, "reporting for duty\n");
    /* send to rank 0: */
    MPI_Send(buff, BUFSIZE, MPI_CHAR, 0, TAG, MPI_COMM_WORLD);
  }

  MPI_Finalize(); /* MPI Programs end with MPI Finalize; this is a weak synchronization point */
  return 0;
}

It is important to note that the runtime environment for the MPI implementation used (often called MPIRUN or MPIEXEC, spawns multiple copies of this program text, with the total number of copies determining the number of process ranks in MPI_COMM_WORLD, which is an opaque descriptor for communication between the set of processes. A Single-Program-Multiple-Data (SPMD) programming model is thereby facilitated. Each process has its own rank, the total number of processes in the world, and the ability to communicate between them either with point-to-point (send/receive) communication, or by collective communication among the group. It is enough for MPI to provide an SPMD-style program with MPI_COMM_WORLD, its own rank, and the size of the world to allow for algorithms to decide what they do based on their rank. In more robust examples, additional I/O to the real-world is needed of course. MPI does not guarantee how POSIX I/O, as used in the example, would actually work on a given system, but it commonly does work, at least from rank 0. If it does work, POSIX I/O like printf() is not particularly scalable, and should be used sparingly.

The notion of process and not processor is used in MPI, as shown below. The copies of this program are mapped to processors by the runtime environment of MPI. In that sense, the parallel machine can map to 1 physical processor, or N, where N is the total number of processors available, or something in between. For maximal potential for parallel speedup, more physical processors are used, but the ability to separate the mapping from the design of the program is an essential value for development, as well as for practical situations where resources are limited. It should also be noted that this example adjusts its behavior to the size of the world N, so it also seeks to be scalable to the size given at runtime. There is no separate compilation for each size of the concurrency, although different decisions might be taken internally depending on that absolute amount of concurrency provided to the program.

Adoption of MPI-2

While the adoption of MPI-1.2 has been universal, including on almost all cluster computing, the acceptance of MPI-2.1 has been more limited. Here are some of the reasons.

1. While MPI-1.2 emphasizes message passing and a minimal, static runtime environment, full MPI-2 implementations include I/O and dynamic process management, and the size of the middleware implementation is substantially larger. Furthermore, most sites that use batch scheduling systems cannot support dynamic process management. Parallel I/O is well accepted as a key value of MPI-2.

2. Many legacy MPI-1.2 programs were already developed by the time MPI-2 came out, and work fine. The threat of potentially lost portability by using MPI-2 functions kept people from using the enhanced standard for many years, though this is lessening in the mid 2000's, with wider support for MPI-2.

3. Many MPI-1.2 applications use only a subset of that standard (16-25 functions). This minimalism of use contrasts with the huge availability of functionality now afforded in MPI-2.

Other inhibiting factors can be cited too, although these may amount more to perceptions and belief than fact. MPI-2 has been well supported in free and commercial implementations since at least the early 2000's, with some implementations coming earlier than that.

The Future of MPI

There are several schools of thought on this. The MPI Forum has been dormant for a decade, but maintains its mailing list. Recently (late 2006), the mailing list was revived, for the purpose of clarifying MPI-2 issues, and possibly for defining a new standard level.

1. MPI as a legacy interface is guaranteed to exist at the MPI-1.2 and MPI-2.1 levels for many years to come. Like Fortran, it is ubiquitous in technical computing, taught everywhere, and used everywhere. The body of free and commercial products that require MPI help ensure that will go on indefinitely, as will new ports of the existing free and commercial implementations to new target platforms.

2. Architectures are changing, with greater internal concurrency (multi-core), better fine-grain concurrency control (threading, affinity), and more levels of memory hierarchy. This has already yielded separate, complementary standards for SMP programming, namely OpenMP. However, in future, both massive scale and multi-granular concurrency reveal limitations of the MPI standard, which is only tangentially friendly to multithreaded programming, and does not specify enough about how multi-threaded programs should be written. While multi-threaded capable MPI implementations do exist, the number of multithreaded, message passing applications are few. The drive to achieve multi-level concurrency all within MPI is both a challenge and an opportunity for the standard in future.

3. The number of functions is huge, though as noted above, the number of concepts is relatively small. However, given that many users don't use the majority of the capabilities of MPI-2, a future standard might be smaller as well as more focused, or have profiles to allow different users to get what they need without waiting for a complete implementation suite, or have all that code be validated from a software engineering point of view.

4. Grid computing, and virtual grid computing offer MPI's way of handling static and dynamic process management with particular 'fits'. While it is possible for force the MPI model into working on a grid, the idea of a fault-free, long-running virtual machine under the MPI program is a forced on in a grid environment. Grids may want to instantiate MPI APIs between sets of running processes, but multi-level middleware that addresses concurrency, faults, and message traffic are needed. Fault tolerant MPI's and Grid MPIs have been attempted, but the original design of MPI itself impacts what can be done.

5. People want a higher productivity interface. MPI programs are often referred to as assembly language of parallel programming. This goal -- whether through semi-automated compilation -- or through model-driven architecture and component engineering, or both, mean that MPI would have to evolve, and in some sense, move into the background. These areas, some well-funded by DARPA and others, others underway in academic groups worldwide, have yet to produce a consensus that can fundamentally disrupt MPI's key values - performance and portability and ubiquitous support.

See also

  • IPython allows MPI applications to be steered interactively.

References

This article is based on material taken from the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later.