Software: MPI - nthu-ioa/cluster GitHub Wiki

:warning: This page is still being written. Please get in touch with the admins if you have any questions or suggestions.

About MPI

MPI (Message Passing Interface) is a standard for communication between independent processes, usually over a network. MPI can be used to write programs that solve problems by working on many CPU cores in parallel, even if those CPU cores are spread across many different machines. To use MPI, programs have to be linked with a software library that provides the standard MPI functions. The MPI library we use on our cluster is called OpenMPI.

https://mpitutorial.com/tutorials/

Using MPI effectively requires a basic understanding of the hardware in our machines, and how programs are executed on that hardware. See Tutorial: Parallel Computing. That page also explains the difference between MPI and multithreading/multiprocessing, which are types of shared memory parallelism.

In a nutshell, the point of MPI is to allow programs to communicate between physically separate machines, over a network. MPI programs usually start one process per machine. Those processes are identical and do exactly the same job. Each process can only see its own 'local' memory.

However, because the processes can send messages to each other over the network using MPI, they can still work together to solve a bigger problem. For example, they can each read in part of a big file, use that data to compute something, and then share the result with each other.

Launching MPI Jobs

Like all jobs on the cluster, MPI jobs should be submitted to the Slurm batch queue using sbatch. The Slurm environment and the commands to start the job should be provided in a shell script. See our page on Slurm for more details.

For MPI jobs, the batch script needs to contain three things:

  1. A set of sbatch directives to allocate cluster resources for the job in a way that MPI can use efficiently;
  2. Environment variables and modules that configure the MPI environment;
  3. A job command line that starts your computation using an MPI-aware job controller (mpirun or srun).

The following sections walk through these three topics.

SBATCH directives for MPI jobs

MPI environment variables

The job command line

Inside your job script, there are two commands that can be used to launch MPI jobs on the cluster:

  1. Using mpirun
  2. Using srun

For the time being the correct choice depends on how your code was compiled. The bottom line is:

  1. Use mpirun if you compiled your MPI code yourself using the MPI libraries on the cluster;
  2. Use srun if you are using a code that has already been compiled with different MPI libraries, including python with the mpi4py module or any code you have installed via the conda package manager.

mpirun options for MPI

srun options for MPI

FAQs

Q. For serial jobs we always use srun. What is the difference between srun and mpirun?

There is not a big difference.

mpirun is a job-launching code that creates an environment for processes running on a cluster of machines to talk to each other using the MPI protocol. It is independent of batch job systems Slurm.

srun is an all-purpose job controller for jobs launched under Slurm. It works for serial and parallel jobs. Setting up the 'plumbing' to connect tasks on different machines is very similar to what mpirun does, but srun also does some extra work to implement the resource management and job control in Slurm.

If everything was set up on our cluster correctly, srun would completely replace mpirun. However, everything is not set up perfectly, yet. With our current setup, srun will fail to launch codes that have been compiled with the libraries provided by module load openmpi. Since most codes that are run on the cluster will be linked with those libraries, they have to be started with mpirun (the mpirun command itself is provided by our openmpi module). There is little or no difference in terms of functionality and performance.

However, we have found that at least some MPI-capable codes built by the conda package manager (specifically those that use Python's mpi4py library) will fail if they're launched by our cluster mpirun, but run fine when launched by srun. We're still investigating exactly why this is the case.

In future, the system will be configured in such a way that srun will be the only option.

Q. What if I wrote my own python code using the mpi4py python package? Should I use mpirun or srun?

If you installed mpi4py using conda, probably srun, because the codes will still be linked to MPI C libraries from conda. If you installed mpi4py using pip (or built it yourself), it may be linked to our cluster MPI libraries, in which case you'll probably need mpirun.