Parallelization - bertdupe/Matjes GitHub Wiki

The Matjes code has several level of parallelization to allow the a significant speedup on cluster. At the moment, we have implemented a two level Message Passing Interface (MPI) or CUDA and one level OpenMP.

crash course

there are many courses online about the different parallelization. We will give here only a very small introduction so that a beginner user can start with the topic and use basic parallelizations The different parallelization procedure are divided into 2 parts: the one that share the memory (OpenMP for example) and those who do not share the memory (MPI). When the parallelization shares memory, there is no need of a communicator to exchange information. The compiler only books some memory regions and executes the calculations on separate cores but on the same node. On the figure below, the OpenMP parallelization corresponds to the red links that occur between the different cores within node i and j BUT NOT between the nodes.

On the other hand, the MPI parallelization corresponds to a parallelization between the nodes. In practice, it means that the do not share the Cache memory (where the calculations are done before they are stored on the hard drive). Therefore, for the program to coordinate between the nodes, the main CPU (called the master thread) has to send the information from node i to node j so that both nodes can work on a different chunk of data. This operation involves a communication through the network. When the operation is done, the main CPU must gather the results of the calculations.

In practice, when the code is compiled with MPI and/or OMP parallelization the wrapper MPIRUN or MPIEXEC (other names are possible) is taking care of the number of CPUs assigned to the execution. On large clusters, workload managers (usually slurm) are used to tell the operating system how much processes should be dedicated to OMP and/or MPI tasks. All these variables are architecture dependent and you should contact your system administrator for more information.

OpenMP parallelization

The openMP parallelization is used when large sums are calculated. Large involves typically 1M terms. This can occur when the lattice size is higher than 1000x1000 for example. It can also be used to accelerate the FFTW. If the sums are too small then the code slows down with OpenMP is used because of the time offset necessary to block the memory on each thread. Do a simple test on your architecture to evaluate if this parallelization is worth it.

Cuda parallelization

to come

MPI parallelization

The MPI parallelization is the more advanced because it is used more often. It have the advantage of being a splitting the memory used per MPI processes and allow larger calculations. The MPI communication is split into 2 levels. These 2 levels are used especially in the Monte Carlo simulations or the dynamics simulations when non-zero temperature is used. In that case the ------ parallelization is splitting the temperatures on N_inner processors and the ---- parallelization is splitting the Hamiltonian between the N_outer processors. The code check that the N_inner* N_outer is smaller or equal to the total number of processors requested for the simulation.