Parallelization and checkpointing - bpp/bpp-tutorial-geneflow GitHub Wiki

BPP implements multithreading via pthreads and currently supports medium-grained parallelization across loci. Loci are distributed evenly to threads, and each thread handles its assigned loci sequentially for each move in an MCMC iteration. The number of threads used cannot exceed the number of loci in the dataset.

Modern multiprocessor computers typically utilize the Non-Uniform Memory Access (NUMA) architecture, in which memory access time depends on the memory location relative to the processor. The below figure depicts the memory layout of a NUMA multiprocessor system with four CPUs.

Each CPU may comprise several cores, and has its own local memory which is faster to access than the local memory of another processor.

Currently BPP does not take full advantage of the NUMA memory layout, as all memory accessed throughout the run of the program is allocated locally to the processor on which the first (master) thread is running.

Therefore, to achieve optimal performance it is important to ensure that all threads are allocated on cores of the same processor.

Given that the typical strategy followed by operating systems is to distribute the processing workload equally across processors, the performance of BPP can degrade substantially when increasing numbers of processors (not to be confused with cores) are involved in the computation.

To alleviate this issue, we have implemented core pinning, i.e. each thread is pinned to a particular core.

Multithreading is enabled by specifying the threads variable in the control file which has the format

threads = N A B

where N is the number of threads to be used, A is the starting core/thread number, and B is the stride (or step), so the N threads will be assigned to cores A,A+B,...,A+(N-1)B. Parameters A and B are optional, and their default values are 1.

threads = 4       # equivalent to threads = 4 1 1

The lscpu program is available on most GNU/Linux and MacOS distributions and can be used to see the topology of the system, such as the number of processors, cores and threads.

Example 1 - Lenovo ThinkSystem SR850 with four Intel Xeon Gold 6154 CPUs.

lscpu | egrep 'NUMA|Thread\Core'

We observe the following output:

Thread(s) per core:    2
Core(s) per socket:    18
NUMA node(s):          4
NUMA node0 CPU(s):     0-17,72-89
NUMA node1 CPU(s):     18-35,90-107
NUMA node2 CPU(s):     36-53,108-125
NUMA node3 CPU(s):     54-71,126-143

There are four processors (NUMA nodes), and each processors comprises 18 physical cores, each of which can execute two threads (via hyperthreading). Cores 1-18 are part of CPU-1, 19-36 of CPU-2, 37-54 of CPU-3 and 55-72 of CPU-4. The remaining 72 cores are hyperthreaded. Note that the lscpu output starts from 0 while we start from 1 in the threads option in BPP.

Thus the following uses all 18 cores of the second processor

threads = 18 19

or equivalently

threads = 18 19 1

Example 2 - Dell PowerEdge T640 with two Intel Xeon Gold 5118 CPUs.

lscpu | egrep 'NUMA|Thread|Core'

Thread(s) per core:  2
Core(s) per socket:  12
NUMA node(s):        2
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47

In this case, the cores of a processor are not enumerated sequentially, but are interleaved. Then

threads = 4 1 2

will specify the first four cores of CPU 1.

The option threads = 4 1 1 or equivalently threads = 4 would use the first two cores of CPU 1 and the first two cores of CPU 2, and should be avoided.

Checkpointing

Checkpointing can be enabled using the option checkpoint. There are two formats:

checkpoint = X      # Creates a single checkpoint after the X-th MCMC sample (including burnin)
checkpoint = X Y    # Creates additional checkpoints every Y steps

Next: Contact, questions, support