Parallel Processing - FennisRobert/EMerge GitHub Wiki

Introduction

The current best direct solver available is PyPARDISO. The PARDISO solver does not run on ARM processors like with the new Mac computers.

To remedy this there is a fast implementation of the SuperLU solver that will be automatically activated if you run on an ARM chip. SuperLU does not use PARDISO's distributed processing algorithm and it also doesn't optimize the algebraic factorization which means that it is slightly slower. However, because it only uses a single thread you can run a distributed frequency sweep over multiple cores.

How to set it up

To run parallel processes, all you have to do is simply set the arguments parallel=True in the frequency_domain method:

import emerge as em

m = em.Simulation3D(...):
# Generate your geometry
# Setup your physics
data = m.mw.frequency_domain(parallel=True, njobs=3)

RAM Management

By default, when you set parallel=True, all matrices for each frequency point will be pre-assembled and stored in the RAM memory. A rule of thumb is that every 100k DoF yield a Sparse matrix occupying approximately 100MB of RAM memory. Therefore, 100k DoF with a 100 point frequency sweep will take up 10GB of RAM. Especially with large sweeps and larger systems this can quickly clog your RAM memory. There are two ways in which you can prevent this from happening. First the frequency_groups optional argument can be used to specify exactly how many frequency steps you want to pre-assemble before solving. Setting this as a whole integer multiple of the number of threads used will solve your problem most efficiently.

data = m.mw.frequency_domain(parallel=True, njobs=3, frequency_groups=6) # will pre-assemble 6 matrices and then solve them over 3 threads

Alternatively, if the systems get very large, you can cache them on the hard-drive. This incurs a serious IO penalty and is only adviced if you work with very large problems. In this case you can set harddisc_threshold and harddisc_path. The Threshold determines the number of Degrees of Freedom beyond which the system will automatically write the CSR matrices to the harddrive. The path will always be a new path specified by harddisc_path relative to the Python script. The solver will automatically cleanup all CSR matrix files after completing the simulation. If the path harddisc_path is empty afterwards, it will also remove the newly created path.

Multi-threaded vs Multi-process

By default EMerge will run distributed solves using the multithreading capabilities of Python. The most timeconsuming step in a solution is the solving of the system of equations Ax=b. The implementation of SuperLU releases the GIL so multi-threading is usually the most efficient option. If you want, it is also possible to run your simulation in parallel using multi-processing. To do this you have to guard your entry-point by wrapping your script in a main function and calling that from a if __name__ == "__main__": statement like so

import emerge as em

def main():
   
    m = em.Simulation3D(...)
    # setup here
    m.mw.frequency_domain(parallel=True, multi_processing=True)

if __name__ == "__main__":
    main()