Optimizing SuperLU Solver - FennisRobert/EMerge GitHub Wiki

Introduction

One of the core philosophies of EMerge is that is should be easy to use and install without complicated package dependencies. Sadly, due to differences in compatibility of high performance libraries, this is not always as easy as we hope. Different Python versions on different operating systems with different package managers may see different results.

As far as we know, SuperLU will work on all operating systems. However, the OpenBLAS backend of Scipy that contains SuperLU is about 2 to 3 times faster than other backends. The level of optimisation with parallel threads also varies. Getting SuperLU to work with OpenBLAS is not always easy.

We are working on writing the solvers such that EMerge works "out of the box" as well as possible. To assist with optimization we have the following instructions.

Checking the current backend

You can see which Scipy dependencies are known by calling the fem.superlu_info() function. This will display something like this If you have a dependency used it looks like this.

   Library info:
     - user API: blas
     - Internal API: openblas
     - Num threads: 4
   Library info:
     - user API: blas
     - Internal API: openblas
     - Num threads: 4
   Library info:
     - user API: blas
     - Internal API: mkl
     - Num threads: 4
   Scipy BLAS Build Dependencies info:
     - name: openblas
     - found: True
     - version: 0.3.21.dev
     - detection method: pkgconfig
     - include directory: /usr/local/include
     - lib directory: /usr/local/lib
     - openblas configuration: USE_64BITINT= DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS= NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= SANDYBRIDGE MAX_THREADS=3
     - pc file directory: /usr/local/lib/pkgconfig

Without OpenBLAS installed it looks like this:

    Scipy BLAS Build Dependencies info:
      - name: Accelerate
      - found: True
      - version: unknown
      - detection method: extraframeworks
      - include directory: unknown
      - lib directory: unknown
      - openblas configuration: unknown
      - pc file directory: unknown

Windows

For Intel machines, Windows seems to pick the right backends for standard Scipy installation through Pip or Conda. With AMD processors we currently don't know the performance.

MacOS (Intel and ARM)

On older MacOS Intel systems it is known that Scipy doesn't always install Scipy with the right backend. For that reason we propose the following recipe courtesy of reddit user /u/Swipecat. It is verified that this also works for ARM systems. The improvement over Accelerate is marginal but present (about +25% faster).

Python version <=3.11 Uninstall existing scipy and numpy builds.

pip uninstall scipy
pip uninstall numpy

Then install the anaconda version of scipy and numpy

pip install --pre -i https://pypi.anaconda.org/scipy-wheels-nightly/simple scipy
pip install --pre -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy

If this doesn't work you might have to add `python -m ...':

python -m pip install --pre -i https://pypi.anaconda.org/scipy-wheels-nightly/simple scipy
python -m pip install --pre -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy

It is known that this installation may output the following line in your terminal when importing emerge.

/bin/sh: lscpu: command not found

As far as its known, this is harmless.

Linux

EMerge is known to work on Linux machines with PyPardiso on an intel machine. The SuperLU openBLAS performance needs to be verified.

Compatibility table

SuperLU runs on all operating systems. The following table summarises the known compatibility of the direct solver libraries with different operating systems and CPU architectures.

OS	Arch.	PyPARDISO	SuperLU	SuperLU (OpenBLAS)
Windows	Intel	✅	✅	✅
Windows	AMD	✅	✅	✅
MacOS	Intel	✅	✅	Python <=3.11
MacOS	ARM	❌	✅	Python <=3.11
Linux	Intel	✅	✅	✅
Linux	AMD	❓	✅	❓

Performance Metrics

The performance of Pardiso vs SuperLU vs OpenBLAS SuperLU on a larger frequency sweep depends on multiple factors. The total cores available, RAM-memory, number of degrees of freedom (DOF), cache etc etc all play a roll in determining how optimal libraries work. In general, for large frequency sweeps on smaller problems, parallel SuperLU processes are faster than PARDISO because they are able to use the cores more optimal that Pardiso. On larger problems with more DOFs, PARDISO outperforms SuperLU. However, SuperLU with an OpenBLAS backend tends to outperform PARDISO slightly. The following tests are done on a Macbook Pro 2019 with an Intel CPU.

Problem with 35kDOF

SuperLU: 55 seconds (4 threads + OpenBLAS)
PARDISO: 70 seconds
SuperLU: 85 seconds (4 threads, accelerate)

PARDISO does not run well on multiple threads because it already uses multiple cores for the solution process