Notes on Conference Call on Dual Serial MPI builds - easybuilders/easybuild GitHub Wiki

Date

September 14th, 2020

Participants

(in no particular order)

  • Mikael Ohman
  • Davide Vanzo
  • Sam Moors
  • Bennet Fauber
  • Maxime Boissonneault
  • Kenneth Hoste
  • Bart Oldeman
  • Jorg Sabmannshausen
  • Alex Domingo

Notes

  • Description of the scope and goals of the project. No questions raised.

Point about the packages that should have dual serial/MPI builds:

  • HDF5 already done
  • Boost is done waiting for some changes to its easyblock. However there are no issues with the serial build of Boost.
  • Maxime and Bart suggest that next in line should be FFTW and VTK

Point about packages that can be moved to lower toolchains by depending on serial packages

  • Finding packages that sit in gompi/iimpi because of a dependency in Boost or HDF5 are easy to spot
  • It will be more complicated with those packages sitting in foss/intel, because then we need to verify that those indeed cannot use MPI and BLAS/LAPACK
  • We should start by checking what has already been done in Compute Canada
  • Reviewers of easyconfigs in EB usually check if the package can be moved to lower toolchains, so the number of misplaced packages in foss/intel should not be large
  • We could also consider that certain software supporting MPI can be build without it if it is not useful, as the dependencies will be automatically disabled

Point on name scheme

  • Using different package names is needed if the serial and MPI builds of a given package should be loaded at the same time
  • The requirements on naming scheme might be different depending on the module system (traditional vs hierarchical)
  • In Compute Canada serial/MPI modules use different names, but their modules cannot be loaded at the same time (controlled through Lmod)
  • If both modules can be loaded at the same time, then we might face a situation where a given package supporting MPI will load (for instance) Boost-MPI, but at the same time that package might depend on a non-MPI library that loads Boost-serial. Then what happens? Will there be conflicts?
    • Relying in path precedence in LD_LIBRARY_PATH can lead to trouble, if the wrong library is loaded it can have missing/incompatible symbols
    • Order also matters for the lookup of symbols in RPATH
    • Boost is a good case because libraries are already well separated, file names of shared objects are different
    • Serial packages should have no issues if they load the libraries of the MPI variant, as those libraries might just have additional symbols
    • We should look at how CRAY handles serial and MPI packages, they already have such a split
    • There is the option to rename symbols to avoid collision between the serial and MPI libraries, but that adds a whole new level of complexity
    • It is not clear if such issues will indeed arise. If that is the case the easy solution is to also add dual builds of those libraries needed by packages with MPI support, to ensure that all loaded dependencies use MPI
  • Having X.serial and X.MPI builds that can be loaded at the same time will be helpful for complex workflows that combine MPI and non-MPI packages, otherwise we might end up in a full split of the tree. This might be specially important with bio workflows.
  • Users might want to load X.serial and X.MPI at the same time mostly for convenience, not because it is really required
  • The non-MPI and MPI libraries cannot be hot swapped as these can lead to more unforeseen problems. The library stack should be consistent top to bottom.
  • There is a risk of walking towards a split of the whole tree between serial and MPI builds. For instance, a large package such as TensorFlow might have a few components using MPI, but many of the packages depending on TensorFlow cannot use any of the MPI features. Will this require a split of TensorFlow.serial and TensorFlow.MPI? It's difficult to foresee the extent of it.
  • One of the core goals is to provide users with an MPI-free environment if MPI is not needed

Point on module visibility

  • The modules of serial builds could be made hidden as these modules are only intended to be used as dependencies
  • Both modules are visible in Compute Canada, this allows users developing code to not load MPI if it is not needed
  • As long as the description of modules is clear enough, it should not be a problem to have both modules visible.

Point on toolchain of the serial builds

  • Performance of dual libraries might be critical, but the cases considered so far will perform equally well in GCC based toolchains than in ICC based toolchains
  • Using GCCcore for the serial builds could be very useful to reduce the number of new easyconfigs and provide dependencies at even lower levels
  • Using GCCcore might cause trouble if its MPI counterpart uses Intel compilers. In such a case loading both modules at the same time will probably raise linking issues.

Concluding remarks

  • It is desirable that the serial and MPI modules of a given package can be loaded at the same time

    • Pro: Minimize the amount of splitting in the easyconfig tree. Only the dual builds of some selected libraries will be needed.
    • Pro: Users will gain a lot more flexibility to mix non-MPI and MPI modules
    • Con: Loading both modules might cause linking issues with dependent packages. However, since we are choosing which libraries will be made dual, this can be tailored to our needs and problems will be analyzed on case per case basis.
  • Packages will have a suffix for serial builds and another suffix for MPI builds. Tentative names X.serial and X.MPI.

    • Pro: Having explicit suffixes avoids wrong assumptions from the users. It is important that the chosen suffixes are as self-explanatory as possible.
    • Con: The base name of the package looses its purpose. The default package name could be used to load either the serial or the MPI based on the state of the environment.
  • Both the serial and MPI modules will be visible by default. Making any of those modules invisible should be decided by each site.

  • The toolchain of the serial builds will be decided on a per package basis.

    • Pro: in cases where GCCcore is suitable it will be used to provided a lower level dependency