BLIS evaluation - easybuilders/easybuild GitHub Wiki

BLIS evaluation

Practical

In scope

  • BLIS + libFLAME (LAPACK)
  • gobff vs foss
  • iibff vs intel
  • also FFTW?

Notes meeting 20210408

(meeting cancelled)


Notes meeting 20210325

  • Kenneth

    • looked at FlexiBLAS at bit, half-working easyblock + easyconfig for it
    • we could use FlexiBLAS as a toolchain component?
      • collapses
      • pick BLAS/LAPACK to use at runtime, but this is a global setting (via $FLEXIBLAS)
      • also complicates testing, for which BLAS backends do we run the numpy tests for example?
    • Bart: single-threaded performance is probably more important than multi-threaded for BLAS?
    • also supports profiling which BLAS/LAPACK functions are used
  • Jure: BLAS testing, all BLAS functions available in blas-tester evaluated

  • Sam: CP2K benchmarking

    • H2O-128 benchmark
    • 30% performance hit with goblf vs foss (on Intel Skylake)...
    • some profiling done, seems to be mostly dgemm?
    • ~10% difference with direct dgemm benchmark (matrix size 1k-8k)
      • mostly agrees with Jure's results
    • ~20% with larger matrices (matrix size 10k)
      • so 30% perf hit mostly due to dgemm?
    • could look into FlexiBLAS profiling support to figure out dgemm matrix sizes used by CP2K
    • should try to reproduce this via numpy, and also check on AMD Rome...
  • next meeting: Thu April 8th at 2.15pm CEST


Notes meeting 20210318

  • Åke

  • Kenneth

    • numpy benchmarking: https://github.com/easybuilders/blis-eval/tree/main/apps/python
      • OpenBLAS better than BLIS on low core counts (except 1)
      • MKL is very jumpy on AMD (core pinning?)
    • need to re-check pinning for MKL, and also check without $OMP_* for pinning
      • Maxim: export KMP_AFFINITY=granularity=fine,scatter
    • toolchain for AMD forks for BLIS/libFLAME/ScaLAPACK/FFTW: gobff/2021.03-amd
      • numpy tests keep failing
    • TODO:
      • non-x86_64 (Arm, POWER)
      • complete results
      • better pinning for MKL
      • other functions?
  • Maxim

    • very similar results for foss/2020b and gobff/2020b on Broadwell/Haswell
    • not sure if these runs are reliable, too short
  • Sebastian

    • testing with AMD forks (on top of JSC toolchains)
    • libFLAME issues doesn't seem to be fixed
    • problem reported with eigensolver in libFLAME, not fixed in AMD's fork
    • naming of upstream BLIS vs AMD BLIS
      • standard BLIS: libblis.a is multi-threaded when MT is enabled (or serial when MT is not enabled)
      • AMD's fork uses a suffix for the MT build: libblis-mt.a...
      • should we rename multi-threaded standard BLIS?
      • assess perf. diff of serial vs multi-threaded BLIS
    • HMNS would have to be changed too, to allow multiple different BLAS libraries in the same "branch"
      • clash of module names for gomkl vs foss installations
      • two possible solutions:
        • add another level in the hierarchy for BLAS+LAPACK+FFTW
          • copy HierarchicalMNS to a new MNS that adds an extra level
        • use versionsuffix to discriminate between default BLAS library (e.g. BLIS) and others (e.g. -mkl)
          • "fork" HierarchicalMNS to customize module name (add -mkl or -blis)
  • Sam

    • CP2K with goblf (BLIS, LAPACK, no libFLAME):
      • fixes all extra failed tests (summary now looks exactly the same as with foss)
      • performance tests underway - need to use $BLIS_NUM_THREADS?
        • 30% slower on Skylake (4 cores) at first sight?
        • CP2K popt with $OMP_NUM_THREADS

Notes meeting 20210311


Notes meeting 20210304


Notes meeting 20210225


Notes meeting 20210218

  • new BLIS-based toolchains

    • BLIS moved to GCCcore because it doesn't like being built with Intel compilers (see https://github.com/flame/blis/pull/372)
    • gobff/2020b, iibff/2020b (+ gomkl/2020b), to be included with EasyBuild v4.3.3
  • BLAS test suite (Åke)

  • Sam tested https://github.com/xianyi/BLAS-Tester

    • ran into linking errors when using BLIS
      gcc -I./include -DAdd_  -DStringSunStyle -DATL_OS_Linux  -DTHREADNUM=4  -DF77_INTEGER=int -fopenmp -m64 -O3 -o ./bin/xsl1blastst sl1blastst.o ATL_sf77rotg.o ATL_sf77rot.o ATL_sf77rotmg.o ATL_sf77rotm.o ATL_sf77swap.o ATL_sf77scal.o ATL_sf77copy.o ATL_sf77axpy.o ATL_sf77dot.o ATL_sdsf77dot.o ATL_dsf77dot.o ATL_sf77nrm2.o ATL_sf77asum.o ATL_sf77amax.o ATL_sf77rotgf.o ATL_sf77rotf.o ATL_sf77rotmgf.o ATL_sf77rotmf.o ATL_sf77swapf.o ATL_sf77scalf.o ATL_sf77copyf.o ATL_sf77axpyf.o ATL_sf77dotf.o ATL_sdsf77dotf.o       ATL_dsf77dotf.o ATL_sf77nrm2f.o ATL_sf77asumf.o ATL_sf77amaxf.o ATL_sf77aminf.o ATL_flushcache.o ATL_sinfnrm.o ATL_rand.o ATL_svdiff.o ATL_sf77amin.o  ./refblas/librefblas.a /apps/brussel/CO7/skylake/software/BLIS/0.8.0-GCCcore-10.2.0/lib/libblis.so  -lm -lgfortran -lpthread ATL_sf77amin.o:
      ATL_f77amin.c:function OPENBLAS_sf77amin: error: undefined reference to 'isamin_'
      collect2: error: ld returned 1 exit status
      make: *** [xsl1blastst] Error 1
      
    • Åke may be able to help with that...
      • Use NO_EXTENSION=1
      • And one can set TEST_BLAS=-lblis to make it simpler
  • Sebastian starting with low-level benchmarks on JUWELS (Skylake partition)

  • Sam is looking into building CP2K with gobff

    • already includes a regression test
    • default: popt, should also look into psmp

Notes meeting 20210210

Tasks

  • correctness checking

    • run netlib BLAS/LAPACK tests (Åke)
    • netlib BLAS tests with BLIS
    • netlib LAPACK tests with BLIS+LAPACK
    • netlib LAPACK tests with BLIS+libFLAME
    • also https://github.com/xianyi/BLAS-Tester (Sam) does not work with BLIS
  • low-level performance testing (Sebastian)

  • gearshift FFTW benchmark (ask Miguel?)

    • Kenneth: see also PR for Christian with FFTW app

Toolchains

  • Sebastian, Kenneth
  • gobff/2020a + 2020b (PR is ready)
    • foss with OpenBLAS replaced by BLIS+libFLAME+FFTW
    • compare with foss + gomkl
      • custom gobff-amd (patched BLIS+libFLAME+FFTW)
  • iibff
    • intel with MKL replaced by BLIS+libFLAME+FFTW
  • FFTW 3.3.9 is out

Test systems

  • TODO: collect exact hardware info per site in blis-eval

    • CPU model numbers, see lscpu output
    • memory channels (hwloc?, sudo dmidecode -t memory)
    • STREAM benchmark results
      • see Åke custom version (more exact timings)
  • AMD Rome

    • HPC-UGent (doduo): Rome
    • EMBL (Jure): Rome + Napels
    • Compute Canada (Bart): Rome (single-node)
    • JSC: Rome
    • Azure (Davide): various Rome SKUs (124-core, 120 usable)
  • Intel

    • HPC-UGent (Kenneth): Haswell, Skylake, Cascade Lake
    • VUB (Sam): Ivy Bridge, Haswell, Broadwell, Skylake
    • EMBL (Jure): Skylake
    • SURF: Cascade Lake
    • Compute Canada: same, KNL
    • Umeå (Åke): Broadwell, Skylake, (KNL)
    • JSC: Skylake
    • Azure (Davide): various (incl. special)
  • other

    • Arm (Kenneth @ AWS)
    • POWER9 (Kenneth?, via UBirm.)
  • Bart: 6248 vs 6248R makes a big difference...

Applications

  • HPL (Bart)
  • CP2K (Sam, Robert)
    • Sam has some experience with this
    • h2o_128 benchmark included in CP2K
  • VASP
    • too dependent on their shitty code
    • fair amount in BLAS, most in FFTW
    • Åke: may not be a good fit for this effort...
    • Åke has a test suite (correctness) + benchmarks (with some scientific validation)
  • numpy/scipy test suites (Kenneth)
  • QuantumESPRESSO (Robert, Sebastian)
    • standard benchmarks

Notes

  • previous experiments by Bart
    Some HPL results (could be improved upon)
    (LAPACK params)  N      NB     P     Q            seconds              GFLOPS    (CPU, BLAS lib)
    ----------------------------------------------------------------------------------------------------
    WR11C2R4      128000   384     8     8             678.88              2.059e+03 (7452 MKL2020.1)
    WR12R2R4      177000   192     8     8            1528.47              2.419e+03 (7452,MKL2020.0,MKL_DEBUG_CPU_TYPE=5)
    WR12R2R4      168960   232     4     4            1370.64             2.3461e+03 (7452, AMD BLIS)
    WR12R2R4      177000   232     4     4            1629.23             2.2691e+03 (7452, OpenBLAS)
    
    • newer MKL versions have custom kernels for AMD Rome
    • $MKL_DEBUG_CPU_TYPE no longer works with MKL 2020.1 (and is generally unsafe on AMD Rome)