Notes from EasyBuild maintainer summit 2021 - easybuilders/easybuild GitHub Wiki

EasyBuild maintainer summit 2021 (Tue-Wed Sept 28-29 2021)

Attendees

Day 1

  • Åke
  • Alan
  • Alex
  • Bart
  • Bob
  • Caspar
  • Damian
  • Davide
  • Fotis
  • Kenneth
  • Lars
  • Miguel
  • Mikael
  • Sam
  • Sebastian
  • Simon

Day 2

  • Adam
  • Åke
  • Alan
  • Alex
  • Bart
  • Bob
  • Fotis
  • Kenneth
  • Lars
  • Miguel
  • Mikael
  • Sebastian
  • Sam
  • Simon

Topics

Missing contributions in upstream forks

  • raised by Alan
  • mostly due to visualisation stuff that's done very differently at JSC
    • creates problems w.r.t. contributing back
    • OpenGL wrapper around Mesa (next to X11)
    • plan was to create a PR for that (but person responsible didn't find time yet)
    • (Mikael) quite a complex bundle of stuff, could/should be broken up?
    • (Alan) some additional magic on top to ensure right stuff gets used
    • LINK TO PR?
  • (Alan) also: Julia (see CSCS setup)
    • not upstream yet, also used at JSC
    • (Bart) work done by CSCS involved too much custom stuff, we ended up with this:
    • modluafooter = '''
      append_path("JULIA_DEPOT_PATH", ":")
      append_path("JULIA_LOAD_PATH", ":")
      '''```
      
    • (Bart) there is a JuliaPackage easyblock to install Julia packages as extensions: https://github.com/easybuilders/CSCS/tree/master/easybuild/easyblocks has three julia-related easyblocks: julia.py, juliapackage.py, and juliabundle.py.
  • (Damian) also Jupyter
  • (Kenneth) why aren't these things not being contributed back?
    • (Damian) partially due to differences in dependencies in local setups
      • sticking closer to upstream could help
    • people are mostly focused on what's needed for own site
  • (Bart) ComputeCanada has a couple of customized easyblocks
    • contributing those changes back takes time and effort
    • internal stuff is sometimes not "good enough" for upstream
    • example: GPU offloading in GCC, used to live in custom easyconfigs, now integrated in main gcc.py easyblock, so available by default in GCCcore.
  • (Kenneth) could pairing up with a maintainer from another site help to get stuff upstreamed?
    • Bart + CSCS for Julia?
    • Mikael + JSC for visualisation stuff?
  • once you fork, it gradually becomes more difficult to contribute back
  • "Perfect is the enemy of good"
    • (Fotis) see easybuild.experimental repository
  • (Sebastian) asked people at JSC to open PRs to upstream stuff
    • nobody actually did, only got requests to help out with other stuff
    • there's definitely a barrier there...
    • people are not familiar with GitHub integration (they're more familiar with GitLab)
    • extending GitHub integration with support for GitLab could help
    • quite different toolchains at JSC
      • partially due to Parastation MPI (could be custom toolchain upstream)
      • different names/versions for standard toolchains like gompi/gomkl
      • raises the bar for easy contribution upstream
    • (Damian) differences in toolchains grew historically
      • toolchain versions are tied to JSC stages
    • (Damian) JSC could work on sticking closer to upstream toolchain definitions
      • toolchains seems to be more up-to-date upstream currently?
  • (Caspar) similar problems at SURFsara before
    • changing internal workflow a lot helped
    • open PRs more quickly, use --from-pr to install (don't wait for merge)
    • any reason why JSC doesn't take this approach?
      • (Damian) probably just workflow we're used to
      • new stage is an opportunity to change this and increase overlap with upstream
  • (Sebastian) also thinking about custom toolchain for AMD
    • AOCC, BLIS/libFLAME, ...
    • (Damian) biggest concern is easyconfigs using compiler-only toolchain like GCCcore
    • (Damian) things like SciPy-bundle are handled differently in JSC to avoid using MPI toolchain
    • (Mikael) relates "diamond" toolchains with compiler+BLAS/LAPACK subtoolchain (no MPI)
    • (Bart) FFT MPI is actually rarely used
      • (Åke) VASP?
      • (Bart) checked in detail, seems not
      • could be a separate package with only FFTW wrappers for MKL
      • (Damian) MKL is installed in two places: system + full toolchain (incl. FFTW wrappers)
        • same version, so doesn't really cause problems
  • (Kenneth) can we make changes centrally to make it easier for sites to contribute back?
    • (Caspar) more structured/faster discussion on new toolchains?
      • they're very fundamental, quicker turnaround time needed there
    • (Damian) UCX is at system level at JSC (along with CUDA)
      • so using system compiler to build UCX
      • also related to Parastation, where vendor tests with OS compiler
    • (Åke) hooks are the place to implement more or less simple site-specific changes
      • could be done such that no changes are needed to easyconfig, so they can be sent upstream easily
    • (Alex) two types of divergence: easyconfigs vs toolchains/easyblocks
      • too hard to follow easyconfig differences
      • chasing down more significant changes, such as easyblock and toolchain differences, is more feasible
  • (Kenneth) working group on defining common toolchains
    • try to stick to original timeframe (a by Jan, b by July)
  • should we decouple version of GCCcore subtoolchain from GCC version?
    • allow easier updating to different GCC while still in develop
    • allow easier divergence for sites that want to
    • already done at JSC to some extent with MPI with module naming scheme (OpenMPI/4.1)
      • does create some tension with reproducibility
  • (Damian) plan to use FlexiBLAS for future (foss) toolchains?
    • what about ScaLAPACK and FFTW?
    • any problems with mixing BLIS for BLAS (via FlexiBLAS) vs FFTW via MKL
    • different library names, so they don't conflict
    • (Bart) planning to look into support for MKL backend for FlexiBLAS
  • (Alan) FlexiBLAS provides a lot of flexibility
    • MPI thing doesn't seem to be as mature?
    • Alan reached out to them, no response?

Keeping up with incoming contributions

Easyconfig PRs

  • some PRs take significantly more time

    • new contributors
    • complex software
  • will most likely break our easyconfig PR record again this year

    • 85% are through our GitHub integration

    • about 500 open for last few months
    • 100 different contributors for those...so no silver bullet

  • allow reviewer to make "trivial" changes to PRs to get them merged

    • fixing code style issues, adding sanity check command, ...
    • anything that doesn't have impact on how the software is installed
    • contributors usually don't mind
  • auto-closing inactive PRs => stalebot GitHub Action (https://github.com/probot/stale)

    • right timeframes?
    • auto-tagging PRs as "stale"
  • be stricter about PRs using system toolchain?

    • relies on OS dependencies, limited value in EB ecosystem
  • active easyconfig maintainers are ~10

    • finding more people with the right skills and level of attention is hard
  • would it have helped to have a "rulebook" for maintainers?

    • (Sebastian...as a recently integrated maintainer) didn't feel that would be necessary
      • Had contributor experience
      • Attended biweekly meeting
      • Open to asking other maintainers question
    • (Caspar) would be nice to have a checklist for merging PRs
      • Can take time to get back into things when you have not done maintenance for a while
      • The "rules" may change over time (cfr. CUDAcore/CUDA, Python versionsuffix)
    • (xx) make the bot spew out a --review-pr output
      • only comparing against most similar easyconfig in develop
  • deprecate toolchains older than 2019

    • make EB produce a warning
    • start closing PRs that use the toolchains
  • make --new-pr ask the user to submit a test report?

  • let maintainer mark a PR as approved for testing

    • let bot auto-test in "standard" environments
  • have bot mark PR's stale

  • do we need extra labels to make it easier to find PRs to work on?

    • single 'status' label, partially managed automated by the bot?
  • extra labels to mark PR ready for auto testing by generoso?

  • any maintainers not using Octobox should be used yet

  • CI should over patches with no comments on top

  • --new-pr can/should check a couple of common things (checksum, top comment patch)

    • trigger an automatic --check-contrib?
      • does not do exactly the same as the CI... but we should fix that!
  • test suite errors could be improved sometimes

    • see failure message when code style check fails
  • auto-reply from bot with checklist of things for contributor

    • request test report after 1 day (if CI is passing)
  • (Adam) Real application testing

Easyblocks/framework PRs

  • Reviewing is (usually) more time-consuming
  • Should we have a CI check that verifies that a new feature appears in documentation?
  • Need to document requirements for maintainers to know when it's ok to merge
  • Would training help?
    • It's hard to find your way around
    • An overview of the structure
    • Workflow from eb command to parsing easyconfig to easyblock, etc.
    • People who could help out with this (in order of being familiar with framework): Kenneth, Alan, Bart, Åke
    • potential topics
  • (Alex) sometimes reviewer requests additional changes that are "out of scope" for that PR
  • What about easyblocks?
    • What are the requirements for test reports of easyblock PRs?
      • You cannot (currently) ask the bot to test an easyblock PR
    • Nice to know what easyconfigs are touched by an easyblock
      • ambertools was a case that used the amber easyblock and broke when this got updated
      • bot could check this for non-generic?
    • Have a webpage or similar function where you can enter a easyblock name which will then generate a list of which other easyblocks uses that as its base

Keeping up with incoming issues

  • Slack is not searchable so we should keep away from using it for issues
  • (Adam) issue template is good thing
    • (Kenneth) should be optional
    • add suggestion to ask on Slack in issue template
    • Things to request in template
      • eb --show-config
      • name of easyconfig
      • ...
  • Add to docco to use the general easybuild repo for new issues, we will move them into the correct repo
  • Document better how to interpret build logs and how to find the actual problem
    • Search for _step etc
  • See troubleshooting in the docco
  • (Kenneth) arch-tagging? will it help
  • (Kenneth) Maintainer sprint sessions bi-weekly, on non-eb-bi-weekly weeks

Revamp of EasyBuild documentation

  • Docs not being updated when we add features (in framework in particular)
  • Current syntax is RST, differences with markdown are enough to be annoying
  • Workflow with readthedocs is also a bit annoying if you want to preview
  • Tutorial uses mkdocs...and is in a separate tutorial
    • Allows for easy and instant local preview
  • Should we also be hosting the docs on GitHub?
  • Fotis: GitLab does direct rendering of .rst
    • Alan: but the Sphinx stuff it won't deal with
  • Move to another format is the biggest jump
    • rst-to-myst looks like a good help here
  • Starting point
    • create a repo and do a page or two
    • then do a cry for help, looking for volunteers for 1 page at a time
      • Need a decent starting point
      • Should have decent CI and contribution docs in place
    • Will also need to port automated docs
    • ACTION: Look for volunteers to help kickstart this

Support and testing on non-x86_64 platforms

  • ARM and POWER are secondary platforms

    • Don't hold back a PR if Intel/AMD work
      • Open an issue though for non-working archs
    • Do have access to both archs (ARM through EESSI)
      • Doing these checks introduces additional latency
      • Don't have to require these, but can add the capability to the bot to requests there
    • Can we do a Gentoo-style tagging so we know what works where
      • Keeping track of this in easyconfigs is a maintenance nightmare
      • Can use regression tests to at least document this
    • Alex: what about blacklisting stuff
      • keeping track of known issues
        • in easyconfigs? maintenance burden
        • from regression tests as part of the release in the same way that indexing is done?
          • Will delay a release
            • Could do this step afterwards as part of the docs
            • Still need the ability for EB to pick that up
        • Would documentation be a better place to keep track of known issues?
          • cfr. FlexiBLAS trouble on POWER
  • Consensus on:

    • treating Arm and POWER as secondary platforms
    • don't block PRs because of test failures on Arm/POWER
    • document known issues (on non-x86 platforms) and let 'eb' pick them up and print warnings?
  • Can we get some (cloud) resources to support arch testing?

    • JUSUF Cloud is perhaps an option for AMD, Sebastian will investigate
    • Can we ask vendors for hardware?
      • Admin is the challenge there
      • Cloud credits are perhaps a better option
        • Would cover ARM as well
  • Fotis: How do other projects manage multiple architectures builds?

  • expand boegelbot to test PRs

    • POWER9 (emulated) at OSUOSL
    • Graviton2 aarch64 @ AWS (using EESSI credits)
    • aarch64 @ fosshost (to be requested)
    • AMD + GPU @ JSC (JUSUF Cloud) [Sebastian]
    • bot account @ Mikael's infrastructure?

Outlook to EasyBuild 5.0

  • start working on this in a 5.x branch?
  • Opportunity to change things we are not happy with
  • HMNS has a couple of issues
    • robot can be broken if people do not easyconfigs in the robot search path
    • building your own software on top of someone elses stack is cumbersome, you need to fiddle with MODULEPATH
    • HMNS is non-unique which makes handling something like gomkl difficult (module clashes with foss)
    • Non-unique names in HMNS for things like OpenBLAS (stuff installed with foss vs gomkl toolchain) => separate HMNS with one extra level for math libs or add versionsuffix for non-foss toolchains
  • (Bart) Having a bootstrap location for dependencies required for bootstrapping toolchains
  • (Bart) Kill incomplete implementation of support for .yeb easyconfigs (YAML syntax)
  • (Bart) cleanup in:
    • easyblocks (old software versions)
    • 32-bit support in framework
    • macOS support
      • Kenneth: let's not, basic functionality works
  • Deprecating Python 2.7 makes sense but dropping support right now is probably too much
    • Can also deprecate 3.5
  • What about Lmod?
    • Deprecating Lmod 7 might be a good idea
  • Can we default to use depends_on for dependencies (with Lmod)?
  • Drop support for ancient Tcl-only implementation?
  • new features for 5.0
    • (Alex) versionless dependencies, let EB use what's available
      • (Åke) not a fan of this...
      • (Kenneth) adding the feature in framework and using it in easyconfigs in the central repo are two different things
      • Can we leverage some of the code of --try-update-deps?
    • (Sam) support for only specifying partial versions (Python 3.6.*)
      • should that be reflected in generated module file, or not?
    • (Alex) separate metadata files for easyconfigs (homepage, checksums, etc)
      • metadata and checksums should be separate files (due to updates needed in PRs)
      • (Kenneth) could help with maintenance?
        • (Mikael) will it really?
      • having things across multiple files may cause trouble in some contexts
        • example: checksum added in checksums.json, only copying easyconfig file
    • (Fotis) reduce the need for conditionals in easyblocks
      • missing feature that results in lots of if statements
      • some kind of lookup table to avoid if/else blocks
      • exploiting repetitive patterns can be indication of a missing feature
  • (Simon) cleaning up of easyconfigs for old software versions
    • cfr. bintray cleanup
    • mostly stuff with system toolchain?

Code-of-conduct for EasyBuild community

  • Contacts
    • separate committee
    • documented
    • group contact + individual contacts

New generoso cluster at CSCS

  • Most of the details are in the slides
  • Bot is currently installing in shared space
    • Should do an install in /tmp first
    • Should automatically do the installation in the shared space after the PR is merged
      • automation is hard as there are a lot of corner cases
      • can we use singularity to do the installation in an overlay?
        • EESSI uses fuse overlay (not singularity overlay)
  • would like to start testing in singularity containers
    • should create a repository where we define these test environments (hosted on GitHub)
    • will allow us to test in multiple environments/OSes
    • (Åke) have Ubuntu Focal minimal containers available
  • would like to have access to logs

Actions items

  • Crush the curve of open (easyconfig) PRs

  • Try to empower contributors more to make PRs ready

    • Improve errors for failing CI tests (cfr. code style for easyconfigs)
    • Make CI fail over common issues (like missing comments on top of patches)
  • Make life of maintainers easier

    • document requirements to merge PRs (in different repos)
    • Make boegelbot add comment with output of eb --review-pr (single easyconfig)
    • auto-label with PR status (new, CI passes, )
  • working group for migrating docs to mkdocs

  • expand farm of test platforms (boegelbot)

  • code-of-conduct

    • Alan & Kenneth follow up
    • odd number of committee members (3?)
    • PR for code-of-conduct that all maintainers should agree on
  • EasyBuild 5.0

    • project to track progress on major targets to tackle for EasyBuild 5.0
    • set up 5.x branches
  • issue template for reporting bugs/questions/...

⚠️ **GitHub.com Fallback** ⚠️