Sync meeting 2025 05 13 - multixscale/meetings GitHub Wiki

MultiXscale WP1+WP5 sync meetings


Next meetings

  • Tue 10 June 2025 09:30 CEST
  • Tue 1 July 2025 09:30 CEST
    • rescheduled from 8 July
  • Tue 12 August 2025 09:30 CEST
  • Tue 9 September 2025 09:30 CEST
    • 3 days before review

Agenda/notes 2025-05-13

attending:

  • Petra (NIC)
  • Richard, Thomas (UiB)
  • Casper (SURF) Caspar (SURF)
  • Lara (UGent)
  • Helena, Susana, Arturo (Do IT Now)
  • Pedro (RUG)
  • Julián (BSC)

General

  • Petra has started as Project Manager for MultiXscale
  • overview of MultiXscale planning
  • upcoming project review
    • final date: Fri 12 Sept 2025 (confirmed)
      • Agenda & list of participants is on OneDrive
        • also some emails sent during last week
      • Is it confirmed that WP1/WP5 leads can be remote
        • Caspar is on holiday :D
        • confirmed (?) by PM, focus will be on scientific WPs
  • amendment update
    • status?
    • was signed by NIC
    • waiting for response (25/45 days)

Deliverables due M30

  • Upcoming deliverables (M30 - June 2025):
    • Discussed D1.4, 1.5 and 5.3 on 09-05-2025
      • see notes @ https://hackmd.io/WHVWZR3ITu20oJkLZZqd0w
      • all deliverables in very good shape
      • next: review + update
      • goal: have them (camera-)ready by end of May for handover for final review to Steering Committee
      • next meeting May 28
    • D6.3 Interim report on Community outreach, Education, and Training => NIC (Petra)
      • mostly written (by Neja)
      • review by Alan ?

WP status updates

  • WP status updates
    • [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
      • [UGent] T1.1 Stable (EESSI) - due M12+M24
      • [RUG] T1.2 Extending support (starts M9, due M30)
        • NVIDIA Grace: 100% - epsilon done
          • Lara & Richard are working on LAMMPS
          • little issue is that we don't have yet a service account
          • Do we have a non-personal bot that can be used to keep installations for NVIDIA Grace in sync?
            • Not yet, only personal bots for Thomas/Ricard
            • Service account would be needed, should ask JSC? => Thomas will ask
        • AMD ROCm: (Pedro, Bob)
          • Status: close to having a ROCm-LLVM easyconfig
            • builds were picking up something from the host, but this should be fixed now
          • Next few weeks, try to build rest of the toolchain
            • likely not before end of webinar series (June 2nd)
          • (?) Alan will look into exposing AMD drivers, similar to what we do for NVIDIA drivers
          • (?) Alan will look into EasyBuild toolchain (is it needed, what would be in there, can it share stuff with the regular LLVM toolchain?)
          • Not started yet or no results yet:
            • Then in following weeks, look into higher level software (GROMACS etc)
          • Goal: ROCm available by end of june 2025, probably in dev.eessi.io
          • Notes at https://github.com/EESSI/meetings/wiki/Sync-meeting-on-EESSI-ROCm-support-(2025-04-04)
          • Tiger team meeting Friday (4 april): ? maybe one more since April 4?
          • Bob met with AMD ROCm developer April 22
            • someone from AMD reached out to EasyBuild (via Slack)
        • Caspar and Bob are building stacks for Cascadelake and Icelake (> 75 %)
        • Resumed work on A64FX stack (building on Deucalion) (50 %)
      • [SURF] T1.3 Test suite - due M12+M24
        • v0.6.0 released: Breaking change in constants => Required ReFrame configs to be updated
        • Caspar gave a talk on EESSI test suite at EUM'25 (https://easybuild.io/eum25/#program)
          • Hands-on in 20 minutes was tight, but feedback showed at least a handfull of people tried (and some managed) to run the test suite
      • [BSC] T1.4 RISC-V (starts M13)
        • Working on LLVM in riscv.eessi.io, currently fails on the testing fails. 346 tests (0.25% out of total tests) => Maybe check with Davide, he might be able to help
          • Some of the test failures are not specific to RISC-V
          • still working on LLVM (some issues even on non-RISC-V)
          • in contact with Davide (CECAM), work on Intel/AMD CPUs
          • objective now is a minimal installation
        • (same as last month) Bob has a build bot running on RISC-V, but it has some issues (smee client crashes)
          • Currently, tarballs passed manually to Bob for ingestion
          • Bob's plan is to push things to software.eessi.io, but not sure if we will still do that for 2023.06
      • [SURF] T1.5 Consolidation (starts M25)
        • Improve NVIDIA GPU support
          • CUDA sanity check (PR to EasyBuild framework by Caspar, Kenneth/Caspar/Alan reviewing)
          • Formalize supported Cuda Compute Capability & CPU architecture combinations
            • for CC70, CC80, CC90 it's formalized
            • building of software for CPU+GPU combinations has started
              • first PR covering CUDA, UCX, UCC, OSU benchmarks built, ingested and merged
              • these also include builds for Icelake and Cascadelake stack
            • there was a problem building CUDA-Samples version 12.1 for CC70
              • don't build that for now for CC70
              • but build it for CC80 and CC90
              • look into building a newer version of CUDA-Samples for all CCxx
              • if that is in place we add some custom modules or Lmod hook for cc70 CUDA-Samples 12.1 pointing to the newer version
            • not decided yet, how to deal with minor CCyy versions
          • (?) Log / document which combinations are supported natively, and which are cross-compiled
          • see also https://gitlab.com/eessi/support/-/issues/142
            • we should have a strategy in case something doesn't work
            • some strategy emerging, for example, CUDA-Samples issue with cc70
        • some efforts to automate building a stack for a new CPU microarchitecture
          • see work on building stack for Icelake and Cascadelake
          • question is how to ensure that stacks a built as close as possible to how the other stacks were built
          • when building a new version of EESSI we want to make such efforts easier
    • [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations
      • [UGent] T5.1 Support portal - due M12
      • [SURF] T5.2 Monitoring/testing (starts M9)
        • finalising dashboard work and working on deliverable 5.3
        • working on identity matrix
          • shows systems on X, packages on Y
          • quick overview of which modules work on which system
          • idea is to integrate this into EESSI documentation
      • [UiB] T5.3 community contributions (bot) - due M12
      • [UGent] T5.4 support/maintenance (starts M13)
        • New rotation proposal for July+August has been sent out
          • some issue in July
          • Lara follows up on this
        • continuing work on improvements for the bot
          • reducing chattiness of the bot
          • bundling staging PRs (very important for reducing load when building full matrix of CPU+GPU combinations)
          • reversing filter match logic for bot build commands
          • idea to add a bot: deploy command
    • [UB] WP6 Community outreach, education, and training
      • webinar series on EESSI in May-June 2025
      • EESSI-related talk @ HPCKP'25?
        • could fit into industry training => Susana will talk to Alan about this
          • not yet decided
      • VSC (Austria) considering to switch to EESSI
        • help them, open support issue, eventually blog post? => Progress?
          • Kenneth had a chat with them during EuroHPC Summit, needs to reply to their email for follow-up
        • for next sync: update on progress and decide about promoting this (blog or other)
      • Code-of-the-month: April 30
        • ESPReSo
      • CECAM workshop, April 8-11
        • Modelling
    • [HPCNow] WP7 Dissemination, Exploitation & Communication

Other topics

  • Thomas: blog post Grace stack?

Notes of previous meetings

see https://github.com/multixscale/meetings/wiki