Sync meeting 2023 05 25 with CernVM FS developers on Best Practices for CernVM FS on HPC tutorial - multixscale/meetings GitHub Wiki

Best Practices for CernVM-FS in HPC

  • online tutorial, focused on (Euro)HPC system administrators
  • aiming for Fall 2023 (Sept-Oct-Nov)
  • collaboration between MultiXscale/EESSI partners and CernVM-FS developers
  • tutorial under cvmfs-contrib + improvements to CernVM-FS docs
  • similar approach to introductory tutorial by Kenneth & Bob in 2021, see https://cvmfs-contrib.github.io/cvmfs-tutorial-2021/
  • format: tutorial website (+ CVMFS docs) + accompanying slide deck

Initial meeting (2023-05-25)

Attending:

  • CernVM-FS team: Laura, Valentin, Jakob
  • EESSI/MultiXscale: Alan, Thomas, Kenneth (excused: Bob)

Topics

  • [10-15min] What is CernVM-FS (for HPC sysadmins)
    • Incl. short project history
    • Used at NERSC, CSCS + EuroHPC sites (JSC, Vega in Slovenia)
      • Thomas: how about user communities such as Elixir?
    • Use cases
      • Software distr. (+ aux. data like ML models, geometry files)
      • Data distr. (not the focus of this tutorial) => see HFT trading talk + LIGO
  • Accessing an existing CernVM-FS repo
    • Client configuration: local cache, important parameters, how to update configuration (full remount vs client reload)
      • autofs vs alternatives (/etc/fstab, cvmfs-exec in userspace via namespaces, ...)
    • Squid proxy: why, how
      • squid per rack (for example)
      • auto-discovery: DNS load balancing, WPATH std
    • Private Stratum-1 mirror: why, how
      • Reduce latency, protect network disconnect, controlling which Stratum servers clients connect to, load balancing towards Stratum-0, etc.
    • Public S1
      • talk to your network team w.r.t. rate limiting, security features/scanning, etc.
    • Rule of thumb for # squids, etc.
  • Handling special cases
    • Diskless workernodes
      • loopback device, see CVMFS docs (1 file on shared FS per node per cache)
    • Offline workernodes
      • Private Stratum-1 and/or squid required
    • Alien cache
      • Prefetching (warm cache)
      • Downside: metadata access is file-by-file => increased load on shared FS
    • Security: root requirements by CVMFS, and how to limit them (namespaces, etc.)
    • Not recommended to use NFS export: explain why
    • Syncing CVMFS repo to another filesystem (like NFS, etc.)
      • see shrinkwrap utility
      • drawbacks: lots of files, metadata load, etc.
    • Present alternatives to mounting CVMFS proper vs workarounds (with gradually degrading UX w.r.t. CVMFS features)
    • In-place update of software in CVMFS repo is discouraged
      • Could be problematic for multi-node jobs
      • Manual controlled mode for (asynchronous) client catalog updates (for example at job startup)
  • Troubleshooting & debugging
    • CVMFS logs + common problems
    • CVMFS stats: cache hit rate, usage stats, etc.
    • Debugging slow startup of applications
    • Monitoring of CVMFS client + squids
      • (for CVMFS>=2.11) exposed CVMFS internal stats as Influxdb or custom format
    • Manually mounting in debug mode => live logging of what client does
  • Performance
    • Startup performance (cold/warm cache): benchmark impact of app performance like HPL, OpenFOAM, etc.
    • Jitter by CVMFS daemon (FUSE module)
      • cvmfs* processes are there as long as mount is there
      • nothing special about CVMFS, similar with any FUSE module
    • libfuse3 should be preferred over fuse2
    • Use of CDN (vs GeoAPI)
      • Can be controlled via Python script that determines order of Stratum-1's
      • One S1 is contacted to provide ordered list of S1's
      • Or just use different client configuration when using a different CDN
  • Different storage backends for CVMFS
    • S3, etc.
    • Also for (private) Stratum-1
  • Containers
    • Accessing CVMFS repos via Apptainer/Singularity
      • Easiest via bind mount of host /cvmfs into container
      • CVMFS in container (cfr. EESSI client container)
    • Ingesting container images in CVMFS repos
  • BACKUP: Getting started with CVMFS from scratch
    • Creating a CernVM-FS repo + setting up Stratum-0: how, important aspects, ...
      • Alan: Who's the audience here? I think it is too involved to get into this unless we expect people to create rather than consume repositories. At the very least, I think it should be last rather than first.
    • Statci configuration vs configuration repo: why, how, ...
    • Setting up Stratum-1 mirror server: why, how, ...
    • Squid proxy: why, how, ...
    • Local cache (node) configuration
    • Point to https://cvmfs-contrib.github.io/cvmfs-tutorial-2021 for details

Practical

  • new repo under cvmfs-contrib (cvmfs-contrib/cvmfs-tutorial-hpc-best-practices)
    • set up tutorial structure
    • assign sections to people to write them out
    • ask for input/feedback from CVMFS devs
      • email
      • Mattermost, dedicated channel
      • GitHub handles: @vvolkl (Valentin), @HereThereBeDragons (Laura), @jblomer (Jakob)
        • maybe via GitHub team @fs-team
    • input from Dave Dykstra (Fermilab) + Ryan Taylor (The Alliance)

Notes

  • Valentin: communication channels to promote (LHC)
  • Jakob: how to maintain tutorial contents
  • Alan: use CI to ensure that used commands still work
    • maybe add some badges, "Succeeds with CVMFS/2.10"
  • Jakob: umbrella under which tutorial will be held could matter
    • could help to park it under EESSI or MultiXscale umbrella
  • Use case is distributing software only?
  • next sync meeting
    • Fri 7 July 2023, 10:00 CEST