Sync meeting 2023 05 25 with CernVM FS developers on Best Practices for CernVM FS on HPC tutorial - multixscale/meetings GitHub Wiki
Best Practices for CernVM-FS in HPC
- online tutorial, focused on (Euro)HPC system administrators
- aiming for Fall 2023 (Sept-Oct-Nov)
- collaboration between MultiXscale/EESSI partners and CernVM-FS developers
- tutorial under cvmfs-contrib + improvements to CernVM-FS docs
- similar approach to introductory tutorial by Kenneth & Bob in 2021, see https://cvmfs-contrib.github.io/cvmfs-tutorial-2021/
- format: tutorial website (+ CVMFS docs) + accompanying slide deck
Initial meeting (2023-05-25)
Attending:
- CernVM-FS team: Laura, Valentin, Jakob
- EESSI/MultiXscale: Alan, Thomas, Kenneth (excused: Bob)
Topics
- [10-15min] What is CernVM-FS (for HPC sysadmins)
- Incl. short project history
- Used at NERSC, CSCS + EuroHPC sites (JSC, Vega in Slovenia)
- Thomas: how about user communities such as Elixir?
- Use cases
- Software distr. (+ aux. data like ML models, geometry files)
- Data distr. (not the focus of this tutorial) => see HFT trading talk + LIGO
- Accessing an existing CernVM-FS repo
- Client configuration: local cache, important parameters, how to update configuration (full remount vs client reload)
- autofs vs alternatives (/etc/fstab, cvmfs-exec in userspace via namespaces, ...)
- Squid proxy: why, how
- squid per rack (for example)
- auto-discovery: DNS load balancing, WPATH std
- Private Stratum-1 mirror: why, how
- Reduce latency, protect network disconnect, controlling which Stratum servers clients connect to, load balancing towards Stratum-0, etc.
- Public S1
- talk to your network team w.r.t. rate limiting, security features/scanning, etc.
- Rule of thumb for # squids, etc.
- Client configuration: local cache, important parameters, how to update configuration (full remount vs client reload)
- Handling special cases
- Diskless workernodes
- loopback device, see CVMFS docs (1 file on shared FS per node per cache)
- Offline workernodes
- Private Stratum-1 and/or squid required
- Alien cache
- Prefetching (warm cache)
- Downside: metadata access is file-by-file => increased load on shared FS
- Security: root requirements by CVMFS, and how to limit them (namespaces, etc.)
- Not recommended to use NFS export: explain why
- Syncing CVMFS repo to another filesystem (like NFS, etc.)
- see shrinkwrap utility
- drawbacks: lots of files, metadata load, etc.
- Present alternatives to mounting CVMFS proper vs workarounds (with gradually degrading UX w.r.t. CVMFS features)
- In-place update of software in CVMFS repo is discouraged
- Could be problematic for multi-node jobs
- Manual controlled mode for (asynchronous) client catalog updates (for example at job startup)
- Diskless workernodes
- Troubleshooting & debugging
- CVMFS logs + common problems
- CVMFS stats: cache hit rate, usage stats, etc.
- Debugging slow startup of applications
- Monitoring of CVMFS client + squids
- (for CVMFS>=2.11) exposed CVMFS internal stats as Influxdb or custom format
- Manually mounting in debug mode => live logging of what client does
- Performance
- Startup performance (cold/warm cache): benchmark impact of app performance like HPL, OpenFOAM, etc.
- depend on job mix & cold/hot local node cache
- cfr. https://github.com/EESSI/filesystem-layer/issues/151
- switching S1 to non-standard port could help (80/8000)
- Jitter by CVMFS daemon (FUSE module)
cvmfs*
processes are there as long as mount is there- nothing special about CVMFS, similar with any FUSE module
- libfuse3 should be preferred over fuse2
- Use of CDN (vs GeoAPI)
- Can be controlled via Python script that determines order of Stratum-1's
- One S1 is contacted to provide ordered list of S1's
- Or just use different client configuration when using a different CDN
- Startup performance (cold/warm cache): benchmark impact of app performance like HPL, OpenFOAM, etc.
- Different storage backends for CVMFS
- S3, etc.
- Also for (private) Stratum-1
- Containers
- Accessing CVMFS repos via Apptainer/Singularity
- Easiest via bind mount of host /cvmfs into container
- CVMFS in container (cfr. EESSI client container)
- Ingesting container images in CVMFS repos
- Accessing CVMFS repos via Apptainer/Singularity
- BACKUP: Getting started with CVMFS from scratch
- Creating a CernVM-FS repo + setting up Stratum-0: how, important aspects, ...
- Alan: Who's the audience here? I think it is too involved to get into this unless we expect people to create rather than consume repositories. At the very least, I think it should be last rather than first.
- Statci configuration vs configuration repo: why, how, ...
- Setting up Stratum-1 mirror server: why, how, ...
- Squid proxy: why, how, ...
- Local cache (node) configuration
- Point to https://cvmfs-contrib.github.io/cvmfs-tutorial-2021 for details
- Creating a CernVM-FS repo + setting up Stratum-0: how, important aspects, ...
Practical
- new repo under cvmfs-contrib (cvmfs-contrib/cvmfs-tutorial-hpc-best-practices)
- set up tutorial structure
- assign sections to people to write them out
- ask for input/feedback from CVMFS devs
- Mattermost, dedicated channel
- GitHub handles: @vvolkl (Valentin), @HereThereBeDragons (Laura), @jblomer (Jakob)
- maybe via GitHub team
@fs-team
- maybe via GitHub team
- input from Dave Dykstra (Fermilab) + Ryan Taylor (The Alliance)
Notes
- Valentin: communication channels to promote (LHC)
- Jakob: how to maintain tutorial contents
- Alan: use CI to ensure that used commands still work
- maybe add some badges, "Succeeds with CVMFS/2.10"
- Jakob: umbrella under which tutorial will be held could matter
- could help to park it under EESSI or MultiXscale umbrella
- Use case is distributing software only?
- next sync meeting
- Fri 7 July 2023, 10:00 CEST