PyTorch working group - easybuilders/easybuild GitHub Wiki
EasyBuild PyTorch working group
Goal
How can we streamline process to support new PyTorch versions?
Action points
(based on discussion on 20251009 - by KH)
- [Kenneth] implement support for
--test-step-mode
in frameworkskip
,minimal
,basic
(default),full
as possible values?--test-step-mode=skip
as replacement for--skip-test-step
?
- [Alexander] make PyTorch easyblock aware of it
- at least for
basic
mode - current PyTorch test step would then correspond to
full
- at least for
- [Alex?] define clear policy w.r.t. merging PyTorch easyconfig PRs
- minimal set of test installation that should pass
- break up process to add support for new PyTorch versions
- experimental easyconfig (passing basic test step)
- mature easyconfig (passing full test step)
- report problems in issue, follow up in subsequent PRs (adding more patch files, etc.)
- [Kenneth] implement support for experimental easyconfigs
experimental
easyconfig parameter that can be set toTrue
- don't include those easyconfig files in EasyBuild release?
- auto-add
-EXPERIMENTAL
to module name + auto-hide module file - for PyTorch: passing
basic
test step is sufficient to merge as experimental easyconfig file
- [Loris? Alex?] evaluate performance benefits of from-source PyTorch installation
- vs container images
- vs
pip install torch
installation
Meetings
20251113
(Thu 13 Nov'25 15:00 CET)
attendees:
- ...
Notes
- ...
20251009 - kickoff meeting
attendees:
- Alex Domingo (Vrije Universiteit Brussel, Belgium)
- Loris Ercole (CECAM)
- Alexander Grund, a.k.a. Flamefire (ZIH, TU Dresden, Germany)
- Kenneth Hoste (HPC-UGent, Belgium)
- Adam Huffman (University of Oxford, UK)
- Emmanuel Kiefer (LuxProvide, Luxembourg)
- Jure Pečar (EMBL, Germany)
- Lara Peeters (HPC-UGent, Belgium)
- Jörg Saßmannshausen (Imperial College London, UK)
Summary of current situation
- PyTorch test suite is a PITA
- it's taking forever...
- can be over 24h on older systems (AMD Rome @ HPC-UGent)
- @ TU Dresden: over 33h on 64-core AMD Rome with 6x H100
- handful of flaky tests (sometimes even hanging tests)
- failing tests shouldn't be ignored, because they can indicate real problems
- can point to problem with wrong dependency version (or to a bug in a dependency being used)
- it's taking forever...
- growing complexity in build
- cfr. Triton (which by itself is a small nightmare)
- depends on specific versions of NCCL & cuDNN
- PyTorch easyblock is quite complex
- easyblock PR #3803 helps a bit by adding CI for
get_test_results
function used by PyTorch easyblock
- easyblock PR #3803 helps a bit by adding CI for
Impact
- lots of wasted time
- failing tests => not completing installations
- work to try and fix tests
- no recent PyTorch versions supported by EasyBuild
Ideas/questions
- does it really make sense to run full PyTorch test suite to verify an installation?
- something more lightweight?
- => "integration test" scripts found by Alex
- run more reasonable battery of tests by default
- support for running full test suite for those that want to
- only "end-to-end" tests, maybe in combination with small part of PyTorch test suite
- many tests in PyTorch test suite are:
- only testing a niche feature
- flaky/poorly written/make assumptions
- it's pretty easy to run PyTorch test suite for an existing PyTorch installation
- PyTorch test suite consists of a bunch of groups of tests
- we could identify ones that are reasonable to run in default mode
- try to focus on core features
- framework feature to opt-in to running full PyTorch test suite
eb --test-step-mode=intense PyTorch.eb
--skip-test-step
could (eventually) be replaced with--test-step-mode=skip
- list set of tests in separate file like
test-pytorch-2.7-intense.yml
eb --test-step-input=PyTorch:/tmp/test-pytorch-2.7-intense.yml
-
test_step_input = { 'basic': 'basic.yml', 'intense': 'intense.yml', }
- support for specifying how a max. time for test step
eb --test-step-max-time=1h PyTorch.eb
- where would we get reasonable accurate info on this?
- depends on hardware, available resources, PyTorch version, etc.
- collecting timing info for tests would be helpful, so each site can figure out for themselves which excessively long tests to skip
- maybe update PyTorch easyblock to allow:
python -m easybuild.easyblocks.pytorch run-test-suite
- set clear target to include new PyTorch easyconfig in test suite
- at least 2 (common) GPU generations
- jsc-zen3 test bot w/ A100
- H100 somewhere?
- don't block PR when a couple of PyTorch tests fail for some people
- at least 2 (common) GPU generations
- streamline process to get PyTorch easyconfigs merged
- PyTorch is becoming a common dependency, so lots of other easyconfig PRs are being blocked...
- clear policy on what should be achieved before merge would help
- try to get more people up to speed on how to maintain PyTorch easyblock/easyconfigs
- have a way to quickly merge updated PyTorch easyconfig in repo, but not included in EasyBuild release
- separate
experimental/easyconfigs
folder? - only let EasyBuild pick up on it when it's told to be allowed
eb --use-experimental-easyconfigs
- also
-EXPERIMENTAL
asversionsuffix
?- add automatically to module name being installed?
- also make it a hidden module file?
experimental = True
in easyconfig file- don't include these in EasyBuild release
- also auto-add
-EXPERIMENTAL
to install path?
- separate
- something more lightweight?
- initially use pre-built wheels for PyTorch
- from PyPI? from NVIDIA (for CUDA-aware installs)?
PyTorch
wrapper to allow for in-place update from wheel to from-source installation?- only really works for pure Python packages that only do
import torch
- Alex' experiment with
torchvision
- see https://github.com/easybuilders/easybuild/issues/921#issuecomment-3386620931
- playing with wheels vs from-source installations of PyTorch/torchvision
- using pre-built wheel of PyTorch with torchvision from-source on top works, but you then can't swap to a from-source PyTorch
- only affects stuff that link to
libtorch.so
- so swapping PyTorch install with pre-built wheel with from-source built implies also reinstalling torchvision & co
- bundle stuff together that links to PyTorch library
PyTorch-bundle-PyPI
(wheel installs) vsPyTorch-bundle-EasyBuild
(from source builds)
- how can we identify things that link to PyTorch library?
- only affects stuff that link to
- using pre-built wheel didn't show any significant performance degradation
- only really works for pure Python packages that only do
- how well does our from-source installation perform vs prebuilt binary wheels?
- especially on GPU
- depends on which CUDA kernels are being used, which GPU is being used
- "pure PyTorch" scripts that Alex found can be useful here
- can we get input from PyTorch developers?
Notes
- issue with having to use other libuv @ LuxProvide
- not using internal libuv in tensorpipe
- for specific problems/questions: open issue/discuss in Slack
- is there a way to figure version of dependencies expected by PyTorch
- yes, sort of, see comments in recent PyTorch PRs like easyconfigs PR #23923
- can we extract versions that were used from pre-built wheels?
- maybe a range of versions?
Action points
- (Kenneth/Alex) implement support for
--test-step-mode
in frameworkskip
,minimal
,basic
(default),full
as possible values?--test-step-mode=skip
as replacement for--skip-test-step
?
- (Alexander) make PyTorch easyblock aware of it
- at least for
basic
mode - current PyTorch test step would then correspond to
full
- at least for
- define clear policy w.r.t. merging PyTorch easyconfig PRs
- minimal set of test installation that should pass
- break up process to add support for new PyTorch versions
- experimental easyconfig (passing basic test step)
- mature easyconfig (passing full test step)
- report problems in issue, follow up in subsequent PRs (adding more patch files, etc.)
- implement support for experimental easyconfigs
experimental
easyconfig parameter that can be set toTrue
- don't include those easyconfig files in EasyBuild release?
- auto-add
-EXPERIMENTAL
to module name + auto-hide module file - for PyTorch: passing
basic
test step is sufficient to merge as experimental easyconfig file
- evaluate performance benefits of from-source PyTorch installation
- vs container images
- vs
pip install torch
installation
Next meeting
- Thu 13 Nov'25 15:00 CET
- OK for Xavier, Alex, Jörg (TBC: Alexander)