EESSI GPU support 2023 11 21 - EESSI/meetings GitHub Wiki
EESSI GPU support sync meeting (2023-11-21)
attending: Kenneth + Caspar + Alan
- https://github.com/EESSI/software-layer/issues/375
- open PRs
- PR #368
- PR #381
- combining these two PRs should be sufficient to get GPU support working, if the GPU driver is recent enough
- combine both PRs into one branch to experiment with
mkdir -p /tmp/$USER cd /tmp/$USER git clone https://github.com/EESSI/software-layer cd software-layer # fetch Alan's branches git remote add alan https://github.com/ocaisa/software-layer git fetch alan # create 'gpu' branch, and merge PR branches into it git checkout -b gpu git merge alan/host_injections_cuda git merge alan/cuda_install - steps:
-
- start container in read-write mode + prepared to install CUDA + access GPU
./eessi_container.sh -m shell --access rw --nvidia all -
- install CUDA 12.1.1 in
/cvmfs/pilot.eessi-hpc.org/host_injectionsby running theinstall_cuda_host_injections.shscript:
source /cvmfs/pilot.eessi-hpc.org/versions/2023.06/init/bash gpu_support/nvidia/install_cuda_host_injections.sh 12.1.1 - install CUDA 12.1.1 in
-
- install CUDA/12.1.1 (runtime only) + CUDA samples in EESSI
- update
eessi-2023.06-eb-4.8.2-2023a.ymlto use--from-pr 19189(unless it has been already) - follow the steps at https://www.eessi.io/docs/adding_software/debugging_failed_builds/ and
(note that this will not update the lmodrc file, which is done byeb --easystack eessi-2023.06-eb-4.8.2-2023a.yml --robotEESSI-pilot-install-software.sh) OR - first update
EESSI-pilot-install-software.shscript to hardcode use ofeessi-2023.06-eb-4.8.2-2023a.ymleasystack file:
and then runfor easystack_file in eessi-2023.06-eb-4.8.2-2023a.yml; do./install_software_layer.shin container
-
- TODO
- script to create file required by Lmod hook (cfr. lmodrc file) is still missing, needs to be done manually work (or tweak module to bypass the check)
- should be separate PR to add scripts in
software-layer/scripts/ - bot/build.sh can be updated to also deploy scripts in EESSI repo
- this script should create symlinks for all libraries shipped with GPU driver, based on:
ldconfig -p | awk '{print $1 " " $NF}' > libs.txt curl -O https://raw.githubusercontent.com/apptainer/apptainer/main/etc/nvliblist.conf grep '.so$' nvliblist.conf | xargs -i grep {} libs.txt
- should be separate PR to add scripts in
- placeholder page in docs that we can point to from Lmod load hook: https://eessi.io/docs/gpu
- script to create file required by Lmod hook (cfr. lmodrc file) is still missing, needs to be done manually work (or tweak module to bypass the check)
Caspar's replication steps:
cd /scratch-shared/casparl# Using /tmp results in "WARNING: 'nodev' mount option set on /tmp, it could be a source of failure during build process"git clone https://github.com/EESSI/software-layercd software-layergit remote add alan https://github.com/ocaisa/software-layergit fetch alangit checkout -b gpu --track alan/host_injections_cuda# Creating a fresh branch from the main branch now gives a ton of conflicts. Its easier to start from this, then merge cuda_install into itgit merge alan/cuda_installmodule purge# Make sure we don't pick up on EasyBuild from the host later onSINGULARITY_TMPDIR=/scratch-shared/casparl/singularity.tmpdir ./eessi_container.sh -m shell --access rw --nvidia all -g /scratch-shared/casparl/# See if pointing SINGULARITY_TMPDIR and -g away from /tmp resolves the "/cvmfs/.../ is a read only file system" issue- Follow steps at https://www.eessi.io/docs/adding_software/debugging_failed_builds/ to start prefix and source EESSI environment
- module load EasyBuild/4.8.2
gpu_support/nvidia/install_cuda_host_injections.sh 12.1.1# Install cuda 12.1.1 in/cvmfs/pilot.eessi-hpc.org/host_injectionsexport WORKDIR=$(mktemp --directory --tmpdir=/tmp -t eessi-debug.XXXXXXXXXX)source configure_easybuildeb CUDA-Samples-12.1-GCC-12.3.0-CUDA-12.1.1.eb --robot --from-pr 19189