如何透過ansible安裝slurm與如何使用slurm - myelintek/deepops GitHub Wiki
1. 安裝Slurm
環境初始化
git clone https://github.com/myelintek/deepops.git
cd deepops/
產生SSH key及處理keyless
ssh-keygen
ssh-copy-id [host]
安裝前置需求與建立預設設定檔
./scripts/setup.sh
建立inventory
- Add Slurm controller/login host to
slurm-master
group - Add Slurm worker/compute hosts to the
slurm-node
groups - Example: config/inventory.yaml
all:
hosts:
node1:
ansible_host: [HOST_IP]
ip: [HOST_IP]
access_ip: [HOST_IP]
ansible_user: [USERNAME]
ansible_sudo_pass: [PASSWORD]
ansible_ssh_private_key_file: [KEY_FULLNAME] #/home/ubuntu/.ssh/id_rsa
slurm-master:
hosts:
node1:
slurm-nfs:
hosts:
node1:
slurm-node:
hosts:
node1:
slurm-cache:
children:
slurm-master:
slurm-nfs-client:
children:
slurm-node:
slurm-metric:
children:
slurm-master:
slurm-login:
children:
slurm-master:
slurm-cluster:
children:
slurm-master:
slurm-node:
slurm-cache:
slurm-nfs:
slurm-metric:
slurm-login:
透過ansible安裝Nvidia driver
ansible-playbook -i config/inventory.yaml playbooks/nvidia-software/nvidia-driver.yml -e nvidia_driver_package_state="latest"
# NOTICE: 安裝完driver後會重開機
(Optional) 透過ansible upgrade driver
# Edit DeepOps configuration, eg:
nvidia_driver_ubuntu_branch: "510"
# Upgrade driver
ansible-playbook playbooks/nvidia-software/nvidia-driver.yml [-l <list-of-nodes>]
安裝Slurm
ansible-playbook -l slurm-cluster -e '{"slurm_force_rebuild": true}' -i config/inventory.yaml playbooks/slurm-cluster/slurm.yml
# NOTICE: 安裝完後會重開機,如果re-deploy則不會重開機
安裝Slurm addons - singularity
# 開啓安裝singularity選項,需要修改config/group_vars/slurm-cluster.yml
# 將slurm_cluster_install_singularity: false
# 改為
# slurm_cluster_install_singularity: true
ansible-playbook -i config/inventory.yaml playbooks/slurm-addons.yml --become
2. 如何使用Slurm
將hpc-benchmarks儲存成hpc-benchmarks:21.4-hpl.sif.
cd ~
mkdir nvidia
cd nvidia
sudo singularity pull --docker-login hpc-benchmarks:21.4-hpl.sif docker://nvcr.io/nvidia/hpc-benchmarks:21.4-hpl
編輯可運作於1個GPU的.dat範例 (Ps, Qs需為1)
HPL-1GPU.dat (放置於/home/ubuntu/nvidia/dat-files/HPL-1GPU.dat)
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
10960 Ns
1 # of NBs
288 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
1 Ps
1 Qs
16.0 threshold
1 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
2 8 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
3 2 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 0 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
192 swapping threshold
1 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
0 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
(Optional) 如何進入singularity shell運作HPC benchmark
/usr/local/bin/singularity shell --nv "/home/ubuntu/nvidia/hpc-benchmarks:21.4-hpl.sif"
cd /workspace/hpl-linux-x86_64
./xhpl /home/ubuntu/nvidia/dat-files/HPL-1GPU.dat
編寫需要透過sbatch運作的script
hpc-test.sh
#!/bin/bash
#SBATCH --job-name=hpc-test
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
#/usr/local/bin/singularity run --nv "/home/[USER]/nvidia/hpc-benchmarks:21.4-hpl.sif" /workspace/hpl-linux-x86_64/xhpl /home/[USER]/nvidia/dat-files/HPL-1GPU.dat
/usr/local/bin/singularity run --nv "/home/ubuntu/nvidia/hpc-benchmarks:21.4-hpl.sif" /workspace/hpl-linux-x86_64/xhpl /home/ubuntu/nvidia/dat-files/HPL-1GPU.dat
透過sbatch運作HPC testing
sbatch hpc-test.sh
3. NOTE
Slurm基本指令說明
# 查看目前叢集狀態
sinfo
# 查看當前作業排隊狀況
squeue
# 提交批次作業
sbatch [script.sh]
# 直接執行作業
srun --partition=debug --ntasks=4 ./my_program
# 取消某個作業
scancel job_id
# 查詢作業執行紀錄
sacct
進入hpc-benchmark container的方式
docker image pull nvcr.io/nvidia/hpc-benchmarks:25.02
docker run -i -t nvcr.io/nvidia/hpc-benchmarks:25.02 /bin/bash
# HPL.dat範例路徑
# ./hpcg-linux-x86_64/sample-dat
# ./hpl-linux-x86_64/sample-dat
如何查HPC-benchmarks舊說明網頁
search "https://catalog.ngc.nvidia.com/orgs/nvidia/containers/hpc-benchmarks"