如何透過ansible安裝slurm與如何使用slurm - myelintek/deepops GitHub Wiki

1. 安裝Slurm

環境初始化

git clone https://github.com/myelintek/deepops.git
cd deepops/

產生SSH key及處理keyless

ssh-keygen
ssh-copy-id [host]

安裝前置需求與建立預設設定檔

./scripts/setup.sh

建立inventory

Add Slurm controller/login host to slurm-master group
Add Slurm worker/compute hosts to the slurm-node groups
Example: config/inventory.yaml

all:
  hosts:
    node1:
      ansible_host: [HOST_IP]
      ip: [HOST_IP]
      access_ip: [HOST_IP]
      ansible_user: [USERNAME]
      ansible_sudo_pass: [PASSWORD]
      ansible_ssh_private_key_file: [KEY_FULLNAME] #/home/ubuntu/.ssh/id_rsa

slurm-master:
  hosts:
    node1:
slurm-nfs:
  hosts:
    node1:
slurm-node:
  hosts:
    node1:
slurm-cache:
  children:
    slurm-master:
slurm-nfs-client:
  children:
    slurm-node:
slurm-metric:
  children:
    slurm-master:
slurm-login:
  children:
    slurm-master:
slurm-cluster:
  children:
    slurm-master:
    slurm-node:
    slurm-cache:
    slurm-nfs:
    slurm-metric:
    slurm-login:

透過ansible安裝Nvidia driver

ansible-playbook -i config/inventory.yaml playbooks/nvidia-software/nvidia-driver.yml -e nvidia_driver_package_state="latest"
# NOTICE: 安裝完driver後會重開機

(Optional) 透過ansible upgrade driver

# Edit DeepOps configuration, eg:
nvidia_driver_ubuntu_branch: "510"

# Upgrade driver
ansible-playbook playbooks/nvidia-software/nvidia-driver.yml [-l <list-of-nodes>]

安裝Slurm

ansible-playbook -l slurm-cluster -e '{"slurm_force_rebuild": true}' -i config/inventory.yaml playbooks/slurm-cluster/slurm.yml

# NOTICE: 安裝完後會重開機,如果re-deploy則不會重開機

安裝Slurm addons - singularity


# 開啓安裝singularity選項,需要修改config/group_vars/slurm-cluster.yml
# 將slurm_cluster_install_singularity: false
# 改為
# slurm_cluster_install_singularity: true

ansible-playbook -i config/inventory.yaml playbooks/slurm-addons.yml --become

2. 如何使用Slurm

將hpc-benchmarks儲存成hpc-benchmarks:21.4-hpl.sif.

cd ~
mkdir nvidia
cd nvidia
sudo singularity pull --docker-login hpc-benchmarks:21.4-hpl.sif docker://nvcr.io/nvidia/hpc-benchmarks:21.4-hpl

編輯可運作於1個GPU的.dat範例 (Ps, Qs需為1)

HPL-1GPU.dat (放置於/home/ubuntu/nvidia/dat-files/HPL-1GPU.dat)

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
10960        Ns
1            # of NBs
288          NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
1            Ps
1            Qs
16.0         threshold
1            # of panel fact
0 1 2        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
2 8          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
0 1 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
3 2          BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1 0          DEPTHs (>=0)
1            SWAP (0=bin-exch,1=long,2=mix)
192          swapping threshold
1            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
0            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

(Optional) 如何進入singularity shell運作HPC benchmark

/usr/local/bin/singularity shell --nv "/home/ubuntu/nvidia/hpc-benchmarks:21.4-hpl.sif"

cd /workspace/hpl-linux-x86_64
./xhpl /home/ubuntu/nvidia/dat-files/HPL-1GPU.dat

編寫需要透過sbatch運作的script

hpc-test.sh

#!/bin/bash
#SBATCH --job-name=hpc-test
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1

#/usr/local/bin/singularity run --nv "/home/[USER]/nvidia/hpc-benchmarks:21.4-hpl.sif" /workspace/hpl-linux-x86_64/xhpl /home/[USER]/nvidia/dat-files/HPL-1GPU.dat
/usr/local/bin/singularity run --nv "/home/ubuntu/nvidia/hpc-benchmarks:21.4-hpl.sif" /workspace/hpl-linux-x86_64/xhpl /home/ubuntu/nvidia/dat-files/HPL-1GPU.dat

透過sbatch運作HPC testing

sbatch hpc-test.sh

3. NOTE

Slurm基本指令說明

# 查看目前叢集狀態
sinfo

# 查看當前作業排隊狀況
squeue

# 提交批次作業
sbatch [script.sh]

# 直接執行作業
srun --partition=debug --ntasks=4 ./my_program

# 取消某個作業
scancel job_id

# 查詢作業執行紀錄
sacct

進入hpc-benchmark container的方式

docker image pull nvcr.io/nvidia/hpc-benchmarks:25.02
docker run -i -t nvcr.io/nvidia/hpc-benchmarks:25.02 /bin/bash

# HPL.dat範例路徑
# ./hpcg-linux-x86_64/sample-dat
# ./hpl-linux-x86_64/sample-dat

如何查HPC-benchmarks舊說明網頁

https://archive.org/

search "https://catalog.ngc.nvidia.com/orgs/nvidia/containers/hpc-benchmarks"

https://web.archive.org/web/20220301000000*/https://catalog.ngc.nvidia.com/orgs/nvidia/containers/hpc-benchmarks