安装docker - lyulyul/shine-cluster GitHub Wiki
https://docs.docker.com/engine/install/ubuntu/
安装最新版就好
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io
运行sudo docker run hello-world
验证
考虑设置log容量上限( https://medium.com/%E7%A8%8B%E5%BC%8F%E8%A3%A1%E6%9C%89%E8%9F%B2/docker-container-log-%E6%96%87%E4%BB%B6%E9%99%90%E5%88%B6-1ab8559f1308 ),防止一个log文件占用几百GB:
$ ll -h
total 183G
drwx--x--- 4 root root 4.0K Jun 26 12:23 ./
drwx--x--- 5 root root 4.0K Jun 26 12:10 ../
-rw-r----- 1 root root 183G Jun 26 12:14 0c0301244ef07e438912fad1aa66884ba3a1cc18af7e030d1a7d23009f3-json.log
drwx------ 2 root root 4.0K Jun 22 12:46 checkpoints/
-rw------- 1 root root 2.7K Jun 26 12:23 config.v2.json
-rw-r--r-- 1 root root 1.7K Jun 26 12:23 hostconfig.json
drwx--x--- 2 root root 4.0K Jun 22 12:46 mounts/
准备rootless docker
https://docs.docker.com/engine/security/rootless/
sudo apt-get install -y dbus-user-session
sudo systemctl disable --now docker.service docker.socket
cgroup and GPU
add systemd.unified_cgroup_hierarchy=1
sudo update-grub
运行docker info | grep Cgroup
Cgroup Version应该是2。
测试
从aha登陆,用sit或srun进入计算节点。
第一次运行
loginctl enable-linger $USER
dockerd-rootless-setuptool.sh install
dockerd-rootless-setuptool.sh install
运行成功的话会输出两句export PATH=???
和export DOCKER_HOST=???
的话,记住这两句话,后面要用。
每次运行
loginctl enable-linger $USER
export PATH=???
export DOCKER_HOST=???
Design Decisions
rootless docker Usage有两种模式With systemd (Highly recommended)和Without systemd。
用户依赖slurm进入计算节点,是不经过pam_systemd的。这时候如果直接运行dockerd-rootless-setuptool.sh install
会报错“systemd not detected, dockerd daemon needs to be started manually”。参见https://unix.stackexchange.com/q/587674/77141
有三种选择:
- run without systemd
- login with pam_systemd
- enable user lingering
我们一开始采用的是方法1(https://github.com/gqqnbig/shine-cluster/issues/116#issuecomment-1094205685),每次进入计算节点用户都要重新创建$XDG_RUNTIME_DIR,否则dockerd-rootless.sh会报错。有时甚至重新创建还是报错。
方法2:
pam_systemd.so is known to not play nice with Slurm's usage of cgroups. It is recommended that you disable it or possibly add pam_slurm_adopt.so after pam_systemd.so.
https://gitlab.unige.ch/hpc/slurm/-/tree/aa61233bb271fd09774c614b94edc2b71062f538/contribs/pam_slurm_adopt https://slurm.schedmd.com/pam_slurm_adopt.html
看起来有点难搞
方法3: enable user lingering会令服务器一开机,该用户的用户级systemd自动启动,即使用户还没有登录。这个模式会加重服务器负担。
以后如果问题明显,考虑类似 https://github.com/gqqnbig/shine-cluster/blob/migrate/login/lib/systemd/system/free-memory.timer 的自动清理。