docker, cluster, mlops - feliyur/exercises GitHub Wiki
docker run hello-world |
sanity check |
docker ps [-a] |
list containers |
docker image ls |
list images |
docker run -it \<image-name\>:\<tag\> |
Run command prompt withing an image |
docker commit <commitid> <newimagename> |
Where commit id can be taken from docker ps
|
docker start <container name> && docker exec -it <container name> <command> |
Restart and run a command within a container |
https://askubuntu.com/questions/477551/how-can-i-use-docker-without-sudo
# Add the docker group if it doesn't already exist:
sudo groupadd docker
# Add the connected user "$USER" to the docker group. Change the user name to match
# your preferred user if you do not want to use your current user:
sudo gpasswd -a $USER docker
Either do a newgrp docker
or log out/in to activate the changes to groups.
docker run hello-world
to check if can run docker without sudo.
http://blog.fx.lv/2017/08/running-gui-apps-in-docker-containers-using-vnc/
Taken from here.
First - make sure that nvidia driver is installed and recognizes the gpu (e.g. by running nvidia-smi
).
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
# NOTE: apt-key is deprecated and will produce a warning as of Ubuntu 22.04. Will need to modify this to use gpg command instead
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
Install nvidia-docker2:
apt-get update
apt-get install -y nvidia-docker2
sudo systemctl restart docker
Run a base image
docker run -it --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
Image alternatives:
- base: minimal option with essential cuda runtime
- runtime: more fully-featured option that includes the CUDA math libraries and NCCL for cross-GPU communication
- devel: everything from runtime as well as headers and development tools for creating custom CUDA images
Can then use the image as the base in the dockerfile
FROM nvidia/cuda:11.4.0-base-ubuntu20.04
RUN apt update
RUN apt-get install -y python3 python3-pip
RUN pip install tensorflow-gpu
COPY tensor-code.py .
ENTRYPONT ["python3", "tensor-code.py"]
If need to use a different base, can manually add cuda support, see link above / https://stackoverflow.com/questions/25185405/using-gpu-from-a-docker-container/64422438#64422438
Setting up a server (using docker): https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_linux_mac + bringing it up.
Server starts on http://localhost:8080. Go to the profile page (right top button or http://localhost:8080/profile) ==> add credentials ==> copy as input into clearml-init
(below).
Locally:
pip install clearml
clearml-init
bsub |
Submit job. Can either provide full arguments or a .bsub script file. |
bjobs |
List user jobs. bjobs -l <job id> display details about job. Use -w or -W for untruncated output. |
bkill -l <job id> |
Kill job |
battach -L /bin/bash <job id> |
Attach to running interactive session. |
blimits -u <username> |
Check compute resource quota for user. |
bqueues , qstat
|
Shoe available queues and their running / pending job counts. |
btop |
Move a pending job to top of (per user) scheduling order. |
bpeek <job id> |
view stdout from job. -f uses tail on the output. |
command | module | description |
---|---|---|
iquota , quota_advisor
|
quota |
|
ncdu |
||
mc , tmux , gcc , boost , cuda , conda
|
command | description |
---|---|
/usr/lpp/mmfs/bin/mmlsquota -j <drive> --block-size G rng-gpu01 /usr/lpp/mmfs/bin/mmlsattr -L <drive>
|
Check quota |