Investigating NVIDIA NGC and LXD containerisation - lmmx/devnotes GitHub Wiki
If you look back through the NGC releases, Pytorch 1.6.0 was released on 28th July 2020 and NGC release 20.07-py3
came out on the 29th July so it’s fair to assume that’ll be the corresponding version of the NGC pytorch container for 1.6.0
The URL for this is:
nvcr.io/nvidia/pytorch:20.07-py3
Simos says you can put docker in an LXD container like so:
lxc launch ubuntu:x dockerbox -c security.nesting=true
- This image is "ubuntu 16.04 LTS"
[I presume he then omits lxc shell dockerbox
]
sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get install docker-ce
At this point you might want to open another terminal to save the progress so far with a snapshot
(note: you can't do this from within the dockerbox
instance, obviously)
lxc snapshot dockerbox docker-ready
lxc publish dockerbox/docker-ready --alias ubudocker
- N.B.
publish
is a 'local' action, it doesn't go online, just your images (shown bylxc image list
)
However from what I can see this will default to VFS storage which is what LXD calls directory-based storage (very slow!). Instead we want to use the ZFS storage driver, just like the host LXD container is using.
This is mentioned at the end of Simos's blog post:
An issue here is that the Docker storage driver is vfs, instead of aufs or overlay2. The Docker package in the Ubuntu repositories (name: docker.io) has modifications to make it work better with LXD. There are a few questions here on how to get a different storage driver and a few things are still unclear to me.
So we also want to add
apt install docker.io
Recommends: ca-certificates, cgroupfs-mount | cgroup-lite, git, pigz, ubuntu-fan, xz-utils, apparmor Suggests: aufs-tools, btrfs-progs, debootstrap, docker-doc, rinse, zfs-fuse | zfsutils
Stop docker
service docker stop
Then I was expecting to just create (or edit) the JSON file at /etc/docker/daemon.json
to be:
{
"storage-driver": "zfs"
}
And restart (service docker start
) - however this didn't work, the engine wouldn't start.
I came across this which links to this issue on LXD which culminates in user mlow saying
I was able to get the zfs backend working inside a zfs-backed LXD container without any obvious ill-effects...
The LXD container must be privileged, then simply having the device:
dev-zfs: mode: "0666" path: /dev/zfs type: unix-char
Will get things going - docker creates its new datasets relative to the LXD container's root dataset:
...However, I think it's better to create a dedicated dataset for docker. It needs to be mounted somewhere on the host (I'm using
/data/docker
for this example), and then passed through to the LXD container. Then LXD's ZFS tree doesn't get messed up by docker:var-lib-docker: path: /var/lib/docker source: /data/docker type: disk
(This is an example YAML preseed file)
var-lib-docker:
path: /var/lib/docker
source: /data/docker
type: disk
This would then let you use it as the --preseed
argument for a brand new LXD setup, but since
I've already set up LXD I risk overwriting my config by doing so.
Stephane Graber then chimes in
The latter is very much preferred especially from a security standpoint :)
You can do it all through LXD by creating a second storage pool using any of the supported backends except for ZFS.
lxc storage create docker dir
lxc storage volume create docker my-container
lxc config device add my-container docker disk pool=docker source=my-container path=/var/lib/docker
Note that he uses dir
(equivalent to VFS) type storage which you don't want, but he says it has to be "any
supported backend except for ZFS" e.g. BTRFS would work (here dockerbox4
is a container I've
freshly created which is now running)
lxc storage create docker btrfs
lxc storage volume create docker dockerbox4
lxc config device add dockerbox4 docker disk pool=docker source=dockerbox4 path=/var/lib/docker
⇣
Storage pool docker created
Storage volume dockerbox4 created
Device docker added to dockerbox4
Then you again service docker stop
and create (or edit) the JSON file at /etc/docker/daemon.json
to be:
{
"storage-driver": "btrfs"
}
and this time service docker start
works!
docker info
now shows that the LXD container has BTRFS storage
Again you might want to save the progress so far with a snapshot
lxc snapshot dockerbox4 docker-btrfs
lxc publish dockerbox4/docker-btrfs --alias ubudockerbtrfs
Then the rest of Simos's example was:
sudo docker run hello-world
sudo docker run -it ubuntu bash
so this is where you want to run
or pull
the NGC image.
The guide for NGC CLI client setup (via) says to run:
apt install unzip
wget -O ngccli_cat_linux.zip https://ngc.nvidia.com/downloads/ngccli_cat_linux.zip && unzip -o ngccli_cat_linux.zip && chmod u+x ngc
md5sum -c ngc.md5
echo "export PATH=\"\$PATH:$(pwd)\"" >> ~/.bash_profile && source ~/.bash_profile
ngc config set
The defaults are:
Enter API key [no-apikey]. Choices: [<VALID_APIKEY>, 'no-apikey']:
Enter CLI output format type [ascii]. Choices: [ascii, csv, json]:
Successfully saved NGC configuration to /root/.ngc/config
I recommend you take the steps to register for NGC and get an API key
(I may be imagining it, but I got the impression while staring at iftop
that the pull was faster when signed in).
Again you might want to save the progress so far with a snapshot
lxc snapshot dockerbox4 docker-ngc
lxc publish dockerbox4/docker-ngc --alias ubu16ngc
You should now get a result from ngc --version
like "NGC Catalog CLI 1.26.0"
and be able to run
ngc registry image pull nvcr.io/nvidia/pytorch:20.07-py3
- (It's clear from the output that this is just a wrapped
docker pull
)
Without going through the extra steps to use a proper storage driver, the download speed for this
pull
was being rate limited by the file write speed, and then once complete there was an error
(and no image saved) along the lines of 'out of memory' (even though sufficient memory had been
allocated, in theory). Don't use directory storage drivers for multi-GB images!
In this case, the PyTorch image is almost 6 GB, and the download speed is an order of magnitude higher with BTRFS. Phew!
In this case I expect the container image to correspond to PyTorch 1.6.0, and can also pull it via docker
sudo docker pull nvcr.io/nvidia/pytorch:20.07-py3
sudo docker info
You could then re-snapshot this as ubu_pytorch_1_6
to save the image (bearing in mind docker
pull takes quite a while) but note that it'll be quite large...
At the end of the process, ngc
prints out
Status: Downloaded newer image for nvcr.io/nvidia/pytorch:20.07-py3
root@dockerbox4:~# docker image list
REPOSITORY TAG IMAGE ID CREATED SIZE
nvcr.io/nvidia/pytorch 20.07-py3 08c218778a87 6 months ago 11.9GB
However this image has some requirements: nvidia-smi
and CUDA:
root@dockerbox4:~# docker inspect nvcr.io/nvidia/pytorch:20.07-py3
⇣
[
{
"Id": "sha256:08c218778a87fbd61a76ac47943ed76a4fda2b4e63d34adf9b771ae518d988d8",
"RepoTags": [
"nvcr.io/nvidia/pytorch:20.07-py3"
],
...
"Config": {
"Hostname": "",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"ExposedPorts": {
"6006/tcp": {},
"8888/tcp": {}
},
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"PATH=/opt/conda/bin:/opt/cmake-3.14.6-Linux-x86_64/bin/:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin",
"CUDA_VERSION=11.0.194",
"CUDA_DRIVER_VERSION=450.51.05",
"CUDA_CACHE_DISABLE=1",
"_CUDA_COMPAT_PATH=/usr/local/cuda/compat",
"ENV=/etc/shinit_v2",
"BASH_ENV=/etc/bash.bashrc",
"NVIDIA_REQUIRE_CUDA=cuda>=9.0",
"NCCL_VERSION=2.7.6",
"CUBLAS_VERSION=11.1.0.229",
"CUFFT_VERSION=10.2.0.218",
"CURAND_VERSION=10.2.1.218",
"CUSPARSE_VERSION=11.1.0.218",
"CUSOLVER_VERSION=10.5.0.218",
"NPP_VERSION=11.1.0.218",
"NVJPEG_VERSION=11.1.0.218",
"CUDNN_VERSION=8.0.1.13",
"TRT_VERSION=7.1.3.4",
"NSIGHT_SYSTEMS_VERSION=2020.3.2.6",
"NSIGHT_COMPUTE_VERSION=2020.1.1.8",
"DALI_VERSION=0.23.0",
"DALI_BUILD=1396141",
"LD_LIBRARY_PATH=/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
"NVIDIA_VISIBLE_DEVICES=all",
"NVIDIA_DRIVER_CAPABILITIES=compute,utility,video",
"MOFED_VERSION=4.6-1.0.1",
"IBV_DRIVERS=/usr/lib/libibverbs/libmlx5",
"OPENUCX_VERSION=1.6.1",
"OPENMPI_VERSION=3.1.6",
"LIBRARY_PATH=/usr/local/cuda/lib64/stubs:",
"PYTORCH_BUILD_VERSION=1.6.0a0+9907a3e",
"PYTORCH_VERSION=1.6.0a0+9907a3e",
"PYTORCH_BUILD_NUMBER=0",
"NVIDIA_PYTORCH_VERSION=20.07",
"NVM_DIR=/usr/local/nvm",
"JUPYTER_PORT=8888",
"TENSORBOARD_PORT=6006",
"TORCH_CUDA_ARCH_LIST=5.2 6.0 6.1 7.0 7.5 8.0+PTX",
"COCOAPI_VERSION=2.0+nv0.4.0",
"PYTHONIOENCODING=utf-8",
"LC_ALL=C.UTF-8",
"NVIDIA_BUILD_ID=14714849.1"
],
...
"Entrypoint": [
"/usr/local/bin/nvidia_entrypoint.sh"
],
"OnBuild": null,
"Labels": {
"com.nvidia.build.id": "14714849.1",
"com.nvidia.build.ref": "73eb5774179a9587153717e6cda1a5136b4fd436",
"com.nvidia.cublas.version": "11.1.0.229",
"com.nvidia.cuda.version": "9.0",
"com.nvidia.cudnn.version": "8.0.1.13",
"com.nvidia.cufft.version": "10.2.0.218",
"com.nvidia.curand.version": "10.2.1.218",
"com.nvidia.cusolver.version": "10.5.0.218",
"com.nvidia.cusparse.version": "11.1.0.218",
"com.nvidia.nccl.version": "2.7.6",
"com.nvidia.npp.version": "11.1.0.218",
"com.nvidia.nsightcompute.version": "2020.1.1.8",
"com.nvidia.nsightsystems.version": "2020.3.2.6",
"com.nvidia.nvjpeg.version": "11.1.0.218",
"com.nvidia.pytorch.version": "1.6.0a0+9907a3e",
"com.nvidia.tensorrt.version": "7.1.3.4",
"com.nvidia.volumes.needed": "nvidia_driver"
}
},
Some things to note in this config are
- of course,
"com.nvidia.pytorch.version": "1.6.0a0+9907a3e"
PyTorch is v1.6.0 (as I'd hoped) "com.nvidia.cuda.version": "9.0"
means it comes with CUDA version 9"com.nvidia.volumes.needed": "nvidia_driver"
means you need the NVIDIA driver to run this image
The table here shows which driver package to use for which CUDA version, but the older drivers won't support newer hardware, so you should still choose the most recent drivers.
NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2
The Unix driver archive page gives the most recent driver (currently 460.39)
Again in the LXD container:
wget https://uk.download.nvidia.com/XFree86/Linux-x86_64/460.39/NVIDIA-Linux-x86_64-460.39.run
chmod +x NVIDIA-Linux-x86_64-460.39.run
./NVIDIA-Linux-x86_64-460.39.run
This failed and said there was a conflict with the other GPU using it