Investigating NVIDIA NGC and LXD containerisation - lmmx/devnotes GitHub Wiki

If you look back through the NGC releases, Pytorch 1.6.0 was released on 28th July 2020 and NGC release 20.07-py3 came out on the 29th July so it’s fair to assume that’ll be the corresponding version of the NGC pytorch container for 1.6.0

The URL for this is:

nvcr.io/nvidia/pytorch:20.07-py3

Simos says you can put docker in an LXD container like so:

lxc launch ubuntu:x dockerbox -c security.nesting=true

This image is "ubuntu 16.04 LTS"

[I presume he then omits lxc shell dockerbox]

sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update
sudo apt-get install docker-ce

At this point you might want to open another terminal to save the progress so far with a snapshot (note: you can't do this from within the dockerbox instance, obviously)

lxc snapshot dockerbox docker-ready
lxc publish dockerbox/docker-ready --alias ubudocker

N.B. publish is a 'local' action, it doesn't go online, just your images (shown by lxc image list)

However from what I can see this will default to VFS storage which is what LXD calls directory-based storage (very slow!). Instead we want to use the ZFS storage driver, just like the host LXD container is using.

This is mentioned at the end of Simos's blog post:

An issue here is that the Docker storage driver is vfs, instead of aufs or overlay2. The Docker package in the Ubuntu repositories (name: docker.io) has modifications to make it work better with LXD. There are a few questions here on how to get a different storage driver and a few things are still unclear to me.

So we also want to add

apt install docker.io

Recommends: ca-certificates, cgroupfs-mount | cgroup-lite, git, pigz, ubuntu-fan, xz-utils, apparmor Suggests: aufs-tools, btrfs-progs, debootstrap, docker-doc, rinse, zfs-fuse | zfsutils

Stop docker

service docker stop

Then I was expecting to just create (or edit) the JSON file at /etc/docker/daemon.json to be:

{
  "storage-driver": "zfs"
}

And restart (service docker start) - however this didn't work, the engine wouldn't start.

I came across this which links to this issue on LXD which culminates in user mlow saying

I was able to get the zfs backend working inside a zfs-backed LXD container without any obvious ill-effects...

The LXD container must be privileged, then simply having the device:

dev-zfs: mode: "0666" path: /dev/zfs type: unix-char

Will get things going - docker creates its new datasets relative to the LXD container's root dataset:

...However, I think it's better to create a dedicated dataset for docker. It needs to be mounted somewhere on the host (I'm using /data/docker for this example), and then passed through to the LXD container. Then LXD's ZFS tree doesn't get messed up by docker:

var-lib-docker: path: /var/lib/docker source: /data/docker type: disk

(This is an example YAML preseed file)

var-lib-docker:
  path: /var/lib/docker
  source: /data/docker
  type: disk

This would then let you use it as the --preseed argument for a brand new LXD setup, but since I've already set up LXD I risk overwriting my config by doing so.

Stephane Graber then chimes in

The latter is very much preferred especially from a security standpoint :)

You can do it all through LXD by creating a second storage pool using any of the supported backends except for ZFS.

lxc storage create docker dir

lxc storage volume create docker my-container

lxc config device add my-container docker disk pool=docker source=my-container path=/var/lib/docker

Note that he uses dir (equivalent to VFS) type storage which you don't want, but he says it has to be "any supported backend except for ZFS" e.g. BTRFS would work (here dockerbox4 is a container I've freshly created which is now running)

lxc storage create docker btrfs
lxc storage volume create docker dockerbox4
lxc config device add dockerbox4 docker disk pool=docker source=dockerbox4 path=/var/lib/docker

⇣

Storage pool docker created
Storage volume dockerbox4 created
Device docker added to dockerbox4

Then you again service docker stop and create (or edit) the JSON file at /etc/docker/daemon.json to be:

{
  "storage-driver": "btrfs"
}

and this time service docker start works!

docker info now shows that the LXD container has BTRFS storage

Again you might want to save the progress so far with a snapshot

lxc snapshot dockerbox4 docker-btrfs
lxc publish dockerbox4/docker-btrfs --alias ubudockerbtrfs

Then the rest of Simos's example was:

sudo docker run hello-world
sudo docker run -it ubuntu bash

so this is where you want to run or pull the NGC image.

The guide for NGC CLI client setup (via) says to run:

apt install unzip
wget -O ngccli_cat_linux.zip https://ngc.nvidia.com/downloads/ngccli_cat_linux.zip && unzip -o ngccli_cat_linux.zip && chmod u+x ngc
md5sum -c ngc.md5
echo "export PATH=\"\$PATH:$(pwd)\"" >> ~/.bash_profile && source ~/.bash_profile
ngc config set

The defaults are:

Enter API key [no-apikey]. Choices: [<VALID_APIKEY>, 'no-apikey']: 
Enter CLI output format type [ascii]. Choices: [ascii, csv, json]: 
Successfully saved NGC configuration to /root/.ngc/config

I recommend you take the steps to register for NGC and get an API key (I may be imagining it, but I got the impression while staring at iftop that the pull was faster when signed in).

Again you might want to save the progress so far with a snapshot

lxc snapshot dockerbox4 docker-ngc
lxc publish dockerbox4/docker-ngc --alias ubu16ngc

You should now get a result from ngc --version like "NGC Catalog CLI 1.26.0"

and be able to run

ngc registry image pull nvcr.io/nvidia/pytorch:20.07-py3

(It's clear from the output that this is just a wrapped docker pull)

Without going through the extra steps to use a proper storage driver, the download speed for this pull was being rate limited by the file write speed, and then once complete there was an error (and no image saved) along the lines of 'out of memory' (even though sufficient memory had been allocated, in theory). Don't use directory storage drivers for multi-GB images!

In this case, the PyTorch image is almost 6 GB, and the download speed is an order of magnitude higher with BTRFS. Phew!

In this case I expect the container image to correspond to PyTorch 1.6.0, and can also pull it via docker

sudo docker pull nvcr.io/nvidia/pytorch:20.07-py3
sudo docker info

You could then re-snapshot this as ubu_pytorch_1_6 to save the image (bearing in mind docker pull takes quite a while) but note that it'll be quite large...

At the end of the process, ngc prints out

Status: Downloaded newer image for nvcr.io/nvidia/pytorch:20.07-py3

root@dockerbox4:~# docker image list
REPOSITORY               TAG                 IMAGE ID            CREATED             SIZE
nvcr.io/nvidia/pytorch   20.07-py3           08c218778a87        6 months ago        11.9GB

However this image has some requirements: nvidia-smi and CUDA:

root@dockerbox4:~# docker inspect nvcr.io/nvidia/pytorch:20.07-py3

⇣

[                                                                                                                      
    {                                                                                                                  
        "Id": "sha256:08c218778a87fbd61a76ac47943ed76a4fda2b4e63d34adf9b771ae518d988d8",                               
        "RepoTags": [                                                                                                  
            "nvcr.io/nvidia/pytorch:20.07-py3"                                                                         
        ],                                                                                                             
	...
        "Config": {
            "Hostname": "",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "ExposedPorts": {
                "6006/tcp": {},
                "8888/tcp": {}
            },
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/opt/conda/bin:/opt/cmake-3.14.6-Linux-x86_64/bin/:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin",
                "CUDA_VERSION=11.0.194",
                "CUDA_DRIVER_VERSION=450.51.05",
                "CUDA_CACHE_DISABLE=1",
                "_CUDA_COMPAT_PATH=/usr/local/cuda/compat", 
		"ENV=/etc/shinit_v2",
                "BASH_ENV=/etc/bash.bashrc",
                "NVIDIA_REQUIRE_CUDA=cuda>=9.0",
                "NCCL_VERSION=2.7.6",
                "CUBLAS_VERSION=11.1.0.229",
                "CUFFT_VERSION=10.2.0.218",
                "CURAND_VERSION=10.2.1.218",
                "CUSPARSE_VERSION=11.1.0.218",
                "CUSOLVER_VERSION=10.5.0.218",
                "NPP_VERSION=11.1.0.218",
                "NVJPEG_VERSION=11.1.0.218",
                "CUDNN_VERSION=8.0.1.13",
                "TRT_VERSION=7.1.3.4",
                "NSIGHT_SYSTEMS_VERSION=2020.3.2.6",
                "NSIGHT_COMPUTE_VERSION=2020.1.1.8",
                "DALI_VERSION=0.23.0",
                "DALI_BUILD=1396141",
                "LD_LIBRARY_PATH=/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64",
                "NVIDIA_VISIBLE_DEVICES=all",
                "NVIDIA_DRIVER_CAPABILITIES=compute,utility,video",
                "MOFED_VERSION=4.6-1.0.1",
                "IBV_DRIVERS=/usr/lib/libibverbs/libmlx5",
                "OPENUCX_VERSION=1.6.1",
                "OPENMPI_VERSION=3.1.6",
                "LIBRARY_PATH=/usr/local/cuda/lib64/stubs:",
                "PYTORCH_BUILD_VERSION=1.6.0a0+9907a3e",
                "PYTORCH_VERSION=1.6.0a0+9907a3e",
                "PYTORCH_BUILD_NUMBER=0",
                "NVIDIA_PYTORCH_VERSION=20.07",
                "NVM_DIR=/usr/local/nvm",
                "JUPYTER_PORT=8888",
                "TENSORBOARD_PORT=6006",
                "TORCH_CUDA_ARCH_LIST=5.2 6.0 6.1 7.0 7.5 8.0+PTX",
                "COCOAPI_VERSION=2.0+nv0.4.0",
                "PYTHONIOENCODING=utf-8",
                "LC_ALL=C.UTF-8",
                "NVIDIA_BUILD_ID=14714849.1"
            ],
	    ...
            "Entrypoint": [
                "/usr/local/bin/nvidia_entrypoint.sh"
            ],
            "OnBuild": null,
            "Labels": {
                "com.nvidia.build.id": "14714849.1",
                "com.nvidia.build.ref": "73eb5774179a9587153717e6cda1a5136b4fd436",
                "com.nvidia.cublas.version": "11.1.0.229",
                "com.nvidia.cuda.version": "9.0",
                "com.nvidia.cudnn.version": "8.0.1.13",
                "com.nvidia.cufft.version": "10.2.0.218",
                "com.nvidia.curand.version": "10.2.1.218",
                "com.nvidia.cusolver.version": "10.5.0.218",
                "com.nvidia.cusparse.version": "11.1.0.218",
                "com.nvidia.nccl.version": "2.7.6",
                "com.nvidia.npp.version": "11.1.0.218",
                "com.nvidia.nsightcompute.version": "2020.1.1.8",
                "com.nvidia.nsightsystems.version": "2020.3.2.6",
                "com.nvidia.nvjpeg.version": "11.1.0.218",
                "com.nvidia.pytorch.version": "1.6.0a0+9907a3e",
                "com.nvidia.tensorrt.version": "7.1.3.4",
                "com.nvidia.volumes.needed": "nvidia_driver"
            }
        },

Some things to note in this config are

of course, "com.nvidia.pytorch.version": "1.6.0a0+9907a3e" PyTorch is v1.6.0 (as I'd hoped)
"com.nvidia.cuda.version": "9.0" means it comes with CUDA version 9
"com.nvidia.volumes.needed": "nvidia_driver" means you need the NVIDIA driver to run this image

The table here shows which driver package to use for which CUDA version, but the older drivers won't support newer hardware, so you should still choose the most recent drivers.

NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2

The Unix driver archive page gives the most recent driver (currently 460.39)

Again in the LXD container:

wget https://uk.download.nvidia.com/XFree86/Linux-x86_64/460.39/NVIDIA-Linux-x86_64-460.39.run
chmod +x NVIDIA-Linux-x86_64-460.39.run
./NVIDIA-Linux-x86_64-460.39.run

This failed and said there was a conflict with the other GPU using it