setup k3s in lxd - gvijqb/qbwiki GitHub Wiki

Steps to perform:

Bump up the kernel limits on the Host machine:

sudo sysctl -n -w fs.inotify.max_user_instances=1048576
sudo sysctl -n -w fs.inotify.max_queued_events=1048576
sudo sysctl -n -w fs.inotify.max_user_watches=1048576
sudo sysctl -n -w vm.max_map_count=262144

Launch LXD container and add the following in container profile: https://www.qblocks.cloud/host/lxc/k3d-profile-example.txt
Create rc.local inside the container and add the following content: sudo vim /etc/rc.local

#!/bin/bash
apparmor_parser --replace /var/lib/snapd/apparmor/profiles/snap.microk8s.*
exit 0

sudo chmod +x /etc/rc.local

Reboot the Container

More details here: https://ubuntu.com/blog/running-kubernetes-inside-lxd

The above steps should make the container ready for K3D support

Once the Host and Container have been configured to support Kubernetes inside LXD, k3d can be installed in the container to run kubernetes.

Install K3D in LXD container:

1. Install kubectl:

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"

sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

2. Install K3D:

wget -q -O - https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash

Steps to bring up K3s cluster inside LXC Container

K3s is a edge based kubernetes cluster

1. Make sure nvidia-smi is running inside the container

2. Install nvidia-container-toolkit

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/libnvidia-container.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

3. Make sure default runtime for docker is nvidia. Confirm using below command

cat /etc/docker/daemon.json
sudo systemctl restart docker

4. Now, we will run K3s cluster using docker runtime. By default, k3s prefers containerd runtime. But for GPU to work we need default runtime of nvidia, which from above steps we have installed in docker

sudo curl -sfL https://get.k3s.io | sh -s - --docker

5. Make sure k3s cluster is up and running

sudo k3s kubectl get pods --all-namespaces

6. Install NVIDIA daemon for K3s*. This makes host GPU available for k3s cluster

sudo k3s kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

7. Do check logs of nvidia-device-plugin to confirm GPU are detected*

sudo k3s kubectl logs <daemon set pod name> -n kube-system

0812 05:23:47.267089       1 main.go:154] Starting FS watcher.
I0812 05:23:47.267213       1 main.go:161] Starting OS watcher.
I0812 05:23:47.267548       1 main.go:176] Starting Plugins.
I0812 05:23:47.267563       1 main.go:234] Loading configuration.
I0812 05:23:47.267689       1 main.go:242] Updating config with default resource matching patterns.
I0812 05:23:47.267884       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0812 05:23:47.267893       1 main.go:256] Retreiving plugins.
I0812 05:23:47.268313       1 factory.go:107] Detected NVML platform: found NVML library
I0812 05:23:47.268378       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0812 05:23:47.279615       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0812 05:23:47.280859       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0812 05:23:47.283115       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet

8. Validate is GPU is detected by K3s cluster node

sudo k3s kubectl describe node -A | grep nvidia

9. If you are able to see GPU recognised and deamonSet not throwing an error. its time to do a test run and make sure a pod can access the GPU. Make sure to run this container only on a node with GPU.

Create a .yaml file gputest.yaml

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.2.1-ubuntu18.04
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists

Depending on CUDA version running inside container. You need to pick the appropriate image from https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/cuda-sample/tags to test
Run the gpu pod

sudo k3s kubectl apply -f gputest.yaml
sudo k3s kubectl logs gpu-pod

On successful run, the logs will look like below

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

**This confirms K3s cluster was able to detect GPU and pods are able to run code on GPU inside kubernetes cluster