GPU Resource Management on the VDK - vmware/versatile-data-kit GitHub Wiki

Problem statement

GPUs are very expensive to acquire. If you are using Kubeflow to run your ML training you will need to have GPUs in your k8s cluster.

We want to ensure the following:

  1. Teams are only using their fare share of the available resources 🤝
  2. Available GPU don't become fragmented across nodes 💻🧩
  3. Resources should never sit idle if there is actual demand for them 🚫 🏭

Teams are only using their fare share of the available resources 🤝

Problem Scenario 1: The greedy team 🐷🖥️

There are 4 GPU nodes with 10 cuda devices each on your k8s cluster. The total available GPU cuda devices is 40.

There are 2 teams sharing this cluster. you tell them to only use 20 cuda devices each.

However team A consumes 30 cuda devices leaving team B to only be able to access 10 of its allocated 20.

This is really frustrating for Team B and there is no automated or obvious recourse they can take to reclaim the 10 cuda devices the are missing

Available GPU don't become fragmented across nodes 💻🧩

Problem Scenario 2: Fragmentation hell 😵‍💫

There are 4 GPU nodes with 10 cuda devices each on your k8s cluster. The total available GPU cuda devices is 40.

There are 2 teams sharing this cluster. you tell them to only use 20 cuda devices each.

Both teams are respectful of this. Team A works with smaller models and team B works with bigger models.

Team A deploy 6 training each using 3 cuda devices for a total of 18 consumed cuda devices.

Team B goes to deploy a model which needs 8 cuda devices. Of course they expect this to be able to run. Officially they have 20 cuda devices free. However this job fails to start.

Why does it fail to start?

When we look at the pod layout on the nodes we see

Node name pod state resource state has resources to run team Bs job
Node 1 Running 2 Team A jobs (6 cuda devices in use, 4 free)
Node 2 Running 2 Team A jobs (6 cuda devices in use, 4 free)
Node 3 Running 1 Team A job (3 cuda devices in use, 7 free)
Node 4 Running 1 Team A job (3 cuda devices in use, 7 free)

This is really frustrating because we want it to pack jobs onto the same nodes where possible so that there is a single block of 8 GPUs free for Team B to consume. The ideal pod layout would look like

Node name pod state resource state has resources to run team B's job
Node 1 Running 3 Team A jobs (9 cuda devices in use, 1 free)
Node 2 Running 3 Team A jobs (9 cuda devices in use, 1 free)
Node 3 Empty, this is able to take a job from team B (0 cuda devices in use, 10 free)
Node 4 Empty, this is able to take a job from team B (0 cuda devices in use, 10 free)

Resources should never sit idle if there is actual demand for them 🚫 🏭

Problem Scenario 3: Blocked from using available resources

There are 4 GPU nodes with 10 cuda devices each on your k8s cluster. The total available GPU cuda devices is 40.

There are 2 teams sharing this cluster. you tell them to only use 20 cuda devices each.

Team A has used up all of its GPUs on training.

Team B are focusing on one model at the moment and often aren't using most of their allocated resources.

It is senseless to leave the GPU sitting idle. Team A should be able to consume the resources that belong to the other team under the knowledge that they could be pushed off them at anytime. As long as they are checkpointing the training often this shouldn't be an issue.

Existing solutions 🛠️

RUN:AI (https://www.run.ai/)

Run AI supports all these use cases. But its not open source. We would be an open source alternative.

Build it your self 🏗️

Some of this functionality could be build by a k8s admin. For example they could assign namespaces to teams and create a resourceQuote per namespace. This addresses problem 1. But it doesn't address fragmentation issues or allowing over quota.

This solution will cost your company a lot of money holding onto extra resources because of fragementation and not supporting over quota

Why is VDK well positioned?

Current VDK control plane impl:

  1. Currently it is a system that monitors your kubernetes cluster. It listens for jobs created, finished etc..
  2. When it is deploying a job it dynamically builds the k8s yaml files

These is powerful functionality to build on:

  1. This allows it to dynamically deal with fragmentation. At deploy time we can look at which nodes have free resources and select the most appropriate node by dynamically adding node affinity to the job right before it is deployed.
  2. The same is true for over quota. If there are free resources we can let a team run extra work. When a team which has free resources goes to deploy a job and there are no free resources on the cluster we can scan the cluster and kill a job of a team which is over quota.