ML Training on VDK - vmware/versatile-data-kit GitHub Wiki
The problem 🧩
Users don't know where or how to run their machine learning jobs, this problem is particularly exacerbated when it comes to distribution jobs. At the moment there are multiple solutions.
The ideal solution would meet the following requirements
- Easy to run a simple ML job. 🚀 when I make changes to my code or dataset I should be able to start a new training with one or two commands
- Huggingface/Pytorch tutorials can be run as is. We don't need to make modifications to the code for it to run on the solution
- The team don't need to learn about a new product and its terminology and there should be no vendor lock in 🚫🤯
- Must support distributed training to handle training larger models 🌐
- Visibility into running job 👁️🗨️
- Sensible configuration out of the box 📦✅
- I write the code in intellij/pycharm like I am used to. I use git like I am use to. I have many classes, way to many to fix in a notebook. 💻🔧
- Auditable 🔍
- Reproducible 🔄
Existing solutions 🔍
Manually provision GPUs
This is the approach that most teams will use to get started. Its messy and manual but it doesn't require deep knowledge of any extra frameworks tools etc and is guaranteed to work
Drawbacks ⚠️
- Messy
- Manual
- Ad hoc
Ray
This is the approach absolutely massive teams will go with. Ray is extremely efficient. great support. great docs etc..
Drawbacks ⚠️
- Ray require a cluster to be setup on your behalf
- Code needs to be custom written for Ray
- Has its own domain terminology which is actually quite complicated because the problems they are tackling are complicated.
- Ray has a high barrier to entry
My opinion: 'Ray is to machine learning what spark is to data processing'
Why do I compare Ray and Spark?
I have a lot of spark experience, However always when I need to mess around with data now I just use pandas. I use it because its much easier to get started with. My own machine is so powerful now that it covers 100% of the use cases I run into on side projects.
I also feel like this is the case across the board. More and more ppl are choosing pandas over spark on a daily basis for batch processing. They are choosing it because they don't have to setup a cluster, they don't need to write spark specific code and don't need to learn spark concepts.
I think the same will be true for ML engineers. Most ppl building models, will be following tutorials from pytorch or huggingface. They will want to run the code locally for one batch to see that everything works. They will slowly iterate on this model. They will want to run the code as is and won't want to need to change the code before they can run it on actual quality hardware, GPUs etc...
on this page you can see the changes that need to be made to run on ray: https://docs.ray.io/en/latest/train/getting-started-pytorch.html
Kubeflow
Kubeflow is a good approach. It is definitely still complicated to setup. On a first impression we would say that kubeflow has a lower barrier to entry.
- It doesn't require us to modify our code to run on the platform. All our existing projects can run as they are.
- We don't need to learn domain specific terminology and concepts to get up and running
However this is not true.
Kubeflow runs on kubernetes.
it has created its own kubernetes CRDs and also its own kubernetes operators.
With kubeflow we need to learn about these CRDs to declare our job.
We also need to create a docker image and submit these to the cluster.
Even for those with k8s experience this will be tedious. For those without k8s experience this will be impossible.
Drawbacks ⚠️
- Onus is on ML engineer to package yaml and docker for k8s deployment. Very far from single click deployments
Summary of existing solutions
There isn't one solution that meets all our needs. Kubeflow is a fundamental part of the VMWare reference architecture for gen AI. Its also arguably closest to an ideal solution.
Why is VDK positioned to tackle this problem????
As discussed in the kubeflow section, kubeflow has created k8s CRDs and Operators to enable it to run ML distributed training. These can be installed separately from the rest of the kubeflow stack.
fundamentally what the VDK control plane does is package users python code into a docker image/k8s yaml and run it. If the control plane was to package code into kubeflow CRDs then VDK should support distributed ML training.
If the customer also has a kubeflow instance running on their cluster we can report the job status into that system too. This really provides the customer with the best of both worlds and a great solution
Risks 🚧
- Despite what the docs say we aren't able to get Kubeflow operators running without the rest of kubeflow. The installation would become to complex if we needed to install kubeflow as part of VDK. it needs to be optional
- VDK control plane is set up to run cron jobs. To support a different type of job definition could require an enormous amount of work.
- It deviates too much from our skill set. Issues arise that we aren't able to support because of a lack of distributed training knowledge on the team
Rewards 🏆
- There is real value in this. Being able to run your ML training on your existing k8s infra is a game changer. In previous companies I would certainly have installed VDK if i had this feature
- Its a unique situation where we can provide massive value from only a small amount of work. The reason being that kubeflow have built the operators handling all the complexity. It would seem crazy not to try capitalise on this
- From a VDK marketing perspective it will make our jobs much easier
- Blogs in distributed training will be much more interesting
- Traffic to the project will jump
- Customer interest will jump
- as a by product VDK becomes more than just the platform to run ETL tasks on. It becomes a tool to run complex python/k8s projects on which is 100 times more powerful. It positions it well for the future.
- We could even do stuff like install the spark operator on the cluster and then handle spark jobs in a similar manner
Next steps
- Package a job and run it on kubeflow
- Run that same job on an k8s installation with only the operators
- Have VDK control plane package a pytoch job for us into a docker image.
- Have VDK deploy this docker image and a custom yaml to the cluster. Make sure everything runs correct.
- Update control plane to be able to find this type of job and monitor its health.