Machine Learning - saeed349/quant_infra GitHub Wiki

I have experimented with machine learning across various platforms, and the choice largely hinges on specific requirements. Factors such as the need for AutoML or a feature store come into play. Personally, my primary criterion is access to cost-effective yet powerful GPUs. The decision then revolves around whether to opt for a managed ML platform like AWS Sagemaker or GCP Vertex AI.

A simple comparison of prices for a 16GB T4 NVIDIA GPU machine (ml.g4dn.xlarge) on Sagemaker ($0.7364/hour) versus EC2 ($0.526/hour) underscored that a managed service was unnecessary for my needs. Sagemaker boasts fancy features like automl, data wrangling and others, while on-demand EC2 instances are swift to boot up, resulting in an overall development experience. The only remaining consideration was package management on the bare metal machine, for which Lambdastack + Anaconda emerged as the ideal solution. This combination can be installed on any Linux machine, automatically managing all required machine learning and deep learning packages. Lambda stack has also addressed the challenge of accessing GPUs inside a docker container, making it my exclusive software stack for machine learning.

Additionally, Lambda Labs provides more cost-effective GPU machines compared to AWS, making it my preferred choice unless I have free credits on AWS. While EC2 machines are occasionally used for inferences, their utilization is limited to when heavy GPU usage is not required. A 24GB A10 NVIDIA GPU instance (g5.8xlarge) costs $2.448/hour on AWS, whereas a similar machine on Lambda Labs is priced at only $0.75/hour. Lambda Labs stands out as one of the few platforms offering H100s and H200s, showcasing their commitment to keeping up with the latest technologies.

In the realm of experiment tracking, Mlflow excels in basic tracking, complemented by several MLOps features. However, Weights and Biases caters specifically to deep learning experiments, offering advanced research features. W&B elevates experiment tracking and note-taking, providing insightful data on CPU and GPU usage. This feature is particularly valuable in reinforcement learning, where understanding resource utilization during simulation is crucial. W&B seamlessly integrates with frameworks like stable-baselines3.

Nonetheless, I prefer the registry and serving capabilities offered by Mlflow. You have the option to either host Mlflow yourself or leverage the pre-hosted version in your Databricks deployment, enabling you to track external experiments and register models seamlessly.