Machine Learning on local or Cloud based NVidia or Apple GPUs - ObrienlabsDev/blog GitHub Wiki
Introduction
This blog details various configurations around running machine learning software towards LLM or general AI based applications on a variety of hardware including NVidia professional workstation GPUs locally or on the cloud or local Apple ARM hardware.
Performance
Batch Size variations among GPUs (shorter time per iteration is better). Notice that a dual GPU setup is performant only for large batch sizes over 1024 and correlates to GPU core count - in this case 32768 cores for the dual RTX-4050.
Quickstart
- https://obrienlabs.medium.com/running-the-larger-google-gemma-7b-35gb-llm-for-7x-inference-performance-gain-8b63019523bb
- https://github.com/ObrienlabsDev/machine-learning/issues
- https://github.com/ObrienlabsDev/blog/issues/13
- https://github.com/ObrienlabsDev/blog/issues/9
Setup
Architecture
DevOps
Example ML Systems
2023 Lenovo P1 Gen 6 : i7-13800H 64G and NVidia RTX-A3500 Ada AD-104 5120 cores 12G 192bit VRam
- Order the 64G laptop not the 96G version for now https://forums.lenovo.com/t5/ThinkPad-P-and-W-Series-Mobile-Workstations/P1-Gen6-Bricked-after-BSOD-second-laptop-with-the-same-problem/m-p/5254145?page=2#6148028
2019 Lenovo P17 Gen 1 : Xeon W-10855M 128G and NVidia Quadro RTX-5000 TU104 Turing 3072 cores 16G 256bit VRam
2023 Custom : i9-13900K 192G and Dual NVidia GTX-4090 MSI Suprim Liquid X
2023 Custom : i9-13900K 128G and Dual NVidia RTX-A4500 with NVidia RTX-4000
2021 Lenovo X1 Carbon gen 9 : Intel GPU
Google Cloud Workstation : NVidia L4 GPU
Google Pixel 6 : Google TPU
Links
PMLE Training
PMLE Notes
-
Machine Learning Crash Course https://developers.google.com/machine-learning/crash-course/representation/cleaning-data
-
learn gradient ascent and expand the partial derivative section - "the negative of the gradient vector points into the valley" https://developers.google.com/machine-learning/crash-course/reducing-loss/gradient-descent
-
deep field before deep learning https://esahubble.org/images/heic0611b/ https://simbad.u-strasbg.fr/simbad/sim-id?Ident=Hubble+Ultra+Deep+Field
-
https://en.wikipedia.org/wiki/Comparison_of_deep_learning_software
-
tree classifier using area under the curve - https://dmip.webs.upv.es/papers/ICML2002presentation.pdf - the greater AUC means better positive/negative classification
-
XGBoost - https://xgboost.readthedocs.io/en/stable/tutorials/model.html https://www.analyticsvidhya.com/blog/2018/09/an-end-to-end-guide-to-understand-the-math-behind-xgboost/#:~:text=XGBoost%20is%20a%20machine%20learning,won%20several%20machine%20learning%20competitions.
-
https://codelabs.developers.google.com/vertex_notebook_executor#0
-
https://www.tensorflow.org/guide/tpu#distribution_strategies
-
TPU nodes(gRPC)/VMs(ssh) and twisted topology https://cloud.google.com/tpu/docs/system-architecture-tpu-vm
-
TPU V4 up to 2048 TPU cores - https://cloud.google.com/tpu/docs/supported-tpu-configurations
-
JAX Autograd (automated gradient function) and XLA (Accelerated Linear Algebra - see cuBLAS) https://jax.readthedocs.io/en/latest/
-
https://neptune.ai/blog/retraining-model-during-deployment-continuous-training-continuous-testing
-
hashing or homomorphic encryption https://fastdatascience.com/sensitive-data-machine-learning-model/
-
TensorFlow Data Validation and Pandas https://www.tensorflow.org/tfx/data_validation/get_started
-
TensorFlow from Google Brain https://en.wikipedia.org/wiki/TensorFlow#TensorFlow
-
Batch and Streaming data processing https://beam.apache.org/
-
https://medium.com/mlpoint/pandas-for-machine-learning-53846bc9a98b
-
training with mini-batch gradient descent https://towardsdatascience.com/batch-mini-batch-stochastic-gradient-descent-7a62ecba642a
-
https://en.wikipedia.org/wiki/Regularization_%28mathematics%29
-
training with L1 regularization (prevent overfitting) https://towardsdatascience.com/regularization-in-deep-learning-l1-l2-and-dropout-377e75acc036
-
small normalized wide dataset (reduce feature scaling for training) https://developers.google.com/machine-learning/data-prep/transform/normalization
-
PCA https://www.analyticsvidhya.com/blog/2022/07/principal-component-analysis-beginner-friendly/
-
reduce ML latency https://cloud.google.com/architecture/minimizing-predictive-serving-latency-in-machine-learning#optimizing_models_for_serving
-
https://www.tensorflow.org/guide/keras/serialization_and_saving
-
https://cloud.google.com/vertex-ai/docs/model-registry/introduction
-
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
-
https://cloud.google.com/vertex-ai/docs/workbench/managed/schedule-managed-notebooks-run-quickstart
-
https://cloud.google.com/vertex-ai/docs/pipelines/run-pipeline
-
https://cloud.google.com/architecture/setting-up-mlops-with-composer-and-mlflow
-
https://cloud.google.com/tpu/docs/intro-to-tpu#when_to_use_tpus
-
https://www.tensorflow.org/tutorials/distribute/multi_worker_with_ctl
-
https://cloud.google.com/dlp/docs/transformations-reference#transformation_methods
-
https://cloud.google.com/blog/products/identity-security/next-onair20-security-week-session-guide
-
https://cloud.google.com/tensorflow-enterprise/docs/overview
-
https://developers.google.com/machine-learning/crash-course/representation/cleaning-data
-
https://developers.google.com/machine-learning/testing-debugging/metrics/interpretic
-
https://developers.google.com/machine-learning/crash-course/feature-crosses/video-lecture
-
https://cloud.google.com/vertex-ai/docs/training/hyperparameter-tuning-overview
-
https://cloud.google.com/automl-tables/docs/evaluate#evaluation_metrics_for_regression_models
-
https://developers.google.com/machine-learning/glossary#baseline
-
https://cloud.google.com/ai-platform/training/docs/training-at-scale
-
https://cloud.google.com/ai-platform/training/docs/machine-types#scale_tiers
-
https://cloud.google.com/vertex-ai/docs/training/distributed-training
-
https://cloud.google.com/ai-platform/training/docs/overview#distributed_training_structure
-
https://cloud.google.com/vertex-ai/docs/featurestore/overview#benefits
-
https://cloud.google.com/architecture/ml-on-gcp-best-practices#model-deployment-and-serving
-
https://cloud.google.com/memorystore/docs/redis/redis-overview
-
https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview
-
https://cloud.google.com/vertex-ai/docs/ml-metadata/introduction
-
https://cloud.google.com/vertex-ai/docs/pipelines/visualize-pipeline
-
https://cloud.google.com/vertex-ai/docs/model-monitoring/overview
-
https://cloud.google.com/architecture/best-practices-for-ml-performance-cost
-
https://www.tensorflow.org/lite/performance/model_optimization
-
https://www.tensorflow.org/tutorials/images/transfer_learning
-
https://developers.google.com/machine-learning/glossary#calibration-layer
-
https://developers.google.com/machine-learning/testing-debugging/common/overview
-
https://cloud.google.com/bigquery-ml/docs/preventing-overfitting
-
https://www.tensorflow.org/tutorials/keras/overfit_and_underfit
-
https://cloud.google.com/architecture/implementing-deployment-and-testing-strategies-on-gke
-
https://docs.seldon.io/projects/seldon-core/en/latest/analytics/routers.html
-
https://www.tensorflow.org/tutorials/customization/custom_layers
-
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Lambda
-
https://cloud.google.com/vertex-ai/docs/ml-metadata/tracking
-
https://cloud.google.com/architecture/ml-on-gcp-best-practices#operationalized-training
-
https://cloud.google.com/architecture/ml-on-gcp-best-practices#organize-your-ml-model-artifacts
Hardware
AD102 RTX-4090 Ada Consumer
- 24GB 384 bit 1008 GB/s 16384 cores 76B transistors 1344 GTexels
AD104 RTX-3500 Ada Mobile Workstation P1Gen6 2023
- 12GB 192 bit 432 GB/s 5120 cores 35B transistors 319 GTexels
RTX-A4500 Ampere Workstation 2021
- 20GB
RTX-A4000 Ampere Workstation 2021
- 16GB
RTX-5000 Lenovo P17Gen1 2020
- 16GB