Machine Learning on Spark - sedgewickmm18/microk8s-spark-pulsar-etc GitHub Wiki

Various topics related to ML on Spark

Small to medium sized data

This is the typical IoT setting where each single I/O point doesn't contribute lots of data, but the all of them together can fill up terabytes.

The following article Boosting Parallelism for ML in Python using scikit-learn, joblib & PySpark brings up 3 recipes, using a spark-enabled joblib for sklearn, distributed inferencing with trained models and using the native [Spark ML]((https://spark.apache.org/mllib/)

Alternatively IBotta created sk-dist for this use case.

See also following medium article Train sklearn 100x faster - 2019 as reference.

Deep learning

I assume either training on a local workstation or kubeflow based pipelines. This should be good enough for our use cases. If we want to make the best of Spark Uber's horovod

The medium article Distributed Deep Learning Training with Horovod on Kubernetes provides an overview over Horovod.

Regardless of the model training component Pulsar provides direct access to historic data.

Spark DAGs

Similar to the Monitor 1.0 pipeline we provide a master wrapper function to build the execution graph (actually it's a sequence of functions). To generate a proper Spark DAG we have to take individual I/O points into account and dynamically build the execution graph to benefit from distributed processing.

Spark internals are explained in the following gitbook Spark DAG Scheduler

and the following reference, Spark dynamic DAG is a lot slower and different from hard coded DAG serves as example how to dynamically create a DAG.

Spark DAG Scheduler