Analytics Service data pipeline and training a model - sedgewickmm18/mmfunctions GitHub Wiki
Introduction
The AS pipeline is designed to work with new data, i.e. data it hasn't seen before: jobs are run periodically, pick up new data and perform desired computations.
This approach works quite fine for computations not depending on order or historic data, for example to determine operational hours which is mostly a sum. It won't work for operations that
- do not satisfy the associative law
- require more data than presented in a single run and
- need specific data, i.e. data from a particular time frame.
backtracking
allows to pull in more than the most recent data to cope with "stragglers", i.e. events that come late. This would address requirements (1) and (2) but not (3) and it comes at a cost because more data needs to be pulled out and passed to all functions active for the particular entity type.
Training a model is a one-time effort that requires (2) and (3):
You need much more data for training than for predicting and you need to select training data carefully. Furthermore training is a one time effort. Do it once and only redo it if there are significant changes, for example when a piece of equipment is replaced.
As a result model training should be handled outside of the regular periodic job scheduling. Approaches like asynchronous job execution has been proposed to address model training.
Poor man's approach to model training
Mike A. provided a set of functions called estimators based on gradient boosted regression. He rolled model training, model evaluation with cross validation, saving a trained model to Cloud Object Store, and finally predicting with trained model into a single piece of code.
We could take advantage of the estimator group of functions if we could instantiate an estimator and provide it with its intended training data.
The following notebook shows how to instantiate a local pipeline connected to a tenant's database.
It sets up a dummy entity type to access training data by columns: in the example above we try to predict temperature from pressure and store the predicting in column "predict".
Executing the pipeline/job controller instance pulls the training data from the database for the specific time frame.
It then trains a couple of models with different hyperparameters from it and stores the best model in the tenant's cloud object store bucket from where it will be pulled for the regular periodic pipeline runs.
Here a snippet from the job log output showing model training and the first steps of cross validation
Short term approach
I propose to turn the notebook above into a line command that asks for parameters like time frame, sets up the environment and drives model training locally. Code and supporting library would be shipped as a self-sufficient docker container.
Longer term approach
Longer term we should provide a REST call to schedule one time jobs (as opposed to the regular periodic cron jobs). There should be an extension to the function's ui class to equip the one time job with parameters, for example the time frame for the data, and when to actually run the training job, for example based on the kubernetes cluster's CPU and memory load. With these additional parameter templates the dashboard could prompt users for the regular parameters used to predict as well as for one time parameters used to train.