How to turn a regression function into a BaseEstimatorFunction - sedgewickmm18/mmfunctions GitHub Wiki
Introduction
In general Maximo Asset Monitoring pipeline functions add columns to a pandas dataframe (BaseTransformer) or aggregate dataframes over periods of time (BaseAggregator).
The BaseEstimatorFunction class serves multiple purposes:
- First of all it is a subclass of a BaseTransformer to add new columns to a pandas dataframe. These new columns reflect predictions.
- Since predictions won't work without models, it keeps track of existing models, loads and instantiates them when invoked. It supports multiple tenants, entity types and aggregation granularities.
- It initiates training runs if there is no existing model or on request and persists models in ObjectStore.
- It allows for multiple sets of hyperparameters and makes use of cross-validation to find the best model (according to a metric you specify, typically r2_score)
How to provide your own estimator
First derive it from BaseEstimatorFunction as it provides all the infrastructure to deal with your models.
Then make sure to set self.auto_train
to False
. Otherwise BaseEstimatorFunctions automatically initiate a model training run in the pipeline's kubernetes cluster when invoked the first time.
Why this is bad
The pipeline cluster is set up to support multiple tenants with many different entity types and aggregation granularities. Each combination of tenant, entity type and aggregation granularity is reflected as a cron job. So typically many pods will be sharing the kubernetes cluster, each vying for CPU and memory. And training runs tend to consume lots of both, so you can expect some OOM killer action when the pipeline is hosting a training run by accident.
Long-term proposal is to schedule training runs on a dedicated system, either kubernetes or even PowerAI, whatever approach is more cost-effective, or delegate it to Watson Studio's AutoAI. Before this is available, you can train a model on your local laptop.
How to train a pipeline regressor on your local laptop
Just take the GradientBoostingInPipeline notebook found here as example. It prepares a pandas dataframe with some test data and builds the pipeline metadata, i.e. an entity type for it. Then it instantiates a regressor function instance, sets auto_train to 'True, links it to the entity type and executes the pipeline locally.
Caveat You need to copy the credentials to access database and object store for model load/save need from the Maximo Asset Monitoring dashboard into the second cell to replace the expression credentials={}
before running it. Furthermore note that there is no correlation between the input variables and dependent variable - so it's in fact independent and prediction fails.
How to write your own regression function
First of all make sure your regression function's API is compatible to sklearn's Regressor API and supports RandomizedSearchCV
based model selection. Same applies to the scorer, your metric has to be compatible to sklearn's r2_score etc.
For an example have a look at GBMRegressor
class found here