A guide to SAIL Pipeline - IBM/sail GitHub Wiki

Introduction

SAIL Pipeline extends the capability of the Scikit-learn pipeline to incremental learning and adds compatibility with machine learning models from [River, Keras and PyTorch] via SAIL Wrappers.

images/architecture.png

Above figure describes the architecture of SAIL AutoML pipeline and gives an overview of its framework design and components. The framework offers an end-to-end AutoML workflow engineered by a SAIL pipeline which enables data ingestion, cleaning, feature selection, model selection, incremental training, scoring and monitoring for data drifts. The workflow covers all the aspects of the AutoML life cycle, from data collection to engineering it, to fit the given estimator while handling the data drift. The workflow is made intelligent through a robust set of APIs offered by the SAIL framework. The various components adhere to the SAIL modelling patterns making it efficient when executing them over the pipeline APIs. The pipeline can run distributed using the Ray Ecosystem.

The architecture diagram shows three base components: (a) SAIL Pipeline, (b) SAIL Model, and (c) Ray API (Philipp 2017). The SAIL Pipeline acts as a backbone that produces production-ready incremental models. It streamlines training aspects such as chaining pre-processing tasks, model optimisation, incremental training and monitoring. The pipeline works with several standard transformers and pre-processors for data transformation tasks. The Transformer API accepts a mix and match of multiple transformers from libraries like River (Montiel 2021) and Scikit-learn (Pedregosa 2011). It is responsible for ingesting data from high-frequency and high-volume data sources such as RES Systems. The pipeline can be executed exclusively for pre-processing tasks. The data cleaning component deals with data manipulation and cleaning jobs. The Feature Selection activity is completed through a Float Python library integrated into the SAIL.

The Evaluation API uses a consistent measurement process to quantify the quality of model predictions on the test data. The output is evaluated against the ground truth using various metrics. It monitors for data drift and triggers model selection to re-evaluate model parameters.