Getting started with SAIL AutoML - IBM/sail GitHub Wiki
SAIL AutoML Pipeline, as shown in the above figure aim to simplify and accelerate the process of high-frequency data ingestion, data engineering and model training while also handling the model selection and drift detection capabilities. AutoML Pipeline acquires high-frequency raw RES Data from the sources such as wind and solar farms, performs necessary transformations, and collects it in a format suitable for stream-based incremental training. It also ensures the quality and usability of the data in the subsequent stages of the pipeline life cycle. It automates and facilitates different steps in the pipeline, allowing users to focus more on the transformer selection and model definitions and less on the pipeline operations. Calling the fit on a pipeline instance triggers the sequential calling of the fit and transform methods of each step in the pipeline, each time passing the transformed data from one step to the input of the next step. For the last step, only the fit method is called. Similarly, when the predict method is called on a pipeline, the transform method is called on each step and the predict on the IML model.
Ingestion and Data Cleaning
The Ingestion framework involves ingesting data from multiple RES systems. The data is loaded as soon as it is recognized by the stream ingestion component and guarantees that all relevant information is available for training and analysis.
Stream Ingestion
In SAIL, the primary data format on which the components operate, including transformers and estimators, is either a Pandas DataFrame (pandas 2020) or NumPy (J. 2020) arrays. Thus, any file type that can be materialized as a DataFrame or arrays are supported. Stream Ingestion in SAIL Pipeline collects, processes and analyses streaming data in real-time. Stream Ingestion can be done using Apache Kafka, River Stream API and local file storage. It reads data into a Pandas DataFrame / NumPy arrays from either csv or parquet files. The collected data is available for further processing in the data pipeline.
Data Cleaning and Transformation
images/data_cleaning_transformation.png
This step performs data engineering tasks through chaining a sequence of operations or steps together into a single object to solve a particular task. Data cleaning and engineering in the SAIL pipeline is a cumulative rolling operation accomplished via stateful transformers. In the context of incremental learning, a stateful transformer remembers the learned samples from the previous data and applies them cumulatively to the new data. Transformers maintain moving statistics as more data is fed into the AutoML Pipeline. Transformers are often updated with new data insights as the underlying model gains more knowledge. An example data transformation component is shown in the above figure and consists of a sequence of transformers followed by an estimator.
Feature Selection
SAIL AutoML Feature Selection can automatically select relevant features from the dataset and remove irrelevant or redundant features.. This step helps to improve model performance and reduce overfitting. It uses the Float (Kasneci 2022), a modular Python framework for standardized evaluations of online learning methods. Float provides implementations of popular feature selection algorithms and wrappers of River feature selection. Feature Selection is optional and can be excluded on demand.
Distributed Pipeline Selection
Pipeline Selection is triggered once at the beginning of the training and every time the data drift occurs. At the beginning of the training operation, AutoML Pipeline does hyperparameter tuning through the model selection component. It automatically evaluates and selects the most appropriate model for the given dataset and task. Pipeline Selection is distributed and is achieved through the Ray Tune Scikit-Learn API - GridSearchCV and TuneSearchCV. Model hyperparameters that need tuning are set at the time of pipeline creation. The component automatically searches for the optimal combination of hyperparameters to maximize pipeline performance. Ray makes use of the RayDMatrix for efficiency. It utilizes Ray Trainer which leverages the distribution strategy as per model family. It is a synchronous distributed training strategy where each worker trains a copy of the model on a partition of the data with model weight synchronization after each batch. Ray Tune then searches over the space of possible hyperparameters using the allocated actors.
As shown in the above figure, batches of streaming data after passing through data cleaning and transformation, undergoes the Pipeline Selection operation. Ray Tune performs distributed hyperparameter tuning over the series of prospective models. The pipeline with the best parameters is used as a base model to instantiate incremental training operations. The base pipeline is incrementally trained for the subsequent data batches until there is a data drift, after which, the Pipeline Selection will produce the next base pipeline. The process is explained in the Monitoring section.
Score and Evaluation
The trained AutoML pipeline is scored on a separate validation dataset to evaluate the model's performance. It provides many ready-to-use implemented metrics from the River library. At the end of the evaluation, the AutoML pipeline gives the evaluation report on predictions depending on the regression or classification task. The best pipeline after the evaluation can be retrained on the new data and it is thus possible to update its weights with the new data patterns and information.
Monitoring
Monitoring is an important aspect of the SAIL AutoML operation. After every incremental training step on the streaming batch, the pipeline checks for data drift. When the pipeline observes significant data drift, it resets the model parameters and performs model selection on the next data collection. This process performs a new hyperparameters search using the updated data insights and produces a new base model. In another case, if the data drift is below the user-defined threshold, the AutoML pipeline will continue the incremental training using the last best model.
Drift Detection
Various drifts detectors are available through the River package. AutoML Pipeline is flexible with any custom or built-in algorithm for drift detection.