Step by step guide to AutoML training - IBM/sail GitHub Wiki

Model Definition

Define and create instances of incremental models to tune with SAIL AutoML. Choose any model from sail/models.

logistic_reg = LogisticRegression(optimizer=optim.SGD(0.1))
random_forest = AdaptiveRandomForestClassifier(n_models=10)

Create SAIL Pipeline

Define steps i.e. a list of transformers. The last element of the list must be an estimator, either a classifier or regressor. The estimator can be an incremental model or a "passthrough". The value "passthrough" indicates that it will get replaced with different estimators during the tuning process. SAIL Pipeline takes in a list containing transformers chain.

  • Choose transformer from sail/transformers.
  • Currently, SAIL supports mix and match of transformers from river and scikit-learn packages.
steps = [
    ("Imputer", SimpleImputer(missing_values=np.nan, strategy="mean")),
    ("standard_scalar", StandardScaler()),
    ("classifier", "passthrough"),
]
sail_pipeline = SAILPipeline(steps=steps)

Define hyper-parameters grid to explore with SAILAutoPipeline.

  • Here, “passthrough” indicates that no transformer should be passed.
  • Support passing in a list of dictionaries.
params_grid = [
    {
        "classifier": [logistic_reg],
        "classifier**l2": [0.1, 0.9],
        "classifier**intercept_init": [0.2, 0.5],
    },
    {
        "classifier": [random_forest],
        "classifier\_\_n_models": [5, 10],
        "Imputer": ["passthrough"],
    },
]

Create an instance of the SAILAutoPipeline

auto_pipeline = SAILAutoPipeline(
    pipeline=sail_pipeline,
    pipeline_params_grid=params_grid,
    search_method=SAILTuneGridSearchCV,
    search_method_params={
        "max_iters": 1,
        "early_stopping": False,
        "mode": "max",
        "scoring": "accuracy",
        "pipeline_auto_early_stop": False,
        "keep_best_configurations": 2
    },
    search_data_size=1000,
    incremental_training=True,
    scoring=metrics.Accuracy,
    drift_detector=ADWIN(delta=0.001),
    pipeline_strategy="DetectAndIncrement",
)

As shown above, SAILAutoPipeline class takes in the following parameters:

  • pipeline: instance of the SAIL Pipeline
  • pipeline_params_grid: parameters for Hyper-parameters tuning.
  • search_method: Tuning method from Ray Tune mainly TuneGridSearchCV and TuneSearchCV.
  • search_method_params: search parameters to pass to the tuning method.
  • search_data_size: batch size to use for tuning.
  • incremental_training: continue incremental learning in SAILAutoPipeline after the best pipeline is selected.
  • scoring: the scoring metric to track cumulative evaluation of the best pipeline. It must be from rivers.metrics.
  • drift_detector: instance of drift detector from River Library
  • pipeline_strategy: One of the Pipeline Strategies.

Collect data and start training

  • Invoke SAILAutoPipeline.train() method with the input features (X) and target variable (y).
  • Optionally, it is required to pass classifier_classes, containing all eligible class labels, if incremental training is enabled and the final estimator is a classifier.
X = pd.read_csv("datasets/agrawal.csv").head(50000)
y = X["class"]
X.drop("class", axis=1, inplace=True)

y_preds = []
y_true = []
batch_size = 50

start = 0
for end in range(50, 2001, batch_size):

    X_train = X.iloc[start:end]
    y_train = y.iloc[start:end]

    if end > 1000: # search_data_size is 1000
        preds = auto_pipeline.predict(X_train)
        y_preds.extend(list(preds))
        y_true.extend(list(y_train))

    auto_pipeline.train(X_train, y_train, classifier__classes=[1, 0])
    start = end

Get classification report of the SAIL Pipeline

from sklearn.metrics import confusion_matrix, classification_report
print(classification_report(y_true, y_preds))

images/report.png

Plot confusion matrix

import seaborn as sns

cf_matrix = confusion_matrix(y_true, y_preds)
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True, fmt='.2%', cmap='Blues')

images/confusion_matrix.png