Step by step guide to AutoML training - IBM/sail GitHub Wiki
Model Definition
Define and create instances of incremental models to tune with SAIL AutoML. Choose any model from sail/models.
logistic_reg = LogisticRegression(optimizer=optim.SGD(0.1))
random_forest = AdaptiveRandomForestClassifier(n_models=10)
Create SAIL Pipeline
Define steps i.e. a list of transformers. The last element of the list must be an estimator, either a classifier or regressor. The estimator can be an incremental model or a "passthrough". The value "passthrough" indicates that it will get replaced with different estimators during the tuning process. SAIL Pipeline takes in a list containing transformers chain.
- Choose transformer from sail/transformers.
- Currently, SAIL supports mix and match of transformers from river and scikit-learn packages.
steps = [
("Imputer", SimpleImputer(missing_values=np.nan, strategy="mean")),
("standard_scalar", StandardScaler()),
("classifier", "passthrough"),
]
sail_pipeline = SAILPipeline(steps=steps)
Define hyper-parameters grid to explore with SAILAutoPipeline.
- Here, “passthrough” indicates that no transformer should be passed.
- Support passing in a list of dictionaries.
params_grid = [
{
"classifier": [logistic_reg],
"classifier**l2": [0.1, 0.9],
"classifier**intercept_init": [0.2, 0.5],
},
{
"classifier": [random_forest],
"classifier\_\_n_models": [5, 10],
"Imputer": ["passthrough"],
},
]
Create an instance of the SAILAutoPipeline
auto_pipeline = SAILAutoPipeline(
pipeline=sail_pipeline,
pipeline_params_grid=params_grid,
search_method=SAILTuneGridSearchCV,
search_method_params={
"max_iters": 1,
"early_stopping": False,
"mode": "max",
"scoring": "accuracy",
"pipeline_auto_early_stop": False,
"keep_best_configurations": 2
},
search_data_size=1000,
incremental_training=True,
scoring=metrics.Accuracy,
drift_detector=ADWIN(delta=0.001),
pipeline_strategy="DetectAndIncrement",
)
As shown above, SAILAutoPipeline class takes in the following parameters:
- pipeline: instance of the SAIL Pipeline
- pipeline_params_grid: parameters for Hyper-parameters tuning.
- search_method: Tuning method from Ray Tune mainly TuneGridSearchCV and TuneSearchCV.
- search_method_params: search parameters to pass to the tuning method.
- search_data_size: batch size to use for tuning.
- incremental_training: continue incremental learning in SAILAutoPipeline after the best pipeline is selected.
- scoring: the scoring metric to track cumulative evaluation of the best pipeline. It must be from rivers.metrics.
- drift_detector: instance of drift detector from River Library
- pipeline_strategy: One of the Pipeline Strategies.
Collect data and start training
- Invoke SAILAutoPipeline.train() method with the input features (X) and target variable (y).
- Optionally, it is required to pass classifier_classes, containing all eligible class labels, if incremental training is enabled and the final estimator is a classifier.
X = pd.read_csv("datasets/agrawal.csv").head(50000)
y = X["class"]
X.drop("class", axis=1, inplace=True)
y_preds = []
y_true = []
batch_size = 50
start = 0
for end in range(50, 2001, batch_size):
X_train = X.iloc[start:end]
y_train = y.iloc[start:end]
if end > 1000: # search_data_size is 1000
preds = auto_pipeline.predict(X_train)
y_preds.extend(list(preds))
y_true.extend(list(y_train))
auto_pipeline.train(X_train, y_train, classifier__classes=[1, 0])
start = end
Get classification report of the SAIL Pipeline
from sklearn.metrics import confusion_matrix, classification_report
print(classification_report(y_true, y_preds))
Plot confusion matrix
import seaborn as sns
cf_matrix = confusion_matrix(y_true, y_preds)
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True, fmt='.2%', cmap='Blues')