lightGBM - noypeban/noypeban.github.io GitHub Wiki

Simple usage

Classifier

from lightgbm import LGBMClassifier
model = LGBMClassifier(n_jobs=-1,random_state=0,num_leaves=2**7-1,boosting_type="dart")
model.fit(X_train, y_train, eval_set=(X_test, y_test), eval_metric='mse',
        verbose=True, 
        early_stopping_rounds=10
       )
y_pred = model.predict(X_test)

Regressor

from lightgbm import LGBMRegressor
model = LGBMRegressor(n_jobs=-1,random_state=0,num_leaves=2**7-1,boosting_type="dart")
model.fit(X_train, y_train, eval_set=(X_test, y_test), eval_metric='mse',
        verbose=True, 
        early_stopping_rounds=10
       )
y_pred = model.predict(X_test, num_iteration=model.best_iteration_)

Parameters

Instance

  • num_leaves (int, optional (default=31)) – Maximum tree leaves for base learners.
  • n_estimators (int, optional (default=100)) – Number of boosted trees to fit.
  • boosting_type (string, optional (default='gbdt')) – ‘gbdt’, traditional Gradient Boosting Decision Tree. ‘dart’, Dropouts meet Multiple Additive Regression Trees. ‘goss’, Gradient-based One-Side Sampling. ‘rf’, Random Forest.
  • importance_type (string, optional (default='split')) – The type of feature importance to be filled into feature_importances_. If ‘split’, result contains numbers of times the feature is used in a model. If ‘gain’, result contains total gains of splits which use the feature.

Fit

  • eval_metric (string, list of strings, callable or None, optional (default=None)) – If string, it should be a built-in evaluation metric to use. If callable, it should be a custom evaluation metric, see note below for more details. In either case, the metric from the model parameters will be evaluated and used as well. Default: ‘l2’ for LGBMRegressor, ‘logloss’ for LGBMClassifier, ‘ndcg’ for LGBMRanker.
  • categorical_feature (list of strings or int, or 'auto', optional (default='auto')) – Categorical features. If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values.

Use case

Plot Feature Importance

from lightgbm import plot_importance
plot_importance(model,figsize=(10,10))

Plot Learning History

from lightgbm import plot_metric

Select Feature From Importance

from lightgbm import LGBMClassifier

model=LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=32, colsample_bytree=0.2,
            reg_alpha=3, reg_lambda=1, min_split_gain=0.01, min_child_weight=40)

# Feature Selection
from sklearn.feature_selection import SelectFromModel
featureSelector = SelectFromModel(model, threshold='1.25*median')
featureSelector.fit(X_train, y_train)
featureSelectorSupport = featureSelector.get_support()
selectedColumns = X_train.loc[:,featureSelectorSupport].columns.tolist() # boolean each columns
X_train = X_train[selectedColumns]
X_test = X_test[selectedColumns]