Models and Evaluation - rudydesplan/book_rating GitHub Wiki

Evaluation metrics:

Root Mean Square Error RMSE: RMSE is the square root of the MSE which is the average of the squared differences between predicted and actual values. It penalizes large errors more heavily, making it sensitive to outliers.
Mean Absolute Error MAE: MAE is the average of the absolute differences between predicted and target values. It provides a more intuitive sense of the average prediction error but doesn't penalize outliers as heavily as RMSE.
R-squared (Coefficient of Determination): R2 is a statistical metric used to assess the goodness of fit of a model. The value of R2 lies between 0 to 1, where the value of one means the model perfectly fits the data and there is no difference between the predicted value and actual value. However, the value of zero means the model does not learn any relationship between the target values and the input features.

Models:

To solve this regression task, we used the following models:

Linear regression: We started first with linear regression because it is a simple and interpretable model. Linear regression helps us to understand the relationship between the input features and the target variable, whether it is linear or non-linear.
Then we tried Decision Trees and Random Forest because they are good for capturing non-linear relationships and interactions.
We also tried Gradient Boosting algorithms (XGBoost, Explainable Boosting EBM, and Adaboost) because they are powerful for improving predictive performance.

Discussion:

After implementing the previous models with, linear regression and decision tree gave a very low R2 (0.05 and -0.3 respectively) which means the models fit the data very poorly. Random forest gave good results on training data but not on testing which means we have an over-fitting. In other words, the models are too complex, they are fitting the training data very well but are not able to generalize to new unseen data. Boosting models were better regarding generalization but nevertheless, we still had a bit of over-fitting. Moreover, to further study the generalization ability of each model on unseen data, we tried k-fold Cross-Validation using cross_validate function from scikit-learn library. Finally, to reduce the gap between the training and the testing, we fine-tuned the hyperparameters using the Grid Search technique. XGBoost was the best model.

Models	MAE	RMSE	R2
Linear Regression	0.2262	0.3677	0.05
Decision Tree	0.2807	0.4309	-0.3
Random Forest	0.2027	0.3238	0.26
XGBoost	0.205	0.3276	0.26
Adaboost	0.2307	0.3334	0.22
EBM	0.2096	0.3236	0.26

AutoML:

In order to find the best model automatically, we also tried the AutoML feature of Azure Machine Learning tools.

To do so, we first created our workspace and then launched an AutoML task. We fed the task with data coming from the preprocessing steps with outliers kept (with_outliers.csv). After waiting for about 45 minutes, we got the results. Our settings were the following : automatic validation set creation and random split of the set between 70% for the training set and 30% for the test set. The metric proposed by azure that fit the best for our task was : the normalized RMSE (NRMSE)

The best results came from two ensemble methods : VotingEnsemble and StackEnsemble.

The models used in theses ensemble methods were the five first that gave the bests results :

Random Forest with a MinMaxScaler preprocessor (hyperparameters of RandomForest: "bootstrap" : "false", "max_features" : 0.7, "n_estimators" : 25)
Random Forest with a MinMaxScaler preprocessor (hyperparameters of RandomForest: bootstrap : true, "max_features" : "sqrt", "n_estimators" : 25)
LightGBM with MaxAbsScaler preprocessor (hyperparameters of LigthGBM: "min_data_in_leaf" : 20)
Random Forest with StandardScalerWrapper preprocessor("with_mean" : "true" and "with_std" : "true") (hyperparameters of RandomForest: "bootstrap": "true", "max_features": 0.4,"min_samples_leaf": 0.005080937188890647,"min_samples_split": 0.0008991789964660114, "n_estimators": 50)
XGBoostRegressor with MaxAbsScaler preprocessor (hyperparameters of XGBoostREgressor : "tree_method": auto)

Models	NRMSE	MAE	RMSE	R2
1.	0.06112	0.20715	0.30561	0.19516
2.	0.061216	0.20780	0.30608	0.19277
3.	0.061575	0.20283	0.30787	0.18066
4.	0.061804	0.20793	0.30902	0.17753
5.	0.062687	0.20392	0.31343	0.15162

Finally, the ensemble methods tested by AutoML tasks were:

VotingEnsemble (weights for each models : 1. : 0.2, 2. : 0.266, 3. : 0.1333, 4. : 0.0667, 5. : 0.3333)
StackEnsemble

Models	NRMSE	MAE	RMSE	R2
VotingEnsemble	0.059761	0.19929	0.29880	0.23028
StackEnsemble	0.060298	0.19996	0.30149	0.21649

For the VotingEnsemble models, the most important features were the following :