Project Workflow 2 - sideround/project-ml-onlineshop GitHub Wiki

1. Data cleaning, visualization, correlation, transformation

Clean errors of the columns with a function
Create random column (to be used in the feature selection)
Column transformation

download clean csv without filling csv

Fill missing values (ffill)

download clean_no_nan to csv

2. ML Modeling

Encoding data

download encoded data to csv

define X and y
Balance data (undersample NearMiss)
Split data (train_test_split)
Scale data (StandardScaler)
Run each model
Short description about each model
- In RandomForest do the feature selection
- Run models with only selected features (probably will do it in the pipeline)
Visualize model results
Hyperparameter tuning for each model with GridSearch
Run each model with best parameters
Visualize model results

3. ML Modeling (pipeline and deep learning)

Create pipeline and try all models models
Try with different scalers
Try with oversampling (SMOTE)
Other changes (different percentage of train and test..)
Cross-validation with and without kfold
Run one deep learning model

1st part - Sosa (with Isaak if has the function finished)
2nd part - Kristina with Sosa (Sosa taking care of visualizations)
3rd part - Pau and Jota
Readme - Isaak
Presentation ?