Project Workflow 2 - sideround/project-ml-onlineshop GitHub Wiki

1. Data cleaning, visualization, correlation, transformation

  • Clean errors of the columns with a function
  • Create random column (to be used in the feature selection)
  • Column transformation

download clean csv without filling csv

  • Fill missing values (ffill)

download clean_no_nan to csv

2. ML Modeling

  • Encoding data

download encoded data to csv

  • define X and y
  • Balance data (undersample NearMiss)
  • Split data (train_test_split)
  • Scale data (StandardScaler)
  • Run each model
  • Short description about each model
    • In RandomForest do the feature selection
    • Run models with only selected features (probably will do it in the pipeline)
  • Visualize model results
  • Hyperparameter tuning for each model with GridSearch
  • Run each model with best parameters
  • Visualize model results

3. ML Modeling (pipeline and deep learning)

  • Create pipeline and try all models models
  • Try with different scalers
  • Try with oversampling (SMOTE)
  • Other changes (different percentage of train and test..)
  • Cross-validation with and without kfold
  • Run one deep learning model

  • 1st part - Sosa (with Isaak if has the function finished)
  • 2nd part - Kristina with Sosa (Sosa taking care of visualizations)
  • 3rd part - Pau and Jota
  • Readme - Isaak
  • Presentation ?