Project Workflow - sideround/project-ml-onlineshop GitHub Wiki

Information about the copies of the dataset:

  • original = raw (no changes)
  • data = copy of the original
  • clean = clean data, no errors, correct data types (objects and floats), with NaNs
  • clean_no_nan = clean data, NaNs filled with ffill
  • df_encoded = all categorical variables transformed to numeric

Workflow:

  • Jota - Data transformation/engineering with small changes what we saw after the data visualization:
    • operating system: 5-8 values don't give much information, there is no relation to Revenue True. Change 5,6,7,8 to 'other'.
    • browser: values 3, 7, 9, 11, 12, 13 change to 'other'
    • TrafficType
  • Kristina - One hot encoding for the categorical variables
    • change revenue to 0 and 1 (label encoding) (and other variables that have boolean)
    • others encode with one hot encoding
  • Create copy and new csv of the dataframe: encoded and also export to CSV so that every person could work on it's copy
  • Isaac - correlation check (heatmap) and equilibration
  • pairplot of numerical variables (after one hot encoding all should be numerical)
    • zoom in pairplot, plot only the variables that are interesting for us
  • Sosa - Split to X data and y target (X, y = iris.data, iris.target)
  • Split to train and test
    • from sklearn.model_selection import train_test_split
    • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
  • Scale (in the first code along split was done after scaling, but Paula said to split to train and test before)
    • as an example, scaled to gaussian: X_scaled = preprocessing.scale(X)
  • Pau - First model to try: Logistic Regression, try with data that we have (with outliers and missing data filled with ffill).

todo para martes/miercoles

Hypertuning and other changes:

  • Equilibrate Revenue (stratify data, do downsampling)
  • Create more variables: clustering with kNN to have 3 types of sessions as new variables
  • Try with and without outliers
  • Try with different NaNs
  • Try different models with different parameters. Each person to try 3 models
  • We will need to minimize false-positive metrics in the confusion matrix