Project Workflow - sideround/project-ml-onlineshop GitHub Wiki

original = raw (no changes)
data = copy of the original
clean = clean data, no errors, correct data types (objects and floats), with NaNs
clean_no_nan = clean data, NaNs filled with ffill
df_encoded = all categorical variables transformed to numeric

Jota - Data transformation/engineering with small changes what we saw after the data visualization:
- operating system: 5-8 values don't give much information, there is no relation to Revenue True. Change 5,6,7,8 to 'other'.
- browser: values 3, 7, 9, 11, 12, 13 change to 'other'
- TrafficType
Kristina - One hot encoding for the categorical variables
- change revenue to 0 and 1 (label encoding) (and other variables that have boolean)
- others encode with one hot encoding
Create copy and new csv of the dataframe: encoded and also export to CSV so that every person could work on it's copy
Isaac - correlation check (heatmap) and equilibration
pairplot of numerical variables (after one hot encoding all should be numerical)
- zoom in pairplot, plot only the variables that are interesting for us
Sosa - Split to X data and y target (X, y = iris.data, iris.target)
Split to train and test
- from sklearn.model_selection import train_test_split
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
Scale (in the first code along split was done after scaling, but Paula said to split to train and test before)
- as an example, scaled to gaussian: X_scaled = preprocessing.scale(X)
Pau - First model to try: Logistic Regression, try with data that we have (with outliers and missing data filled with ffill).

todo para martes/miercoles

Equilibrate Revenue (stratify data, do downsampling)
Create more variables: clustering with kNN to have 3 types of sessions as new variables
Try with and without outliers
Try with different NaNs
Try different models with different parameters. Each person to try 3 models
We will need to minimize false-positive metrics in the confusion matrix