Project Workflow - sideround/project-ml-onlineshop GitHub Wiki
Information about the copies of the dataset:
- original = raw (no changes)
- data = copy of the original
- clean = clean data, no errors, correct data types (objects and floats), with NaNs
- clean_no_nan = clean data, NaNs filled with ffill
- df_encoded = all categorical variables transformed to numeric
Workflow:
- Jota - Data transformation/engineering with small changes what we saw after the data visualization:
- operating system: 5-8 values don't give much information, there is no relation to Revenue True. Change 5,6,7,8 to 'other'.
- browser: values 3, 7, 9, 11, 12, 13 change to 'other'
- TrafficType
- Kristina - One hot encoding for the categorical variables
- change revenue to 0 and 1 (label encoding) (and other variables that have boolean)
- others encode with one hot encoding
- Create copy and new csv of the dataframe: encoded and also export to CSV so that every person could work on it's copy
- Isaac - correlation check (heatmap) and equilibration
- pairplot of numerical variables (after one hot encoding all should be numerical)
- zoom in pairplot, plot only the variables that are interesting for us
- Sosa - Split to X data and y target (X, y = iris.data, iris.target)
- Split to train and test
- from sklearn.model_selection import train_test_split
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
- Scale (in the first code along split was done after scaling, but Paula said to split to train and test before)
- as an example, scaled to gaussian: X_scaled = preprocessing.scale(X)
- Pau - First model to try: Logistic Regression, try with data that we have (with outliers and missing data filled with ffill).
todo para martes/miercoles
Hypertuning and other changes:
- Equilibrate Revenue (stratify data, do downsampling)
- Create more variables: clustering with kNN to have 3 types of sessions as new variables
- Try with and without outliers
- Try with different NaNs
- Try different models with different parameters. Each person to try 3 models
- We will need to minimize false-positive metrics in the confusion matrix