16 04 Preprocessing and pipelines - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

01 Preprocessing data

  • Scikit-learn will not accept categorical features by default
    • Convert to dummy variables
  • scikit-learn: OneHotEncoder()
  • pandas: get_dummies()

Encoding dummy variables

  • df_origin = pd.get_dummies(df, drop_first = True)

02 Handling missing data

  • Different datasets encode missing values in different ways. Sometimes it may be a '9999', other times a 0.
    • df.col_name.replace(0, np.nan, inplace=True)

Imputing missing data

  • Making an educated guess about the missing data
  • Example: Using the mean of the non-missing entries
# Import the Imputer module
from sklearn.preprocessing import Imputer

# Setup the Imputation transformer: imp
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(X)
X = imp.transform(X)

Imputing with a pipeline

  • In a pipeline, each step but the last must be a transformer and the last must be an estimator, such as, a classifier or a regressor.
# Import necessary modules
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
        ('SVM', SVC())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = pipeline.predict(X_test)

# Compute metrics
print(classification_report(y_test, y_pred))

03 Centering and Scaling

  • Features on larger scales can unduly influence the model.

Ways to normalize your data

  • Standardization: substract the mean and divide by variance
  • Can also substract the minimum and divide by the range
  • Can also normalize

Scaling in scikit-learn

  • from sklearn.preprocessing import scale
  • from sklearn.preprocessing import StandardScaler
# Import the necessary modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())]
        
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)

# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)

# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))  # 0.7700680272108843
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test))) # 0.6979591836734694