16 04 Preprocessing and pipelines - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki
01 Preprocessing data
- Scikit-learn will not accept categorical features by default
- Convert to
dummy variables
- scikit-learn:
OneHotEncoder()
- pandas:
get_dummies()
Encoding dummy variables
df_origin = pd.get_dummies(df, drop_first = True)
02 Handling missing data
- Different datasets encode missing values in different ways. Sometimes it may be a '9999', other times a 0.
df.col_name.replace(0, np.nan, inplace=True)
Imputing missing data
- Making an educated guess about the missing data
- Example: Using the mean of the non-missing entries
# Import the Imputer module
from sklearn.preprocessing import Imputer
# Setup the Imputation transformer: imp
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(X)
X = imp.transform(X)
Imputing with a pipeline
- In a pipeline, each step but the last must be a transformer and the last must be an estimator, such as, a classifier or a regressor.
# Import necessary modules
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
('SVM', SVC())]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)
# Predict the labels of the test set
y_pred = pipeline.predict(X_test)
# Compute metrics
print(classification_report(y_test, y_pred))
03 Centering and Scaling
- Features on larger scales can unduly influence the model.
Ways to normalize your data
- Standardization: substract the mean and divide by variance
- Can also substract the minimum and divide by the range
- Can also normalize
Scaling in scikit-learn
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
# Import the necessary modules
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
('knn', KNeighborsClassifier())]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)
# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)
# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test))) # 0.7700680272108843
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test))) # 0.6979591836734694