Preprocessing - setiamanlhc/python-snippet-code GitHub Wiki
Data Conversion using apply function
when your function returns 'row' at the end (like example below) the output returned by the df.apply() will be a Data Frame
when your function returns a row['column_name'] like what you have been using so far, the output returned by the df.apply() will be a column
def convert_inch_to_cm(row):
inch_to_cm = 2.54
row['Father'] = row['Father'] * inch_to_cm
row['Mother'] = row['Mother'] * inch_to_cm
row['Height'] = row['Height'] * inch_to_cm
return row
df = df.apply(convert_inch_to_cm, axis=1)
Data conversion using Vectorized functions
# Convert height in Inc to CM
df['Father', 'Mother', 'Height'](/setiamanlhc/python-snippet-code/wiki/'Father',-'Mother',-'Height') = df['Father', 'Mother', 'Height'](/setiamanlhc/python-snippet-code/wiki/'Father',-'Mother',-'Height') * 2.54
Data conversion using map
mapping_dict_gender = {
'M': 'M',
'F': 'F',
'female': 'F',
'male': 'M'
}
df['Gender'] = df['Gender'].map(mapping_dict_gender)
Create Dummies Data
Create dummy data and concatenate it with main data to convert categorial column as numerical column.
cols_categorical_to_transform = [
'hotel',
'market_segment',
'is_repeated_guest',
'reserved_room_type',
'country'
]
df_dummies = pd.get_dummies(df[cols_categorical_to_transform], drop_first=True)
df2 = pd.concat([df, df_dummies], axis=1)
Concatenate and Merging
df = pd.concat([df1,df2,df3])
df = pd.merge(df1, df2, how='left', left_on = 'key1', right_on = 'key2') #merge on left_on for df1, right_on for df2
df = pd.merge(df1, df2, how='inner', on=['key1', 'key2']) #merge on two keys
#using dataframe to join. Left join will only get data from DF2
df3 = df2.merge(df, how='left', on=['Class', 'Seat_No'])
#using Outer it will include record from DF (Right)
df3 = df2.merge(df, how='outer', on=['Class', 'Seat_No'])
Create a new Dataset using Scaler function
StandardScaler will reduce the bias of data by transforming outlier to be more appropriate. Below example is using IRIS dataset. The features variable is used to hold numerical columns holding the list of columns from IRIS dataset except for 'species' column.
from sklearn.preprocessing import StandardScaler
features = list(df.columns)
features.remove('species')
print(features)
# Instantiate the scaler (from the 'Recipe')
scaler_std = StandardScaler()
# Storing the standardized features into dff
dff = scaler_std.fit_transform(df[features])
#Create a new Dataframe
dff = pd.DataFrame(dff, columns=features)
Using Pivot Table
df.pivot_table(values='ap_hi', index=['active', 'cardio'], columns='gender_label', aggfunc=['mean','min'])
Unpivot Table
Convert wide format data to long format data.
df_long = df.melt(id_vars=['Channel_Label', 'Region_Label'],
value_vars=['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen'],
var_name='product_type',
value_name='sales')

Creating Pair plot
Calculate average and minimum 'ap_hi' of male and female for the group of 'active' and 'cardio'.
sns.pairplot(df,hue='TARGET CLASS',palette='coolwarm')