Preprocessing - setiamanlhc/python-snippet-code GitHub Wiki

Data Conversion using apply function

when your function returns 'row' at the end (like example below) the output returned by the df.apply() will be a Data Frame

when your function returns a row['column_name'] like what you have been using so far, the output returned by the df.apply() will be a column

def convert_inch_to_cm(row):
    inch_to_cm = 2.54
    row['Father'] = row['Father'] * inch_to_cm
    row['Mother'] = row['Mother'] * inch_to_cm
    row['Height'] = row['Height'] * inch_to_cm
    return row

df = df.apply(convert_inch_to_cm, axis=1)

Data conversion using Vectorized functions

# Convert height in Inc to CM
df['Father', 'Mother', 'Height'](/setiamanlhc/python-snippet-code/wiki/'Father',-'Mother',-'Height') = df['Father', 'Mother', 'Height'](/setiamanlhc/python-snippet-code/wiki/'Father',-'Mother',-'Height') * 2.54

Data conversion using map

mapping_dict_gender = {
    'M': 'M',
    'F': 'F',
    'female': 'F',
    'male': 'M'
}

df['Gender'] = df['Gender'].map(mapping_dict_gender)

Create Dummies Data

Create dummy data and concatenate it with main data to convert categorial column as numerical column.

cols_categorical_to_transform = [
    'hotel',
    'market_segment',
    'is_repeated_guest',
    'reserved_room_type',
    'country'
]

df_dummies = pd.get_dummies(df[cols_categorical_to_transform], drop_first=True)

df2 = pd.concat([df, df_dummies], axis=1)

Concatenate and Merging

df = pd.concat([df1,df2,df3])
df = pd.merge(df1, df2, how='left', left_on = 'key1', right_on = 'key2')  #merge on left_on for df1, right_on for df2
df = pd.merge(df1, df2, how='inner', on=['key1', 'key2']) #merge on two keys

#using dataframe to join. Left join will only get data from DF2
df3 = df2.merge(df, how='left', on=['Class', 'Seat_No'])

#using Outer it will include record from DF (Right)
df3 = df2.merge(df, how='outer', on=['Class', 'Seat_No'])

Create a new Dataset using Scaler function

StandardScaler will reduce the bias of data by transforming outlier to be more appropriate. Below example is using IRIS dataset. The features variable is used to hold numerical columns holding the list of columns from IRIS dataset except for 'species' column.

from sklearn.preprocessing import StandardScaler

features = list(df.columns)
features.remove('species')
print(features)

# Instantiate the scaler (from the 'Recipe')
scaler_std = StandardScaler()

# Storing the standardized features into dff
dff = scaler_std.fit_transform(df[features])

#Create a new Dataframe
dff = pd.DataFrame(dff, columns=features)

Using Pivot Table

df.pivot_table(values='ap_hi', index=['active', 'cardio'], columns='gender_label', aggfunc=['mean','min'])

Output

Unpivot Table

Convert wide format data to long format data.

df_long = df.melt(id_vars=['Channel_Label', 'Region_Label'], 
                  value_vars=['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen'], 
                  var_name='product_type', 
                  value_name='sales')

Output

Creating Pair plot

Calculate average and minimum 'ap_hi' of male and female for the group of 'active' and 'cardio'.

sns.pairplot(df,hue='TARGET CLASS',palette='coolwarm')