7.2.1.Data Wrangling - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

Data Pre-processing

The process of converting or mapping data from the initial 'raw' form into another format, in order to prepare the data for further analysis.

Missing Values

Missing values occur when no data value is stored for a variable (feature) in an observation
Could be represented as '?', 'N/A', 0 or just a blank cell

Dealing with missing data

Check with the data collection source

Drop the missing values

drop the variable
drop the data entry

Replace the missing values

replace it with an average (of similar datapoints)
replace it by frequency
replace it based on other functions

Leave it as missing data

Question

how would you deal with missing values for categorical data

replace the missing value with the mode of the particular column
~~replace the missing value with the mean of the particular column~~
replace the missing value with the value that appears most often of the particular column

Correct. this is called the mode

Drop missing values in Python

Use dataframes.dropna()
- axis=0 : drops the entire row
- axis=1 : drops the entire column

df.dropna(subset=['column_name'], axis=0, inplace=True)

df = df.dropna(subset=['column_name'], axis=0)

Replace missing values in Python

Use dataframe.replace(missing_value, new_value)

mean = df['column_name'].mean()
df['column_name'].replace(np.nan, mean)

Data Formatting in Python

Data Formatting

Data are usually collected from different places and stored in different formats.
Bringing data into a common standard of expression allows users to make meaningful comparison.

Applying calculations to an entire column

Example: Convert 'mpg' to 'L/100km' in Car dataset

df['city-mpg'] = 235 / df['city-mpg']

df.rename(columns={'city-mpg': 'city-L/100km'}, inplace=True)

Incorrect data types

Sometimes the wrong data type is assigned to a feature.

Correcting data types

To identify datatypes:
- Use dataframe.dtypes() to identify data type.
To convert data types:
- Use dataframe.astype() to convert data type.
Example: Convert data type to integer in column 'price'

df['price'] = df['price'].astype('int')

Data Normalization in Python

Data Normalization

Uniform the features value with different range.

Methods of normalizaing data

Several approaches for normalization:
1. Simple Feature scaling
2. Min-Max
3. Z-score

Simple Feature Scaling in Python

Example:

df['length'] = df['length']/df['length'].max()

Min-Max in Python

Example:

df['length'] = (df['length']-df['length'].min()) /
							(df['length'].max()-df['length'].min())

Z-score in Python

Example:

df['length'] = (df['length']-df['length'].mean())/df['length'].std()

Binning in Python

Binning

Binning: Grouping of values into "bins"
Converts numeric into categorical variables
Group a set of numerical values into a set of "bins"
"price" is a feature range from 5,000 to 45,500 (in order to have a better repersentation of price)

Binning in Python pandas

Example

bins = np.linspace(min(df['price']), max(df['price']), 4)

group_names = ['Low', 'Medium', 'High']

df['price-binned'] = pd.cut(df['price'], bins, labels=group_names, include_lowest=True)

Turning categorical variables into quantitative variables in Python

Categorical Variables

Problem:

Most statistical models cannot take in the objects/strings as input

Categorical → Numeric

Solution:

Add dummy variables for each unique category
Assign 0 or 1 in each category

Dummy variables in Python pandas

Use pandas.get_dummies() method
Convert categorical variables to dummy variables (0 or 1)

pd.get_dummies(df['fuel'])

Lesson Summary

In this lesson, you have learned how to:

Identify and Handle Missing Values: Drop rows with incomplete information and impute missing data using the mean values.

Understand Data Formatting: Wrangle features in a dataset and make them meaningful for data analysis.

Apply normalization to a data set: By understanding the relevance of using feature scaling on your data and how normalization and standardization have varying effects on your data analysis.