7.2.1.Data Wrangling - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki
Data Pre-processing
- The process of converting or mapping data from the initial 'raw' form into another format, in order to prepare the data for further analysis.
Missing Values
- Missing values occur when no data value is stored for a variable (feature) in an observation
- Could be represented as '?', 'N/A', 0 or just a blank cell
Dealing with missing data
Check with the data collection source
Drop the missing values
- drop the variable
- drop the data entry
Replace the missing values
- replace it with an average (of similar datapoints)
- replace it by frequency
- replace it based on other functions
Leave it as missing data
Question
how would you deal with missing values for categorical data
- replace the missing value with the mode of the particular column
replace the missing value with the mean of the particular column- replace the missing value with the value that appears most often of the particular column
Correct. this is called the mode
Drop missing values in Python
- Use
dataframes.dropna()
axis=0
: drops the entire rowaxis=1
: drops the entire column
df.dropna(subset=['column_name'], axis=0, inplace=True)
df = df.dropna(subset=['column_name'], axis=0)
Replace missing values in Python
- Use
dataframe.replace(missing_value, new_value)
mean = df['column_name'].mean()
df['column_name'].replace(np.nan, mean)
Data Formatting in Python
Data Formatting
- Data are usually collected from different places and stored in different formats.
- Bringing data into a common standard of expression allows users to make meaningful comparison.
Applying calculations to an entire column
- Example: Convert 'mpg' to 'L/100km' in Car dataset
df['city-mpg'] = 235 / df['city-mpg']
df.rename(columns={'city-mpg': 'city-L/100km'}, inplace=True)
Incorrect data types
- Sometimes the wrong data type is assigned to a feature.
Correcting data types
-
To identify datatypes:
- Use
dataframe.dtypes()
to identify data type.
- Use
-
To convert data types:
- Use
dataframe.astype()
to convert data type.
- Use
-
Example: Convert data type to integer in column 'price'
df['price'] = df['price'].astype('int')
Data Normalization in Python
Data Normalization
- Uniform the features value with different range.
Methods of normalizaing data
- Several approaches for normalization:
- Simple Feature scaling
- Min-Max
- Z-score
Simple Feature Scaling in Python
Example:
df['length'] = df['length']/df['length'].max()
Min-Max in Python
Example:
df['length'] = (df['length']-df['length'].min()) /
(df['length'].max()-df['length'].min())
Z-score in Python
Example:
df['length'] = (df['length']-df['length'].mean())/df['length'].std()
Binning in Python
Binning
- Binning: Grouping of values into "bins"
- Converts numeric into categorical variables
- Group a set of numerical values into a set of "bins"
- "price" is a feature range from 5,000 to 45,500 (in order to have a better repersentation of price)
Binning in Python pandas
Example
bins = np.linspace(min(df['price']), max(df['price']), 4)
group_names = ['Low', 'Medium', 'High']
df['price-binned'] = pd.cut(df['price'], bins, labels=group_names, include_lowest=True)
Turning categorical variables into quantitative variables in Python
Categorical Variables
Problem:
- Most statistical models cannot take in the objects/strings as input
Categorical → Numeric
Solution:
- Add dummy variables for each unique category
- Assign 0 or 1 in each category
Dummy variables in Python pandas
- Use pandas.get_dummies() method
- Convert categorical variables to dummy variables (0 or 1)
pd.get_dummies(df['fuel'])
Lesson Summary
In this lesson, you have learned how to:
Identify and Handle Missing Values: Drop rows with incomplete information and impute missing data using the mean values.
Understand Data Formatting: Wrangle features in a dataset and make them meaningful for data analysis.
Apply normalization to a data set: By understanding the relevance of using feature scaling on your data and how normalization and standardization have varying effects on your data analysis.