7.2.1.Data Wrangling - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

Data Pre-processing

  • The process of converting or mapping data from the initial 'raw' form into another format, in order to prepare the data for further analysis.

Missing Values

  • Missing values occur when no data value is stored for a variable (feature) in an observation
  • Could be represented as '?', 'N/A', 0 or just a blank cell

Dealing with missing data

Check with the data collection source

Drop the missing values

  • drop the variable
  • drop the data entry

Replace the missing values

  • replace it with an average (of similar datapoints)
  • replace it by frequency
  • replace it based on other functions

Leave it as missing data

Question

how would you deal with missing values for categorical data

  • replace the missing value with the mode of the particular column
  • replace the missing value with the mean of the particular column
  • replace the missing value with the value that appears most often of the particular column

Correct. this is called the mode

Drop missing values in Python

  • Use dataframes.dropna()
    • axis=0 : drops the entire row
    • axis=1 : drops the entire column
df.dropna(subset=['column_name'], axis=0, inplace=True)
df = df.dropna(subset=['column_name'], axis=0)

Replace missing values in Python

  • Use dataframe.replace(missing_value, new_value)
mean = df['column_name'].mean()
df['column_name'].replace(np.nan, mean)

Data Formatting in Python

Data Formatting

  • Data are usually collected from different places and stored in different formats.
  • Bringing data into a common standard of expression allows users to make meaningful comparison.

Applying calculations to an entire column

  • Example: Convert 'mpg' to 'L/100km' in Car dataset
df['city-mpg'] = 235 / df['city-mpg']

df.rename(columns={'city-mpg': 'city-L/100km'}, inplace=True)

Incorrect data types

  • Sometimes the wrong data type is assigned to a feature.

Correcting data types

  • To identify datatypes:

    • Use dataframe.dtypes() to identify data type.
  • To convert data types:

    • Use dataframe.astype() to convert data type.
  • Example: Convert data type to integer in column 'price'

df['price'] = df['price'].astype('int')

Data Normalization in Python

Data Normalization

  • Uniform the features value with different range.

Methods of normalizaing data

  • Several approaches for normalization:
    1. Simple Feature scaling
    2. Min-Max
    3. Z-score

Simple Feature Scaling in Python

Example:

df['length'] = df['length']/df['length'].max()

Min-Max in Python

Example:

df['length'] = (df['length']-df['length'].min()) /
							(df['length'].max()-df['length'].min())

Z-score in Python

Example:

df['length'] = (df['length']-df['length'].mean())/df['length'].std()

Binning in Python

Binning

  • Binning: Grouping of values into "bins"
  • Converts numeric into categorical variables
  • Group a set of numerical values into a set of "bins"
  • "price" is a feature range from 5,000 to 45,500 (in order to have a better repersentation of price)

Binning in Python pandas

Example

bins = np.linspace(min(df['price']), max(df['price']), 4)

group_names = ['Low', 'Medium', 'High']

df['price-binned'] = pd.cut(df['price'], bins, labels=group_names, include_lowest=True)

Turning categorical variables into quantitative variables in Python

Categorical Variables

Problem:

  • Most statistical models cannot take in the objects/strings as input

Categorical → Numeric

Solution:

  • Add dummy variables for each unique category
  • Assign 0 or 1 in each category

Dummy variables in Python pandas

  • Use pandas.get_dummies() method
  • Convert categorical variables to dummy variables (0 or 1)
pd.get_dummies(df['fuel'])

Lesson Summary

In this lesson, you have learned how to:

Identify and Handle Missing Values: Drop rows with incomplete information and impute missing data using the mean values.

Understand Data Formatting: Wrangle features in a dataset and make them meaningful for data analysis.

Apply normalization to a data set: By understanding the relevance of using feature scaling on your data and how normalization and standardization have varying effects on your data analysis.