11 03 Advanced data problems - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

In this chapter, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.

Uniformity

Column Unit
Temperature 32C is also 89.6F
Weight 70kg is also 11 st
Date 26-11-2019 is also 26, Nov, 2019

Uniform currencies

# Find values of acct_cur that are equal to 'euro'
acct_eu = banking['acct_cur'] == 'euro'

# Convert acct_amount where it is in euro to dollars
banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1

# Unify acct_cur column by changing 'euro' values to 'dollar'
banking.loc[acct_eu, 'acct_cur'] = 'dollar'

# Print unique values of acct_cur
assert banking['acct_cur'].unique() == 'dollar'

Uniform dates

  • pandas.to_datetime()
    • Can recognize most formats automatically
    • Sometimes fails with erroneous or unrecognizable formats
    • infer_datetime_format = True : attempt to infer format of each date
    • errors = 'coerce' : return NA for rows where conversion failed
  • .dt.strftime('%d-%m-%Y')
# Print the header of account_opend
print(banking['account_opened'].head())

# Convert account_opened to datetime
banking['account_opened'] = pd.to_datetime(banking['account_opened'],
                                           # Infer datetime format
                                           infer_datetime_format = True,
                                           # Return missing value for error
                                           errors = 'coerce') 

# Get year of account opened
banking['acct_year'] = banking['account_opened'].dt.strftime('%Y')

# Print acct_year
print(banking['acct_year'] )

Cross filed validation

The use of multiple fields in a dataset to sanity check data integrity

Catch inconsistency

# Store fund columns to sum against
fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D']

# Find rows where fund_columns row sum == inv_amount
inv_equ = banking[fund_columns].sum(axis = 1) == banking['inv_amount']

# Store consistent and inconsistent data
consistent_inv = banking[inv_equ]
inconsistent_inv = banking[~inv_equ]

# Store consistent and inconsistent data
print("Number of inconsistent investments: ", inconsistent_inv.shape[0])

# Store today's date and find ages
today = dt.date.today()
ages_manual = today.year - banking['birth_date'].dt.year

# Find rows where age column == ages_manual
age_equ = ages_manual == banking['age']

# Store consistent and inconsistent data
consistent_ages = banking[age_equ]
inconsistent_ages = banking[~age_equ]

# Store consistent and inconsistent data
print("Number of inconsistent ages: ", inconsistent_ages.shape[0])

What to do?

  • Dropping data.
  • Set to missing and impute.
  • Apply rules from domain knowledge.

Completeness

  • Missing data.

Missingno

  • Useful package for visualizing and understanding missing data

Missing type

  • Missing completely at Random
    • No systematic relationship between missing data and other values
    • Data entry errors when inputting data
  • Missing at Random
    • Systematic relationship between missing data and other observed values
    • Missing ozone data for high temperatures
  • Missing not at Random
    • Systematic relationship between missing data and other unobserved values
    • Missing temperature values for high temperatures
# Print number of missing values in banking
print(banking['inv_amount'].isna().sum()) #13

# Visualize missingness matrix
msno.matrix(banking)
plt.show()

# Isolate missing and non missing values of inv_amount
missing_investors = banking[banking['inv_amount'].isna()]
investors = banking[~banking['inv_amount'].isna()]

By using .desribe() on both missing_investors and investors: - The inv_amount is missing only for young customers, since the average age in missing_investors is 22 and the maximum age is 25.

# Sort banking by age and visualize
banking_sorted = banking.sort_values('age')
msno.matrix(banking_sorted)
plt.show()

How to deal with?

Simple approaches:

  1. Drop missing data : df.dropna(subset=['CO2'])
  2. Impute with statistical measures (mean, median, mode..) : df.fillna({'CO2': df['CO2].mean()})

More complex approaches:

  1. Imputing using an algorithmic approach
  2. Impute with machine learning models

Drop missing value and impute missing value

# Drop missing values of cust_id
banking_fullid = banking.dropna(subset = ['cust_id'])

# Compute estimated acct_amount
acct_imp = banking_fullid['inv_amount']*5

# Impute missing acct_amount with corresponding acct_imp
banking_imputed = banking_fullid.fillna({'acct_amount':acct_imp})

# Print number of missing values
print(banking_imputed.isna().sum())