11 01 Common Data Problems - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

Data type constraints

DataType Example PY DataType
Text name String
Integers #Subscribers int
Decimals Temperature float
Binary Is married bool
Dates Order dates datetime
Categories Marriage status category

String to Integers (Summing Strings)

# Remove $ from Revenue column
sales['Revenue'] = sales['Revenue'].str.strip('$')
sales['Revenue'] = sales['Revenue'].astype('int')
# Verify that Revenue is now an integer
assert sales['Revenue'].dtype == 'int'  # if not true, will throw AssertionError

Numeric or categorical

# Print the information of ride_sharing
print(ride_sharing.info())

# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25760 entries, 0 to 25759
Data columns (total 9 columns):
duration           25760 non-null object
station_A_id       25760 non-null int64
station_A_name     25760 non-null object
station_B_id       25760 non-null int64
station_B_name     25760 non-null object
bike_id            25760 non-null int64
user_type          25760 non-null int64
user_birth_year    25760 non-null int64
user_gender        25760 non-null object
dtypes: int64(5), object(4)
memory usage: 2.0+ MB
None
count    25760.000000
mean         2.008385
std          0.704541
min          1.000000
25%          2.000000
50%          2.000000
75%          3.000000
max          3.000000
Name: user_type, dtype: float64
# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

# Print new summary statistics 
print(ride_sharing['user_type_cat'].describe())
count     25760
unique        3
top           2
freq      12972
Name: user_type_cat, dtype: int64

Data range constraints

How to deal with out of range data?

  • Dropping data
# Drop values using filtering
movies = movies[movies['avg_rating'] <= 5]

# Drop values using .drop()
movies.drop(movies[movies['avg_rating'] > 5].index, inplace = True)
  • Setting custom mins and max
  • Treat as missing and impute
  • Setting custom value depending on business assumptions
movies.loc[movies['avg_rating'] > 5, 'avg_rating'] = 5
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
print(ride_sharing['tire_sizes'].describe())

Date range example

# Convert ride_date to datetime
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date'])

# Save today's date
today = dt.date.today()

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

Uniqueness constraints

Finding duplicates

The .duplicated() method

  • subset :List of column names to check for duplication.
  • keep: Whether to keep first( 'first' ),last( 'last' )or all( False) duplicate values.
  • Return boolean Series denoting duplicate rows.
# Find duplicates
duplicates = ride_sharing.duplicated(subset='ride_id', keep=False)

# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')

# Print relevant columns of duplicated_rides
print(duplicated_rides['ride_id','duration','user_birth_year'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'ride_id','duration','user_birth_year'))

output complete and incomplete(33,55) duplicates

       ride_id  duration  user_birth_year
    22       33        10             1979
    39       33         2             1979
    53       55         9             1985
    65       55         9             1985
    74       71        11             1997
    75       71        11             1997
    76       89         9             1986
    77       89         9             2060

Treating duplicates

The .drop_duplicates() method

  • subset :List of column names to check fo rduplication.
  • keep : Whether to keep first( 'first' ),last('last')or all(False)duplicate values.
  • inplace :Drop duplicated rows directly inside DataFrame without creating new object (True ).
    The .groupby() and .agg()method
# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year':'min' , 'duration': 'mean'}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()

# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0