11 01 Common Data Problems - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki
Data type constraints
DataType | Example | PY DataType |
---|---|---|
Text | name | String |
Integers | #Subscribers | int |
Decimals | Temperature | float |
Binary | Is married | bool |
Dates | Order dates | datetime |
Categories | Marriage status | category |
String to Integers (Summing Strings)
# Remove $ from Revenue column
sales['Revenue'] = sales['Revenue'].str.strip('$')
sales['Revenue'] = sales['Revenue'].astype('int')
# Verify that Revenue is now an integer
assert sales['Revenue'].dtype == 'int' # if not true, will throw AssertionError
Numeric or categorical
# Print the information of ride_sharing
print(ride_sharing.info())
# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25760 entries, 0 to 25759
Data columns (total 9 columns):
duration 25760 non-null object
station_A_id 25760 non-null int64
station_A_name 25760 non-null object
station_B_id 25760 non-null int64
station_B_name 25760 non-null object
bike_id 25760 non-null int64
user_type 25760 non-null int64
user_birth_year 25760 non-null int64
user_gender 25760 non-null object
dtypes: int64(5), object(4)
memory usage: 2.0+ MB
None
count 25760.000000
mean 2.008385
std 0.704541
min 1.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 3.000000
Name: user_type, dtype: float64
# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')
# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'
# Print new summary statistics
print(ride_sharing['user_type_cat'].describe())
count 25760
unique 3
top 2
freq 12972
Name: user_type_cat, dtype: int64
Data range constraints
How to deal with out of range data?
- Dropping data
# Drop values using filtering
movies = movies[movies['avg_rating'] <= 5]
# Drop values using .drop()
movies.drop(movies[movies['avg_rating'] > 5].index, inplace = True)
- Setting custom mins and max
- Treat as missing and impute
- Setting custom value depending on business assumptions
movies.loc[movies['avg_rating'] > 5, 'avg_rating'] = 5
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')
# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27
# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')
# Print tire size description
print(ride_sharing['tire_sizes'].describe())
Date range example
# Convert ride_date to datetime
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date'])
# Save today's date
today = dt.date.today()
# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today
# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())
Uniqueness constraints
Finding duplicates
The .duplicated()
method
subset
:List of column names to check for duplication.keep
: Whether to keep first('first'
),last('last'
)or all(False
) duplicate values.- Return boolean Series denoting duplicate rows.
# Find duplicates
duplicates = ride_sharing.duplicated(subset='ride_id', keep=False)
# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')
# Print relevant columns of duplicated_rides
print(duplicated_rides['ride_id','duration','user_birth_year'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'ride_id','duration','user_birth_year'))
output complete and incomplete(33,55) duplicates
ride_id duration user_birth_year
22 33 10 1979
39 33 2 1979
53 55 9 1985
65 55 9 1985
74 71 11 1997
75 71 11 1997
76 89 9 1986
77 89 9 2060
Treating duplicates
The .drop_duplicates() method
subset
:List of column names to check fo rduplication.keep
: Whether to keep first('first'
),last('last'
)or all(False
)duplicate values.inplace
:Drop duplicated rows directly inside DataFrame without creating new object (True
).
The.groupby()
and.agg()
method
# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()
# Create statistics dictionary for aggregation function
statistics = {'user_birth_year':'min' , 'duration': 'mean'}
# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()
# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]
# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0