11 02 Text and categorical data problems - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki
Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.
Membership constraints
- Predefined finite set of categories.
Finding consistency
# Print categories DataFrame
print(categories)
# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique(), "\n") # turns out to have an inconsistent category
print('Safety: ', airlines['safety'].unique(), "\n")
print('Satisfaction: ', airlines['satisfaction'].unique(), "\n")
inconsistent_categories = set(study['blood_type']).difference(categories['blood_type'])
: return a set with elements in A but not in B
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])
# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)
# Print rows with inconsistent category
print(airlines[cat_clean_rows])
# Print rows with consistent categories only
print(airlines[~cat_clean_rows])
Categorical variables
Type of errors
- value inconsistency
- Inconsistent fields:
married
,Maried
- Trailing white spaces:
married
,married
- Inconsistent fields:
- Collapsing too many categories to few
- Creating new groups
- Mapping groups to new ones
- Making sure data is of type
category
Value consistency
.str.lower()
/.str.upper()
.str.strip()
Collapsing data into categories
pd.qcut()
pd.cut()
.replace(mapping)
# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())
['Asia' 'Canada/Mexico' 'West US' 'East US' 'Midwest US' 'EAST US'
'Middle East' 'Europe' 'eur' 'Central/South America'
'Australia/New Zealand' 'middle east']
['Hub' 'Small' ' Hub' 'Medium' 'Large' 'Hub ' ' Small'
'Medium ' ' Medium' 'Small ' ' Large' 'Large ']
# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower()
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})
# Remove white spaces from `dest_size`
airlines['dest_size'] = airlines['dest_size'].str.strip()
Remapping categories
# Create ranges for categories
label_ranges = [0, 60, 180, np.inf]
label_names = ['short', 'medium', 'long']
# Create wait_type column
airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges,
labels = label_names)
# Create mappings and replace
mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday',
'Thursday': 'weekday', 'Friday': 'weekday',
'Saturday': 'weekend', 'Sunday': 'weekend'}
airlines['day_week'] = airlines['day'].replace(mappings)
Cleaning text data
Common text data problems
- Data inconsistency
+96171679912
or009817169912
- Fixed length violations
- Passwords need to be at least 8 characters
- Typos
.str.replace()
:
# Replace "Dr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Dr.","")
phones.loc[phones['Phone number'].str.len() < 10, 'Phone number'] = np.nan
- regular expressions