11 04 Record linkage - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you'll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.

Comparing strings

Simple string comparison

# Let's compare between two strings
from fuzzywuzzy import fuzz

# Compare reeding vs reading
fuzz.WRatio('Reeding', 'Reading')  
#86

### Partial Strings and different ordering
```python3
# Partial string comparison
fuzz.WRatio('Houston Rockets', 'Rockets') # 90

# Partial string comparison with different order
fuzz.WRatio('Houston Rockets vs Los Angeles Lakers', 'Lakers vs Rockets') # 86

Comparison with arrays

process.extract(): a string, an array to compare, a number of matches to return

# Import process
from fuzzywuzzy import process

# Define string and array of possible matches
string = "Houston Rockets vs Los Angeles Lakers"
choices = pd.Series(['Rockets vs Lakers', 'Lakers vs Rockets',
                        'Houson vs Los Angeles', 'Heat vs Bulls']) 
process.extract(string, choices, limit = 2)

[('Rockets vs Lakers', 86, 0), ('Lakers vs Rockets', 86, 1)] # the similar string, similarity, index in the original array

Collaspsing categories

# For each correct category
for state in categories['state']:
  # Find potential matches in states with typoes
    matches = process.extract(state, survey['state'], limit = survey.shape[0])

  # for each potential_match 
    for potential_match in matches:
      # if high similarity score
      if potential_match[1] >= 80:
         # Replace typo with correct category
          survey.loc[survey['state'] == potential_match[0], 'state'] = state

Generating pairs

Blocking

Create pairs on matching column, to reduce the possible number of pairs.

import recordlinkage

# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()

# Block pairing on cuisine_type
indexer.block('cuisine_type')

# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)

Comparing the DataFrames

# Create a comparison object
comp_cl = recordlinkage.Compare()

# Find exact matches on city, cuisine_types 
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('cuisine_type', 'cuisine_type', label = 'cuisine_type')

# Find similar matches of rest_name
comp_cl.string('rest_name', 'rest_name', label='name', threshold = 0.8) 

# Get potential matches and print
potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new) # same order as pairs
print(potential_matches)

Print out potential_matches, the columns are the columns being compared, with values being 1 for a match, and 0 for not a match for each pair of rows in your DataFrames. To find potential matches, you need to find rows with more than matching value in a column.

# Find the pairs we want
potential_matches[potential_matches.sum(axis = 1) => 3]

Linking dataframes

Get the indices

.index.get_level_values()

Linking DataFrames

# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis=1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]

# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)