KBB Fuzzy Matching HOH IDs - LeoLedesma237/LeoWebsite GitHub Wiki

This script may not be used as often due to the potential of it working out incorrectly.

Essentially, the matching process is dependent on the testers correctly spelling the head of household's first and last name and their date of birth. If there are typos, then children will not be able to be matched to each other.

This script is broken down into five parts:

  • Part 1: General data loading and cleaning
  • Part 2: Splitting the data into DD and nonDD children
  • Part 3: Graphing screening coordinates
  • Part 4: Fuzzy matching HOH_IDs between DD and non DD children
  • Part 5: Saving the ouput of the fuzzy matching

Part 1: General data loading and cleaning

  • The addition of the CFM2_4 and CFM5_17 is to add coordinates to the data.
# Quality control script using Fuzzy Matching. 
library(readxl)
library(tidyverse)
library(ggplot2)
library(fuzzyjoin)
library(openxlsx)

# Set working directory
setwd("~/KBB_new_2/1_screener/final_data")

# Load in the data
No.Matches.Within.HOH <- read_excel("3) HOH No Matches (level 1).xlsx")

# Select variables of interest
No.Matches.Within.HOH <- select(No.Matches.Within.HOH, HOH_ID, Name_of_the_Village, Date_of_Evaluation, Child_ID, KBB_DD_status)

# Load in cleaned processed CFM data
setwd("~/KBB_new_2/1_screener/processed_data")

CFM2_4.location <- read.csv("CFM2_4_clean.csv") %>% select(Child_ID, GPS.lat, GPS.long)
CFM5_17.location <- read.csv("CFM5_17_clean.csv") %>% select(Child_ID, GPS.lat, GPS.long)

Binded.data.location <- rbind(CFM2_4.location, CFM5_17.location)

# Introduce the location information into the No Matches dataset
No.Matches.Within.HOH <- No.Matches.Within.HOH %>%
  left_join(Binded.data.location, by = "Child_ID")

Part 2: Split the data into DD and non DD children

# Split the data into DD and no DD
DD.No.Matches <- filter(No.Matches.Within.HOH, KBB_DD_status == "Yes")
noDD.No.Matches <- filter(No.Matches.Within.HOH, KBB_DD_status == "No")

Part 3: Graph data screening location

  • This could indicate if two villages are potentially the same if they share the same location coordinates.
# Print a graph with location
Binded.data.location %>%
  ggplot(aes(x = GPS.lat, y = GPS.long)) +
  geom_point(color = "blue") +
  labs(title = "Data collection Location in Choma") +
  theme_linedraw()

Part 4: Fuzzy matching HOH_IDs between DD and non DD children

# Take the HOH_IDs for both groups and fuzzy match them.
fuzzy.matched.results <- stringdist_join(DD.No.Matches, noDD.No.Matches, 
                                            by='HOH_ID', #match based on HOH_ID
                                            mode='left', #use left join
                                            method = "osa", #use jw distance metric
                                            max_dist=3, 
                                            distance_col='dist') %>%
  filter(complete.cases(.))

Part 5: Saving the results

# Save this as an excel sheet 
setwd("~/KBB_new_2/Fuzzy matching")

write.xlsx(fuzzy.matched.results, "Potential HOH IDs Fuzzy Matched in HOH No Matches (level 1).xlsx")