KBB CFM2‐4 and CFM5‐17 DD Data Work‐Flow Part 1 - LeoLedesma237/LeoWebsite GitHub Wiki

Overview

The goal of this script is to essentially clean the data before visual matching. This makes the process much easier and categorizes children by how far they make it down the pipeline. Screened data would end up falling into one of these 4 categories:

  • Incorrect Screeners
  • Excluded
  • HOH No Matches
  • HOH at least one potential match

There are 5/6 parts to the script.

  • Part 1: General data cleaning
  • Part 1.5: (Optional) Manipulating the time of the data
  • Part 2: Identifying correct/incorrect CFM screener used
  • Part 3: Creating datasets with/without excluded children
  • Part 4: Identify HOH with no matches and with at least one potential match
  • Part 5: Save each data respectively (each is mutually exclusive)

There is also a Quality Control section where visual data inspection confirms the code from these chunks are functioning as intended.

Visualizing the WorkFlow

KBB WorkFlow

Part 1: General data cleaning

The script below will be

  1. Loading in the scored CFM2-4 and CFM5-17 datasets
  2. Selecting and renaming variables that are present in both datasets
  3. Add a screener type variable
  4. Bind both datasets together
library(tidyverse)
library(readxl)
library(lubridate)
library(openxlsx) # To save excel files with multiple tabs

# Set working directory
setwd("~/KBB_new_2/1_screener/processed_data")

# Load in all data to be matched
CFM2_4 <- read.csv("CFM2_4_clean.csv")
CFM5_17 <- read.csv("CFM5_17_clean.csv")

# Select the variables that both datasets have in common
CFM2_4 <- CFM2_4 %>%
  select(HOH_ID, 
         Date_of_Evaluation, 
         Evaluator_ID, 
         Name_of_the_Village,
         Location_Type,
         HOH_First_Name,
         HOH_Last_Name,
         Respondant_First_Name,
         Respondant_Last_Name,
         Respondant_relationship,
         Child_First_Name,
         Child_Last_Name,
         Child_Gender,
         Child_age,
         Child_Date_of_Birth,
         BF,
         BM,
         glasses,
         hearing.aid,
         walking.equipment,
         Seeing = CF3_Seeing,
         Hearing = CF6_Hearing,
         Walking = CF10_Walking,
         Physical_difficulty_type,
         CFM_DD,
         KBB_CFM_DD,
         CFM_DD_type,
         KBB_CFM_DD_type,
         Epilepsy,
         KBB_DD_status,
         Excluded,
         Child_ID)

CFM5_17 <- CFM5_17 %>%
  select(HOH_ID, 
         Date_of_Evaluation, 
         Evaluator_ID, 
         Name_of_the_Village,
         Location_Type,
         HOH_First_Name,
         HOH_Last_Name,
         Respondant_First_Name,
         Respondant_Last_Name,
         Respondant_relationship,
         Child_First_Name,
         Child_Last_Name,
         Child_Gender,
         Child_age,
         Child_Date_of_Birth,
         BF,
         BM,
         glasses,
         hearing.aid,
         walking.equipment,
         Seeing = CF3_Seeing,
         Hearing = CF6_Hearing,
         Walking = CF13_Walking_500,
         Physical_difficulty_type,
         CFM_DD,
         KBB_CFM_DD,
         CFM_DD_type,
         KBB_CFM_DD_type,
         Epilepsy,
         KBB_DD_status,
         Excluded,
         Child_ID)

# Create a variable indicating the screener type
CFM2_4$Screener.Type <- rep("CFM2_4", nrow(CFM2_4))
CFM5_17$Screener.Type <- rep("CFM5_17", nrow(CFM5_17))

# Bind the datasets
Binded.data <- rbind(CFM2_4, CFM5_17)

Part 1.5: (Optional) Manipulating the time of the data

This could be used if we want to reconstruct reports from earlier times. Be very careful with this code- if forgotten that it is in place it will prevent new numbers from being processed by the following code chunks.

# Halt children after this date
Day = '10'
Month = '09'
Year =  '2023'

#Binded.data <- Binded.data %>% 
#  filter(Date_of_Evaluation < paste(Year, Month, Day, sep="-"))

Part 2: Identify correct/incorrect CFM screener used

There is a small possibility that some of the children screened were done so with an incorrect CFM screener. The code below checks to see if that is the case. Only those with correct screeners will continue with further data processing since their data is reliable. Both correct and incorrect entries will be saved in respective objects. We created a function that used if else statements to see if each CFM screener had the age expected range. To do this we implemented mapply function in conjunction with mutate.

# Correct screener used?
Screener.Test.fun <- function(Age, Screener.Type) {
  
  if(is.na(Age)) {
    return("Age is missing")
    
  }
  
  if(Screener.Type == "CFM2_4") {
    
    if(between(Age,1.8,5)) {
      return("Correct: CFM2-4")
    } else if(Age > 5) {
      return("Incorrect: Should have used CFM5-17")
    } else {
      return("Incorrect: Too young for the study")
    }
    
    
  } else if(Screener.Type == "CFM5_17") {
    
    if(between(Age,4.5,19)) {
      return("Correct: CFM5-17")
    } else if(Age > 19) {
      return("Incorrect: Too old for the study")
      
    } else {
      return("Incorrect: Should have used CFM2-4")
      
    }
  }
  
}

Binded.data <- Binded.data %>%
  mutate(Screener.Test = mapply(Screener.Test.fun, Child_age, Screener.Type))

# Print a data frame with the frequency of screener errors
data.frame(Frequency = cbind(table(Binded.data$Screener.Test)))

# Save the correct screeners 
Correct.Screeners <- Binded.data %>%
  filter(Screener.Test %in% c("Correct: CFM2-4", "Correct: CFM5-17"))

# Save the incorrect screeners
Incorrect.Screeners <- Binded.data %>%
  filter(!(Screener.Test %in% c("Correct: CFM2-4", "Correct: CFM5-17")))

Part 3: Create a datasets with/without excluded children

Children that have a 'Yes' for the excluded label will be saved in one object and those with a 'No' will be saved in another. The latter will go to the next step.

# Save excluded children
Excluded.Children <- Correct.Screeners %>%
  filter(Excluded == "Yes")

# Save only children that do not have an excluded status
Not.Excluded.Children <- Correct.Screeners %>%
  filter(Excluded == "No")

Part 4: Identify HOH with no matches and with at least one potential match

From the children that do not have an exclusion label, we can nest our data by the HOH_IDs we created in previous scripts. This groups our data by HOH and allows us to run functions within these households. The code below is identifying households that have at least one DD and one non-DD pair and those that do not. These pairs are potential matches since they need to be investigated visually to see if they are half-siblings or full siblings.

# Nest the data by HOH_IDs
Not.Excluded.Children.Nested <- Not.Excluded.Children %>%
  group_by(HOH_ID) %>%
  nest()

# Run a function using the map from Purr to see if there are DD and non-DD status
Not.Excluded.Children.Mapped <- Not.Excluded.Children.Nested %>%
  mutate(table = map(.x = data, .f = ~table(.x$KBB_DD_status))) %>%
  mutate(matches = map(.x = table, .f = ~length(.x) == 2))

# Return household with at least one potential match 
HOH.Matches <- Not.Excluded.Children.Mapped[do.call(c,Not.Excluded.Children.Mapped$matches),]

# Return households with no matches
HOH.No.Matches <- Not.Excluded.Children.Mapped[!do.call(c,Not.Excluded.Children.Mapped$matches),]

# Unnest bothdatasets so they can be saved as CSVs
HOH.Matches.unnested <- unnest(HOH.Matches, data)
HOH.No.Matches.unnested <- unnest(HOH.No.Matches, data)

# Remove unneeded variables
HOH.Matches.unnested <- select(HOH.Matches.unnested, -c(table, matches))
HOH.No.Matches.unnested <- select(HOH.No.Matches.unnested, -c(table, matches))

Part 5: Save each data respectively (each is mutually exclusive)

Each of these datasets are mutually exclusive, meaning that a child in one dataset will not (hopefully) be present in any the other datasets. The variable 'Overall.Summary' was created to describe which category a child belongs to. Additionally, potential matches have "To be Determined" since their classification will be done by hand in the next script:

We will also save all of the children in CFM2-4 and CFM5-4 dataset binded before any data manipulation to use as quality control for the next script.

## Add an overall summary to each of the datasets
Incorrect.Screeners$Overall.Summary <- "Incorrect Screener"
Excluded.Children$Overall.Summary <- "Excluded Children"
HOH.No.Matches.unnested$Overall.Summary <- "Unable to Match"
HOH.Potential.Matches.unnested$Overall.Summary <- "To be Determined"

# Set the save directory
setwd("~/KBB_new_2/1_screener/final_data")

# Save level one datasets
write.xlsx(list(data = Incorrect.Screeners), file =  "1) Incorrect Screeners (level 1).xlsx")
write.xlsx(list(data = Excluded.Children), file =  "2) Excluded Children (level 1).xlsx")
write.xlsx(list(data = HOH.No.Matches.unnested), file =  "3) HOH No Matches (level 1).xlsx")
write.xlsx(list(data = HOH.Potential.Matches.unnested), file =  "4) HOH Potential Matches (level 1).xlsx")
write.xlsx(list(data = Binded.data), file = "All Children.xlsx")

Quality Control

Mutual Exclusive Datasets

The four mutually exclusive datasets were reconstructed and compared to the original 'Binded.data'. The reconstructed dataset contained no duplicates for Child_ID and had the same number of rows as 'Binded.data'.

# The number of rows for the original binded dataset
nrow(Binded.data)

# Bind the mutually exclusive datasets back into one
reconstructed.data <- rbind(Incorrect.Screeners,
                            Excluded.Children,
                            HOH.No.Matches.unnested,
                            HOH.Potential.Matches.unnested)

# Obtain their row numbers- should match Binded.data
nrow(reconstructed.data)

# Check for duplicates for the Binded Child IDs
sum(duplicated(Binded.data$Child_ID))

# Check for duplicates in the reconstructed data
sum(duplicated(reconstructed.data$Child_ID))

# Check for any missing data for Child ID
sum(is.na(Binded.data$Child_ID))
⚠️ **GitHub.com Fallback** ⚠️