Two parameter IRT model (2PL) with Missing Data - Private-Projects237/Statistics GitHub Wiki

Overview

Today we will be doing some tests by running the same model on different versions of a single dataset. This dataset was used in Two parameter IRT model (2PL) and looked to produce good estimates. Thus, we will see how well the following four datasets compare to this gold standard one.

  • dataset1: Contains 10% of data missing at random
  • dataset2: Contains 10% of data missing but systematically, with a higher bias toward the later items
  • dataset3: Contains 30% of data missing at random
  • dataset4: Contains 30% of data missing but systematically, with a higher bias toward the later items

Step 1 Generating data

Generate the gold-standard dataset

# Dataset 1
# Set seed for reproducibility
set.seed(123)

# Number of subjects and items
n_subjects <- 150
n_items <- 25

# Step 1: Simulate subject abilities (theta) from N(0, 1)
theta <- rnorm(n_subjects, mean = 0, sd = 1)

# Step 2: Define item parameters
# Difficulties (b) ranging from -2 to +2, evenly spaced
b <- seq(-2, 2, length.out = n_items)
# Discrimination (a) ranging from 0.5 to 2.0, slightly varied
a <- runif(n_items, min = 0.5, max = 2.0)

# Step 3: Generate response probabilities using 2PL formula
probs <- matrix(NA, nrow = n_subjects, ncol = n_items)
for (i in 1:n_subjects) {
  for (j in 1:n_items) {
    probs[i, j] <- 1 / (1 + exp(-a[j] * (theta[i] - b[j])))
  }
}

# Step 4: Simulate binary responses (0s and 1s) based on probabilities
response_data2 <- matrix(rbinom(n_subjects * n_items, 1, probs), 
                         nrow = n_subjects, ncol = n_items)

# Step 5: Create data frame with IDs
ids <- 1:n_subjects
ds2 <- rep(c('dataset2'), length(ids))
dummy_data2 <- data.frame(ID = ids, data = ds2, response_data2)
colnames(dummy_data2) <- c("ID", "data", paste0("Item", 1:n_items))

Generate four datasets with different conditions of missing data

# Dataset 1
# Set seed for reproducibility
set.seed(123)

# Number of subjects and items
n_subjects <- 150
n_items <- 25

# Step 1: Simulate subject abilities (theta) from N(0, 1)
theta <- rnorm(n_subjects, mean = 0, sd = 1)

# Step 2: Define item parameters
# Difficulties (b) ranging from -2 to +2, evenly spaced
b <- seq(-2, 2, length.out = n_items)
# Discrimination (a) ranging from 0.5 to 2.0, slightly varied
a <- runif(n_items, min = 0.5, max = 2.0)

# Step 3: Generate response probabilities using 2PL formula
probs <- matrix(NA, nrow = n_subjects, ncol = n_items)
for (i in 1:n_subjects) {
  for (j in 1:n_items) {
    probs[i, j] <- 1 / (1 + exp(-a[j] * (theta[i] - b[j])))
  }
}

# Step 4: Simulate binary responses (0s and 1s) based on probabilities
response_data2 <- matrix(rbinom(n_subjects * n_items, 1, probs), 
                         nrow = n_subjects, ncol = n_items)

# Step 5: Create data frame with IDs
ids <- 1:n_subjects
ds2 <- rep(c('dataset2'), length(ids))
dummy_data2 <- data.frame(ID = ids, data = ds2, response_data2)
colnames(dummy_data2) <- c("ID", "data", paste0("Item", 1:n_items))


# Create dummy_data2a with ~10% random missing data
dummy_data2a <- dummy_data2

# Generate 375 random row and column indices for Item1 to Item25 (columns 3 to 27)
rows <- sample(1:150, 375, replace = TRUE)
cols <- sample(3:27, 375, replace = TRUE)

# Introduce NA at selected positions
for (i in 1:375) {
  dummy_data2a[rows[i], cols[i]] <- NA
}

# Create a dummy_data2c with ~30% random missing data
dummy_data2c <- dummy_data2

# Generate 375 random row and column indices for Item1 to Item25 (columns 3 to 27)
rows <- sample(1:150, 1400, replace = TRUE)
cols <- sample(3:27, 1400, replace = TRUE)

# Introduce NA at selected positions
for (i in 1:1400) {
  dummy_data2c[rows[i], cols[i]] <- NA
}


# Create dummy_data2b with ~10% systematic missing data
dummy_data2b <- dummy_data2

# Generate indices for systematic missing data
# Columns 3 to 14 (Item1 to Item12): ~100 NAs
rows_early <- sample(1:150, 100, replace = TRUE)
cols_early <- sample(3:14, 100, replace = TRUE)

# Columns 15 to 27 (Item13 to Item25): ~275 NAs, with higher probability for later columns
# Weighted sampling for columns 15 to 27
cols_late <- sample(15:27, 275, replace = TRUE, prob = seq(1, 13, length.out = 13))
rows_late <- sample(1:150, 275, replace = TRUE)

# Introduce NA for early columns (Item1 to Item12)
for (i in 1:100) {
  dummy_data2b[rows_early[i], cols_early[i]] <- NA
}

# Introduce NA for later columns (Item13 to Item25)
for (i in 1:275) {
  dummy_data2b[rows_late[i], cols_late[i]] <- NA
}


# Create dummy_data2b with ~30% systematic missing data
dummy_data2d <- dummy_data2

# Generate indices for systematic missing data
# Columns 3 to 14 (Item1 to Item12): ~300 NAs
rows_early <- sample(1:150, 500, replace = TRUE)
cols_early <- sample(3:14, 500, replace = TRUE)

# Columns 15 to 27 (Item13 to Item25): ~825 NAs, with higher probability for later columns
# Weighted sampling for columns 15 to 27
cols_late <- sample(15:27, 1000, replace = TRUE, prob = seq(1, 13, length.out = 13))
rows_late <- sample(1:150, 1000, replace = TRUE)

# Introduce NA for early columns (Item1 to Item12)
for (i in 1:500) {
  dummy_data2d[rows_early[i], cols_early[i]] <- NA
}

# Introduce NA for later columns (Item13 to Item25)
for (i in 1:1000) {
  dummy_data2d[rows_late[i], cols_late[i]] <- NA
}

Step 2: Getting Descriptives of NA by Each Dataset

  • We need to confirm data our data looks like the way we think, or at least it is close!
binded_data <- data.frame(
  Na = colSums(is.na(dummy_data2)),
           Item = 1:length(dummy_data2),
           dataset = "No NAs") %>%
  rbind(
    data.frame(
      Na = colSums(is.na(dummy_data2a)),
      Item = 1:length(dummy_data2),
      dataset = "10%  (Random)") 
  ) %>%
  rbind(
    data.frame(
      Na = colSums(is.na(dummy_data2b)),
      Item = 1:length(dummy_data2),
      dataset = "10% (Systematic)") 
  ) %>%
  rbind(
    data.frame(
      Na = colSums(is.na(dummy_data2c)),
      Item = 1:length(dummy_data2),
      dataset = "30% (Random)") 
  ) %>%
  rbind(
    data.frame(
      Na = colSums(is.na(dummy_data2d)),
      Item = 1:length(dummy_data2),
      dataset = "30% (Systematic)") 
  ) 

# Graph the data
Item_lvl <- c("No NAs","10%  (Random)", "10% (Systematic)", "30% (Random)","30% (Systematic)")

binded_data %>%
  mutate(dataset = factor(dataset, levels = Item_lvl)) %>%
  ggplot(aes(x = Item, y = Na)) +
  geom_point() +
  facet_grid(dataset~.) #+
  theme_classic()
    
# Obtain the percentage of missing data
binded_data %>%
  group_by(dataset) %>%
  summarise(percent_na = round(sum(Na)/(ncol(dummy_data2)*nrow(dummy_data2)),2)*100)
Percentage of Missing Data by Datasets
Screenshot 2025-04-21 at 8 39 55 PM
Number of Missing Data by Item for Each Dataset
Screenshot 2025-04-21 at 8 38 42 PM
  • From the first table, we see that we roughly have the percentage of NA's in the dataset that we wanted, so that is good.
  • In the figure, we see that for the (random) datasets, that the NA's seem to be spread out pretty evenly across the items and that for the (systematic) datasets, that we see more NAs in the later items than we do in the earlier ones.
  • We this table and figure, we have confirmed our datasets look the way we intended.

Step 3: Running the IRT model (2PL) on each dataset

Running IRT for each dataset

data_list <- list(dummy_data2 = dummy_data2,
                  dummy_data2a = dummy_data2a,
                  dummy_data2b = dummy_data2b,
                  dummy_data2c = dummy_data2c,
                  dummy_data2d = dummy_data2d)

# Redundant but that is okay
Item_lvl <- c("No NAs","10%  (Random)", "10% (Systematic)", "30% (Random)","30% (Systematic)")

# Run IRT
library(ltm)

IRT_coefficients_list <- list()
theta_values_list <- list()

for(ii in 1:length(data_list)) {
  
  # Extract current dataset
  current_dataset <- data_list[ii](/Private-Projects237/Statistics/wiki/ii)
  
  # Remove the first two columns (they are not item responses)
  current_dataset2 <- current_dataset[, -c(1,2)]
  
  # Run the analysis
  irt_mod <-  ltm(current_dataset2 ~ z1, IRT.param = TRUE)
  irt_mod_summ <- summary(irt_mod)
  
  # Extract the coefficients (the estimates)
  coeff <- as.data.frame(irt_mod_summ$coefficients)
  coeff$var <- row.names(coeff)
  coeff$dataset <- Item_lvl[ii]
  
  # Extract theta values and sace it into a dataframe
  theta_scores = factor.scores(irt_mod, resp.patterns = current_dataset2, method = "EAP")
  theta <- data.frame(theta = theta <- round(theta_scores$score.dat$z1,3),
                      dataset = Item_lvl[ii])
  
  # Save the outputs into a list
  IRT_coefficients_list[ii](/Private-Projects237/Statistics/wiki/ii) <- coeff
  theta_values_list[ii](/Private-Projects237/Statistics/wiki/ii) <- theta
}


# Bind the irt outputs for each dataset together
binded_irt <- do.call(rbind, IRT_coefficients_list)
binded_theta <- do.call(rbind, theta_values_list)

Plotting the Discrimination and Difficulty Estimates

# Let's plot the esimates of the IRT
item_order <- paste0("Item",25:1)

binded_irt %>%
  mutate(estimates = ifelse(grepl("Dffclt",var), "b (difficulty)", "a (discrimination)"),
         Items = sub(".*\\.", "", var),
         Items = factor(Items, levels = item_order),
         dataset = factor(dataset, levels = Item_lvl)) %>%
  ggplot(aes(x = Items, y = value)) +
  geom_point() +
  facet_grid(estimates~dataset, scales = "free_y") +
  theme_linedraw() +
  coord_flip() +
  labs(y = "Estimates")
Discrimination and Difficulty Estimates for each Dataset
Screenshot 2025-04-21 at 9 09 00 PM

The estimates of the 10% missing data look nearly identical to that of the dataset without any NAs. We can start to see some problems with the 30% missing data, but overall it doesn't look too bad. The main concern is item 24 for the 30% (random) dataset.

Standard Errors of the Estimates for each Dataset
Screenshot 2025-04-21 at 9 18 12 PM

Step 4: Extracting Ability Performance ($\theta$)

For code efficient we already extracted theta scores in the for loop from the previous section using the factor.scores() function.

binded_theta %>%
  mutate(dataset = factor(dataset, levels = Item_lvl)) %>%
  ggplot(aes(x = dataset, y = theta)) +
  geom_violin() +
  geom_jitter(width = .15) +
  labs(y = expression(theta~"(theta)")) +
  theme_classic()
Estimated Ability Between Datasets
Screenshot 2025-04-21 at 9 29 34 PM

When examining the distribution of the theta values between the datasets, they all look good and normally distributed.

Theta Correlations Across Datasets

Conclusion

The investigation into the discrimination, difficulty, their standard errors, and theta values showcase that irt models (2PL) can handle 10% missing data at random and 10% missing data systematically very well. This was shown to be true in our dataset of 150 observations. Thus, any datasets that we analyze in the future with this sample size or larger should be pretty robust to some minor missing data.