Two parameter IRT model (2PL) - Private-Projects237/Statistics GitHub Wiki

Here we will be discussing two parameter IRT models (2PL) by going through some examples in R. We will be creating two datasets, one with items whose performance was chosen at random (binary: 0's and 1's) and another that has more of a consistent pattern with these items (also binary: 0's and 1's).

Part 1: Creating the datasets

Copy this chunk to generate random data

# Dataset 1
# Set seed for reproducibility
set.seed(123)

# Number of subjects and items
n_subjects <- 150
n_items <- 25

# Create a matrix of random binary responses (0s and 1s)
response_data1 <- matrix(rbinom(n_subjects * n_items, 1, 0.5), 
                        nrow = n_subjects, ncol = n_items)

# Create ID column
ids <- 1:n_subjects
ds1 <- rep(c('dataset1'), length(ids))

# Combine IDs with response data into a data frame
dummy_data1 <- data.frame(ID = ids, data=ds1, response_data1)

# Name the item columns for clarity
colnames(dummy_data1) <- c("ID", "data",paste0("Item", 1:n_items))

Copy this data to get less randomized data

# Dataset 2
# Step 1: Simulate subject abilities (theta) from N(0, 1)
theta <- rnorm(n_subjects, mean = 0, sd = 1)

# Step 2: Define item parameters
# Difficulties (b) ranging from -2 to +2, evenly spaced
b <- seq(-2, 2, length.out = n_items)
# Discrimination (a) ranging from 0.5 to 2.0, slightly varied
a <- runif(n_items, min = 0.5, max = 2.0)

# Step 3: Generate response probabilities using 2PL formula
probs <- matrix(NA, nrow = n_subjects, ncol = n_items)
for (i in 1:n_subjects) {
  for (j in 1:n_items) {
    probs[i, j] <- 1 / (1 + exp(-a[j] * (theta[i] - b[j])))
  }
}

# Step 4: Simulate binary responses (0s and 1s) based on probabilities
response_data2 <- matrix(rbinom(n_subjects * n_items, 1, probs), 
                        nrow = n_subjects, ncol = n_items)

# Step 5: Create data frame with IDs
ids <- 1:n_subjects
ds2 <- rep(c('dataset2'), length(ids))
dummy_data2 <- data.frame(ID = ids, data = ds2, response_data2)
colnames(dummy_data2) <- c("ID", "data", paste0("Item", 1:n_items))

Part 2: Visualizing the data

We can merge these datasets together (rbind()) and then calculate the proportion of correct responses by item and dataset. We can then plot the responses to get a sense of how randomly created data looks like vs data that was meant to be more meaningful

item_order <- paste0("Item",25:1)

rbind(dummy_data1, dummy_data2) %>%
  pivot_longer(-c(ID,data), names_to = "Items", values_to = "Correct") %>%
  group_by(data, Items)%>%
  summarise(mean_Correct = mean(Correct)) %>%
  mutate(Items = factor(Items, levels = item_order)) %>%
  ggplot(aes(x = Items, y = mean_Correct)) +
  facet_grid(~data) +
  coord_flip() +
  geom_point() +
  theme_classic()

Proportion of correct responses by item between datasets

We can see that in dataset1 mean proportion responses by items are very similar to each other- there is very little variability in responses and it seems to be restricted between maybe 45 and 55. For dataset2, we see mean responses vary greatly from 50%, with earlier items being larger than 75% and later items smaller than 25%, thus showing much greater variability in responses.

Part 3: Running an Item Response Theory (IRT) Model

Running the model

# load the ltm package
library(ltm)

# Run the analysis for the first dataset
irt_data1 <- dummy_data1[, -c(1,2)]  # Excludes the ID column
irt_model1 <- ltm(irt_data1 ~ z1, IRT.param = TRUE)
irt_model1_summary <- summary(irt_model1)

# Run the analysis for the second dataset
irt_data2 <- dummy_data2[, -c(1,2)]  # Excludes the ID column
irt_model2 <- ltm(irt_data2 ~ z1, IRT.param = TRUE)
irt_model2_summary <- summary(irt_model2)

# Prepare the model outputs to be merged (rbinded)
coeff_1 <- as.data.frame(irt_model1_summary$coefficients)
coeff_2 <- as.data.frame(irt_model2_summary$coefficients)
coeff_1$var <- row.names(coeff_1)
coeff_2$var <- row.names(coeff_2)
coeff_1$dataset <- "dataset1"
coeff_2$dataset <- "dataset2"

Plotting the output estimates

# plot the 'a' and 'b'
rbind(coeff_1, coeff_2) %>%
  mutate(estimates = ifelse(grepl("Dffclt",var), "b (difficulty)", "a (discrimination)"),
         Items = sub(".*\\.", "", var),
         Items = factor(Items, levels = item_order)) %>%
  ggplot(aes(x = Items, y = value)) +
  geom_point() +
  facet_grid(estimates~dataset, scales = "free_x") +
  coord_flip() +
  theme_linedraw() +
  labs(y = "Estimates")

Output Estimates of IRT on the datasets

In the first figure, we plotted two estimates from both datasets using the same model.

a (discrimination) represents how well an item can discriminate between good and poor performers- it represents the slope of the logistic curve at $\theta = b$. Basically, the higher this value, the better it can differentiate. Low discrimination is less than .8, medium discrimination is between .8 and 1.3, and high discrimination is anything above this. In real tests, $a$ fall between 0.5 and 2.5.
b (difficulty) represents the ability level ($\theta$) where the success probability is 50%. So this captures the difficulty of an item. If people with ability level lower than 1 standard deviation from the mean have a 50% chance of getting that item correct, then it is an easy item. Conversely, if people with ability level higher than 1 standard deviation from the mean have a 50% chance of getting that item correct, then that item would be considered difficulty. $b$ typically ranges between -3 and +3 in real tests.
$b$ is thought of in a $\theta$ scale but they are not z-scores even though they are standardized like z-scores. $a$ is more of a scaling factor so it is unitless. When comparing dataset1 to dataset 2, we can clearly see that for discrimination, dataset1 has most items as low in discriminating while dataset2 has most items be either high discriminating or moderate. This indicated that dataset2 has many more items that can tell apart good vs bad performers. For difficulty, we can see that for dataset1, most items are considered difficult while a few items are considered easy or hard. For dataset2, the range in difficulty is much more spread out, increasing in nice increments.

Output Standard Errors of IRT on the datasets

We can also plot the standard errors of the estimates for a (discrimination) and b (difficulty). Again, estimates are only as good as their standard errors, since this indicates how certain we can be in the estimate. If the standard error is too large then we can't be sure exactly what the estimate is in some cases. For dataset 1, we can see that the standard errors for some items are abysmal, with one value going way past 100. For dataset2, the standard errors are much more homogenous and restricted, which is good!

Part 4: Extracting Ability Performance ($`\theta`$)

We can extract ability (theta) from the model with the factor.scores() function.

# Extract theta values (performance)
theta_values1 <- factor.scores(irt_model1, method = "EB")$score.dat
theta_values2 <- factor.scores(irt_model2, method = "EB")$score.dat

# Introduce theta values into the datasets
dummy_data1_theta <- data.frame(ID = dummy_data1$ID, dataset = "dataset1", Theta = theta_values1$z1)
dummy_data2_theta <- data.frame(ID = dummy_data2$ID, dataset = "dataset2", Theta = theta_values2$z1)

# Plot the ability estimates
rbind(dummy_data1_theta, dummy_data2_theta) %>%
  ggplot(aes(x = dataset, y = Theta)) +
  geom_violin() +
  geom_jitter(width = .15) +
  labs(y = expression(theta~"(theta)")) +
  theme_classic()

Estimated Ability Between Datasets

Here we notice something interesting about the estimated ability scores. We see that for the first dataset, the values are more restricted compared to dataset2 and are truncated. We don't really see the tails, instead the data points just fall off at the 1.5 mark. On the contrary, dataset2 has a much larger range and has tails present. This indicates that the second dataset was able to give the model better variability in performance discrimination- which is a good thing.

Part 5: Checking Model Fit and Assumptions

Unidimensionality: This indicates that the test only measures one single trait. In this case we can say our data are only items that correspond to the latent trait 'working memory'. To test unidimensionality, we can use an exploratory or confirmatory factor analysis. When we run this analysis, we except there to be one clear dominant factor. If we see many strong factors then that is problematic.

library(psych)

# Parallel analysis (good for suggesting # of factors)
fa.parallel(dummy_data1[, -c(1,2)] , fa = "fa", fm = "ml")
fa.parallel(dummy_data2[, -c(1,2)] , fa = "fa", fm = "ml")

Dataset1 Test for Unidimensionality	Dataset2 Test for Unidimensionality

WAIT THIS IS ACTUALLY BETTER!

### Assumptions
library(psych)

# Testing for unidimensionality (null hypothesis is this was violated)
unidim1 <- unidimTest(irt_model1)
unidim2 <- unidimTest(irt_model2)
print(unidim1)
print(unidim2)