Getting Started with Categorical Confirmation Factor Analysis - Private-Projects237/Statistics GitHub Wiki
Here we will simulate some dummy data, run the categorical confirmatory factory analysis, and then interpret the results.
# Load packages
library(MASS)
library(lavaan)
# Set seed
set.seed(123)
N <- 500
# Simulate correlated latent traits
Sigma <- matrix(c(1, 0.5, 0.4,
0.5, 1, 0.3,
0.4, 0.3, 1), nrow = 3)
latent_factors <- mvrnorm(N, mu = c(0, 0, 0), Sigma = Sigma)
colnames(latent_factors) <- c("F1", "F2", "F3")
# Logistic model: item_response = factor_loading * latent_factor + error
make_ordinal <- function(theta, loading = 1, thresholds = c(-0.5, 0.5)) {
z <- loading * theta + rnorm(length(theta))
cut(z, breaks = c(-Inf, thresholds, Inf), labels = 0:2)
}
# 5 items per factor
items <- data.frame(
item1 = make_ordinal(latent_factors[, "F1"], 1),
item2 = make_ordinal(latent_factors[, "F1"], 0.9),
item3 = make_ordinal(latent_factors[, "F1"], 0.8),
item4 = make_ordinal(latent_factors[, "F1"], 1.1),
item5 = make_ordinal(latent_factors[, "F1"], 1),
item6 = make_ordinal(latent_factors[, "F2"], 1),
item7 = make_ordinal(latent_factors[, "F2"], 1.2),
item8 = make_ordinal(latent_factors[, "F2"], 0.9),
item9 = make_ordinal(latent_factors[, "F2"], 1),
item10 = make_ordinal(latent_factors[, "F2"], 1.1),
item11 = make_ordinal(latent_factors[, "F3"], 1),
item12 = make_ordinal(latent_factors[, "F3"], 0.9),
item13 = make_ordinal(latent_factors[, "F3"], 1.2),
item14 = make_ordinal(latent_factors[, "F3"], 1),
item15 = make_ordinal(latent_factors[, "F3"], 1)
)
# Convert to ordered factors
items[] <- lapply(items, ordered)
We will now explain what is occurring in the code step by step to showcase why this model works. First we start by creating a matrix ('Sigma'), which will represent 1) the number of factors in the data; 2) their variances; and 3) how well they correlate with each other. We created three factors all with a variance set to 1 and with moderate correlations to each other.
Then we use the mvnorm()
function to produce a dataset that is 500 rows long (observation) and has three columns (number of factors). the values represent theta values. These values were designed to match the data relationship we set with 'Sigma' and the means of each factor was set to 0 using the 'mu' parameter.
Next we create a custom function named make_ordinal()
which essentially takes a vector (theta values) and multiplies it by a factor loading that we specify and then adds some noise to it. These values are then categorized in 0, 1, or 2 based on their value. This is done 5 times for each factor, thus simulating responses on 15 items, which will be the data we will run the categorical CFA on.
Sigma Correlation Matrix | Theta Values for Each Factor | Generate Ordinal Items From Theta Values |
---|---|---|
![]() |
![]() |
![]() |
# Visualizing the data
items_summarised <- items %>%
pivot_longer(item1:item15, names_to = "items", values_to = "response") %>%
mutate(factors = case_when(
items %in% paste0("item", 1:5) ~ "F1",
items %in% paste0("item", 6:10) ~ "F2",
items %in% paste0("item", 11:15) ~ "F3"
)) %>%
mutate(response = as.numeric(response)) %>%
group_by(factors, items, response) %>%
summarise(response_freq = length(items))
item_order <- paste0("item",15:1)
items_summarised %>%
mutate(response = factor(response),
items = factor(items, levels = item_order)) %>%
ggplot(aes(x = items, y = response_freq, fill = response)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_classic()
sapply(items, function(x) table(x))
Distribution of Responses by Items | Averaged Items Within Factors Correlation |
---|---|
![]() |
![]() |
model <- '
# Measurement Model
F1 =~ 1*item1 + item2 + item3 + item4 + item5
F2 =~ 1*item6 + item7 + item8 + item9 + item10
F3 =~ 1*item11 + item12 + item13 + item14 + item15
'
fit <- cfa(model, data = items, estimator = "WLSMV", ordered = colnames(items))
summary(fit, standardized = TRUE, fit.measures = TRUE)
parameterEstimates(fit)
standardizedSolution(fit)
inspect(fit, "r2")
Unstandardized Parameter Estimated Part 1 | Unstandardized Parameter Estimated Part 1 |
---|---|
![]() |
![]() |
We start by seeing the unstandardized factor loadings for each item and their corresponding factor (=~
). The first item for each factor was fixed to the value 1. Then we see something new which is the |
operator. This tells us the threshold where a response will change to the next sequential response for that item. Since each item has three unique responses, there are two thresholds to represent the change. The next operator we see is ~~
which functions as the residual variance for that item or the unstandardized variance for the factor. They also show the covariances between the factors. We then have our scaling factors ~*~
, not very sure what this means yet but they have to be fixed to 1 to make the rest of the model work. Lastly, the ~1
operator represents the intercepts/means- notice they have been fixed to 0.
Anyway, even though the output is pretty extensive, we really only care about the same aspects that we did during regular CFA, which is the factor loadings and the variances and covariances between the factors. The residual variances are also important for the items but it becomes more meaningful (I think) when the model is standardized.
Standardized Factor Loadings | Standardized Residual Variances, Factor Variances & Factor Correlations | Item r2 |
---|---|---|
![]() |
![]() |
![]() |
This is the more meaningful part of the output- the |
,~1
and ~*~
is redundant. By looking at our standardized factor loadings we see that each item has a pretty good relationship with their respective factor. We see that the variance of the factor is now set to 1, which is what we want. The residual variances of the items give us an even better idea of how each item is being explained by the factor, we can basically think of them as 1 - r^2. Lastly the standardized covariances between the factors represent their correlations. Notice how they are kinda similar to what we programed for 'Sigma' back in step 1.
# Plot it
library(semPlot)
semPaths(fit,
what = "std", # Standardized estimates
whatLabels = "std", # Show estimated values
layout = "tree", # Hierarchical layout
edge.label.cex = 1, # Edge label size
node.label.cex = 1, # Node label size
sizeMan = 7, # Size of item rectangles
sizeMan2 = 4,
sizeLat = 10, # Size of factor ovals
edge.color = "black", # Black edges for clarity
nodeLabels = c("Item 1", "Item 2", "Item 3", "Item 4", "Item 5",
"Item 6", "Item 7", "Item 8", "Item 9", "Item 10",
"Item 11", "Item 12", "Item 13", "Item 14", "Item 15",
"Factor 1", "Factor 2", "Factor 3"), # Custom labels
residuals = TRUE, # Show residual variances
exoCov = TRUE, # Show factor correlations
intercepts = FALSE, # Exclude intercepts
title = FALSE, # Custom title added later
curve = 3.75, # Curved arrows for correlations
rotation = 2, # Vertical orientation for items
normalize = FALSE, # mitigates overlap between indicators
weighted = FALSE, # gets rid of dumb thickness in the lines
mar = c(3,15,3,10), # D L U R
)
The Path Diagram of the categorical CFA (Standardized) |
---|
![]() |
# Run a Graded Response Model
F1_irt <- grm(select(items,item1:item5), constrained = FALSE, Hessian = TRUE, control = list(iter.max = 200))
F2_irt <- grm(select(items,item6:item10), constrained = FALSE, Hessian = TRUE, control = list(iter.max = 200))
F3_irt <- grm(select(items,item11:item15), constrained = FALSE, Hessian = TRUE, control = list(iter.max = 200))
# Extract the thresholds
thresh_discr_F1 <- coef(F1_irt, simplify = TRUE)
thresh_F1 <- thresh_discr_F1[, c("Extrmt1", "Extrmt2")]
thresh_discr_F2 <- coef(F2_irt, simplify = TRUE)
thresh_F2 <- thresh_discr_F2[, c("Extrmt1", "Extrmt2")]
thresh_discr_F3 <- coef(F3_irt, simplify = TRUE)
thresh_F3 <- thresh_discr_F3[, c("Extrmt1", "Extrmt2")]
irt_thresh <- data.frame(rbind(thresh_F1, thresh_F2, thresh_F3))
irt_thresh$item <- row.names(irt_thresh)
irt_thresh$model <- "irt_GRM"
# Obtain threshold estimates from CFA
cfa_thresh <- parameterEstimates(fit) %>%
filter(op %in% "|") %>%
select(lhs, rhs, est) %>%
pivot_wider(names_from = rhs, values_from = est) %>%
rename(item = lhs, Extrmt1 = t1, Extrmt2 = t2)
cfa_thresh$model <- "cat_CFA"
# Now let's combine the two to plot
rbind(irt_thresh, cfa_thresh) %>%
mutate(item = rep(1:15, 2),
item = factor(item)) %>%
pivot_longer(cols = c(Extrmt1, Extrmt2), names_to = "threshold_type", values_to = "value") %>%
ggplot(aes(x = item, y = value, shape = threshold_type, color = model)) +
geom_point(size = 2.5) +
#facet_grid(threshold_type ~ ., scales = "free_y") +
theme_linedraw()
# Comparing theta values
theta_scores_CFA <- data.frame(lavPredict(fit, type = "lv", method = "EBM"))
theta_scores_CFA$model <- "cat_CFA"
# irt theta values
theta_scores_GRM <- data.frame(
F1 = factor.scores(F1_irt, resp.patterns = select(items, item1:item5), method = "EAP")$score.dat$z1,
F2 = factor.scores(F2_irt, resp.patterns = select(items, item6:item10), method = "EAP")$score.dat$z1,
F3 = factor.scores(F2_irt, resp.patterns = select(items, item11:item15), method = "EAP")$score.dat$z1
)
theta_scores_GRM$model <- "irt_GRM"
# Bind them together to plot
rbind(theta_scores_CFA, theta_scores_GRM) %>%
pivot_longer(-model,
names_to = "Factor",
values_to = "Theta_values") %>%
ggplot(aes(x = Theta_values)) +
geom_histogram() +
facet_grid(model~Factor) +
theme_linedraw()
Estimated Threshold Comparisons Between CFA and IRT Models | Comparing Theta Values Between Models |
---|---|
![]() |
![]() |