Getting Started with Categorical with Exploratory Factor Analysis (EFA) in Lavaan - Private-Projects237/Statistics GitHub Wiki

Overview

This page will be covering how to run and interpret categorical Exploratory Factor Analysis in R using Lavaan. The focus will be in using Lavaan syntax since this can later be used to create more complex models, such as CFA-EFA hybrid models. We will cover this topic by generating data, running EFA on it through lavaan functions and through other base functions in R to confirm that the code and output was done correctly. Each section will also introduce a parallel analysis part, which is important to run so you have an idea of how many factors to expect to be produced by your data.

Part 1: Generating EFA for a single factor

We will start by generating data that a single factor can best explain. The code below was generated through Grok. We will start by creating values for a standardized Factor, which mathematically means creating a vector in which it is normally distributed with a mean = 1 and a standard deviation = 1. We created this variable this way so when we take the means of generated values for items that we will create, their means will match the means of the specified factor loadings. Additionally changing the mean from 1 to 0 will not change the factor loadings produced by the EFA model. Next we will create a vector with the factor loadings specified. These factor loadings represent the standardized relationship between the Factor and the corresponding item. We want our data to have 6 items, therefore 6 factor loadings will need to be specified c(0.8, 0.75, 0.7, 0.85, 0.65, 0.7). The values we chose also indicate moderate to strong relationship with the Factor. We then create a dataset that has 6 columns and 1000 observations, each observation is essentially the vector that makes up the values of the Factor $times$ the factor loading we specified for that column plus some error rnorm(n, 0, 0.5). Thus we create a dataset where we have the Factor values for each column and then we weaken it by the factor loadings we specified and add a little bit of noise- therefore when we run EFA we should reproduce the same factor loadings.

However, we are interested in producing a categorical EFA, so our current numeric values for each of the items is inappropriate. What we need to do is convert each of these numeric values into integers on a 5-point Likert scale. We can do this using the cut() function, where we specify the argument breaks to be equal to 5. What this does is that it takes our continuous variables (basically each item separately) and it categories responses into one of 5 categories. We can then use this to our advantage to ordinally categorizes or numeric responses into integers from 1-5.

Generating and Exploring the data

# Load required packages
library(lavaan)  # For ESEM and SEM
library(psych)   # For standalone EFA
library(MASS)    # For data generation

# Set seed for reproducibility
set.seed(123)

# Generate synthetic dataset with one-factor structure
# 6 items, 5-point Likert scale, 1000 respondents
n <- 1000
latent_factor <- rnorm(n, mean = 1, sd = 1)  # Single latent factor
loadings <- c(0.8, 0.75, 0.7, 0.85, 0.65, 0.7)  # Item loadings

# Generate continuous item responses
items_cont <- matrix(NA, nrow = n, ncol = 6)
for (i in 1:6) {
  items_cont[, i] <- loadings[i] * latent_factor + rnorm(n, 0, 0.5)
}
# Investigate the dataset created
class(items_cont)
dim(items_cont)
round(sapply(as.data.frame(items_cont), function(x) mean(x)), 2)

# Convert to 5-point Likert scale (1 to 5)
items <- matrix(NA, nrow = n, ncol = 6)
for (i in 1:6) {
  items[, i] <- as.numeric(cut(items_cont[, i], breaks = 5, labels = FALSE, include.lowest = TRUE))
}

# Create data frame
data_one_factor <- as.data.frame(items)
names(data_one_factor) <- paste0("Item", 1:6)

Below we can quickly see the means of each of the items before and after the integer transformation and the elements of both datasets.

Visualizing Data As Numeric And As Integers (They basically say the same thing)
Screenshot 2025-06-07 at 4 28 40 PM

Running parallel analysis

Something really cool that we can do, which is basically a precursor to EFA because in the real world we will not know how many factors our data is made up from, is to run a parallel analysis. We can do this by using the fa.parallel() function from the psych package and it will basically tell us how many factors are within our data! How does it work? Well we start by specifying what our correlation matrix for our data is. Below we use the polychoric() function, also from the 'psych' package, on our data to create the correlation matrix. Then we specify how many observations were used to create this matrix (n=1000), followed by the correlation used ('poly' = polychoric), the factor measures use ('wls' = weighted least squares, which is best to use when working with categorical data), and how we want to calculate the eigenvalues ('fa = fa' because we want to do common-factor extraction not principal component).

# Run a parallel analysis
fa_result <- fa.parallel(x = polychoric(data_one_factor)$rho,
                         n.obs = n, cor = "poly", fm = "wls", fa = "fa") # wls = weighted least squares

# View the output of eigenvalues
fa_result

Information below from the table and the scree plot is the same. We see that there are the same number of produced eigen values as items in our dataset. We then see that the first eigen value produced (4.10) is much larger than a simulated one (0.45) and that in the following produce eigen value (0.07) it is much smaller than the simulated one (0.29). This one case where a produce eigen value is greater than a simulated one indicates to us that there is one factor in the data. Therefore in the upcoming EFA code we need to specify we have evidence for only one factor.

Parallel Analysis Table Output Parallel Analysis Scree Plot
Screenshot 2025-06-07 at 6 34 39 PM Screenshot 2025-06-07 at 6 34 59 PM

Fitting an EFA Model

This next section is all about fitting an EFA model onto our data. We will be doing this using two approaches, the first is using lavaan syntax, which is important for us to fully understand so we can create more complicated models later on, and the other is using base R functions which are known to work! We will first start with the lavaan, which is a bit more complicated.

Lavaan requires us to create a character object that specifies our model of interest. Within this object, we need to write the following code efa("efa1")*f1, which is a block that tells the model we are looking to construct a single factor from our data, the =~ operator indicates from what items the factor will be constructed from and the Item1 + Item2 + Item3 + Item4 + Item5 + Item6 names the items that will be used for this mathematical calculation. After the model specification is correctly made and saved as an object, we can now use the sem() function where we specify what the model is (the character object), what the data is (our dataframe), what estimator we want to use (WLSMV, which stands for Weighted Least Squares, which is what we use when handling categorical data which we have), that our items are ordinal (TRUE or we can specify which items from our dataframe are ordinal using the names() function), and the rotation we want (oblim, which standard for oblique rotation- this is the rotation we use when we are okay with multiple factors correlating with each other).

The fa() function on the other hand is much more relaxed. We start by specifying what our data is (it goes into the 'r' argument), how many factors we want to estimate ('nfactors = 1'), what type of correlation to use ('poly' = polychoric), the rotation we want ('oblimin'), and the factor method ('fm wls=', which is also weighted least squares).

Lastly, once both models are created and saved into object, we can then view their output using summary() and print() respectively.

# ESEM: Specify EFA model with one factor using lavaan
esem_model <- '
  # EFA block for one factor
  efa("efa1")*f1 =~ Item1 + Item2 + Item3 + Item4 + Item5 + Item6
'

# Fit ESEM model with WLSMV for categorical data
esem_fit <- sem(
  model = esem_model,
  data = data_one_factor,
  estimator = "WLSMV",
  ordered = TRUE,  # Treat items as ordinal
  rotation = "geomin"  # Oblique rotation
)

# Standalone EFA with psych package
fa_fit <- fa(
  r = data_one_factor,
  nfactors = 1,
  cor = "poly",  # Polychoric correlations for categorical data
  rotate = "oblimin",  # Oblique rotation
  fm = "wls"  # Weighted least squares
)

# Show the output of the EFA from the lavaan and psych approach
summary(esem_fit, fit.measures = TRUE, standardized = TRUE)
print(fa_fit)

To keep this simple, we see that the produced factor loadings from both approaches are essentially the same. Thus validating that our lavaan and sem() approach is doing what we believe it is supposed to be doing.

sem() model output just loadings fa() model output just loadings
Screenshot 2025-06-07 at 4 50 22 PM Screenshot 2025-06-07 at 4 50 52 PM

Part 2: Generating EFA for two factors

Now that we have an understand of what is going on in our data and the models we specified above, let's now go ahead and add onto our understanding by generating a dataset that contains two factors. How will we do this? Well we can use the mvrnorm() function from the MASS package, which will generate essentially the values for two or more factors. How does it do this? We first specify the number of observations we want each factor to have, so we will be consistent and use 1000 for this. Next, we specify what means we want and this time we will keep it at 0 to produce a true standardized vector. We then specify the 'Sigma' argument, which represents the relationship between the two Factors we are about to produce (I call them factors but really they are just normally distributed vectors). Whatever values we set off the diagonal for this matrix (they must be the same) will represent the strength of the correlation between two variables. Thus if we want them to be independent on each other (defeating the purpose of this function) then we can set it to 0. We will have them weakly correlated at r = .3.

We will then specify a matrix (loadings) where it will contain information about what the factor loadings for each item will be for both factors. Since we expect this data to nicely fit with two factors, we will have some items correlate highly with one of the factors and very low with the other, and vice versa to the remaining items. Notice that the factor loadings across items do not need to equal 1.00 but they should not be greater than that! Preferably it is good to keep it smaller than 1.00.

Item creation (numeric): What happens next is a bit tricky to understand. Basically, items ARE the outcome, and just like every other instance we simulated data, the outcome is always going to be an equation representing the relationships across predictors. In this case, the predictors are the factor loadings and the outcome is a single factor loading being multiplied to the distribution of its respective factor, which is done for both factors with their respective factor loading, and then summed! This process is repeated 8 times, and that is what generates a dataset of 8 columns with 1000 observations in which if we take the items and load them onto two factors, will produce factor loadings in where it loads high onto one factor and low onto another.

Item creation (categorical): Now that we have our data of numeric values for each item, we can again use the cut() function and set 'breaks = 5' to categories the numeric values into bins labeled 1 to 5. These new values will represent our data as ordinal, which is what we will need to run categorical CFA.

Below will be the code to generate the data- we won't need to explore it because from the example above we should be able to picture what it looks like.

Generating the Data


# Generate synthetic dataset with two-factor structure
# 8 items, 5-point Likert scale, 1000 respondents
n <- 1000
# Two correlated latent factors (correlation = 0.3)
latent_factors <- mvrnorm(n = n, mu = c(0, 0), Sigma = matrix(c(1, 0.3, 0.3, 1), 2, 2))
f1 <- latent_factors[, 1]
f2 <- latent_factors[, 2]


# Loadings: Items 1-4 on f1, Items 5-8 on f2, with small cross-loadings
loadings <- matrix(c(
  0.80, 0.10,  # Item1: high on f1, low on f2
  0.75, 0.15,  # Item2
  0.70, 0.10,  # Item3
  0.85, 0.05,  # Item4
  0.10, 0.80,  # Item5: low on f1, high on f2
  0.15, 0.75,  # Item6
  0.05, 0.70,  # Item7
  0.10, 0.85   # Item8
), ncol = 2, byrow = TRUE)

# Generate continuous item responses
items_cont <- matrix(NA, nrow = n, ncol = 8)
for (i in 1:8) {
  items_cont[, i] <- loadings[i, 1] * f1 + loadings[i, 2] * f2 + rnorm(n, 0, 0.5)
}


# Convert to 5-point Likert scale (1 to 5)
items <- matrix(NA, nrow = n, ncol = 8)
for (i in 1:8) {
  items[, i] <- as.numeric(cut(items_cont[, i], breaks = 5, labels = FALSE, include.lowest = TRUE))
}

# Create data frame
data_two_factor <- as.data.frame(items)
names(data_two_factor) <- paste0("Item", 1:8)

Running parallel analysis

Now let's run a parallel analysis using the fa.parallel() function. We will input our polychoric correlation matrix of our data and then specify the number of observations for this data (n=1000), followed by the correlation used ('poly' = polychoric), the factor measures use ('wls' = weighted least squares, which is best to use when working with categorical data), and how we want to calculate the eigenvalues ('fa = fa' because we want to do common-factor extraction not principal component).

# Run a parallel analysis
fa_result2 <- fa.parallel(x = polychoric(data_two_factor)$rho,
                         n.obs = n, cor = "poly", fm = "wls", fa = "fa") # wls = weighted least squares

# View the output of eigenvalues
fa_result2

We can see from the information below that our dataset of 8 items can be explained very well from two factors.

Parallel Analysis Table Output Parallel Analysis Scree Plot
Screenshot 2025-06-07 at 6 40 56 PM Screenshot 2025-06-07 at 6 41 19 PM

Why are there two factors?

If we look closely at the polychoric correlation matrix of our data, we can see something very interesting. Here each correlation was rounded to .1 to make the pattern very noticeable. We have

# Viewing the polychoric correlation matrix
round(polychoric(data_two_factor)$rho,1)

Below we see that the items are arranged in a way to make viewing easier. Items 1,2,3 and 4, were created to load highly to factor 1 while items 5,6,7, and 8 were created to load highly to factor 2. Additionally, all items were created from the sum of how well they load onto one factor and how weakly they load onto the other. The takeaway from that is that not one item was created from one factor alone, and both factors were created to correlate with each other- thus, Items 1,2,3, and 4, also relate to items 5,6,7, and 8, but not as strongly as the items within the factor that it corresponds to. We can see that very clearly in our data below, the correlations between Items 1-4 are all 0.7, the correlations between Items 5-8 are between 0.7-0.8 and the correlations between items that load strongly to different factors is between 0.3 and 0.4. Therefore this shows clears evidence that there are two factors and that the items are all correlated with each other to a degree, that degree is moderated by the factor the items correspond to.

Polychoric Correlations of our 8 items
Screenshot 2025-06-07 at 7 22 48 PM

Fitting an EFA Model

Now we can jump into the good part. We can start by specifying the model in lavaan syntax. Here we want to specify that there are two factors we want to look for instead of one. To do this, we will write a EFA block that will test two factors. This is identical to what we did above except now we include efa("efa1")*f2 and sum it to the original efa("efa1")*f1. The f1 and f2 is arbitrary, they are just labels that we chose but we can replace it with anything. Afterwards we use the =~ operated indicating these two factors will be measure by and then Item1 + Item2 + Item3 + Item4 + Item5 + Item6 + Item7 + Item8 are the items we want to use. The sem() function does not change, it is basically the same as what we wrote in the previous example for the same reason. For fa() however we do specify that we want to construct two factors in the 'nfactor = 2' argument.

# ESEM: Specify EFA model with two factors using lavaan
esem_model <- '
  # EFA block for two factors
  efa("efa1")*f1 +
  efa("efa1")*f2 =~ Item1 + Item2 + Item3 + Item4 + Item5 + Item6 + Item7 + Item8
'

# Fit ESEM model with WLSMV for categorical data
esem_fit <- sem(
  model = esem_model,
  data = data_two_factor,
  estimator = "WLSMV",
  ordered = TRUE,  # Treat items as ordinal
  rotation = "geomin"  # Oblique rotation
)

# Standalone EFA with psych package
fa_fit <- fa(
  r = data_two_factor,
  nfactors = 2,
  cor = "poly",  # Polychoric correlations for categorical data
  rotate = "oblimin",  # Oblique rotation
  fm = "wls"  # Weighted least squares
)

# Summarize EFA results
summary(esem_fit, fit.measures = TRUE, standardized = TRUE)

# Summarize standalone EFA results
print(fa_fit)

Again to keep this simple, we see that the produced factor loadings from both approaches are essentially the same. Thus validating that our lavaan and sem() approach is doing what we believe it is supposed to be doing. We can also see how our generated dataset (our 8 ordinal items) provides clean evidence for two factors where some items represent one factor, other items represent another factor and these factors are both related to each other!

sem() model output just loadings fa() model output just loadings
Screenshot 2025-06-07 at 5 38 16 PM Screenshot 2025-06-07 at 5 38 39 PM

Part 3: Generating EFA for three factors

We will not spend too much time on describing the code below because it overlaps highly with the previous section. This time we will be generating three factors, and we will do this using the mvrnorm() function. We will give each factor a mean of 0 and we will have all values off the diagonals for 'Sigma' as .3, which will make three normally distributed variables that all correlate to each other at .3 (this is not exactly true but their correlations are close to .3, this becomes more accurate with a larger sample size).

Factor loadings are now specified in a matrix with three columns since we expect three of them. There are 12 rows for this matrix of factor loadings since now we will create 12 items. Lastly the sum of factor loadings for each row never exceeds the value 1. We then run the same steps before where items are the OUTCOME, therefore they are generated by taking the sum of the factors multiplied by their respective factor loading for the row of the loading matrix, this is done 12 times to generate the 12 items with 1000 observations. Afterwards the cut() function with 'break = 5' specified is used to convert the numeric values into the integers 1 through 5, which makes the data ready for categorical EFA.

Generating the Data

# Set seed
set.seed(123)

# Generate synthetic dataset with three-factor structure
# 12 items, 5-point Likert scale, 1000 respondents
n <- 1000
Sigma <- matrix(c(1, 0.3, 0.3, 0.3, 1, 0.3, 0.3, 0.3, 1), 3, 3)
latent_factors <- mvrnorm(n, mu = c(0, 0, 0), Sigma = Sigma)

f1 <- latent_factors[, 1]
f2 <- latent_factors[, 2]
f3 <- latent_factors[, 3]

# Loadings: Items 1-4 on f1, Items 5-8 on f2, Items 9-12 on f3, with small cross-loadings
loadings <- matrix(c(
  0.80, 0.10, 0.05,  # Item1: high on f1, low on f2, f3
  0.75, 0.15, 0.10,  # Item2
  0.70, 0.10, 0.15,  # Item3
  0.85, 0.05, 0.10,  # Item4
  0.10, 0.80, 0.05,  # Item5: low on f1, high on f2, low on f3
  0.15, 0.75, 0.10,  # Item6
  0.05, 0.70, 0.15,  # Item7
  0.10, 0.85, 0.05,  # Item8
  0.05, 0.10, 0.80,  # Item9: low on f1, f2, high on f3
  0.10, 0.15, 0.75,  # Item10 
  0.15, 0.05, 0.70,  # Item11
  0.10, 0.10, 0.85   # Item12
), ncol = 3, byrow = TRUE)

# Generate continuous item responses
items_cont <- matrix(NA, nrow = n, ncol = 12)
for (i in 1:12) {
  items_cont[, i] <- loadings[i, 1] * f1 + loadings[i, 2] * f2 + loadings[i, 3] * f3 + rnorm(n, 0, 0.5)
}

# Convert to 5-point Likert scale (1 to 5)
items <- matrix(NA, nrow = n, ncol = 12)
for (i in 1:12) {
  items[, i] <- as.numeric(cut(items_cont[, i], breaks = 5, labels = FALSE, include.lowest = TRUE))
}

# Create data frame
data_three_factor <- as.data.frame(items)
names(data_three_factor) <- paste0("Item", 1:12)

Running parallel analysis

Now let's run a parallel analysis using the fa.parallel() function. We will input our polychoric correlation matrix of our data and then specify the number of observations for this data (n=1000), followed by the correlation used ('poly' = polychoric), the factor measures use ('wls' = weighted least squares, which is best to use when working with categorical data), and how we want to calculate the eigenvalues ('fa = fa' because we want to do common-factor extraction not principal component).

# Run a parallel analysis
fa_result3 <- fa.parallel(x = polychoric(data_three_factor)$rho,
                          n.obs = n, cor = "poly", fm = "wls", fa = "fa") # wls = weighted least squares

# View the output of eigenvalues
fa_result3

We can see from the information below that our dataset of 12 items can be explained very well from three factors.

Parallel Analysis Table Output Parallel Analysis Scree Plot
Screenshot 2025-06-07 at 6 47 30 PM Screenshot 2025-06-07 at 6 47 50 PM

Why are there three factors?

If we look closely at the polychoric correlation matrix of our data, we can see something very interesting. Here each correlation was rounded to .1 to make the pattern very noticeable. We have

# Viewing the polychoric correlation matrix
round(polychoric(data_three_factor)$rho,1)

Below we see that the items are arranged in a way to make viewing easier and were edited in power point to add color aids. The items are arranged by order of what factor makes up most of their variance (again our items are the outcomes that were generated from the factors we created). We see that items that correspond to the factors we created (items within a specific color) are all pretty much highly correlated with each other (0.7-0.8). Additionally we can see how these items correlate with items of other factors and that it is moderately correlated (0.4-0.5). Thus this tells us that our data looks like it has three factors and that these factors are correlated with each other, which is what we want for a categorical EFA.

Polychoric Correlations of our 12 items

Fitting an EFA Model

Again, to fit an EFA model using lavaan, we basically introduce another efa("efa1")*f3 within the EFA block and sum it to the previous one. We use the =~ operator followed by the items we want to create the three factors from Item1 + Item2 + Item3 + Item4 + Item5 + Item6 + Item7 + Item8 + Item9 + Item10 + Item11 + Item12. Once the model is specified we run it through sem(). For the fa() approach, we just need to modify 'nfactors = 3'.

# ESEM: Specify EFA model with three factors using lavaan
esem_model <- '
  # EFA block for three factors
  efa("efa1")*f1 +
  efa("efa1")*f2 +
  efa("efa1")*f3 =~ Item1 + Item2 + Item3 + Item4 + Item5 + Item6 + Item7 + Item8 + Item9 + Item10 + Item11 + Item12
'

# Fit ESEM model with WLSMV for categorical data
esem_fit <- sem(
  model = esem_model,
  data = data_three_factor,
  estimator = "WLSMV",
  ordered = TRUE,  # Treat items as ordinal
  rotation = "geomin"  # Oblique rotation
)

# Standalone EFA with psych package
fa_fit <- fa(
  r = data_three_factor,
  nfactors = 3,
  cor = "poly",  # Polychoric correlations for categorical data
  rotate = "oblimin",  # Oblique rotation
  fm = "wls"  # Weighted least squares
)

# Summarize ESEM results
summary(esem_fit, fit.measures = TRUE, standardized = TRUE)

# Summarize standalone EFA results
print(fa_fit)

Again to keep this simple, we see that the produced factor loadings from both approaches are essentially the same. Thus validating that our lavaan and sem() approach is doing what we believe it is supposed to be doing. We can also see how our generated dataset (our 12 ordinal items) provides clean evidence for three factors.

sem() model output just loadings fa() model output just loadings
Screenshot 2025-06-07 at 6 50 49 PM Screenshot 2025-06-07 at 6 50 08 PM