Introduction to Path Analysis Using Lavaan - Private-Projects237/Statistics GitHub Wiki
Overview
This wikipage will be introducing the topic of path analysis through lavaan. We will cover this topic in a lot of detail, beginning with simple regression, moving to multiple regression, and then using path analysis to create more complex models including mediation.
Part 1: Simple Regression
This next section is to illustrate that path analysis is basically an extension of regression- we can obtain the same estimates and most of the information that we generally could by just using lm(). First we will start by explaining the dataset, which we will use for this example and more up ahead. The dataset is three continuous variables (x, y, z). They have been programed through the mvrnorm()
function from the MASS package to have a set correlation with each other. To make these estimates true to what we programmed, a sample size of 5000 was chosen. Sample sizes less than 1000 risk deviating slightly from what we intended. Here is the code to generate the data.
Generating the data
# Load in packages
library(tidyverse)
library(MASS)
library(lavaan)
library(semPlot)
library(semptools)
# Set seed
set.seed(123)
# Set population parameters
mu <- c(50, 5, 100) # Realistic means for x, y, z
Sigma <- matrix(c(9, 3.6, 2.4,
3.6, 2.25, 0.75,
2.4, 0.75, 4 ), nrow = 3)
# Set sample size
n <- 5000
# Generate observations
data <- as.data.frame(mvrnorm(n, mu = mu, Sigma = Sigma))
# Name the columns
colnames(data) <- c("x", "y", "z")
Running Simple Regression Through Lavaan
Here we specify three models and run three models. Using Lavaan syntax, we write out the models in a character string and then run them through the sem() function. We can then obtain the summaries of these models to see what the estimated parameters are + other information about model fit. Since these models are just identified- there is no reason to investigate model fit indices.
# Specify the models
mod1_1 <- '
y ~ x
'
mod1_2 <- '
y ~ z
'
mod1_3 <- '
z ~ x
'
# Run the models
fit1_1 <- sem(mod1_1, data)
fit1_2 <- sem(mod1_2, data)
fit1_3 <- sem(mod1_3, data)
Visualizing the Relationships
We can easily visualize relationships by generating path diagrams. However, this does require a lot of code to make the final produce look visually pleasing. That is what this next section of code is presenting.
# View the standardized solutions on a path diagram
sr_list <- list(fit1.1 = fit1_1, fit1.2 = fit1_2, fit1.3 = fit1_3)
# Visualize the path diagrams
m_1 <- matrix(c(NA, NA, NA,
"x", NA, "y",
NA, NA, NA), byrow = TRUE, 3, 3)
m_2 <- matrix(c(NA, NA, NA,
"z", NA, "y",
NA, NA, NA), byrow = TRUE, 3, 3)
m_3 <- matrix(c(NA, NA, NA,
"x", NA, "z",
NA, NA, NA), byrow = TRUE, 3, 3)
# Save the matrices into a list
m_list <- list(m_1, m_2, m_3)
# Print out the path diagrams (standardized)
for(ii in 1:length(sr_list)) {
# Set plot margins explicitly
par(mar = c(10, 5, 6, 5)) # Increased bottom margin for title space (D L U R)
plot <- semPaths(sr_list[ii](/Private-Projects237/Statistics/wiki/ii),
whatLabels = "std", # standardized + pretty
edge.label.cex = 1.6, # Makes loadings and errors larger
layout = m_list[ii](/Private-Projects237/Statistics/wiki/ii), # Uses specified matrix layout
sizeMan = 10, # Size of boxed
border.width = 3, # Thickness of box borders
label.cex = 2, # Size of letters inside boxes
edge.color = "black", # Makes all lines black
edge.width = 3, # Make the lines thicker
curve = 2, # Add curvature to double-headed arrows
mar = c(10, 6, 6, 7.5) # Ensure semPaths respects margins (D L U R)
)
# Add Asterisks
plot2 <- mark_sig(plot, sr_list[ii](/Private-Projects237/Statistics/wiki/ii))
plot(plot2)
}
First lets review the descriptives of our data. We see that there are three variables (x, y, z) and that they both have different means and standard deviations. In a covariance matrix, we can see that the variables with the larger standard deviation produces the larger variances (values on the diagonal) and we can see how variables covary with each other. Nothing is zero so they are related with each other to some extent. In the correlation matrix we can visualize exactly the strength of the correlations among variables. We see that the highest correlation is between z & y and that the weakest correlation is between x & y.
Descriptives | Covariance Matrix | Correlation Matrix |
---|---|---|
The in a simple regression, the unstandardized beta coefficients represent covariance between the two variables divided by the variance of the predictor. For example, in model 1 the covariance between x and y is 3.56 and the variance of x is 8.93 therefore 3.56/8.93 = 0.40. The residual variance for the same example is 0.81, which is considerably smaller than its original variance of 2.23. Therefore, if we divide the remaining residual variance 0.81 by the original variance 2.23 we would would get 0.81/2.23 = 36%, which subtracted from 1 will produce $R^2
$.
Model 1 (Unstandardized) | Model 2 (Unstandardized) | Model 3 (Unstandardized) |
---|---|---|
When examining standardized beta coefficients, we see that they are literally the correlations! I mean this literally makes sense but pretty cool to visually see. This is because we are viewing unadjusted slopes- these are expected to change when running multiple regression, since then the estimates of predictors will control for each others effects on the outcome.
Model 1 (Standardized) | Model 2 (Standardized) | Model 3 (Standardized) |
---|---|---|
Part 2: Multiple Regression
In our previous example we got to view the relationship between a single predictor and an outcome variable and how for unstandardized beta coefficients, it relates to the covariance and variance of the variables, and how for standardized beta coefficients, the slopes represent correlations. Now we will be adding some complexity by running multiple regression, meaning that at least two predictors will be present in the model. However, something important was learned when running multiple regression on the previous dataset, and that is how the correlation between variables can be used to predict how well the model will run.
For example, the correlation between x and y is pretty strong (r = .80) and the correlation between z and y is weak (r=.20). Additionally, there is a moderate correlation between x and y (r= .40). Where am I going with this? Well the whole point of regression is to produce adjusted slopes that represent the relationship between the predictor and the outcome after controlling for the effects of the other predictor. In other words, the multiple regression is basically producing estimates that are weaker (idk if this is always true) than what they would be through simple regression, because some of that predictive power is already being captured by the other predictor. Therefore, an issue arose when the regression y ~ x + z
was fitted, and that was that the $R^2
$ of that model basically remained the same as y ~ x
and the adjusted slope of 'z' became negative. This is known as a suppression effect. Anyway, to prevent this we will now simulate new data where the correlation between x and y & z and y are moderate while the relationship between x and z is lower- thus creating models were the suppression effect (when y is the outcome) manifesting is unlikely.
Generating the data
# Load in packages
library(tidyverse)
library(MASS)
library(lavaan)
library(semPlot)
library(semptools)
# Set seed
set.seed(123)
# Set population parameters
mu <- c(50, 5, 100) # Realistic means for x, y, z
Sigma <- matrix(c(
9.00, 2.70, 1.20,
2.70, 2.25, 1.65,
1.20, 1.65, 4.00), nrow = 3)
# Set sample size
n <- 5000
# Generate observations
data <- as.data.frame(mvrnorm(n, mu = mu, Sigma = Sigma))
# Name the columns
colnames(data) <- c("x", "y", "z")
Running Multiple Regression Through Lavaan
Now we are going to use the same notation as before to specify the models. This time we will add the '+' sign to indicate the inclusion of another predictor in the model. We also wrote in that we want the model to produce the correlation between both predictors in the model. Keep in mind, lavaan by default does this anyway. However, by us manually specifying this relationship, lavaan will also run hypothesis testing on it to indicate if the correlations are significant or not.
mod2_1 <- '
y ~ x + z
x~~z
'
mod2_2 <- '
z ~ x + y
x ~~ y
'
mod2_3 <- '
x ~ z + y
z ~~y
'
# Run the models
fit2_1 <- sem(mod2_1, data)
fit2_2 <- sem(mod2_2, data)
fit2_3 <- sem(mod2_3, data)
Visualizing the Relationships
Now that we have specified and fit our models we can go ahead and generate path diagrams. Again it is much easier to understand relationship among variables as path diagrams than through a table, so we will continue to do it this way.
# Run the models
fit2_1 <- sem(mod2_1, data)
fit2_2 <- sem(mod2_2, data)
fit2_3 <- sem(mod2_3, data)
# View the standardized solutions on a path diagram
sr2_list <- list(fit2.1 = fit2_1, fit2.2 = fit2_2, fit2.3 = fit2_3)
# Visualize the path diagrams
m_1 <- matrix(c("z", NA, NA,
NA, NA, "y",
"x", NA, NA), byrow = TRUE, 3, 3)
m_2 <- matrix(c("y", NA, NA,
NA, NA, "z",
"x", NA, NA), byrow = TRUE, 3, 3)
m_3 <- matrix(c("z", NA, NA,
NA, NA, "x",
"y", NA, NA), byrow = TRUE, 3, 3)
# Save the matrices into a list
m_list <- list(m_1, m_2, m_3)
# Print out the path diagrams (standardized)
for(ii in 1:length(sr2_list)) {
# Set plot margins explicitly
par(mar = c(10, 5, 6, 5)) # Increased bottom margin for title space (D L U R)
plot <- semPaths(sr2_list[ii](/Private-Projects237/Statistics/wiki/ii),
whatLabels = "est", # standardized + pretty
edge.label.cex = 1.75, # Makes loadings and errors larger
layout = m_list[ii](/Private-Projects237/Statistics/wiki/ii), # Uses specified matrix layout
sizeMan = 10, # Size of boxed
border.width = 3, # Thickness of box borders
label.cex = 1.4, # Size of letters inside boxes
edge.color = "black", # Makes all lines black
edge.width = 3, # Make the lines thicker
curve = 2, # Add curvature to double-headed arrows
mar = c(10, 6, 6, 7) # Ensure semPaths respects margins (D L U R)
)
# Add Asterisks
plot2 <- mark_sig(plot, sr2_list[ii](/Private-Projects237/Statistics/wiki/ii))
plot(plot2)
}
Here are the new descriptives of the data that we can explain in more detail here. The most important aspect is the change of the simulated correlation matrix. In this version, we now have two variables that are nicely correlated with the outcome y, but are moderately correlated with each other. This will therefore produce not only estimates that represents the relationship between the predictor and outcome after controlling for the effects of the other predictor, but will also guarantee a larger $R^2
$. However, for fun and for education purposes each variable will function as an outcome. Thus variable x and z will also be modeled by the other two predictors. However, from these results we will expected problems and this goes back to the correlation table.
Starting with z ~ x + y
, we can see that the correlation between x and z is weak (r = .20) and nicely between y and z (r = .55). Additionally, we see that x and y are also nicely correlated with each other (r = . 59)- so what will happen? A suppression effect will occur because the variance that x is trying to explain from z will be captured by y since 1) y is correlated more highly with z than x is and 2) both predictor (x and y) are correlated.
Now moving to x ~ y + z
, we see that the correlation between x and y is nice (r = .59) and that the correlation between x and z is weak (r = .20). Additionally (and not surprising at this point), there is a nice correlation among the predictors (r = 0.55). Thus, we should expect another suppression effect. Why? Because whatever variance z was trying to explain from x, y can already do since 1) y correlated more strongly with x than z does and 2) y and z are correlated.
Descriptives | Covariance Matrix | Correlation Matrix |
---|---|---|
Not too much to mention here but there is confirmation that the suppression effect has taken place.
Model 1 (Unstandardized) | Model 2 (Unstandardized) | Model 3 (Unstandardized) |
---|---|---|
Here is where we can get more into the good stuff. There is a new equation we now need to know which is the equation to calculate the standardized residual error from a multiple regression with two predictors:
\mathrm{Standardized Residual Error} = 1 - (\beta_1^2 + \beta_2^2 + 2\beta_1\beta_2r_{12})
Now this formula is awesome because based on our path diagrams below, the correlation between the two predictors is fixed, so essentially that is what never changes in the model, which kinda forces the beta estimates to change in order to make it work with explaining variance from the outcome. We can use this formula to arrive at the same residual variances for all of the path diagrams below. The main takeaway here is that adding predictors to the model may improve model fit and here we have cases on where this does and does not work and what the reasoning for this is.
Model 1 (Standardized) | Model 2 (Standardized) | Model 3 (Standardized) |
---|---|---|
Part 3: Non recursive (reciprocal) model?
Here is where things start to get interesting. We will use the same data that was generated for multiple regression.
Running Non Recursive Models Through Lavaan
Part 1: Mediation Analysis
A mediation analysis is a statistical technique that tests whether the direct effect of predictor 'x' on outcome 'y' can be partially explained through mediator variable 'm'. The idea is that if mediation is taking place, then a significant effect from the product of the unadjusted effect of 'x' on 'm' (path a) and the adjusted effect of 'm' on 'y' (path b) should be present. Additionally we can calculate a mediation proportion by dividing that product by the unadjusted effect of 'x' on 'y' (path c), which can be derived by adding it's adjust effect counterpart (direct effect aka c') to the product of path a and path b.
In terms of regression, we can write it out like this
M = a *X + \epsilon
Y = b *M + \ c'*X + \epsilon
Y = c * X + \epsilon
Generating the data
# load in packages
library(tidyverse)
library(ggplot2)
library(mediation)
library(lavaan)
library(semPlot)
library(semptools)
# Generate dummy data
set.seed(123) # For reproducibility
n <- 200 # Sample size
# Simulate dat
x <- rnorm(n, mean = 10, sd = 2) # Study hours
m <- 0.6 * x + rnorm(n, mean = 0, sd = 1.5) # Knowledge, influenced by X
y <- 0.4 * m + 0.3 * x + rnorm(n, mean = 0, sd = 1.2) # Exam score, influenced by M and X
# Combine into a dat frame
dat <- data.frame(x, m, y)
Descriptives |
---|
Approach 1: Running 3 regressions
Well technically regression model 2 and 3 is sufficient
# Run a mediation analysis using 3 regressions
mod1 <- lm(y ~ x, dat)
mod2 <- lm(m ~ x, dat)
mod3 <- lm(y ~ x + m, dat)
# Retrieve the slopes of interest
(c <- round(coef(mod1),3)["x"])
(a <- round(coef(mod2),3)["x"])
(b <- round(coef(mod3),3)["m"])
(`c'` <- round(coef(mod3),3)["x"])
(indirect_effect <- round(a * b,2))
(total_effect <- round(a * b + `c'`,3))
mediate()
Approach 2: Running a mediation analysis through Requires mod2 and mod3 from above to work. The pro here is that it uses boot strapping- the con is that it is difficult to interpret the labels because they are weird. The mediate()
function becomes available through the 'mediation' package. Additionally you need to specify which variable is the predictor of interest (treatment) and which one is the mediator.
# Run a mediation analysis using mediation() function
med_mod <- mediate(mod2, mod3, treat = "x", mediator = "m", boot = TRUE, sims = 1000)
summary(med_mod)
Approach 3: Running a mediation analysis through lavaan
This code is not intuitive but it is very neat! So here we are specifying the same two regressions that are required for Approach 1 and Approach 2. These models are specified under the '# Direct effects' line. Thus, this will tell lavaan to generate the estimates or the slopes as a typical regression and multiple regression would. Next, the :=
notation is used to tell lavaan to calculate something after the regressions are completed. In this case, we are telling it to calculate the indirect effect, which is the produce of path a and path b (we labeled the estimates in the '# Direct effects' portion). Then, we used :=
again to calculate the total effect, which is the indirect effect plus the direct effect of the main predictor (path c') on the outcome. Lastly, we can calculate the proportion of the mediation effect by using :=
to divide the indirect effect (path a * path b) by the total effect (path c)
med_lav <- '
# Direct effects
m ~ a*x # path a
y ~ b*m + c*x # path b and c
# Indirect effects
ind := a*b
# Total effect
total := c + (a*b)
# Mediation Proportion
Prop.med := ind/total
'
# Fit the model
fit <- sem(med_lav, data = dat)
# Summary of results
summary(fit, fit.measures = TRUE, standardized = TRUE, rsquare = TRUE)
parameterEstimates(fit)
Mediation Output from all Approaches
We see below that all approaches led to the same estimates- however, some approaches show more information than others. For example, the main limitation for the three (or really two) regression approach is that the indirect effect of 0.208 does not have a p-value associated with it, thus we cannot know whether it is significant or not. This is not an issue for the mediation()
and lavaan approach. When it comes to using mediation()
, the output only shows you the indirect effect (ACME), the direct effect (ADE), the full effect (Total Effect) and the mediation proportion (Prop, Mediated)- ignoring the annoying labels chosen for this, it does not show use path a nor path b separately, just their product. With lavaan, we basically get all the information we want, it is nicely labeled, all estimated parameters have a p-value associated with it, and we can also calculate the proportion of variance explained by the model, which is a nice bonus. Thus, the lavaan approach while a bit awkward to specify produces the best model outputs.
Separate Regressions | mediation() |
lavaan |
---|---|---|
Visualizing the mediation
Additionally by using lavaan functions, we can take the models produced by sem()
as an example and run them through semPaths()
from the 'semPlot' package and mark_sig()
from the 'semptools' package to create nice visuals! Below we showcase the mediation analysis
# Plot our the path diagram
m <- matrix(c(NA, "m", NA,
NA, NA, NA,
"x", NA, "y"), byrow = TRUE, 3, 3)
med_plot <- semPaths(fit,
whatLabels = "est", # standardized + pretty
edge.label.cex = 1.75, # Makes loadings and errors larger
layout = m, # Uses specified matrix layout
sizeMan = 10, # Size of boxed
border.width = 3, # Thickness of box borders
label.cex = 1.4, # Size of letters inside boxes
edge.color = "black", # Makes all lines black
edge.width = 3, # Make the lines thicker
mar = c(5,5,6,5)) # D L U R
# Add asterisks to significant estimates
med_plot2 <- mark_sig(med_plot, fit)
plot(med_plot2)
Visualizing Mediation Analysis |
---|