How to run a multi level meta analysis in R - Private-Projects237/Statistics GitHub Wiki
The goal of this wikipage is to detail running a multilevel meta analysis in R. When extracting data from multiple studies, there is going to be a considerable chance that at least one of those studies reported nested data. What is nested data? These are observations that are not fully independent. This occurs when a research study contains multiple outcomes from the same participants (aka scores on two or more cognitive tasks) or when the study contains multiple outcomes from different participants (there are two or more distinct samples, each with their own effect sizes). In these cases, it would be completely inappropriate to ignore this aspect of the data and run a regular meta analysis.
Why is it bad to ignore this? When running a regular meta analysis, the model assumes that all observations are independent of each other. With this aspect assumed by the model, it uses that information to calculate standard errors for different estimates, such as the pooled effect size. However, when observations are not independent of each other, such as having multiple effect sizes from the same subjects, then this interdependency when ignored will produce smaller standard errors than they should, which increases the risk of causing a type I error (rejecting the null hypothesis when we are not supposed to, aka because the null is correct). Thus, correctly specifying and running a multilevel model, which can account for the dependencies in effect sizes from studies with nested data, will produce accurate results and correct standard errors!
There is a second component to why it is good to run a multilevel meta, and that is to partition the variance of the effect sizes into that attributed to between studies and that attributed to within studies. For between studies, we can think of averaging all effect sizes for each study, so now each study has only one effect size, and then compare if these effect sizes are similar to each other or not. If the effect sizes are similar to each other, then that means the effect of stress on cognition tends to be similar across all studies. This is extremely unlikely- realistically we will see that these single effect sizes across papers will vary, and then that will allow us to test moderators to understand why that is occurring. The other aspect is the within studies part, this is investigating how different effect sizes are within studies. So if this value is small, this means that the multiple effect sizes reported for each study are very similar to each other, but probably different from other studies. Again, this is also not likely to be true- instead we will probably see that within studies, the effect sizes are different from each other, and that will allow us to use moderators to identify why this is happening.
Above I mentioned the word moderator in terms of explaining between studies and within studies variance. The moderators that we will use to explain one type of variance or the other look completely different from each other. For between study variance, this moderator, known as study-level moderators will represent characteristics or features that vary across studies but are identical or similar within studies (ex: each study varies by country, but all effect sizes within a study were obtain within the same country). For within study variance, we want to identify effect size-level moderators, where they give us informations on how effect sizes vary within a study (ex: each effect size in a study corresponds to a different cognitive task). Identifying both study level and effect size level moderators are important because it helps us understand our data in a more nuanced way!
Anyway, for the following examples, we will be modeling effect sizes from dummy data generated by AI. Our research interests are to:
- a) investigate the effect of stress on cognition by calculating a pooled effect size with its corresponding p-value
- b) evaluate whether the effect of stress in our data is heterogenous
- c) identify the variance explained by between and within studies for our effect sizes
- c) test study-level and effect size level moderators to explain away the heterogeneity
We will be generating a dataset that contains a lot of useful information that we will need to run a multi level meta analysis. In the code, different values were specified in the creation of the outcome variable. We simulated the data to produce a negative effect on cognition from stress, a protective factor for higher testosterone within the sample, varying performance level based on cognitive task completed, and some random variability.
# Load required package
library(tidyverse)
library(metafor)
# Set seed for reproducibility
set.seed(124)
#####
######### Part 1: Simulate the data
#####
# Parameters
n_studies <- 10
stress_effect <- - 6
testos_effect <- 0.4
memory_effect <- - 1
attention_effect <- 2
sd_study_effect <- 1.5 # Changes variability due to studies
sd_sample_effect <- 3.5 # Changes variability due to within studies
sigma <- 0.5 # Noise introduced
# Setting up soft coding parameters
study_id <- paste0("S", 1:n_studies) # Study IDs as S1, S2, ..., S10
study_id <- rep(study_id, times = sample(1:3, n_studies, replace = TRUE))
n_effects <- length(study_id)
n_study <- length(unique(study_id))
# Study-level random effect (sd = sd_study_effect)
study_effect <- rnorm(n_studies, 0, sd = sd_study_effect)
# Generate dataset with NAs to keep variables in a nice order
data <- data.frame(
effect_id = 1:n_effects,
study_id = study_id,
group = NA,
subject_group = NA,
study_effect = NA, # New column for study-level random effect
sample_effect = NA, # New column for within-study random effect
task_type = sample(c("Memory", "Attention", "Executive"), n_effects, replace = TRUE),
sample_type = NA,
age = NA,
memory = NA,
attention = NA,
testos_change = NA,
n_pre = round(runif(n_effects, 20, 100)),
n_post = NA,
mean_pre = runif(n_effects, 70, 90),
mean_post = NA,
sd_pre = runif(n_effects, 10, 20),
sd_post = runif(n_effects, 10, 20)
)
# Introduce values for memory and attention (dummy coding)
data <- data %>%
mutate(memory = ifelse(task_type == "Memory", 1, 0),
attention = ifelse(task_type == "Attention",1 ,0))
# Introduce sample type information (non-predictive variable)
for(i in 1:n_study) {
# obtain the length of each study
study_length <- data %>% group_by(study_id) %>% summarise(n = length(effect_id))
# Pick military status information
current_military_status <- sample(c("Marine", "Airforce", "Navy", "Army", "Seals"),1)
# Introduce that into the dataset
data$sample_type[data$study_id == study_length$study_id[i]] <- current_military_status
}
# Specify that for each study 0-15% of subjects will drop by the stress condition
for (i in 1:n_effects) {
data$n_post[i] <- round(data$n_pre[i] * (1 - runif(1, 0, 0.15)))
}
# Introduce the effect of testosterone- fix this value for each study
for (i in 1:n_studies) {
data$testos_change[data$study_id == paste0("S", i)] <- runif(1, -20, 20)
data$study_effect[data$study_id == paste0("S", i)] <- study_effect[i] # Assign study-level random effect
}
# Choose some studies to have effect sizes from the same sample (same subjects)
same_subject_studies <- sample(paste0("S", 1:n_studies), 5)
for (i in same_subject_studies) {
n_tasks <- sum(data$study_id == i)
if (n_tasks > 1) {
# Assign group A for all tasks in same-subject study
data$group[data$study_id == i] <- "A"
data$subject_group[data$study_id == i] <- paste0(i, "_A")
# Within-study random effect (sd = sd_sample_effect), single value for same subjects
sample_effect <- rnorm(1, 0, sd = sd_sample_effect)
data$sample_effect[data$study_id == i] <- sample_effect
} else {
data$group[data$study_id == i] <- "A"
data$subject_group[data$study_id == i] <- paste0(i, "_A")
data$sample_effect[data$study_id == i] <- 0 # No within-study effect for single-task studies
}
}
# Make the remaining studies have effect sizes from different samples within the study
other_studies <- setdiff(paste0("S", 1:n_studies), same_subject_studies)
for (i in other_studies) {
n_tasks <- sum(data$study_id == i)
# Assign group letters A, B, C, ... for each subject group within the study
data$group[data$study_id == i] <- LETTERS[1:n_tasks]
for (j in 1:n_tasks) {
data$subject_group[data$study_id == i & data$group == LETTERS[j]] <- paste0(i, "_", LETTERS[j])
data$sample_effect[data$study_id == i & data$group == LETTERS[j]] <- rnorm(1, 0, sd = sd_sample_effect)
}
}
# Introduce age for each sample (no predictive variable)
unique_subject_group <- unique(data$subject_group)
n_samples <- length(unique_subject_group)
for(i in 1:n_samples) {
# Introduce a random age
data$age[data$subject_group == unique_subject_group[i]] <- rnorm(1, mean = 20, sd = 8)
}
# Generating the mean outcome
data <- data %>%
mutate(mean_post = mean_pre + stress_effect + testos_effect * testos_change + memory_effect * memory + attention_effect * attention +
study_effect + sample_effect + rnorm(1, mean = 0, sd = sigma))
# Calculate Hedges' g and variance
for (i in 1:n_effects) {
n1 <- data$n_pre[i]
n2 <- data$n_post[i]
m1 <- data$mean_pre[i]
m2 <- data$mean_post[i]
sd1 <- data$sd_pre[i]
sd2 <- data$sd_post[i]
sd_pooled <- sqrt(((n1 - 1)*sd1^2 + (n2 - 1)*sd2^2) / (n1 + n2 - 2))
data$Hedges_g[i] <- (m2 - m1) / sd_pooled * (1 - 3 / (4 * (n1 + n2 - 1) - 1))
data$Hedges_g_var[i] <- (n1 + n2) / (n1 * n2) + (data$Hedges_g[i]^2) / (2 * (n1 + n2))
}
# Data cleaning (rounding)
rounded_data <- round(select(data, where(is.numeric)), 2)
data[, names(rounded_data)] <- rounded_data
# Additional data cleaning (remove variable we don't need)
data2 <- data %>%
select( - memory, - attention,- mean_pre, - mean_post, - sd_pre, -sd_post)
# Add effect size by study and within study
data2 <- data2 %>%
group_by(study_id) %>%
mutate(between_study_mean = round(mean(Hedges_g),2)) %>%
as.data.frame()
# Print out the data
data2
Below we can see the full dataset that was generated. Let's start by explaining some of the variables of interest:
-
effect_id
: A variable that shows the number of unique effect sizes in the data (there are 19 of them)- unique does not mean independent. -
study_id
: This represents the study that the information was extracted from, notice the same study_id appears multiple times in the data, showing that there is clearly nesting taking place (there are 10 studies total) -
study_effect
: This is a visualization of how studies can realistically vary- it is a way to see how each study can have its own unique effect on the effect size. Thus, this is how we should think about studies in our real data. -
sample_effect
: This is also a visualization on how different samples could potentially have their own effect on the effect size. Imagine that one sample just happens to be a lot smarter than another sample within the same data, we would expect the effect sizes between the two to be different, so this is sorta what is being shown here. Also the same sample could also perform differently as well, maybe they got tired after doing one task- this unfortunately is not shown here just to keep this simple. -
task_type
: An effect size-level moderator that represents the different tasks subjects did within a study. -
sample_type
: A study-level moderator where each study just happened to only recruit subjects from a specific branch in the military. -
age
: An effect size level moderator where every sample across all studies have different mean ages -
testos_change
: For this example, it is a study-level moderator, but realistically it will be more like a hybrid between that and effect size level moderators- mainly because effect sizes of different samples within the same study should produce different testosterone levels- that is not shown here for simplicity sake. -
Hedges_g
: This is our effect size that represents the effect of stress on cognition. Negative values indicate that stress leads to poorer cognition. Why? Because this effect size is calculated as mean_stress - mean_baseline, so if mean_stress is smaller than mean_baseline, it will produce a negative value. -
Hedges_g_var
: This is the variance for each effect size in our data- this variance of the effect size is completely different from between-study and within-study variance. This type of effect size variance is needed to get the model to work correctly. -
between_study_mean
: This is to further illustrate how the pooled effect size for each study can vary across studies by a good margin.
The full dataset |
---|
![]() |
# Run multilevel meta-analysis
mult_lvl_mod <- rma.mv(
yi = Hedges_g,
V = Hedges_g_var,
random = ~ 1 | study_id/effect_id, # This is saying (right to left), all effect sizes are nested within studies!
data = data,
method = "REML"
)
# Summary of results
print("Multilevel Meta-Analysis Results:")
summary(mult_lvl_mod)
# Forest plot
forest(mult_lvl_mod, main = "Multilevel Meta-Analysis: Effect of Stress on Cognition")
Below is the model output after running a multilevel meta analysis. The model output is pretty short but there is a lot to learn from it. First we should start with the Variance Components. Here we have essentially the full variance of the effect sizes, or how much they differ from one another, being explained or partitioned into study level influences and within study level influences. Let's begin with the former. So for the top row, notice that our sample size ('nlvls') is 10. This is because we basically told the model that there are 10 studies and not 19 (this is how many rows there are in the data)! This is great because the model was able to estimate standard errors correctly. Our next focus is the values under 'estim'. These are the estimates of the variance attributed to the study-level, we can compare this value (0.0899) to the variance attributed to the effect size level (0.0084), and see that a lot of the variance in our effect size is due to differences across studies not within them!!!! This is great because it is telling us that we should really be focusing on testing study-level moderators to see if any of them can explain away that variance.
Additionally, we also get a test for heterogeneity, which when significant (in our case it is), indicates that there is heterogeneity in our data. This means that running moderation analyses are an appropriate thing to do. Regardless we should basically always expect this test to be significant.
Lastly, at the bottom we get the pooled effect size of -0.2387 (significant). This value is significant because 0 is not between its lower (-0.4523) and upper boundaries (-0.0250), also it has an asterisk! This values means that on average, and across all papers, that we should expect stress to have a weak but significant negative effect on cognition.
Model Output |
---|
![]() |
For aesthetic purposes, we can also plot our findings as a forest plot. Notice however there is an error and it states that there are 19 studies. This is not true, we have 19 effect sizes but they are nested within 10 studies.
Forest Plot |
---|
![]() |
# Sample type moderation analysis
update(mult_lvl_mod , mods = ~ sample_type)
update(mult_lvl_mod , mods = ~ factor(sample_type) - 1 ) # Can also be achieved with + 0
pvals <- c(0.7439, 0.0014, 0.8920, 0.0023, 0.9206) # p-values from output
p.adjust(pvals, method = "holm")
# Testosterone moderation analysis
update(mult_lvl_mod , mods = ~ testos_change)
# Sample type and Testosterone moderation analysis
update(mult_lvl_mod , mods = ~ factor(sample_type) - 1 + testos_change) # Can also be achieved with + 0
Okay so from the variance component of the model output, we know that a lot of the variance of our effect size is due to the study-level effects (0.0899 vs 0.0084). Thus, we should follow this up by running moderation analyses with study-level moderators. As mentioned, in this dummy dataset we have two of them, sample type (each study uses the same military branch) and testosterone (everyone's testosterone within the same study is identical).
First we start by looking at the effects of 'sample_type'. Basically, we know that effect sizes differ across the 10 studies, and these studies differ because some studies collected data only on Marines, others only on members of the Navy, etc. So it is worth answering the research question on whether the effect of stress on cognition, differs based on what military branch the study was done on. For example, we know that Marines tend to go through very hard exposure of stress- they have to in order to weed out those who cannot handle very hard core environments. Thus, their trainings might produce very high stress which will then lead to an average decrease in cognitive performance. For members of the Army, while they are still engaging in stressful activities, it might be considerably easier than what Marines go through, therefore the effect size of stress on cognition could differ between groups. Therefore this moderation analysis can test something like that!
Okay so the default moderation test showcases dummy coding. This means that one level of the factor (Airforce), becomes the reference group, and the remaining levels of the factor (Army, Marine, Navy, Seals) are being compared to that reference group. Before we get into exploring that, let's look at other aspects of the moderation model output. First we see that the between-study variance has decreased from 0.0899 (meta analysis model) to 0.0344. This shows that this moderator is explaining away some that variance. We can see that it has not explained away all of the variance since the 'Test for Residual Heterogeneity' is still significant. Next, we get an omnibus test with the 'Test of Moderators (coefficients 2:5)'. The p-value for this test is significant, which means that 'sample_type' does moderate the relationship between stress and cognition (effect size). However, on closer inspection of 'Model Results', we see that none of those estimates are significant. The y-intercept is the mean effect size of the Airforce, which is -0.0900. The remaining estimates, like -0.4965 for Army, represents how different the Army's effect size is from the Airforce, and the same logic applies to the remaining military branches. None of these estimates are significant, which tells us that statistically speaking, all military branches had similar effect sizes when compared to the Airforce. This however, is a weird question to explore, because why should we care about whether effect sizes from branches are similar or different to the Airforce? That is so random?
Moderation Analysis Military Branch (Dummy Coding) |
---|
![]() |
A better approach instead is to change the specification of the moderation analysis, to basically remove the dummy coding aspect. Now, dummy coding inherently is not a bad thing, and it makes sense to use it in some context. Imagine if instead of 'Airforce' as a reference we had 'Civilian', therefore the produce estimates would show the effect size difference of military branches in comparison to civilians- this would be an interesting question! But for our data, we do not have that so instead it would make more sense to produce estimates that tell us what the effect size for each military branch is and whether they are significantly different from 0 or not. We can do this by changing the code to remove the y-intercept, and get the model output below.
Now, in this tweak of the code, we do see that the 'Test of Moderators (coefficients 2:5)' does change, and it becomes more significant, meaning that this approach is a better way to explore the data. This does not mean that this approach explains more variance from our data since 1) the variance components are the same and 2) the 'Test for Residual Heterogeneity' is identical. But now when looking at the 'Model Results', we do see findings that are significant. In this model output, we see the effect size strictly for each military branch, and the hypothesis test is testing whether the the effect size is significantly different from 0 for each one. We see that for Army and for Navy, their respective effect sizes of -0.5865 and -0.5648 are significant since a) their lower and upper bounds do not include 0 and b) there are significant p-values (plus the asterisks). HOWEVER, we are not done! This model output does not control for multiple comparison. This means that essentially 4 t-tests were done, which knowing stats, will increase the chances of getting a type 1 error, which means that we find a significant result but that is not real, it is a significant finding just due to chance that would not be replicated in another study. So, what we do now is take the p-values, run a correction on them (we will use Holm) and this makes the p-values larger to prevent type 1 errors. After doing this, we see that the p-values for Marine (p = 0.0070) and Navy (p = 0.0092) are still significant. This concludes our moderation analysis for 'sample_type'
Moderation Analysis Military Branch (No Intercept) | Corrected P-values using Holmes |
---|---|
![]() |
![]() |
The next study-level moderator to investigate is testosterone. Since it is a continuous variable, the process is a lot straight forward. For this moderator, we see that the variance component for between-studies has decreased quite dramatically, from 0.0899 to 0.0119. Additionally when we look at the 'Test for Residual Heterogeneity', we see that it is no longer significant (p = 0.0731), indicating that there is no longer heterogeneity in the data- this is a great thing. As expected based on looking at the reduction of between study variance and the a non significant heterogeneity test, the 'Test of Moderators (coefficient 2)' came back significant (p < .0001), indicating that testosterone significantly explains away heterogeneity of the effect sizes in the data, and when we look at 'Model Results', there is a positive slope for testosterone (0.0209), which is significant. This means that as a sample has higher produce testosterone after stress, that this would have a positive effect on their cognitive performance.
Moderation Analysis Testosterone |
---|
![]() |
With the right number of study numbers within our meta, there is no need to look at moderators individual as we did previously. Instead, we can introduce them all into one model, which we have done below. Here is the model output of having both sample type (removed the intercept) and testosterone. Again following the same steps as before, we see that the variance component for between-studies is now 0, so all differences across studies has been completely explained by this model. The test for heterogeneity is not significant (p = 0.2331) and the test for a moderator having an effect of explaining away heterogeneity is significant (p < .0001). We can then look at our 'Model Results'. We see that Army and Marine both have significant pooled effect sizes, meaning that these groups both experience negative performance in cognition after stress, but this same finding was not present in the Airforce, Navy, or Seals. Lastly, testosterone is also significant, indicating that groups with higher testosterone overall had higher cognitive performance after stress.
Full Moderation Analysis (Uncorrected P-values) |
---|
![]() |
In this section we will now modify the dataset so that
n_studies <- 10
stress_effect <- - 6
testos_effect <- 0.3
memory_effect <- - 10
attention_effect <- 10
sd_study_effect <- 3 # Changes variability due to studies
sd_sample_effect <- 1 # Changes variability due to within studies
sigma <- 0.5 # Noise introduced