Introduction into Power Analysis for Linear Mixed‐Effects Models - Private-Projects237/Statistics GitHub Wiki

Overview

The goal of this wikipage is to explain how to do power analysis for linear mixed-effects models. We will start by first generating data in which a linear mixed effects model would be appropriate for, run the model, and then use functions to see if our model has enough power and if not what sample size will be required to have enough power!

A Quick Review in Linear Mixed-Effects Models Notation

A quick review, linear mixed-effects models are used when we have data that is nested within some cluster that is random. This means that we have observations that are not fully independent of each other, aka there is some correlation present from this cluster that needs to be controlled for. Failing to do so will result in over inflated standard errors for the fixed effects (type 1 errors). We call this cluster random to differentiate it from a fixed effect. In the latter, we are interested in seeing the predictive power of a factor on the outcome. In the former, we are using this variable more to partition the variance of the outcome into a part related to the cluster influence and a part related to the individual. Therefore, being able to separate the variance from the outcome that is attributed to the individual vs context/environmental factors is important because then we can see what factors are predictive of either!

An empty linear mixed-effects model takes on the following formula:

Y_{ij} = \gamma_{00} + u_{0j} + r_{ij}

Where:

$Y_{ij}$: Is the outcome or the score of individual i in group/cluster j
$\gamma_{00}$: Is the fixed intercept that represents the grand mean of Y across all individuals and clusters
$u_{0j}$: Represents the random effect of being in group j (its deviation from the average effect), usually assumed $u_{0j} \sim N(0,\tau^2)$
$r_{ij}$: Represents the residual or the unexplained portion of the model for individual i in group/cluster j. It is assumed $r_{ij} \sim N(0,\sigma^2)$

Now that we understand what the model looks like, we can now introduce predictors into the model and contextualize it to our dummmy data example. We will use this formula instead moving forward:

Y_{ij} = \gamma_{00} + \beta_1x_{1ij} + \beta^*_2x_{2ij} + u_{0j} + r_{ij}

Where:

$Y_{ij}$: Child i in household j's score on the outcome variable (academic performance)
$\gamma_{00}$: The expected score of child i who is not DD and is at the grand mean average in age
$\beta_1x_{1ij}$: The fixed effect of DD (noDD = 0; DD = 1)
$\beta^*_2x_{2ij}$: The effect of age that is grand mean centered ($age_{ij} - \bar{age}$)
$u_{0j}$: The random effect of being in cluster j
$ r_{ij}$: Unexplained variance (or residuals) of child i after taking into account effects of age, DD status, and cluster j on the outcome.

Generating the data

We will be generating a dataset that contains four variables:

fam_id: This is the househould ID, representing that two children live in the same household
age: This is the child's age (4-18 years)
dd: This is whether the child has DD or not (0 = noDD, 1 = DD)
score: This is the outcome variable of interest (academic performance)

Next we need to specify what the relationship between the fixed and random effects is to the outcome variable. Therefore, we will be generating data that is expected to model the outcome with the following slopes:

Our Final Random Intercepts Model:

Y_{ij} = 50 - 2dd_{ij} + 0.5(age_{ij} - \bar{age}) + u_{0j} + r_{ij}

Distribution of the random effects:

u_{0j} \sim N(0, \tau^2)

r_{ij} \sim N(0, \sigma^2)

Notice that we are not placing any coefficients in front of the random effect portion of the model- that's because it would not make sense to do so. These are just variables that capture differences between the influence of cluster j from the grand mean 50 and the error for child i, which is the deviation of the observed value from the predicted value. Instead, we describe the characteristics of these variables by how much variation there is, with the variance of the cluster effect being $tau^2 = 5^2$ and the variance of the residual being $\sigma^2 = 5^2$ (except when we create this we will be using the standard deviation). Therefore, the strength of the random effects is determined by the variances we choose for them.

Generating the Data in R

Below we will be generating data that should produce the estimates of this model. The dataset will have 200 observations. There will be 100 household IDs, and two children within them (one DD and one not DD). Ages will be uniformly distributed ranging from 4 to 18 years.

# Load in packages
library(lme4)
library(lmerTest)
library(simr)
library(dplyr)

# Set seed 
set.seed(123)

# Parameters
n_families <- 100 # Number of families
children_per_fam <- 2 # Number of children within a family
N <- n_families * children_per_fam # Total sample size

# Simulate family IDs
fam_id <- rep(1:n_families, each = children_per_fam) # Generating family IDs

# Simulate age (uniform 4–18)
age <- runif(N, min = 4, max = 18) # Generating an age variable

# Simulate DD status: within each family, one DD and one noDD
dd <- rep(c(0,1), times = n_families) # generating a DD status variable

# Fixed‐effect coefficients
beta_0 <- 50  # gamma_00, the expected value of performance for a noDD child at the mean age
beta_dd <- -2   # The effect of dd = 1, aka having DD
beta_age <- 0.5  # The effect of age (expected increase in the outcome by a one unit increase in age)

# Random intercepts for family (distribution)
sd_fam <- 5 # family‐level SD- how influential it is
u_fam <- rnorm(n = n_families, mean = 0, sd = sd_fam)
rand_int <- u_fam[fam_id] # Make sure each child within a family has the same family effect (aka effect of j)

# Residual error (distribution)
sd_res <- 5 # The standard deviation of the residual error (part not explained by the model)
epsilon <- rnorm(N, sd = sd_res)

# Generating our outcome from the specified effects and distribution of random effects
score <- beta_0  + beta_dd * dd + beta_age * age + rand_int  + epsilon

# Put into a data frame
dat <- data.frame(
  fam_id = factor(fam_id),
  age    = age,
  dd     = factor(dd, labels = c("noDD","DD")),
  score  = score
)

Visualizing our data

We can look at how the difference variables in our dataset have a direct impact on our outcome. Visualizing the data is a great way to confirm that what we intended on doing actually manifested in the data.

# Visualize the data
ggplot(dat, aes(x = dd, y = score)) +
  geom_boxplot() +
  geom_jitter(width = .2, size = 1) +
  theme_classic() +
  labs(title = "Visualizing the effect of DD status on score")

ggplot(dat, aes(x = age, y = score)) +
  geom_point() +
  geom_smooth(method = "lm", se = F, color = "black") +
  theme_classic() +
  labs(title = "The relationship between age and score")

dat %>%
  group_by(fam_id) %>%
  summarize(mean_score = mean(score)) %>%
  mutate(fam_id = forcats::fct_reorder(fam_id, mean_score)) %>%
  ggplot(aes(x = mean_score, y = fam_id)) +
  geom_point(size = 1) +
  labs(x = "Household Mean Score",
       y = "Household (ordered)",
       title = "Household Means of Child Scores") +
  theme_classic()

Effect of DD Status (Fixed Effect)	Effect of Age (Fixed Effect)	Effect of HouseHold (Ordered; Random Effect)

Plot 1: Here we can see the clear effect of DD status. We specified the model so that being DD would produce a mean difference of -2 compared to noDD. When comparing the box plots between groups, we see that the mean of DD looks lower than that of noDD as expected.
Plot 2: Here we see the relationship between age (continuous) and score (continuous). Putting a line of best fit on the scatter plot shows that there is a positive relationship between the two, which matches with the coefficient that we specified for the effect of age on the outcome.
Plot 3: This plot here is showing us that when looking at average performance by households, that there are major differences! Notice one household has an average performance of ~40 while another an average performance of ~70.

Note: All three plots above are showing us the unadjusted relationships between a predictor and the outcome. Visualizing these plots after running the model will be different because now the relationship between a predictor and the outcome will be controlled for by the effects of the other predictors and random effects in the model.

Running the model

Now that we fully understand what is going on in our data, we can run our model to see if the estimates from the output match out expectations.

First we will run an empty model, where only the random effects are present. This model informs us of the intraclass correlation coefficient (ICC), which informs essentially is a proportion telling us how much variance is explained by the cluster level. To calculate this by hand we take the variance of group level and divide it by the sum of the variance of the group level with the variance of the individual level.

# run an empty model
m0 <- lmer(score ~ (1 | fam_id), data = dat)
performance::icc(m0)
summary(m0)

Empty Model Ouput

Here we see the output of the empty model. We can tell that it is empty because there are no predictors in the model, therefore the only fixed effect present is the y-intercept. We can take the variance of the group level (13.92) and the individual level (38.38) and use that to calculate the ICC = 13.92 / (38.38 + 13.92) = 0.266. Notice this value matches that of the icc() function.

Next we can run the full model with our predictors of interest.

m1 <- lmer(score ~ dd + age + (1 | fam_id), data = dat)
summary(m1)

Model Output

Model Parameter	Variable Type	What we specified	What was Estimated by the Model
$`\gamma_{00}`$	fixed effect	50	48.7378
$`\beta_1x_{1ij}`$	fixed effect	-2	-2.7953
$`\beta^*_2x_{2ij}`$	fixed effect	0.5	0.6202
$`u_{0j}`$	random effect	sd = 5	4.495
$`r_{ij}`$	random effect	sd = 5	5.115

Overall its not a perfect replication of what we specified, and it is not supposed to be! The values we got visually seem very close to what we specified, showing that our model works and that we created our data as intended!

Effect Size

The effect size in these models is very tricky, and I still have not figured out the correct approach to do this. Several articles on the internet are claiming that we should not even bother- however that is not how I role. The problem with calculating effect sizes for these types of models is that effect sizes in traditional simple or multiple regressions do not have variance partitioned into cluster and individual contributions. Instead, the variance is only related to the individual. Therefore, when we calculate the effect size, these values will represent how much variance was explained by the model for the individual variance. In mixed-models, these same functions will completely ignore variance explained from the group level or will not be able to capture it accurately. So these are the limitations we live with and we will proceed.

An easy function we can use is the t_to_eta2() function, where it takes the information from a t-statistic plus the degrees of freedom and converts it into partial eta squared. We can do this below based on the output of our summary.

# Calculate effect size (limitation)
t_to_eta2(-3.853, 97.5455)
t_to_eta2(5.462, 164.9347)

Partial Eta Squared of our Fixed Effects

Calculating Power for the Fixed Effects

It is always good to know whether we have enough power in our data to be able to detect an effect. Power analysis for linear mixed-effects models are also tricky, and it is typically done through data simulation. This is how it works, we basically use the powerSim() function in R from the simr package, and we can provide it the model that we used for it to calculate the power of a single fixed effect. We use the argument "t" to tell the function that we want to simulate t-tests. This make sense since this is what our model computed (a dichotomous variable produces a t-test same with a continuous variable).

# Power Analysis for the fixed effects
# Power for ddDD
power_dd <- powerSim(m1, test = fixed("ddDD", "t"), nsim = 1000)
print(power_dd)

# Power for age
power_age <- powerSim(m1, test = fixed("age", "t"), nsim = 1000)
print(power_age)

Power Analysis for DD	Power Analysis for Age

What these outputs indicate is that we have more than enough power in our model to detect an effect when looking to find our unstandardized beta coefficients as significant. Really what these functions are doing is generating 1000 models, this is what we specified by having 'nsim = 1000', and from those 1000 models, how many times was the effect of dd or age respectively produce a significant result. The proportion of models that did produce a significant result is power! In our case sufficient power is when this proportion reaches .80, which in our case it has! Thus our model is sufficiently powered to detect our two fixed effects.

Calculating Sample Size Needed to Detect .80 Power

In the real world we will likely come across data that due to the small sample size or maybe small effect size will be underpowered. We would expect this for example when analyzing data preliminary while data collection is on going. Therefore, it is important to note how this aspect works. First, we need to address that the data we simulated is extremely over powered due to the large sample size. Therefore, we will start by subsetting the data, rerunning the model, and hopefully finding that our predictors are under powered.

We start by selecting 20 households at random, thus this will decrease our sample size from 200 down to 40.

# Subset the data to decrease our N (N = 40)
new_cluster_n = 20
small_fam_id <- sample(unique(dat$fam_id), new_cluster_n, replace = F)
small_dat <- filter(dat, fam_id %in% small_fam_id)
nrow(small_dat)

Next we re-run the mixed effects model

# Run the second model
m2 <- lmer(score ~ dd + age + (1 | fam_id), data = small_dat)
summary(m2)

Model Output

We see that this had a pretty big effect on our estimates, changing their values much more drastically and making one of our predictors not significant. The reason why our estimates changed is a direct result of sampling error, based on how we picked a small subset of the data, and that the subset of the data ensured that both children from each family ID were present.

Anyway, ignoring that since it does not matter, we can inspect the effect size to see how that changed using the information from the output of the model

# Calculate effect size (limitation)
t_to_eta2(-2.729, 19.3276)
t_to_eta2(1.602, 27.5501)

Effect Size

Interestingly, we see that the effect size for DD effect increased drastically, this is because the slope nearly doubled as a response from random sampling error (chance). And on the other side, the effect of age drastically got reduced, also due to sampling error. Therefore, when we do power analyses, we might see decent power for DD effect but poor power for age.

# Estimate the power of each predictor
power_dd_small <- powerSim(m2, test = fixed("ddDD", "t"), nsim = 1000)
print(power_dd_small)

power_age_small <- powerSim(m2, test = fixed("age", "t"), nsim = 1000)
print(power_age_small)

Power Analysis for DD	Power Analysis for Age

We see now that both of our predictors are underpowered but DD group is getting closed to .80. Therefore, this gives us a good opportunity to simulate data to calculate what sample size is needed to have sufficient power. We will do this by using the functions extend() and powerCurve(). These functions are kinda interesting. extend() basically increases the sample size for a model! So it does not generate a new dataset with more observations, but I think it modifies the model in a way that it simulates what the model would have looked life if generated from a larger sample, and we can specify what sample size we want. After this new object (model) is created, we can use powerCurve() to generate multiple models of varying sample sizes and will report how much power they have to detect a specified predictor.

Important: When we mention sample size, this is directly for the sample size of the level two cluster. Thus, the models present power values of simulated models with specified cluster sample sizes. The powerCurve() function has an argument named 'breaks'. We can use this argument to specify the number of models we want to test the power for and how we want the cluster sample size to increase by for each model (ex: we chose 10). We can then cap this at a model with the largest cluster sample size that we specified in the extend() function (ex: we chose 200).

Below we generated the model with extend() to have a cluster sample size of 200 and then use powerCurve to calculate the power that we would have with the corresponding simulated cluster sample size.

# Extend the model to 200 clusters (adjust as needed)
cluster_n = 200
model_n_increase_by = 10
m2_extended <- extend(m2, along = "fam_id", n = cluster_n) # Simulate a model from a larger dataset

# Now run powerCurve
power_curve_dd <- powerCurve(
  m2_extended, 
  test = fixed("ddDD", "t"), 
  along = "fam_id", 
  breaks = seq(10, cluster_n, by = model_n_increase_by), 
  nsim = 100
)
print(power_curve_dd)

power_curve_age <- powerCurve(
  m2_extended, 
  test = fixed("age", "t"), 
  along = "fam_id", 
  breaks = seq(10, cluster_n, by = model_n_increase_by), 
  nsim = 100
)
print(power_curve_age)

`powerCurve()` for DD status	`powerCurve()` for Age

Conclusion: Even though we specified the sample size for the cluster, the output of powerCurve() is the overall sample size needed to have enough power for each fixed effects. It seems that for DD status, probably with 10 more children a power of 0.80 will be met. For age, however, considering that it had a very small effect size, it will require closer to 100-120 children to have a power of 0.80.