ANCOVA Overview - Private-Projects237/Statistics GitHub Wiki

Overview

This page will explain the main concepts of using an ANCOVA. First we will start with an easy example and include graphs to make the ideas behind ANCOVA more grounded. The gist of the ANCOVA is that it's a statistical approach that may control for the effects of a variable that is not the main predictor (covariate) on the outcome variable. The best way to think about this is by thinking of variance. When collecting data for an outcome variable, there will be a lot of variance for that variable (you will have scores that deviate from the mean) that can be attributed to many different things. Some of that variance may be explained by your predictor variable and other parts of the variance may be explained by another variable. Thus, a result of using ANCOVA is that there is a reduction to the overall variance of the outcome variable, which then gives you more opportunity to look at the relationship from your predictor of interest and the outcome with more precision.

Example 1: When does ANCOVA work?

We will use code in R to create dummy data. This dummy data contains three variables:

group: factor variable with two levels (A, B)
prior_knowledge: continuous variable quantifying prior knowledge
final_exam_score: continuous variable measuring performance on the final exam

# Set seed for reproducibility
set.seed(123)

# Number of observations
n <- 50

# Generate group variable (A or B)
group <- rep(c("A", "B"), each = n/2)

# Generate final exam scores (dependent variable)
final_exam_score <- ifelse(group == "A", 
                           rnorm(n/2, mean = 80, sd = 5),
                           rnorm(n/2, mean = 72, sd = 5))
  
# Generate a variable for pior knowledge (covariate)
prior_knowledge <- ifelse(group == "A", 
                          rnorm(n/2, mean = 5, sd = 5),
                          rnorm(n/2, mean = 0, sd = 5)) + final_exam_score

This chunk of R code is creating data comprised of 50 observations (n=50; the code was created through ChatGPT then modified). There are two groups (A and B) and the outcome variable (final_exam_score) was created by choosing a random number from a normal distribution with a mean dependent on group type. Thus, this will ensure that there is a group effect on the outcome variable. Lastly, a covariate variable was created by taking the outcome variable and adding values at random from a distribution curve. This creates a variable that is related to the outcome variable but has a little bit of noise put it so they are not the exact same thing (r = .84).

Data visualization

# Plotting the the outcome variable based on the main predictor (categorical)
data %>%
  ggplot(aes(x = group, y = final_exam_score)) +
  stat_summary(fun = "mean",
               geom = "bar",
               color = "black",
               fill = "white",
               size = 1,
               width =.5 ) +
  stat_summary(fun.data = mean_cl_normal,
               geom = "errorbar",
               width = .2, size = 1) +
  geom_jitter(width = .1,
              size = 1) +
  labs(title = "Group Differences (unadjusted) of Final Exam Score",
       x = "Group Status",
       y = "Final Exam Score") +
  scale_y_continuous(expand = c(0,0), limits = c(0,100)) +
  theme_classic()

Figure 1: Notice the group mean differences for the outcome variable. For group A it looks like the mean is around 85 and for group B the mean is lower at around 72. This is no coincidence, the way we structured the code provided these outcomes. The dots on the bar graphs indicate individual scores within each group. Lastly, there are error bars included that tell us where the true population means for this outcome variable are located.

# Plotting the outcome variab;e based on the dependent variable
data %>%
  ggplot(aes(x = prior_knowledge, y = final_exam_score)) +
  geom_point() +
  theme_classic() +
  labs(title = "Prior Knowledge (Covariate) and Final Exam\nScore Scatterplot",
      x = "Prior Knowledge",
      y = "Final Exam Score") +
  geom_smooth(method = "lm",
              se = F,
              color = "black")

Figure 2: An important aspect that is not present in Figure 1 is the relationship between this other variable (covariate) and the outcome. Here we can see by this scatterplot that the variables are covarying, which means that as one variable increases the other tends to do the same- thus there is a relationship here that should be accounted for in any statistical models that we would use... but that's not all (we will come back to this later).

Creating ANOVA and ANCOVA models

To fully appreciate the benefits of ANCOVA, it is best to compare their outputs to that of a regular ANOVA. Our research interest for the data we have created is to investigate the relationship between group types (A and B) final exam scores (outcome). We will look at this relationship in a model that does not control for the effects of a covariate (anova_model) and one that does (ancova_model)

# Perform ANOVA
anova_model <- aov(final_exam_score ~ group, data = data)

# Perform ANCOVA
ancova_model <- aov(final_exam_score ~ group + prior_knowledge, data = data)

# Summary of the models
summary(anova_model)
summary(ancova_model)

# Calculate raw means
aggregate(final_exam_score ~ group, data = data, mean)

# Calculate adjusted means
adj_means <- emmeans(ancova_model, "group")
summary(adj_means)

Model Summaries	Group Means

Table 1: Starting with the model summaries. As a quick review ANOVAs produce sum of squares called 'between groups sum of squares (SSB)', 'within groups sum of squares (SSW)', and 'total sum of squares (SST)'. SSB and SSW are both in the outputs and SST is the addition of the two. SSB is how much variance in the outcome is explained by the treatment group (aka any group) and SSW is how much variance is not explained by groups, aka has no explanation. Looking at our ANOVA output, we see that SSB is 670 and SSW is 1044. These values can be used to calculate Mean Squares from respective degrees of freedoms which result in an F statistic that tells us if the group predictor is significant or not. In the case of our ANOVA model it is. However, when we compare it to our ANCOVA model, we notice that the SSB stays the same at 670 but the SSW changes quite dramatically. It decreased from 1044 to 530 and the covariate 'prior_knowledge' has a sum of squares of 530. This indicates that the covariate is explaining and controlling for some of the variance in the outcome. The direct result from this is that the relationship (in our case) between group and the outcome has gotten a lot stronger! Notice that the F statistics nearly doubled. Lastly, since the SSW has been reduced, we can concluded that the ANCOVA is the superior model. When looking at group means, a model without introducing a covariate will keep group means of the outcome the same. However, when a variable is controlled for, the group means do change as they now become adjusted. Notice that this occurs here as well.

Another thing to mention is that using an ANCOVA can decreased the standard errors dramatically by controlling for the effects of a covariate and is visible below. This includes the precision for what the population group means are for the outcome. DISCLAIMER: Evidence has presented itself that the adj means bar graph below might be wrong!