Math Underlying a two‐way ANOVA - Private-Projects237/Statistics GitHub Wiki

Overview

Here we will be taking a look at the math that underlies a two-way ANOVA. Additionally we will create a function that will be able to spit out step by step the calculation for each part.

Underlying Math Behind a Two-Way ANOVA

We are essentially taking the variance of the outcome and then identifying what proportion of the outcome is explained by the factors in the model vs what is not. To do this we need to calculate five types of sums of squares:

  1. Total Sums of Squares
  2. Sums of Squares for FactorA
  3. Sums of Squares for FactorB
  4. Sums of Squares for FactorA x FactorB
  5. Residual (Error) Sums of Squares

Then we will need to calculate four types of degrees of freedom:

  1. Degrees of Freedom for FactorA
  2. Degrees of Freedom for FactorB
  3. Degrees of Freedom for the A x B Interaction
  4. Degrees of Freedom for the Residuals

Then we will use the Sum of Squares and Degrees of Freedom to Calculate Mean Squares

  1. Mean Squares for FactorA
  2. Mean Squares for FactorB
  3. Mean Squares for the A x B Interaction
  4. Mean Squares for the Residuals

Lastly we will use the Mean Squares to calculate the F statistic

  1. F for FactorA
  2. F for FactorB
  3. F for the A x B Interaction

Custom Function

The function below is pretty big but we will be breaking down its components to explain exactly how it is calculating each component needed for the ANOVA results. We can essentially calculate everything except the p-values.

Comprehensive_two_way_ANOVA <- function(Outcome,FactorA, FactorB){
  
  # Generate a dataset
  dat <- data.frame(Outcome, FactorA, FactorB)
  
  # Calculate all the means needed for the ANOVA
  grand_mean = mean(Outcome)
  FactorA_means <- aggregate(Outcome~FactorA, dat, mean)
  FactorB_means <- aggregate(Outcome~FactorB, dat, mean)
  cell_means <- aggregate(Outcome~FactorA + FactorB, dat, mean)
  
  # Combine the calculated means into one dataset plus sample size
  df <- cell_means %>%
    rename(cell_means = Outcome) %>%
    mutate(grand_mean = grand_mean) %>%
    left_join(FactorA_means %>% rename(FactorA_mean = Outcome), by = "FactorA") %>%
    left_join(FactorB_means %>% rename(FactorB_mean = Outcome), by = "FactorB") %>%
    left_join(
      dat %>%
        group_by(FactorA, FactorB) %>%
        summarise(n = n(), .groups = "drop"),
      by = c("FactorA", "FactorB")
    )
  
  # Calculate the sums of squares
  SS_total <- sum((Outcome - grand_mean)^2)

  # Calculate SS of squares individaually
  df2 <- df %>%
    group_by(FactorA) %>%
    reframe(n = sum(n),
              FactorA_mean, 
              grand_mean) %>%
    unique %>%
    mutate(
      mean_diff = FactorA_mean - grand_mean,
      mean_diff_sq = mean_diff^2,
      mean_diff_sq_x_n = mean_diff_sq * n,
      FactorA_SS = sum(mean_diff_sq_x_n)
    )
  
  df3 <- df %>%
    group_by(FactorB) %>%
    reframe(n = sum(n),
            FactorB_mean, 
            grand_mean) %>%
    unique %>%
    mutate(
      mean_diff = FactorB_mean - grand_mean,
      mean_diff_sq = mean_diff^2,
      mean_diff_sq_x_n = mean_diff_sq * n,
      FactorB_SS = sum(mean_diff_sq_x_n)
    )
    
  df4 <- df %>%
   mutate(mean_diff_plus_grand = cell_means - FactorA_mean - FactorB_mean + grand_mean,
          mean_diff_plus_grand_sq = mean_diff_plus_grand^2,
          mean_diff_plus_grand_sq_x_n = mean_diff_plus_grand_sq*n,
           Interaction_SS = sum(mean_diff_plus_grand_sq_x_n))
  
  # Calculate residual sum of squares
  SS_residual = SS_total - df2$FactorA_SS[1] - df3$FactorB_SS[1] - df4$Interaction_SS[1]
  
  # degrees of freedom
  df_FactorA <- nrow(FactorA_means) -1
  df_FactorB <- nrow(FactorB_means) -1
  df_interaction <- (nrow(FactorA_means) -1) * (nrow(FactorB_means)-1)
  df_residual <- nrow(dat) - (df_FactorA + df_FactorB + df_interaction + 1)
  df_total <- length(Outcome) - 1
  
  # Create a dataframe to show how degrees of freedom are calculated
  df5 <- data.frame(
    Source = c("FactorA", "FactorB", "FactorA:FactorB", "Residual", "Total"),
    Equation = c("a-1","b-1","(a-1)(b-1)", "n-(a*b)","n-1"),
    df = c(df_FactorA, df_FactorB, df_interaction, df_residual, df_total)
  )
  
  # Calculate Mean Squares
  MS_FactorA<- df2$FactorA_SS[1] / df_FactorA
  MS_FactorB <- df3$FactorB_SS[1] / df_FactorB
  MS_interaction <- df4$Interaction_SS[1] / df_interaction
  MS_residual <- SS_residual / df_residual
  
  # Calculate F statistics
  F_FactorA <- MS_FactorA / MS_residual
  F_FactorB <- MS_FactorB / MS_residual
  F_iteraction <- MS_interaction / MS_residual
  
  # Create an ANOVA table
  df6 <- data.frame(Source = c("FactorA", "FactorB","FactorA:FactorB", "Residuals", "Total"),
                    Df = c(df_FactorA, df_FactorB, df_interaction, df_residual, NA),
                    SS = c(round(df2$FactorA_SS[1],3),
                           round(df3$FactorB_SS[1],3),
                           round(df4$Interaction_SS[1],3),
                           round(SS_residual[1],3),
                           round(SS_total[1],3)),
                    MS = c(MS_FactorA, MS_FactorB, MS_interaction, MS_residual, NA),
                    F_stat = c(F_FactorA, F_FactorB, F_iteraction, NA, NA))
  
  # Return list
  return_list <- list(
    All_means = df,
    FactorA_SS = data.frame(df2),
    FactorB_SS = data.frame(df3),
    Interaction_SS = df4,
    Degrees_of_Freedom = df5,
    Anova_Table = df6
  )
  
  return(return_list)
}

Practice 1 (two-way ANOVA)

Generating data

# Set seed
set.seed(123)

# Factors
group <- factor(rep(c("G1", "G2", "G3"), each = 20))  # 3 levels of Factor A
sex <- factor(rep(rep(c("M", "F"), each = 10), times = 3))  # 2 levels of Factor B

# True means
group_effects <- c(G1 = 50, G2 = 55, G3 = 60)
sex_effects <- c(M = 0, F = 3)  # Females score 3 higher
interaction_effects <- matrix(c(0, 0, 0,   # G1:M, G2:M, G3:M
                                0, 2, -2),  # G1:F, G2:F, G3:F
                              nrow = 3, byrow = FALSE)  # group × sex

# Generate outcome
outcome <- numeric(length(group))
for (i in 1:length(group)) {
  g <- group[i]
  s <- sex[i]
  outcome[i] <- group_effects[g] + sex_effects[s] +
    interaction_effects[which(levels(group)==g), which(levels(sex)==s)] +
    rnorm(1, 0, 5)  # Add noise
}

# Data frame
dat <- data.frame(outcome, group, sex)
head(dat)

Running the Comprehensive_two_way_ANOVA() function

# Run the cumstom function
Comprehensive_two_way_ANOVA(dat$outcome, dat$group, dat$sex)
Comprehensive_two_way_ANOVA() Output
Screenshot 2025-04-20 at 3 50 42 AM

The main advantage of the custom function is that it shows you the descriptive statistics needed to produce the ANOVA output. It first start by returning a table of means. These means include the marginal means of FactorA and FactorB, these are the means for each level within a factor, and it includes cell means, which are the means created from all combination of levels from both FactorA and FactorB. Additionally we need to know the sample size. These means can easily be obtained by using the aggregate() function.

Next, we need to start calculating the sum of squares.

  1. Total sum of squares: This is the easiest one to calculate, take each outcome value and subtract it by the grand mean of the outcome. Afterwards, square each value and then take their sum.
  2. Factor A sum of squares: Take the marginal means of FactorA and subtract them by the grand mean. Next, square the value and multiply them by their respective sample size for that sub group. Afterwards sum of the values.
  3. Factor B sum of squares: Same as the FactorA SS
  4. Interaction sum of squares: This is by far the trickiest one. Start with the cell mean and subtract it by its corresponding marginal mean of FactorA, then subtract its corresponding marginal mean of FactorB, then sum the grand mean. Afterwards, square the values and multiply them by their corresponding sub group sample size. Lastly, sum up the values.
  5. Residual sum of squares: This is the easiest one, it is just the total sum of squares subtracted by the other three sum of squares.

Next, we need to calculate the degrees of freedom.

  1. FactorA df: Take the number of levels from FactorA and subtract it by 1
  2. FactorB df: Take the number of levels from FactorB and subtract it by 1
  3. A x B Interaction df: Multiply the number of levels of FactorA by the number of levels of Factor B
  4. Residual df: Take the number of observations and subtract it by the FactorA df, FactorB df, A x B Interaction df, and minus 1.

We can use the sums of squares and the degrees of freedom to calculate mean squares.

  1. FactorA Mean Squares: FactorA SS / FactorA df
  2. FactorB Mean Squares: FactorB SS / FactorB df
  3. A x B Interaction Mean Squares: A x B Interaction SS / A x B Interaction df
  4. Residual Mean Squares: Residual SS / Residual df

Lastly, we can use the mean squares to calculate the F statistics

  1. Factor A F Statistic: FactorA MS / Residual MS
  2. Factor B F Statistic: FactorB MS / Residual MS
  3. A x B Interaction F Statistic: A X B Interaction MS / Residual MS
⚠️ **GitHub.com Fallback** ⚠️