Math Underlying a two‐way ANOVA - Private-Projects237/Statistics GitHub Wiki
Here we will be taking a look at the math that underlies a two-way ANOVA. Additionally we will create a function that will be able to spit out step by step the calculation for each part.
We are essentially taking the variance of the outcome and then identifying what proportion of the outcome is explained by the factors in the model vs what is not. To do this we need to calculate five types of sums of squares:
- Total Sums of Squares
- Sums of Squares for FactorA
- Sums of Squares for FactorB
- Sums of Squares for FactorA x FactorB
- Residual (Error) Sums of Squares
Then we will need to calculate four types of degrees of freedom:
- Degrees of Freedom for FactorA
- Degrees of Freedom for FactorB
- Degrees of Freedom for the A x B Interaction
- Degrees of Freedom for the Residuals
Then we will use the Sum of Squares and Degrees of Freedom to Calculate Mean Squares
- Mean Squares for FactorA
- Mean Squares for FactorB
- Mean Squares for the A x B Interaction
- Mean Squares for the Residuals
Lastly we will use the Mean Squares to calculate the F statistic
- F for FactorA
- F for FactorB
- F for the A x B Interaction
The function below is pretty big but we will be breaking down its components to explain exactly how it is calculating each component needed for the ANOVA results. We can essentially calculate everything except the p-values.
Comprehensive_two_way_ANOVA <- function(Outcome,FactorA, FactorB){
# Generate a dataset
dat <- data.frame(Outcome, FactorA, FactorB)
# Calculate all the means needed for the ANOVA
grand_mean = mean(Outcome)
FactorA_means <- aggregate(Outcome~FactorA, dat, mean)
FactorB_means <- aggregate(Outcome~FactorB, dat, mean)
cell_means <- aggregate(Outcome~FactorA + FactorB, dat, mean)
# Combine the calculated means into one dataset plus sample size
df <- cell_means %>%
rename(cell_means = Outcome) %>%
mutate(grand_mean = grand_mean) %>%
left_join(FactorA_means %>% rename(FactorA_mean = Outcome), by = "FactorA") %>%
left_join(FactorB_means %>% rename(FactorB_mean = Outcome), by = "FactorB") %>%
left_join(
dat %>%
group_by(FactorA, FactorB) %>%
summarise(n = n(), .groups = "drop"),
by = c("FactorA", "FactorB")
)
# Calculate the sums of squares
SS_total <- sum((Outcome - grand_mean)^2)
# Calculate SS of squares individaually
df2 <- df %>%
group_by(FactorA) %>%
reframe(n = sum(n),
FactorA_mean,
grand_mean) %>%
unique %>%
mutate(
mean_diff = FactorA_mean - grand_mean,
mean_diff_sq = mean_diff^2,
mean_diff_sq_x_n = mean_diff_sq * n,
FactorA_SS = sum(mean_diff_sq_x_n)
)
df3 <- df %>%
group_by(FactorB) %>%
reframe(n = sum(n),
FactorB_mean,
grand_mean) %>%
unique %>%
mutate(
mean_diff = FactorB_mean - grand_mean,
mean_diff_sq = mean_diff^2,
mean_diff_sq_x_n = mean_diff_sq * n,
FactorB_SS = sum(mean_diff_sq_x_n)
)
df4 <- df %>%
mutate(mean_diff_plus_grand = cell_means - FactorA_mean - FactorB_mean + grand_mean,
mean_diff_plus_grand_sq = mean_diff_plus_grand^2,
mean_diff_plus_grand_sq_x_n = mean_diff_plus_grand_sq*n,
Interaction_SS = sum(mean_diff_plus_grand_sq_x_n))
# Calculate residual sum of squares
SS_residual = SS_total - df2$FactorA_SS[1] - df3$FactorB_SS[1] - df4$Interaction_SS[1]
# degrees of freedom
df_FactorA <- nrow(FactorA_means) -1
df_FactorB <- nrow(FactorB_means) -1
df_interaction <- (nrow(FactorA_means) -1) * (nrow(FactorB_means)-1)
df_residual <- nrow(dat) - (df_FactorA + df_FactorB + df_interaction + 1)
df_total <- length(Outcome) - 1
# Create a dataframe to show how degrees of freedom are calculated
df5 <- data.frame(
Source = c("FactorA", "FactorB", "FactorA:FactorB", "Residual", "Total"),
Equation = c("a-1","b-1","(a-1)(b-1)", "n-(a*b)","n-1"),
df = c(df_FactorA, df_FactorB, df_interaction, df_residual, df_total)
)
# Calculate Mean Squares
MS_FactorA<- df2$FactorA_SS[1] / df_FactorA
MS_FactorB <- df3$FactorB_SS[1] / df_FactorB
MS_interaction <- df4$Interaction_SS[1] / df_interaction
MS_residual <- SS_residual / df_residual
# Calculate F statistics
F_FactorA <- MS_FactorA / MS_residual
F_FactorB <- MS_FactorB / MS_residual
F_iteraction <- MS_interaction / MS_residual
# Create an ANOVA table
df6 <- data.frame(Source = c("FactorA", "FactorB","FactorA:FactorB", "Residuals", "Total"),
Df = c(df_FactorA, df_FactorB, df_interaction, df_residual, NA),
SS = c(round(df2$FactorA_SS[1],3),
round(df3$FactorB_SS[1],3),
round(df4$Interaction_SS[1],3),
round(SS_residual[1],3),
round(SS_total[1],3)),
MS = c(MS_FactorA, MS_FactorB, MS_interaction, MS_residual, NA),
F_stat = c(F_FactorA, F_FactorB, F_iteraction, NA, NA))
# Return list
return_list <- list(
All_means = df,
FactorA_SS = data.frame(df2),
FactorB_SS = data.frame(df3),
Interaction_SS = df4,
Degrees_of_Freedom = df5,
Anova_Table = df6
)
return(return_list)
}
# Set seed
set.seed(123)
# Factors
group <- factor(rep(c("G1", "G2", "G3"), each = 20)) # 3 levels of Factor A
sex <- factor(rep(rep(c("M", "F"), each = 10), times = 3)) # 2 levels of Factor B
# True means
group_effects <- c(G1 = 50, G2 = 55, G3 = 60)
sex_effects <- c(M = 0, F = 3) # Females score 3 higher
interaction_effects <- matrix(c(0, 0, 0, # G1:M, G2:M, G3:M
0, 2, -2), # G1:F, G2:F, G3:F
nrow = 3, byrow = FALSE) # group × sex
# Generate outcome
outcome <- numeric(length(group))
for (i in 1:length(group)) {
g <- group[i]
s <- sex[i]
outcome[i] <- group_effects[g] + sex_effects[s] +
interaction_effects[which(levels(group)==g), which(levels(sex)==s)] +
rnorm(1, 0, 5) # Add noise
}
# Data frame
dat <- data.frame(outcome, group, sex)
head(dat)
# Run the cumstom function
Comprehensive_two_way_ANOVA(dat$outcome, dat$group, dat$sex)
Comprehensive_two_way_ANOVA() Output |
---|
![]() |
The main advantage of the custom function is that it shows you the descriptive statistics needed to produce the ANOVA output. It first start by returning a table of means. These means include the marginal means of FactorA and FactorB, these are the means for each level within a factor, and it includes cell means, which are the means created from all combination of levels from both FactorA and FactorB. Additionally we need to know the sample size. These means can easily be obtained by using the aggregate()
function.
Next, we need to start calculating the sum of squares.
- Total sum of squares: This is the easiest one to calculate, take each outcome value and subtract it by the grand mean of the outcome. Afterwards, square each value and then take their sum.
- Factor A sum of squares: Take the marginal means of FactorA and subtract them by the grand mean. Next, square the value and multiply them by their respective sample size for that sub group. Afterwards sum of the values.
- Factor B sum of squares: Same as the FactorA SS
- Interaction sum of squares: This is by far the trickiest one. Start with the cell mean and subtract it by its corresponding marginal mean of FactorA, then subtract its corresponding marginal mean of FactorB, then sum the grand mean. Afterwards, square the values and multiply them by their corresponding sub group sample size. Lastly, sum up the values.
- Residual sum of squares: This is the easiest one, it is just the total sum of squares subtracted by the other three sum of squares.
Next, we need to calculate the degrees of freedom.
- FactorA df: Take the number of levels from FactorA and subtract it by 1
- FactorB df: Take the number of levels from FactorB and subtract it by 1
- A x B Interaction df: Multiply the number of levels of FactorA by the number of levels of Factor B
- Residual df: Take the number of observations and subtract it by the FactorA df, FactorB df, A x B Interaction df, and minus 1.
We can use the sums of squares and the degrees of freedom to calculate mean squares.
- FactorA Mean Squares: FactorA SS / FactorA df
- FactorB Mean Squares: FactorB SS / FactorB df
- A x B Interaction Mean Squares: A x B Interaction SS / A x B Interaction df
- Residual Mean Squares: Residual SS / Residual df
Lastly, we can use the mean squares to calculate the F statistics
- Factor A F Statistic: FactorA MS / Residual MS
- Factor B F Statistic: FactorB MS / Residual MS
- A x B Interaction F Statistic: A X B Interaction MS / Residual MS