Math Underlying a one‐way ANOVA - Private-Projects237/Statistics GitHub Wiki

Overview

Here we will be taking a look at the math that underlies a one-way ANOVA. Additionally we will create a function that will be able to spit out step by step the calculation for each part.

Underlying Math Behind a One-Way ANOVA

We are essentially taking the variance of the outcome and then identifying what proportion of the outcome is explained by a single factor in the model vs what is left over. To do this we need to calculate three types of sums of squares:

  1. Total Sums of Squares
  2. Sums of Squares for FactorA
  3. Residual (Error) Sums of Squares

Then we will need to calculate two types of degrees of freedom:

  1. Degrees of Freedom for FactorA
  2. Degrees of Freedom for the Residuals

Then we will use the Sum of Squares and Degrees of Freedom to Calculate Mean Squares

  1. Mean Squares for FactorA
  2. Mean Squares for the Residuals

Lastly we will use the Mean Squares to calculate the F statistic

  1. F for FactorA

Custom Function

The function below is pretty big but we will be breaking down its components to explain exactly how it is calculating each component needed for the ANOVA results. We can essentially calculate everything except the p-values.

Comprehensive_one_way_ANOVA <- function(Outcome, FactorA){
  
  # Generate a dataset
  dat <- data.frame(Outcome, FactorA)
  
  # Calculate all the means needed for the ANOVA
  grand_mean = mean(Outcome)
  FactorA_means <- aggregate(Outcome~FactorA, dat, mean)
  
  # Calculate the sums of squares
  SS_total <- sum((Outcome - grand_mean)^2)
  
  # Calculate SS of squares individaually
  df <- dat %>%
    group_by(FactorA) %>%
    reframe(n = length(FactorA),
            FactorA_mean = mean(Outcome), 
            grand_mean) %>%
    unique() %>%
    mutate(
      mean_diff = FactorA_mean - grand_mean,
      mean_diff_sq = mean_diff^2,
      mean_diff_sq_x_n = mean_diff_sq * n,
      FactorA_SS = sum(mean_diff_sq_x_n)
    )
  
  # Calculate residual sum of squares
  SS_residual = SS_total - df$FactorA_SS[1]
  
  # degrees of freedom
  df_FactorA <- nrow(FactorA_means) -1
  df_within <- nrow(dat) - length(levels(dat$FactorA))
  df_total <- length(Outcome) - 1
  
  # Create a dataframe to show how degrees of freedom are calculated
  df2 <- data.frame(
    Source = c("FactorA", "Residual", "Total"),
    Equation = c("a-1","n-a","n-1"),
    df = c(df_FactorA, df_within, df_total)
  )
  
  # Calculate Mean Squares
  MS_FactorA <- df$FactorA_SS[1] / df_FactorA
  MS_residual <- SS_residual / df_within
  
  # Calculate F statistics
  F_FactorA <- MS_FactorA / MS_residual
  
  # Create an ANOVA table
  df3 <- data.frame(Source = c("FactorA", "Residuals", "Total"),
                    Df = c(df_FactorA, df_within, df_total),
                    SS = c(round(df$FactorA_SS[1],3),
                           round(SS_residual[1],3),
                           round(SS_total[1],3)),
                    MS = c(MS_FactorA, MS_residual, NA),
                    F_stat = c(F_FactorA, NA, NA))
  
  # Return list
  return_list <- list(
    All_means = df,
    Degrees_of_Freedom = data.frame(df2),
    Anova_Table = data.frame(df3)
  )
  
  return(return_list)
}

Practice 1 (one-way ANOVA)

Generating data

# set seed
set.seed(123)  

# Group setup
group_sizes <- c(10, 10, 10)  # 10 subjects per group
group_names <- c("A", "B", "C")
group_means <- c(10, 12, 14)
sd <- 2  # standard deviation (same for all groups)

# Create data
group <- rep(group_names, times = group_sizes)
score <- c(
  rnorm(group_sizes[1], mean = group_means[1], sd = sd),
  rnorm(group_sizes[2], mean = group_means[2], sd = sd),
  rnorm(group_sizes[3], mean = group_means[3], sd = sd)
)

# Create data frame
dat <- data.frame(
  subject = 1:sum(group_sizes),
  group = factor(group, levels = group_names),
  score = score
)

Running the Comprehensive_one_way_ANOVA() function

# Run the cumstom function
Comprehensive_one_way_ANOVA(dat$score, dat$group)
Comprehensive_one_way_ANOVA() Output
Screenshot 2025-04-20 at 11 43 21 AM

The main advantage of the custom function is that it shows you the descriptive statistics needed to produce the ANOVA output. It first start by returning a table of means. These means include the marginal means of FactorA, which are the means for each level within that factor. Additionally we need to know the sample size. These means can easily be obtained by using the aggregate() function.

Next, we need to start calculating the sum of squares.

  1. Total sum of squares: This is the easiest one to calculate, take each outcome value and subtract it by the grand mean of the outcome. Afterwards, square each value and then take their sum.
  2. Factor A sum of squares: Take the marginal means of FactorA and subtract them by the grand mean. Next, square the value and multiply them by their respective sample size for that sub group. Afterwards sum of the values.
  3. Residual sum of squares: This is the easiest one, it is just the total sum of squares subtracted by the Factor A sum of squares.

Next, we need to calculate the degrees of freedom.

  1. FactorA df: Take the number of levels from FactorA and subtract it by 1
  2. Residual df: Take the number of observations and subtract it the number of levels in factor A.

We can use the sums of squares and the degrees of freedom to calculate mean squares.

  1. FactorA Mean Squares: FactorA SS / FactorA df
  2. Residual Mean Squares: Residual SS / Residual df

Lastly, we can use the mean squares to calculate the F statistics

  1. Factor A F Statistic: FactorA MS / Residual MS
⚠️ **GitHub.com Fallback** ⚠️