Math Underlying a one‐way ANOVA - Private-Projects237/Statistics GitHub Wiki
Here we will be taking a look at the math that underlies a one-way ANOVA. Additionally we will create a function that will be able to spit out step by step the calculation for each part.
We are essentially taking the variance of the outcome and then identifying what proportion of the outcome is explained by a single factor in the model vs what is left over. To do this we need to calculate three types of sums of squares:
- Total Sums of Squares
- Sums of Squares for FactorA
- Residual (Error) Sums of Squares
Then we will need to calculate two types of degrees of freedom:
- Degrees of Freedom for FactorA
- Degrees of Freedom for the Residuals
Then we will use the Sum of Squares and Degrees of Freedom to Calculate Mean Squares
- Mean Squares for FactorA
- Mean Squares for the Residuals
Lastly we will use the Mean Squares to calculate the F statistic
- F for FactorA
The function below is pretty big but we will be breaking down its components to explain exactly how it is calculating each component needed for the ANOVA results. We can essentially calculate everything except the p-values.
Comprehensive_one_way_ANOVA <- function(Outcome, FactorA){
# Generate a dataset
dat <- data.frame(Outcome, FactorA)
# Calculate all the means needed for the ANOVA
grand_mean = mean(Outcome)
FactorA_means <- aggregate(Outcome~FactorA, dat, mean)
# Calculate the sums of squares
SS_total <- sum((Outcome - grand_mean)^2)
# Calculate SS of squares individaually
df <- dat %>%
group_by(FactorA) %>%
reframe(n = length(FactorA),
FactorA_mean = mean(Outcome),
grand_mean) %>%
unique() %>%
mutate(
mean_diff = FactorA_mean - grand_mean,
mean_diff_sq = mean_diff^2,
mean_diff_sq_x_n = mean_diff_sq * n,
FactorA_SS = sum(mean_diff_sq_x_n)
)
# Calculate residual sum of squares
SS_residual = SS_total - df$FactorA_SS[1]
# degrees of freedom
df_FactorA <- nrow(FactorA_means) -1
df_within <- nrow(dat) - length(levels(dat$FactorA))
df_total <- length(Outcome) - 1
# Create a dataframe to show how degrees of freedom are calculated
df2 <- data.frame(
Source = c("FactorA", "Residual", "Total"),
Equation = c("a-1","n-a","n-1"),
df = c(df_FactorA, df_within, df_total)
)
# Calculate Mean Squares
MS_FactorA <- df$FactorA_SS[1] / df_FactorA
MS_residual <- SS_residual / df_within
# Calculate F statistics
F_FactorA <- MS_FactorA / MS_residual
# Create an ANOVA table
df3 <- data.frame(Source = c("FactorA", "Residuals", "Total"),
Df = c(df_FactorA, df_within, df_total),
SS = c(round(df$FactorA_SS[1],3),
round(SS_residual[1],3),
round(SS_total[1],3)),
MS = c(MS_FactorA, MS_residual, NA),
F_stat = c(F_FactorA, NA, NA))
# Return list
return_list <- list(
All_means = df,
Degrees_of_Freedom = data.frame(df2),
Anova_Table = data.frame(df3)
)
return(return_list)
}
# set seed
set.seed(123)
# Group setup
group_sizes <- c(10, 10, 10) # 10 subjects per group
group_names <- c("A", "B", "C")
group_means <- c(10, 12, 14)
sd <- 2 # standard deviation (same for all groups)
# Create data
group <- rep(group_names, times = group_sizes)
score <- c(
rnorm(group_sizes[1], mean = group_means[1], sd = sd),
rnorm(group_sizes[2], mean = group_means[2], sd = sd),
rnorm(group_sizes[3], mean = group_means[3], sd = sd)
)
# Create data frame
dat <- data.frame(
subject = 1:sum(group_sizes),
group = factor(group, levels = group_names),
score = score
)
# Run the cumstom function
Comprehensive_one_way_ANOVA(dat$score, dat$group)
Comprehensive_one_way_ANOVA() Output |
---|
![]() |
The main advantage of the custom function is that it shows you the descriptive statistics needed to produce the ANOVA output. It first start by returning a table of means. These means include the marginal means of FactorA, which are the means for each level within that factor. Additionally we need to know the sample size. These means can easily be obtained by using the aggregate()
function.
Next, we need to start calculating the sum of squares.
- Total sum of squares: This is the easiest one to calculate, take each outcome value and subtract it by the grand mean of the outcome. Afterwards, square each value and then take their sum.
- Factor A sum of squares: Take the marginal means of FactorA and subtract them by the grand mean. Next, square the value and multiply them by their respective sample size for that sub group. Afterwards sum of the values.
- Residual sum of squares: This is the easiest one, it is just the total sum of squares subtracted by the Factor A sum of squares.
Next, we need to calculate the degrees of freedom.
- FactorA df: Take the number of levels from FactorA and subtract it by 1
- Residual df: Take the number of observations and subtract it the number of levels in factor A.
We can use the sums of squares and the degrees of freedom to calculate mean squares.
- FactorA Mean Squares: FactorA SS / FactorA df
- Residual Mean Squares: Residual SS / Residual df
Lastly, we can use the mean squares to calculate the F statistics
- Factor A F Statistic: FactorA MS / Residual MS