ANOVA - Statistics-and-Machine-Learning-with-R/Statistical-Methods-and-Machine-Learning-in-R GitHub Wiki

Analysis of Variance :

Click for R-Script

anova 1

Analysis of Variance (ANOVA) is a parametric statistical technique used to compare datasets. It is similar in application to techniques such as t-test and z-test, in that it is used to compare means and the relative variance between them. However, analysis of variance (ANOVA) is best applied where more than 2 populations or samples are meant to be compared.

Anova 1

Assumptions

The use of this parametric statistical technique involves certain key assumptions, including the following:

  1. Independence of case: Independence of case assumption means that the case of the dependent variable should be independent of the sample should be selected randomly. There should not be any pattern in the selection of the sample.

  2. Normality: Distribution of each group should be normal. The Kolmogorov-Smirnov or the Shapiro-Wilk test may be used to confirm the normality of the group.

  3. Homogeneity: Homogeneity means variance between the groups should be the same. Levene’s test is used to test the homogeneity between groups.

If particular data follows the above assumptions, then the analysis of variance (ANOVA) is the best technique to compare the means of two, or more, populations.

Steps for Calculation

  1. Calculate the sample means for each of our samples as well as the mean for all of the sample data.
  2. Calculate the sum of squares of error. The sum of all of the squared deviations is the sum of squares of error, abbreviated SSE.
  3. Calculate the sum of squares of treatment. We square the deviation of each sample mean from the overall mean. The sum of all of these squared deviations is multiplied by one less than the number of samples we have. This number is the sum of squares of treatment, abbreviated SST.
  4. Calculate the degrees of freedom. The overall number of degrees of freedom is one less than the total number of data points in our sample, or n - 1. The number of degrees of freedom of treatment is one less than the number of samples used, or m - 1. The number of degrees of freedom of error is the total number of data points, minus the number of samples, or n - m.
  5. Calculate the mean square of error. This is denoted MSE = SSE/(n - m).
  6. Calculate the mean square of treatment. This is denoted MST = SST/m - `1.
  7. Calculate the F statistic. This is the ratio of the two mean squares that we calculated. So F = MST/MSE.

Points to consider

Balanced experiments (those with an equal sample size for each treatment) are relatively easy to interpret; Unbalanced experiments offer more complexity. For single-factor (one-way) ANOVA, the adjustment for unbalanced data is easy, but the unbalanced analysis lacks both robustness and power. The simplest techniques for handling unbalanced data restore balance by either throwing out data or by synthesizing missing data. More complex techniques use regression.

ANOVA is (in part) a test of statistical significance. The American Psychological Association (and many other organizations) holds the view that simply reporting statistical significance is insufficient and that reporting confidence bounds is preferred.

While ANOVA is conservative (in maintaining a significance level) against multiple comparisons in one dimension, it is not conservative against comparisons in multiple dimensions.

A common mistake is to use an ANOVA (or Kruskal–Wallis) for analysis of ordered groups, e.g. in time sequence (changes over months), in disease severity (mild, moderate, severe), or in distance from a set point (10 km, 25 km, 50 km). Data in three or more ordered groups that are defined by the researcher should be analyzed by Linear Trend Estimation.

Click for R-Script