Principal Component Analysis in R - erynmcfarlane/StatsGenLabProtocols GitHub Wiki

A short PCA tutorial in R using the iris dataset

My main guide for conducting a PCA using the below packages came from: https://www.sthda.com/english/wiki/wiki.php?id_contents=7851. It goes into much more detail than I do and basically covers everything here.

Additionally, this is a cool interactive visualization of how a dataset is transformed when running it through a PCA: https://setosa.io/ev/principal-component-analysis/. There is an example with two dimensions and also three dimensions (extra helpful because most brains short circuit at this point).

load packages

library(FactoMineR)
library(factoextra)

view dataset

View(iris)

We can see in this dataset that there are 4 quantitative outcome variables (sepal length, sepal width, petal length, petal width) and one qualitative variable (species). This is not a very high dimensional dataset, which PCA is often used for, but for the sake of simplicity we will use it as our example. Each quantitative outcome variable is a potential "dimension" in our PCA.

conduct the principal component analysis

iris_pca <- PCA(iris[,-5], scale.unit = TRUE, graph = FALSE)

Here, we conduct the analysis using the iris dataset while omitting the column with our qualitative outcome variable (species) because those values cannot be included. However, this is still useful information! We can use this later when we visualize our PCA biplot.

Note: you can set the graph argument to TRUE if you want to see a biplot with just the variables (vectors), but this does not show your data points (individual observations).

iris_pca$eig 
iris_pca$var

In the FactoMineR package, to view your PCA object, you need to also further specify what you would like to view with a $[insert specific thing]. If you just run the name of your object, a handy menu will pop up. I specified $eig to view the eigenvalues for the principal components and $var for the results for the outcome variables.

scree plot

fviz_eig(iris_pca, addlabels = TRUE)

A scree plot can be used to visualize the variation explained by successive principal components (% variation explained by PC1, % more variation explained by PC2 etc.). In my opinion, it's nicer to look at than the output from iris_pca$eig, while conveying the same message.

Traditionally, you want to retain all the principal components whose eigenvalues >1. You can also make this judgement call by looking at the PC (x-axis) at which the line "levels off" (levelling off point - 1) to determine the amount of principal components that you want to retain in your analysis because they explain most of the variation in the data while keeping everything parsimonious (i.e. it's good enough!). There have been debates about which method is better but that is beyond my scope of knowledge.

Fun fact: a "scree" is a sloped mass of loose rocks at the base of a cliff, resembling how the line drops off in the plot.

variable contributions

fviz_contrib(iris_pca, choice="var", axes = 1) 
fviz_contrib(iris_pca, choice="var", axes = 2)

We can also visualize the contribution of each variable to each principal component. In other words, how much of the variance in the principal component is explained by each variable. If we are using factor analysis language, this is similar to factor loadings. We are only looking at contributions to PC1 and PC2 because those are the ones represented in our biplot, but you can look at the other PCs by setting the value in the axes argument to your PC of interest.

Again, this is a nicer way of looking at the output from iris_pca$var that also organizes the contributions from highest to lowest. This is helpful when you have lots of outcome variables.

plot on a biplot

fviz_pca_biplot(iris_pca, 
                col.var = "contrib",
                gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")) + 
  ggtitle("Iris Dataset")

Here, we visualize the results of the PCA in a biplot. This biplot has only two axes, as the name would suggest, so the data points are represented along only two dimensions (i.ee two principal components). This is fine for our iris dataset, because when we look at our eigenvalues, 1-2 PCs is enough to explain most of the variation in our data. But for more high dimensional datasets, two dimensions may not be enough for visualization and the some patterns in the data may be lost when only viewed in two dimensions.

Additionally, I added the argument col.var to colour code the vectors (variables) in the plot to visually represent that information. Another potentially helpful argument is "repel = TRUE" so the numerical labelling of each data points is offset such that there is no overlap. However, with a lot of data points, this can get visually overwhelming. It's also not really necessary if you don't intend on drawing conclusions about any one data point.

grouping in the biplot

fviz_pca_ind(iris_pca, 
             label="none", 
             habillage=iris$Species,
             addEllipses=TRUE, 
             ellipse.level=0.95) + 
  ggtitle("Iris Dataset Grouped by Species")

Finally, our qualitative outcome variable can come in handy. We can group our data points in the biplot by some categorization to inform of us of any group-specific patterns. In our case, this would be the species of each iris individual. This could also be something like treatment type in experimental manipulations (e.g. Treatment A, Treatment B, Control).