14. Principal Component Analysis PCA - raytonghk/genepiper GitHub Wiki

Principal Component Analysis (PCA) is one of the most popular exploratory multivariate analyses probably for its long history (Pearson 1901) and intuitive principle, which is to calculate new synthetic variables called principal components based on the matrix operations applied to the original data set of quantitative variables that capture the most variability in the data. This technique is generally the first multivariate approach to be explained in most data analysis manuals. However, this may not always be justified in ecology. Because PCA uses Euclidean distance to measure dissimilarity among objects, care should be taken when using PCA on a data set with many zeroes, as is often the case for microbiome data. As described in detail by Legendre & Gallagher (2001), when run on such data sets, PCA can generate severe artefacts such as horseshoe visualisation effect (see Legendre & Legendre 2012; ter Braak & Smilauer 2015, for examples). With this artefact, objects at the edges of the environmental gradient actually appear close to each other in the ordination space (Novembre & Stephens 2008). Although the horseshoe effect can be partially reduced by processing of the original data values through a chord or Hellinger transformation before running PCA (Legendre & Gallagher 2001), other multivariate approaches have been progressively preferred over PCA such as correspondence analysis (CA) or multidimensional scaling (PCoA).

Load Data and Subsample

Analysis starts from loading the data in the Load Data panel. User selects the project and the data label to load the saved data. After loading data, subsampling can be done in the Filter panel if needed. Read our tutorial 07. Subsetting Data about the usage of the Filter panel.

Parameters

Taxonomic Rank For Agglomeration: Users may specify the taxonomic rank for the analysis here. The naming of the taxa point will follow the taxonomical rank selected.

Abundance Type: choose between Raw Count (original read counts), Rarefied Count (rarefying sample read counts to the lowest count amongst samples) and Relative Abundance (counts are divided by the sum of counts in each sample).

Scale Variables?: check this option to scale each variable to having unit variance before the analysis.

Results

PCA results are usually displayed as a two- or three-dimensional scatter plot, where each axis corresponds to a chosen principal component, and each object is plotted based on its corresponding PC values.

Output tab provides the summary of the PCA result, and the resulting object can be downloaded as RDS file by clicking the Download DCA button.

Sample tab provides the table of the ordination scores of samples, and the table can be downloaded as a tab-delimited table by clicking the Download button. By definition, the first PC axis of the PCA output represents the largest gradient of variability in the data set, PC2 axis of the PCA is the second largest, and so forth, until all data set variability has been accounted for. Each object (sample) can thus be given a new set of coordinates in the principal components space, and the distribution of samples in that space will correspond to the similarity of the variables’ scores in those samples.

Loading tab provides the table of the loadings of taxa, and the table can be downloaded as a tab-delimited table by clicking the Download button. The amount of variance accounted for by each principal component is given by its ‘eigenvalue.’ Eigenvalues derived from a PCA are generally considered to be significant when their values are larger than the average of all eigenvalues (Legendre & Legendre, 2012). The cumulative percentage of variance accounted for by the largest components indicates how much proportion of the total variance is depicted by the actual ordination. High absolute correlation values between the synthetic variables (principal components) and the original variables (taxa) are useful to identify which taxa mainly contribute to the variation in the data set, and this is referred to as the loading of the taxa on a given axis.

Permanova tab provides additional statistical analysis, Permutational Multivariate Analysis of Variance (permanova), to test for the significance of the different grouping. Significance levels (P values) are obtained through permutation.

Group Column: select a character column from the sample data table for the grouping of samples.
Distance Method: select a distance/dissimilarity measure for the Permanova analysis:
- UniFrac (unweighted) - requires phylogenetic tree
- Weighted UniFrac - requires phylogenetic tree
- Bray-Curtis - commonly used for biological data
- Gower
- Jaccard
- Kulczynski
- Horn-Morisita
- Bionomial
- Cao
- Chao

see vegdist R documentation Dissimilarity Indices For Community Ecologists & Lozupone & Knight (2005) for details.

2D Plot tab provides the ordination plot of PCA. User may download the plot by clicking the Download button, with options to specify the file name and dimension of the figure.

3D Plot tab provides a 3-Dimensional plot of the PCA. Users may explore the ordination space interactively by clicking and dragging in the ordination space. Mouse over onto the data points to display the x, y, z coordinates. There are assistant tools located at the top right corner of the 3D plot. Users may download the plot as a png figure, zoom, pan, orbital rotate, turntable rotate, or reset view with these tools.

Graphic Parameters

Graphic Parameters panel provides different options for the 2D Plot and 3D Plot when the corresponding tab of the result panel is selected.

With `2D Plot` tab selected:

Sample tab provides the options for the plotting and labelling of the sample points (as dots). User may select the symbol and label size, which applies instantly in the ordination.

Loading tab provides options for plotting and labelling of the taxa loadings. User may also customise the line width and label size. Note that the Label loading option will be disabled if the number of taxa is over 1000 to avoid crashing.

Plot Axis tab provides options to select the plot axis. User have to specify exactly two axes for the 2D plot. While it is typical to show PC1-vs-PC2 and/or PC2-vs-PC3 scatter plots, any two principal components can be chosen for visualisation.

Group tab provides options to display additional environmental features for the sample points. User may select a variable from the sample data table in the Group Column pull-down menu, which instantly assign colour to the sample points and labels in the ordination. There are more options for adding information about the classification or grouping of sample points with Convex Hull, Spider and Ellipse functions that overlay in the ordination. The Convex Hull of a set of points is defined as the smallest convex polygon, that encloses all of the points in the set. Convex means that the polygon has no corner that bends inwards. Spider plot connects all points to their centroid as in a spider web. The variable for the grouping is shown in the centroid of each Spider web. User may customise the spider line width and label size. Ellipse adds ellipses of standard deviation or standard error areas at a user-specified significant level. User may also customise the ellipse line width.

Envfit tab provides another option to overlay environmental information onto the ordination, known as the indirect gradient analysis, see our tutorial that provides an overview of the multivariate and ordination analyses. Any environmental features (variables/columns) in the sample data table could be fit into the coordinate of the selected axes via envfit function from the vegan package. The summary of the envfit results will be shown. This text summary results can be downloaded by clicking Download Envfit. User may select the Plot Envfit? option to add the fitted features in the plot. Users should select the features of interest in the Factor tab. For numeric features the fitted vectors are shown as arrows. The arrow points to the direction of most rapid change in the environmental variable. Often this is called the direction of the gradient. The length of the arrow is proportional to the correlation between ordination and environmental variable. Often this is called the strength of the gradient. For categorical features (characters variable), the centroid location of the fitted vectors will be plotted as dots. User may customise the dot and label size, and the length width of the arrows.

User may further customise the plot in the Title and Axis tabs before exporting the figure.

With `3D Plot` tab selected:

Sample tab provides the options for the plotting and labelling of the sample points (as dots), which applies instantly in the ordination.

Plot Axis tab provides options to select the plot axis. User have to specify exactly three axes for the 3D plot.

Group provides options to display additional environmental features for the sample points. User may select a variable from the sample data table in the Group Column pull-down menu, which instantly assign colour to the sample points and labels in the ordination.

Reference:

Anderson MJ (2001) A new method for non-parametric multivariate analysis of variance. Austral Ecology, 26: 32–46.
Becker RA, Chambers JM & Wilks AR (1988) The New S Language. Wadsworth & Brooks/Cole.
Excoffier L, Smouse PE & Quattro JM (1992) Analysis of molecular variance inferred from metric distances among DNA haplotypes: Application to human mitochondrial DNA restriction data. Genetics, 131:479–491.
Legendre P & Anderson MJ (1999) Distance-based redundancy analysis: Testing multispecies responses in multifactorial ecological experiments. Ecological Monographs, 69:1–24.
Legendre P & Gallagher ED (2001) Ecologically meaningful transformations for ordination of species data. Oecologia 129:271-280.
Legendre P & Legendre LFJ (2012) Numerical ecology. Vol. 24. Elsevier.
Lozupone C & Knight R (2005) UniFrac: a new phylogenetic method for comparing microbial communities. Appl. Environ. Microbiol.71(12):8228-35.
Mardia KV, Kent JT & Bibby JM (1979) Multivariate Analysis, London: Academic Press.
McArdle BH & Anderson MJ (2001) Fitting multivariate models to community data: A comment on distance-based redundancy analysis. Ecology, 82: 290–297.
Novembre J & Stephens M (2008) Interpreting principal component analyses of spatial population genetic variation. Nature Genetics, 40, 646–649.
Oksanen J (2007) Multivariate analysis of ecological communities in R: vegan tutorial. Univ. of Oulu, Oulu.
Oksanen J (2015) Vegan: an introduction to ordination
Paliy O & Shankar V. (2016) Application of multivariate statistical techniques in microbial ecology. Molecular ecology. 25(5):1032-57.
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2, 559–572.
Ramette A (2007) Multivariate analyses in microbial ecology. FEMS microbiology ecology. 62(2):142-60.
ter Braak CJF & Smilauer P (2015) Topics in constrained and unconstrained ordination. Plant Ecology, 216, 683–696.
Venables WN & Ripley BD (2002) Modern Applied Statistics with S, Springer-Verlag.
Warton DI, Wright TW & Wang Y (2012) Distance-based multivariate analyses confound location and dispersion effects. Methods in Ecology and Evolution, 3, 89–101.