Diversity analyses - uic-ric/uic-ric.github.io GitHub Wiki

Diversity analyses come in two main modes.

  • Alpha Diversity- Measure of diversity within a sample, For example, are there many features that are evenly distributed or are there only one or two main features?
  • Beta diversity - Measure of dissimilarity between samples. How similar or dissimilar are two give samples based on the features present and their relative abundances.

Alpha diversity - Diversity within a sample.

Alpha diversity analyses will compute a diversity index for each sample. Then the compute diversity indices are compared with any experimental factors/covariates to determine if there is a significant effect/difference in the diversity (feature complexity of the sample) associated with the experimental design.

Overview of results

Typically the following files are provided for an alpha diversity analysis performed by RIC.

  • HTML report - Contains the results of the analyses including the following key items.
    • Description methods performed, e.g. diversity index computed and statistical tests.
    • Results from any statistical tests. Typically perform the following.
      • Modelling of the computed diversity indices as a function of all factors using a generalized linear model (GLM)
      • Group and pairwise non-parametric tests using Kruskal-Wallis and Mann-Whitney/Wilcox tests.
    • Dot/box plots of the diversity indices plotted as a function of the experimental factors/covariates
    • List of other files in the results
  • PDF plot file - Contains a copy of all plots in the report.
  • _values.txt - Tab separated values (tsv) files with the computed diversity indices for each of the samples.

Diversity indices

A number of different diversity indices can used to compute a quantitative diversity measure. Granted each of the methods will express or weight different aspects of the diversity of a sample. While, some indices, e.g. Shannon index, are very good at computing a usable diversity index in most situations other methods may be more applicable based upon the specific goal(s) or hypothesis(es) for your project.

Index Description Formula
Shannon Also know as Shannon's entropy. Value increases with both richness and evenness.
Simpson Computes the probability that two entities taken at random from the dataset of interest represent the same type. Value increases with both richness and evenness.
Inverse Simpson Inverse of the Simpson's index.
Fisher's Alpha Estimated alpha parameter of Fisher's logrithmic series
Richness Number of species (unique entities/features) in a sample.
Pielou's eveness Measure of the evenness within a sample.
Faith's Phylogenetic Diversity Sum of the total phylogenetic branch lengths for features in a sample. Requires a phylogenetic tree

Beta diversity - Dissimilarity between samples.

Overview of results

Typically the following files are provided for an alpha diversity analysis performed by RIC.

  • HTML report - Contains the results of the analyses including the following key items.
    • Description methods performed, e.g. diversity metric computed and statistical tests.
    • Results from any statistical tests. Typically we perform the following. See Statistical tests for more information about these tests.
      • ADONIS/PERMANOVA - Models the computed dissimilarity indices as a function of the experimental factors/covariates.
      • ANOSIM - Non-parametric test of dissimilarity comparing two or more defined groups. This test is used for both pairwise comparisons as well as general group tests.
    • Ordination plots using either Non-metric Multidimensional Scaling (NMDS) or Principal Coordinate Analysis (PCoA) to help simplify the high-dimensional data from the compute dissimilarity matrix. See Visualizations for more information about these techniques.
    • List of other files in the results
  • PDF plot file - Contains a copy of all plots in the report.
  • _values.txt Tab separated values (tsv) files with the computed dissimilarity indices for the samples. This file contains a symmetrical matrix, a.k.a. distance matrix, of the values with sample IDs on the columns and rows and at each intersecting cell is the computed dissimilarity index between the two samples.
  • _plotdata.txt Tab separated values (tsv) files with plot coordinates for the ordination plots.

Details about the various diversity metrics available can be found at the following sites.

Statistical tests

Due to the nature of the dissimilarity matrix, i.e. numbers are compute between pairs of samples rather than a value per sample, the following statistical tests are typically used to compare the computed dissimilarity indices with experimental factors/covariates.

PERMANOVA/ADONIS test

The PERMANOVA (Permutational Multivariate Analysis of Variance Using Distance Matrices) or ADONIS test (according to the vegan R package) is a method for partitioning dissimilarity/distance matrices among sources of variation and fitting linear models to dissimilarity/distance matrices. It utilizes a permutation test to test for significance. The key features of this test is that it can...

  • Utilize multi-factor model to adjust for the effects on one factor while testing for the effect of another factor.
  • Assess possible interactions between different factors. More information about interaction terms can be found at Interaction terms
  • Test both categorical/discrete factors, e.g. Group A vs. B vs. C, as well as continuous covariates, e.g. age, BMI, weight.
  • Computes an R2 that expresses the fraction of dissimilarity that can be explained by a particular factor or term. This can be helpful to assess the relative effect sizes of different factors/covariates.
  • Computes a p value, shown as Pr(>F), using a permutation test that gives the probability the true R2 is 0, i.e. no effect.

The following is an example of output from the PERMANOVA/ADONIS test.

Factor Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
Group 3 0.521 0.174 2.511 0.127 0.001
Residuals 52 3.594 0.069 NA 0.873 NA
Total 55 4.115 NA NA 1.000 NA

In this example a single factor (Group) was tested and was revealed to have an R2 of 0.127 that was significant, p=0.001. Based the particular system, e.g. model system, human samples or environmental data, what is a "large" or "small" R2 can vary. In this particular case, these were clinical data and this could be considered a "strong" R2 value. In the table, there is also an item labeled "Residuals" and the R2 would account for all other dissimilarity not explained by the factors/covariates given.

ANOSIM test

The Analysis of similarities (ANOSIM) test is a non-parametric test of group differences in a dissimilarity/distance matrix. Unlike the PERMANOVA/ADONIS test, only a single categorical/discrete factor, i.e. set of groups, can be tested. The output from the ANOSIM test is an R and p value.

  • The R value provides a sense of the degree of separation between the groups, where 0 or less indicates the groups are perfectly overlapped and 1 indicates the groups are perfectly separated.
  • The p value is computed using a permutation test and gives the probability that the true R is 0, indicating no separation.

In the beta diversity analysis results generated by RIC, we use the ANOSIM test for basic groups tests of any categorical/discrete factors as well as pairwise tests between the different groups or combination of factors.

The following is an example set of pairwise ANOSIM comparisons with two different factors, injury status (Naive vs. Sham vs. Injury) and diet (inulin vs. control)

GroupA GroupB R p.value
Naive.control Naive.inulin -0.019 0.604
Naive.control Sham.control 0.130 0.022
Naive.control Injury.control -0.018 0.569
Naive.inulin Sham.inulin -0.014 0.487
Naive.inulin Injury.inulin 4.20e-4 0.380
Sham.control Sham.inulin -0.054 0.949
Sham.control Injury.control 0.060 0.091
Sham.inulin Injury.inulin 0.013 0.300
Injury.control Injury.inulin 0.004 0.373

Similar to the R2 in the PERMANOVA/ADONIS test, what constitutes a "large" or "small" R depends on the the particular system, e.g. model system, human samples or environmental data. In this particular case, these were from a study of a host associated microbiome from model system and an R=0.130 this could be considered a "moderate" R value. If these were environmental samples, one would expect a higher degree of dissimilarity between samples due to more variation in an environmental setting and a R=0.130 would not be considered nearly as strong.

Visualizations

Non-metric Multidimensional Scaling (NMDS) and Principle Coordinate Analysis (PCoA) are the two main methods use to reduce a high-dimensional dissimilarity/distance matrix into something that can be plotted in few (2 or 3) dimensions. These two techniques have different ways of generating the ordination plots from the dissimilarity matrix and while the results maybe be mostly similar, there can be slight differences.

Non-metric Multidimensional Scaling (NMDS)

The goal of the NMDS algorithm is to project points in a multi-dimensional space in a fixed set of dimensions, e.g. 2 or 3 dimensions. This can be thought of as a "flattening" of the multidimensional space. Is some ways, this akin to the projection of places/features on the round Earth onto a two-dimensional map. Due to this process, the resulting "projection" may be skewed and apparent distances on the "projection" do not always directly correspond to the actual distances between points/features. Thus, if two points are a certain distance away on the plot and two other points are twice the distance on the plot one CANNOT assume that the second set of samples were twice as far in the original dissimilarity matrix.

The amount of "skew" in the distances is typically assess using a stressplot. This plots the Ordination distance (distance between two given points on the plot) as a function of the Observed dissimilarity (distance between the two entities in the original dissimilarity matrix) for all possible pairs of points. While it is very unlikely there to be a direct, linear relationship between the Ordination distance and the Observed dissimilarity, the fit in the stress plot, as measured by R2, can be used to assess how well the NMDS plot is depicting the computed dissimilarity matrix. Values of 0.9 or higher can be considered very good.

The end result is that while there may be some skew in distances presented in the plot, one is guaranteed a set number of dimensions, e.g. 2 or 3, for the final visualization.

Principal Coordinate Analysis (PCoA)

This algorithm is related to the Principal Component Analysis (PCA) method used to generate ordination plots from data with multiple features, e.g. gene expression data. The main difference is that PCoA will utilize a dissimilarity/distance matrix computed between each of the entities (samples) rather than the original feature table for the samples. Similar to PCA, it computes a set of eigenvectors to set a series of orthogonal axes such that the first axis (PC1) of the ordination space describes the greatest amount of dissimilarity, the second axis (PC2) describes the second greatest amount of dissimilarity.

The use of eigenvectors results in an ordination space in which the original distance between points is preserved in the space, however there is no limit on the number or dimensions in the ordination space. Thus, only a portion of the dissimilarity can be depicted if only using the first 2 or 3 ordination axes, e.g. PC1, PC2 and PC3. A scree plot can be used to assess the amount of dissimilarity that is depicted in the first couple of dimensions and how much dissimilarity is "hidden" in the higher ordination axes, e.g. PC4, PC5, PC6, etc. The scree plot displays the relative eigenvalues (fraction of dissimilarity depicted) for each axis. If the first few axes, e.g. PC1 and PC2, account for the majority of the relative eigenvalues then a PCoA plot of the first few axes would be a good representation of the computed dissimilarities.

The end results is that while some fraction of the computed dissimilarities may not be able to be depicted in a only a few dimensions, PCoA preserves the original dissimilarity in the new ordination space and will "prioritize" the dimensions of the dissimilarity space.

⚠️ **GitHub.com Fallback** ⚠️