Beta Diversity - Michael-D-Preston/PrestonLab GitHub Wiki
Introduction
Beta diversity is a measure of community similarity (or difference) across multiple samples along a gradient. Beta diversity does this through ordination methods, which aim to represent the relationships between your samples in 2 or 3 dimensional space, compared to the vast multitude of dimensions that exist within the real world. For a much more in-depth conversation about ordination please read (or skim...) Ordination Methods - an overview. There are many methods used in ordination, principle coordinate analysis (PCoA), nonmetric multidimensional scaling (NMDS), and principle component analysis (PCA), but the main thing to know is ordination methods typically rely on some kind of distance matrix from which samples are calculated under and compared against. There are many different distance measurements some common ones used in biology such as jaccard, or bray-curtis, but we care about the Robust CLR transformed Aitchison's distance plotted by PCA.
PCA vs PCoA vs NMDS
The minutia of this is a bit hairy but put simply, PCA is a form of PCoA BUT PCA specifically uses Euclidean distances. PCoA/PCA differ from NMDS because PCoA/PCA use the raw dissimilarity values (calculated from the specific matrix used) whereas NMDS converts those dissimilatory values into ranks which are then compared against each other. (This is a grotesque oversimplification). With microbiome data, PCA plots tend to be a bit more stable than PCoA plots, especially with subsetted data (i.e. presence/absence of rare species doesn't change the results i.e. this method is also protected from the sparsity of our data).
Distances
Briefly, the Aitchison's distance is a pure Euclidian distance that has been calculated for data that has been clr transformed; therefore, the Aitchison's distance used with PCA (since we are using a Euclidian distance) fully addresses the compositionality of the sample. Aswell, the Aitchison's distance does not decrease if additional taxa are observed, meaning there is higher reproducibility across studies/primer sets. However, because you read Compositionality and You you know that the clr transformation (using pseudocounts) does not address sparsity very well so we will use the robust clr transformation before applying the Euclidian distance in our PCA. This is referred to as Robust PCA (RPCA). As quick comparisons bray-Curtis uses abundance data (very obviously wrong with compositional datasets), and Jaccard uses presence/absence data (poor choice for our sparse dataset)
standard shapes and saddness
Sometimes PCA's create regular shapes due to problems with their calculation. A common shape is the horseshoe! Whereby dissimilar samples are placed close together because they have very VERY few common features. GO read this paper to understand this phenomenon: Uncovering the horseshoe effect in Microbial Analysis
So whats to be done?
Follow the first link for a relatively simple RPCA that you can using with minimal computing power OR read on...
RPCA Link
RPCA citations
Phyloseq:
phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. Paul J. McMurdie and Susan Holmes (2013) PLoS ONE 8(4):e61217
MicrobiotaProcess:
Shuangbin Xu, Li Zhan, Wenli Tang, Qianwen Wang, Zehan Dai, Lang Zhou, Tingze Feng, Meijun Chen, Tianzhi Wu, Erqiang Hu, Guangchuang Yu. MicrobiotaProcess: A comprehensive R package for deep mining microbiome. The Innovation. 2023, 4(2):100388. doi: 10.1016/j.xinn.2023.100388
RPCA conceptually
Martino, C. et al. A Novel Sparse Compositional Technique Reveals Microbial Perturbations. mSystems 4, (2019)
phylo-RPCA
Time for things to get complicated!!! Oh yeah! Something I've neglected to mention so far is that when comparing these samples against each other, each taxa is just a floating object with no relation to any other taxa. So for example you have three samples, one sample is 5 different breeds of dog, one sample 5 more different breeds of dogs, and the last sample is 5 different breeds of cats. Since none of these samples contain any specific species overlap a regular RPCA will partition them at equally far away from each other. BUT we know that different breeds of dogs are all still dogs so two of those samples should actually be rather close on our RPCA. This is where phylo-RPCA comes in. It takes the phylogenetic information from our taxa tables, and the comparative abundances of a normal RPCA and uses this to construct a complete phylo-RPCA with all the available metadata. Please read or skim Compositionally Aware Phylogenetic Beta-Diversity Measures Better Resolve Microbiomes Associated with Phenotype before trying the next tutorial. NOTE: depending on your samples/size of dataset your regular computer might not have the juices to do this and you'll have to use the HPC. It is also worth mentioning this method is MUCH newer being published in 2022 and thus it only has 3 citations on it; however, the stats in the article proving its validity are strong enough for me to recommend its use.
phylo-RPCA link
Out of the frying pan eh? Time to introduce Qiime2
IF you haven't gathered by now if the code isn't working my professionalism within this documentation kinda takes a hit. QIIME2 broke me. You gotta do what you gotta do for the phylo-RPCA, but heres a little rant bc I was angry.
QIIME2 SUCKS. It's harder to read, its command line so inherently less user friendly, but what annoys me most (ignoring the fact i have sunk days into qiime2 now) is that i kinda forces you into just using qiime2. There is very little functionality or documentation on how to import data into qiime2 unless you are starting from the raw sequence files. Qiime2 has the ability to run up to date versions of dada2 so technically you can just rerun your whole analysis to run the phylo-rpca and if you do power to you but currently it aint worth it. In theory you can take your qiime2 data and convert it to R easily but trust is low right now.
After you have Qiime2 up and running then you can work on the phylo-RPCA