Alpha Diversity measurements and tests - Michael-D-Preston/PrestonLab GitHub Wiki

By Angus Ball

vinyl scratch sound

YOURE NOW LISTENING TO

car crash sound

102.3

elephant sound

REAL ALPHA DIVERSITY FM

explosion

WHERE WE PLAY NOTHING BUT DIVERSITY, DIVERSITY, AND MORE DIVERSITY

glass shattering sound

police siren

THIS AINT YOUR GRANNYS STATISTICAL ANALYSIS

Shannon entropy by Shannon 1948 starts playing

(PS, this is actually really funny if you've just spent the last hour+ reading papers of the minutiae of alpha diversity metrics)

Introduction

Alpha diversity is a method used to aggregate ecological richness (# of taxonomic groups) and evenness (relative abundance of taxonomic groups) to describe an environment. Unfortunately, alpha diversity measurements are complex from the choice of which alpha diversity metrics to use (Shannon versus Inverse Simpsons ect) to once again... the problems with compositionality.

Specifically, "the library sizes (i.e. number of reads) can dominate the biology in determining the result of the diversity analysis". Aswell, common diversity metrics (such as shannon's) assumes the diversity metric is calculated using the entire population, which, for microbiome data where rare species are so often lost, is simply not true. All in all alpha diversity metrics tend to underestimate the quote unquote "true" alpha diversity

DivNet

Personally I've found the package DivNet as the best package to address at least some of these issues, and, with 52 citations on web of science since its release in 2020 in papers from Nature to Frontiers in Microbiology, I'm not the only one.

Amy Willis is the lead for this kind of research and thus you should probably give her papers a read: Rarefaction, Alpha Diversity, and Statistics Estimating diversity in networked ecological communities

Eitherway, why is DivNet good?

it addresses for compositionality
it uses covariate information
addresses for sampling depth
more sensitive to very rare species (singleton count level) compared to other packages

Why is DivNet Bad?

it poorly estimates diversity when datasets are based on LV-models
Has a higher standard error compared to other models (of significantly different design)
more sensitive to very rare species (singleton count level) compared to other packages (yeah I can justify it eitherway)

Covariate information

By adding covariate information to the samples, we can determine the amount of rare taxa that are lost through measurement error and including them in the diversity metric. For example if we have two soil samples and a water sample, in theory the two soil samples will have the same diversity because they come from the same location/population. But because of the subsampling and the random chance associated with next gen sequencing the two soil samples will inevitably end up with slightly different species (particularly when looking at rare taxa). BUT since we know they are subsections from the same population, we can assume that all the rare taxa present in soil sample 1 are also present in soil sample 2 and vise versa. Therefore DivNet uses this covariate information (i.e. soil versus water samples) to leverage all the observed reads across samples to determine the species richness of a specific type of sample. This means using this method all the soil samples will have the same diversity compared to water samples at the benefit of greater accuracy. (you can of course add more covariate information this soil sample is from plot A and this one is from plot B ect).

Final note

All in all DivNet seems to be a good, widely used, well documented, balance between new tech that addresses serious concerns (i.e. compositionality) but has its own share of issues that are seen within other program's.

Link

Alpha diversity, this time not silly!

Deprecated, i.e. dont use

citations

DivNet:

Willis A, Martin B (2023). DivNet: Diversity Estimation in Networked Ecological Communities. R package version 0.4.0.

phyloseq:

phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. Paul J. McMurdie and Susan Holmes (2013) PLoS ONE 8(4):e61217.

breakaway:

Willis A, Martin B, Trinh P, Teichman S, Clausen D, Barger K, Bunge J (2022). breakaway: Species Richness Estimation and Modeling. R package version 4.8.4, https://CRAN.R-project.org/package=breakaway.

speedyseq:

McLaren M (2023). speedyseq: Faster implementations of phyloseq functions. R package version 0.5.3.9018, https://github.com/mikemc/speedyseq, https://mikemc.github.io/speedyseq.

magrittr:

Bache S, Wickham H (2022). magrittr: A Forward-Pipe Operator for R. R package version 2.0.3, https://CRAN.R-project.org/package=magrittr.

tibble:

Müller K, Wickham H (2023). tibble: Simple Data Frames. R package version 3.2.1, https://CRAN.R-project.org/package=tibble.

tidyverse:

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.