Differential abundance - Michael-D-Preston/PrestonLab GitHub Wiki

Introduction

Because of the unique characteristics of microbiome datasets determining differential abundance is SUPER hard, and honestly a bit untrust worthy... Science has not decided on the best way to perform differential abundance on microbiome datasets yet so be very careful with analyzing these results. Right now the best solution I think is ANCOM-BC-2 (the 2 is specific), however, new research is being published every day on this topic so be very cognizant.

reading list

Read me for comparison of methods (Note these papers are out of date, for example they don't include ANCOM-BC-2):

Microbiome differential abundance methods produce different results across 38 datasets

The accuracy of absolute differential abundance analysis from relative count data

Read me for ALDEx2:

ALDEx2: Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis

This is useful when trying to interpret ALDEx2 (It explains bland altman, volcano and effect plots) : Displaying Variation in Large Datasets: Plotting a Visual Summary of Effect Sizes

Read me for ANCOM-BC:

ANCOM-BC: Analysis of compositions of microbiomes with bias correction

ANCON-BC part 2: Analysis of microbial compositions: a review of normalization and differential abundance analysis

ANCOM-BC-2, read me

ANCOM-BC-2

So what's ANCOM-BC-2's deal?

  • Multiple pairwise comparisons!! This may not sound like a big deal but this is the first program I've encountered that promises it'll do all the comparisons you care about in groups of 3+ rather than just having a reference taxon and you have to fiddle around with it. (it'll also do binary comparisons)

  • Pattern analysis over ordered study groups (effectively pairwise comparisons but with a fuzzer image)

  • ANCOM-BC-2 will also account for taxon-specific bias. Remember how choosing your PCR primers was so important? because they will preferentially bias certain taxons? ANCOMBC2 promises to account for this in your data.

  • (This one is really cool) ANCOMBC2 will perform sensitivity analysis based on choice of pseudo-count. See pseudo-count can change your results (specifically regarding rare taxa as they are more biased by the psuedo-count), leading to an inflated FDR. ANCOMBC2 will run the analysis with multiple pseudo-counts and see how they effect the results and filter out taxa that are significantly affected by pseudo-count.

  • ANCOMBC2 knows about compositionality and how scary it is. See we use relative abundance because microbiome data is compositional, but a change in the absolute abundance of a single taxon can alter the relative abundances of all taxa (This is shockingly not great). Therefore, when we hypothesis test comparing relative abundances is not equivalent to absolute abundances. Relative abundance/ sampling fraction can be affected by the microbial load in the system and library size. Library size is simple too small a library size you don't hit the important parts of your rarefaction curve and you miss out on species. Microbial load is a bit more complex (and explained well in the ANCOMBC paper). But in short two ecosystems/samples can have the same relative abundance and library sizes of a taxon, but their absolute population size, or microbial load, is different. If you don't account for this then you'd say the two ecosystems are the same (wrong). ANCOMBC accounts for microbial load by introducing a sample-specific offset term to bias correct for the microbial load and constitutionality of microbiome data.

  • Preforms worse when sample size is small (n = 5), and runs perfectly when n >= 10

  • Accounts for structural zeros. Structural zeros are defined as taxons present in one sample type and not present in another (i.e. imagine comparing samples from a jungle to a forest, there'll just be some species that straight up don't exist in one environment or the other and this is expect). You can't really do stats on these structural zeros because its just not there in one sample. So we'll have to examine this differently (and subjectively). Please give Naught all zeros in sequence count data are the same a skim it has some stuff to say about zeros (if you couldn't guess by the title). Structural zeros are called biological zeros in this paper. Now this being said the way these structural zeros are defined is a bit suspect. See if your library size in some samples is too small it won't pick up rare taxa at all! so these rare taxa might be erroneously picked up as structural zeros (where they should be called sampling zeros). However, in my humble opinion using the same sequencing tech and primers should mitigate the issues with sampling zeros (and technical zeros) as you'd expect equal various among samples. ANCOMBC skirts some of these issues by first filter taxa to have to exist in a certain amount of samples (10% by default) to get around rare taxa getting picked up just because they are rare.

Link

ANCOMBC2 Comparing two groups

[ANCOMBC2 Comparing three groups]

Aldex2, valid to use, but use ANCOM-BC-2