Step 7.0: Subsampling and diversity analyses - shenjean/diversity GitHub Wiki

Subsampling

Subsampling is important before proceeding to diversity analyses. Basically, it randomly samples all your libraries to the same depth such that each sample will have the same x number of sequences. The x value is user-specified and samples with less than x sequences will be dropped from diversity analysis. Because most diversity metrics are sensitive to different sampling depths across different samples, higher x values give you better coverage such that the diversity sampled in your analyses are more representative of the actual "true" diversity". So, the choice of x is a balance between coverage and the number of samples, because choosing a high x value usually means eliminating more samples from your analyses.

The choice of x can be made by visualizing the pe.dada2.qzv file generated by the feature-table summarize command or the otu_table.txt generated by the export and biom convert commands.

  • If using pe.dada2.qzv, visualize the file with Qiime2View and click on the Interactive Sample Detail tab.
  • If using otu_table.txt, sum up the numbers of each column - this will be the total number of features (ASV) per sample.

Either way, you should get a table of sampleIDs and the feature (ASV) counts. For example,

Sample ID	Feature Count
BCB11T	28505
BCB17SR	27585
BCB17TL	27476
BCB9TL	26731
BCB17TR	24175
BCB17SL	20863
BCB16TL	19217
BCB11H	17584
BCB9HR	15251
BCB16SL	12269
BCB17HL	3885
BCB16HL	3

Typically, subsampling at a depth (x) of <1,000 sequences is not recommended. Of course, higher is better. So you could subsample at x=3885 and eliminate the one sample with 3 sequences. Or you could subsample at x=10,000 and eliminate the 2 samples (BCB17HL and BCB16HL). The choice of x is subjective, but you can compare sampling depths with rarefaction curves.

Rarefaction analysis

Rarefaction curves allow you to visually evaluate your sampling depth. In this command, we set the maximum rarefaction depth to the highest number of features (ESVs) found in any given sample the dataset. You can also input other files like a metadata file (which allows you to group the curves based on metadata grouping), a phylogenetic tree, specify a minimum rarefaction depth, and specify diversity metrics you want to test out with the rarefaction curves.

See documentation here: https://docs.qiime2.org/2024.5/plugins/available/diversity/alpha-rarefaction/

In this command, we will just do the bare minimum:

qiime diversity alpha-rarefaction --i-table pe.dada2.qza --p-max-depth 3885 --o-visualization rarefaction.3885.qzv
qiime diversity alpha-rarefaction --i-table pe.dada2.qza --p-max-depth 10000 --o-visualization rarefaction.10000.qzv

Again, you can visualize the rarefaction.qzv with Qiime2View. The rarefaction curves increase rapidly then plateau as the sequencing depth increases (meaning abundant species have been sampled and only rare species remain to be sampled), so you want to choose your sampling depth at points where the curves level off and not at points where the curves are still increasing.

You can also perform beta diversity rarefaction.

Core diversity analysis

QIIME2 automates the calculations of multiple alpha and beta diversity indices with just a single command. Here, the rarefaction depth is specified with the --p-sampling-depth option. Users can also input their metadata file for the diversity analyses (-m-metadata-file option).

These commands generate qzv files, each corresponding to a diversity metric, that you can visualize with Qiime2View. From the visualizations, you can get an idea but NOT statistical evidence of how your samples are clustered/grouped with your metadata. You need to test for statistical associations between metadata categories and diversity data (see next step).

If you have a tree:

qiime diversity core-metrics-phylogenetic --i-phylogeny rooted-tree.qza --i-table pe.dada2.qza --m-metadata-file metadata.txt --p-sampling-depth 10000 --output-dir core-diversity-phylogenetic-10000

If you don't have a tree:

qiime diversity core-metrics --i-table pe.dada2.qza --m-metadata-file metadata.txt --p-sampling-depth 10000 --output-dir core-diversity-10000