QIIME2 Brief Tutorial - migred/metagenomics21 GitHub Wiki

Introduction

To carry out this analysis we are going to use QIIME2 environment (https://qiime2.org) that integrates multiple tools to carry out the analysis of the sequences. Unlike what happened in previous versions, in this environment there are two kinds of files and plugins are used to analyze them:

Artifacts

Data produced or handled by QIIME 2 is stored as QIIME 2 artifacts. A QIIME 2 artifact contains data and metadata. Metadata describe peculiarities of the data, such as type, format and how they were generated (source). An artifact in QIIME 2 normally presents the extension.qza.

Since QIIME 2 works with artifacts rather than data files (e.g. FASTA files), these artifacts must be created by importing tools. This process can be performed at any time during the proccess of analysis, although it is usually done at the beginning of the pipeline. In addition, QIIME 2 features tools for exporting data from an artifact. See the import guide in the QIIME 2 page for more details.

Visualizations

Visualizations is the other type of data generated in QIIME 2. These files are stored with the extension.qzv. As with artifacts, they also contain metadata. In addition, these files can be stored and shared so that collaborators can visualize the results. Unlike artifacts, visualizations are the definitive output files of an analysis and can represent, among others, a table of statistical results, an interactive visualization, static images, or a combination of all of them. As they are definitive results they cannot be used as input for other analyses in QIIME 2.

Trick/Recommendation

You can use https://view.qiime2.org to display QIIME 2 artifact and visualizations files content. Thus, you can share the data with collaborators who do not have the program installed.

Plugins

Analyses in QIIME 2 are performed using what are called plugins. To perform the desired analyses it is necessary to install the appropriate plugins. For example, if you want to demultiplex raw readings, you must use the q2-demux plugin, or if you want to perform analysis of alpha and beta diversity you can use the q2-diversity plugin.

It is advisable to visit the plugin availability page to look for existing plugins and the future plugins page to keep track of what is being developed.

Information about the data to analyze in this practice

To demonstrate the potential of this pipeline I have downloaded the sequencing data from a study that intend to evaluate host genotypes and soil type impacts on soybean rhizosphere microbiome communities PRJNA474716. To achieve this goal, 6 soybean cultivars were grown in two different soil types in the greenhouse. At flowering stage, rhizosphere samples were collected and DNA were extracted using MoBio soil DNA extraction kit. 16S rRNA gene amplicon sequencing were used to profile microbial community composition and structures

Starting dataset

Fastq folder: folder with paired end raw readings in fastq format
SraRunTable.xlsx and metadata: information of the content of each file
primers.txt: information on the primers used in PCR and the region they amplify
samplemanifest: list of files used for this tutorial

Pipeline

In the diagram below (https://docs.qiime2.org/2022.2/tutorials/overview/#let-s-get-oriented-flowcharts) you can find the basic pipeline

Reminder. Whenever we are going to use the tools of QIIME 2 we must activate its environment with the following command.

conda activate qiime2-2020.11

2020.11: refers to the QIIME2 version we are going to use (November 2020).

Importing samples

QIIME 2 allows you to import the samples one by one but it is more convenient to import the sequences using a "Manifest" file that indicates the location of each file and the identifier of the corresponding sample. In our case we assume that the sequences are in the fastq directory and the Manifest file is "samplemanifest".

sample-id,absolute-filepath,direction
# Lines starting with '#' are ignored and can be used to create
# "comments" or even "comment out" entries
SRR7265256,$PWD/fastq/SRR7265256_1.fastq,forward
SRR7265340,$PWD/fastq/SRR7265340_1.fastq,forward
SRR7265251,$PWD/fastq/SRR7265251_1.fastq,forward
SRR7265247,$PWD/fastq/SRR7265247_1.fastq,forward

To generate the first artifact with the reading sequences we execute the following command:

qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' \
                   --input-path samplemanifest \
                   --output-path paired-end-demux.qza \
                   --input-format PairedEndFastqManifestPhred33

It is useful to generate a summary of the import process. In this way we can check how many sequences were obtained per sample and also it provides a distribution chart of the qualities of the sequence by position (similar to the one obtained with Fastqc). To get it we execute the following command.

qiime demux summarize --i-data paired-end-demux.qza --o-visualization paired-end-demux.qzv

You can view the .qzv files using the plugin tools view. The command will open a tab in the browser with the content of the .qzv file. Alternatively, you can use the https://view.qiime2.org page to view the content.

qiime tools view paired-end-demux.qzv

Metadata file

Each sample is usually associated with experimental metadata (e.g. treatment, replication, type of organism, etc.). In order for QIIME 2 to be successful in matching the samples with their metadata, we need to prepare a metadata file. This information is stored in a tabular text file, in which the first line indicates the different fields (sample name, treatment, replication...), the first column corresponding to the name of the sample. Next you can see the content of the metadata file that we are going to use in practice:

#SampleID	Bases	Bytes	collection_date	env_biome	env_feature	env_material	Host	Strain_replicate	Plant	Source	     Sample_Name	Instrument	lat_lon	Library_Name	LibraryLayout	LibrarySelection	LibrarySource	Organism	replicate	   SRA_Study	BioProject	BioSample	Experiment
SRR7265256	39369745	23864338	05-Sep-2016	Greenhouse	Pot	Rhizosphere	Soybean	SOJ_01	SOJ	Agr	AgCV501	Illumina MiSeq	35.90 N 83.96 W	Ag_CV5_1_S33_L001_R1_001	PAIRED	PCR	METAGENOMIC	plant metagenome	CV5_01	SRP149820	PRJNA474716	SAMN09371320	SRX4169424
SRR7265285	46541724	29150348	26-Aug-2016	Greenhouse	Pot	Rhizosphere	Soybean	DRT_03	DRT	For	ForCV203	Illumina MiSeq	36.01 N 85.12 W	For_CV2_3_S59_L001_R1_001	PAIRED	PCR	METAGENOMIC	plant metagenome	CV2_03	SRP149820	PRJNA474716	SAMN09371364	SRX4169395

As you can see the first column is called #SampleID (# is optional) which is the term used to name the samples. And the rest of the columns indicate the different metadata. It is essential that this file has a suitable format. For that reason the QIIME 2 platform has developed a plugin called Keemei (https://keemei.qiime2.org/) that allows the use of Google Sheets to generate metadata files. Optionally you can use any spreadsheet program to generate the table and export it to text format using tabulators as a column marker.

ASVs determination using DADA2

DADA2 is an R package designed to determine ASVs using the following steps.

Quality analysis to know the clipping threshold
Sequence filtering and trimming. It is convenient to ensure that the Paired End reads overlap at least 12 nucleotides and that the quality of the read does not fall below 25. Also, to avoid problems in the determination of chimeras it is convenient to eliminate the first nucleotides as they will normally correspond to the primers that have been used in the 16S amplification.
Determination of the error pattern. As mentioned above, DADA will differentiate sequences that differ by only one nucleotide as different ASVs. Therefore, it is convenient to train the algorithm with a percentage of the reads to determine the variations by transition (changes of a purine nucleotide for another purine) or by transversion (a purine base for another pyrimidine base or vice versa) depending on the quality of the sequence.
Determination of reads with distinctive sequences
Paired-ends joining (so far we have analysed Forward and Reverse sequences separately).
Elimination of chimeras. Chimeric sequences are identified if they can be exactly reconstructed by combining a left and a right segment from two more abundant "parent" sequences (this is why it is important to remove the oligonucleotides from the primers, to avoid false positives).

All these steps are executed in QIIME2 with a single command:

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs paired-end-demux.qza \
  --p-trim-left-f 23 \
  --p-trunc-len-f 265 \
  --p-trim-left-r 21 \
  --p-trunc-len-r 220 \
  --o-representative-sequences rep-seqs.qza \
  --o-table table.qza \
  --o-denoising-stats stats.qza \
  --p-n-threads 2 \
  --p-n-reads-learn 15000

Where:

--i-demultiplexed-seqs paired-end-demux.qza: artifact name with sequences.
--p-trim-left-f 23: number of nucleotides to be trimmed by 5' in Forward reads
--p-trunc-len-f 265: position from which we trim the Forward reads
--p-trim-left-r 21: number of nucleotides to be trimmed by 5' in the Reverse reads
--p-trunc-len-r 220: position from which to trim the Reverse reads
--o-representative-sequences rep-seqs.qza: name of the output file where the sequences of each ASV are stored
--o-table table.qza: name of the output file name of the table where the abundance of each ASV per sample is stored
--o-denoising-stats stats.qza: name of the output file of the table where the information about the evolution of the process is stored
--p-n-threads 2: number of processes that can run in parallel
--p-n-reads-learn 15000: number of sequences that will be used to determine the error rate (the higher the number, the better the model but the longer the computation time, 25% of the total number of sequences may be enough).

We can visualise the results by running the following commands:

qiime metadata tabulate \
  --m-input-file stats.qza \
  --o-visualization stats.qzv

qiime feature-table summarize \
  --i-table table.qza \
  --o-visualization table.qzv \
  --m-sample-metadata-file metadata

qiime feature-table tabulate-seqs \
  --i-data rep-seqs.qza \
  --o-visualization rep-seqs.qzv

But instead of using this commands I prefer to run DADA2 in R from the very beginning so we are going to follow the protocol you can find in this file dada.rmd

Importing data from DADA output into QIIME2

As output of the script in R we get three files:

rep-seqs.fna: containing the nucleotide sequences of each ASV
seqtab-nochim: table containing the abundance of ASVs in each sample
stats.txt: a log table showing how the reads have been filtered in each step

To follow with the analysis we must convert the sequences file and the abundance file in artifact files so that we can use them in QIIME2. To convert the sequences we have to use the command qiime tools import (don't forget to activate qiime2 environment in terminal):

qiime tools import \
      --input-path rep-seqs.fna \
      --type 'FeatureData[Sequence]' \
      --output-path rep-seqs.qza

As we are introducing ASV sequences the --type argument must be 'FeatureData[Sequence]'.

Before importing the table of ASVs we have to modify a little bit the file.

echo -n "#OTU Table" | cat - seqtab-nochim.txt > biom-table.txt

With this pipe we send to the standar input a line containing "#OTU Table", then we add the content of seqtab-nochim.txt and finally we save it as biom-table.txt. Use the command head to compare seqtab-nochim.txt and biom-table.txt

TTables, especially those containing large amounts of data, are not easily interpretable by computers. To speed up access to them and quick retrieval of data, we have to index them. Biom type files are just that, small databases where the abundance data of each ASV in each sample is organised. Let's convert our table into such a file using the biom command.

 biom convert -i biom-table.txt -o table.biom --table-type="OTU table" --to-hdf5

As output we get the file table.biom with the information organized for QIIME2. I know that I am using "OTU table" instead of "ASV table" but it is a standard.

Now we are in conditions to import our ASVs table into QIIME2.

qiime tools import \
      --input-path table.biom \
      --type 'FeatureTable[Frequency]' \
      --input-format BIOMV210Format \
      --output-path table.qza

In this case in the --type argument we are indicating the nature of the date (a frequency table). Here feature is a synonym for ASV.

Once we have the artifact we can get the visualizaton files using the command qiime feature-table.

qiime feature-table summarize \
      --i-table table.qza \
      --o-visualization table.qzv \
      --m-sample-metadata-file metadata

qiime feature-table tabulate-seqs \
      --i-data rep-seqs.qza \
      --o-visualization rep-seqs.qzv

For the ASVs table we have to introduce as argument the metadata file so QIIME2 can separate properly the ASV among samples. Let's have a look to table.qzv and rep-seqs.qzv.

Removal of singletons and low-frequency ASVs (optional)

Based on the above visualisation, a threshold can be set to determine how frequent an ASV has to be (and optionally how many samples must have the ASV) to keep it from the results tables. One of the possible options would be to eliminate those ASVs that have a frequency below 0.1% of the mean depth. This threshold would exclude those ASVs that are due to Illumina sequencing errors (0.1% of reads). To calculate the threshold, find the mean frequency per sample, multiply that value by 0.001 and round the result. In this case, the average frequency is 75665 sequences so the cut-off would be 75.

To filter the table we are using the command qiime feature-table filter-features:

qiime feature-table filter-features --i-table table.qza \
                                    --p-min-frequency 75 \
                                    --p-min-samples 1 \
                                    --o-filtered-table table_filt.qza

In --p-min-frequency we must introduce our threshold and in --p-min-samples we introduce the minimum number of samples that a feature must be observed in to be retained.

Since the nucleotide sequences of the ASVs are in another artefact (rep-seqs.qza) it is appropriate to exclude low frequency ASVs from the representative sequences file.

qiime feature-table filter-seqs --i-data rep-seqs.qza \
                                --i-table table_filt.qza \
                                --o-filtered-data rep-seqs_filt.qza

We can generate a new visualization file of the table running the following qiime feature-table summarize:

qiime feature-table summarize \
      --i-table table_filt.qza \
      --o-visualization table_filt.qzv

Phylogenetic distances determination using MAFFT and FastTree

To get an idea of the similarity/dissimilarity of the ASVs to each other, we aligned the representative sequences using the MAFFT algorithm for alignment and FastTree to build the phylogenetic tree. We have to use the command qiime phylogeny align-to-tree-mafft-fasttree

qiime phylogeny align-to-tree-mafft-fasttree \
                --i-sequences rep-seqs_filt.qza \
                --o-alignment aligned-rep-seqs.qza \
                --o-masked-alignment masked-aligned-rep-seqs.qza \
                --o-tree unrooted-tree.qza \
                --o-rooted-tree rooted-tree.qza

Being:

--i-sequences rep-seqs_filt.qza: file with the sequences of the ASVs.
--o-alignment aligned-rep-seqs.qza: output file with the aligned sequences

--o-masked-alignment masked-aligned-rep-seqs.qza: In the 16S alignments, hypervariable regions can appear that lead to noise in the analysis. With this parameter we set the output file of the masked sequences.

--o-tree unrooted-tree.qza: output file with the calculated tree (without root)

--o-rooted-tree rooted-tree.qza: output file of the rooted tree. One way to add a root to the tree is to set it to the branch with the largest tip-to-tip (midpoint of the largest tip-to-tip).

This rooted-tree.qza is the artifact we need to run the diversity analyses.

Taxonomic assignment

At this moment we have the reads grouped as ASVs and this can give us information on population diversity, but what interests us is to also know the taxonomic assignment of each ASV. In order to do this, we carry out the following steps.

How to get taxonomic databases

Depending on the amplicon we are using we need a database of sequences and their corresponding taxonomic assignations. The most common sources are found in:

Unite database (https://unite.ut.ee):a database and sequence management environment centered on the eukaryotic nuclear ribosomal ITS region. This database is the preferred when analysing fungi.
Silva database (https://www.arb-silva.de): provides a dataset of aligned small (16S/18S) and large subunit (23S/28S) ribosomal RNA sequences for all thre domains of life (Bacteria, Archaea and Eukarya). Current version is 138.1 but since 132 the tree of life in this database has been arranged to fit the new knowledge coming from genome comparisons, so certain taxons have changed. For example, Burkholderia formerly in beta-proteobacteria is found in gamma-proteobacteria. Thus, results have to be revised in order to arcertain those changes.

Additional information on databases can be found in QIIME2 page https://docs.qiime2.org/2022.2/data-resources/

Last SILVA version for QIIME2 is 138.1 and can be obtained in https://docs.qiime2.org/2022.2/data-resources/. THe database is curated and ready to be used. However, there is a new plugin (RESCRIPt)has been developed to get the last versions and generate the corresponding artifacts (https://forum.qiime2.org/t/processing-filtering-and-evaluating-reference-sequence-data-with-rescript/15494#heading--second-header).

To install rescript within qiime2 environment run the following commands (see https://github.com/bokulich-lab/RESCRIPt)

conda install -c conda-forge -c bioconda -c qiime2 -c defaults xmltodict
pip install git+https://github.com/bokulich-lab/RESCRIPt.git

We must first get the database, in this case v. 138 SSURef_NR99 (99% homology)

qiime rescript get-silva-data \
      --p-version '138' \
      --p-target 'SSURef_NR99' \
      --p-include-species-labels \
      --o-silva-sequences silva-138-ssu-nr99-seqs.qza \
      --o-silva-taxonomy silva-138-ssu-nr99-tax.qza

Now we must remove sequences containing ambiguous bases.

qiime rescript cull-seqs \
      --i-sequences silva-138-ssu-nr99-seqs.qza \
      --o-clean-sequences silva-138-ssu-nr99-seqs-cleaned.qza

SILVA database contain partial sequences of the SSU, but depending on the Kingdom the length of 'good' sequences can vary. To remove small and unuseful sequences we can filter them according to their taxon with the following command that will remove rRNA gene sequences that do not meet the following criteria: Archaea (16S) >= 900 bp, Bacteria (16S) >= 1200 bp, and any Eukaryota (18S) >= 1400 bp.

qiime rescript filter-seqs-length-by-taxon \
      --i-sequences silva-138-ssu-nr99-seqs-cleaned.qza \
      --i-taxonomy silva-138-ssu-nr99-tax.qza \
      --p-labels Archaea Bacteria Eukaryota \
      --p-min-lens 900 1200 1400 \
      --o-filtered-seqs silva-138-ssu-nr99-seqs-filt.qza \
      --o-discarded-seqs silva-138-ssu-nr99-seqs-discard.qza

Finally, SILVA can contain identical sequences but with different taxonomy that should be dereplicated to avoid ambigous assignation. I recommend to read the consideration on dereplication in https://forum.qiime2.org/t/processing-filtering-and-evaluating-the-silva-database-and-other-reference-sequence-data-with-rescript/15494#heading--second-header.

qiime rescript dereplicate \
      --i-sequences silva-138-ssu-nr99-seqs-filt.qza  \
      --i-taxa silva-138-ssu-nr99-tax.qza \
      --p-rank-handles 'silva' \
      --p-mode 'uniq' \
      --o-dereplicated-sequences silva-138-ssu-nr99-seqs-derep-uniq.qza \
      --o-dereplicated-taxa silva-138-ssu-nr99-tax-derep-uniq.qza

silva-138-ssu-nr99-seqs-derep-uniq.qza and silva-138-ssu-nr99-tax-derep-uniq.qza will be equivalent to the 85_otus.qza and ref-taxonomy.qza we are using below.

Assignment database training

For this session we are going to use a small database in order to accelerate the taxonomic assignment process in the workshop. In particular we are going to use the Greegenes database at 85% (the sequences are grouped to a homology of 85%). The most appropriate would have been to use the 99% SILVA database.

We import the information into the corresponding artifacts. First the sequences as FeatureData[Sequence].

qiime tools import \
      --type 'FeatureData[Sequence]' \
      --input-path 85_otus.fasta \
      --output-path 85_otus.qza

Then the taxonomic assignment data as FeatureData[Taxonomy].

qiime tools import \
     --type 'FeatureData[Taxonomy]' \
     --input-format HeaderlessTSVTaxonomyFormat \
     --input-path 85_otu_taxonomy.txt \
     --output-path ref-taxonomy.qza

The database consists of complete marker sequences, in this case the 16S sequence of ribosomal RNA. However, the amplicons only cover a small fraction of the size of the gene. In order to avoid errors in the assignment, we will train our classifier to contain only the part of the sequence we have amplified. In our tutorial dataset, primers 341F (5′-CCTACGGGGGNGGCWGCAG-3′) and 785R (5′-GACTACHVGGGTATCTAATCC-3′) that amplify the V3-V4 region (about 440 nucleotides long) have been used. Therefore, we will extract from the sequencing artefact the corresponding region with the following command.

qiime feature-classifier extract-reads \
      --i-sequences 85_otus.qza \
      --p-f-primer CCTACGGGNGGCWGCAG \
      --p-r-primer GACTACHVGGGTATCTAATCC \
      --p-min-length 250 \
      --p-max-length 490 \
      --o-reads ref-seqs.qza

As output, we will have ref-seqs.qza containing the sequences from 85_otus.qza that can be amplified using those primers allowing a length range from 100 to 490 nts.

We can now generate the final classifier, in which we associate the trimmed database sequences with their corresponding taxonomic assignment using qiime feature-classifier fit-classifier-naive-bayes command. This step is taking some time as we are training a statistically robust model that will be able to predict the taxonomic assignation for an ASV.

qiime feature-classifier fit-classifier-naive-bayes \
      --i-reference-reads ref-seqs.qza \
      --i-reference-taxonomy ref-taxonomy.qza \
      --o-classifier classifier.qza

Taxonomic assignation

With our classifier already generated, we proceed to do the taxonomic assignment of the representative sequences of each ASV. The information will be stored in the taxa directory in the classification.qza artefact file.

qiime feature-classifier classify-sklearn --i-reads rep-seqs_filt.qza \
                                          --i-classifier classifier.qza \
                                          --p-n-jobs 2 \
                                          --output-dir taxa

A value of two in --p-n-jobs indicates the system to parralelize the process so 2 ASVs can be analised at the same time.

The assignment table can be exported to a tabular file (taxonomy.tsv) in which there is a column with the identifier of each ASV, another column with the corresponding taxonomic assignment and a last one in which the degree of confidence of the assignment is reflected (the closer to 1 the better). To do this, we execute the following command:

qiime tools export --input-path taxa/classification.qza --output-path taxa

However, what we are interested in is knowing the abundance of each ASV in each case and having a graph that represents it (it is important to indicate the metadata file so that each ASV can be grouped in the corresponding sample).

qiime taxa barplot --i-table table_filt.qza \
                   --i-taxonomy taxa/classification.qza \
                   --m-metadata-file metadata \
                   --o-visualization taxa/taxa_barplot.qzv

When viewing the file taxa/taxa_barplot.qzv, the following screen is displayed

We will be able to change at will:

The taxonomic level of representation (those assignments that cannot be resolved across taxa will appear as _ in the unresolved taxon):
- Level 1: Kingdom
- Level 2: Phylum
- Level 3: Class
- Level 4: Order
- Level 5: Family
- Level 6: Genus
- Level 7: Species
Sort the samples against the abundance of a taxon or by the different metadata we have entered in the metadata.txt file.
Change colour palettes (though the number of colours are few)

However, the graph shows the proportion of ASVs for all samples but we can group by the categories we have set in the Metadata file. For example we can group against the plant variable:

qiime feature-table group --i-table table_filt.qza \
                          --p-axis sample \
                          --p-mode sum \
                          --m-metadata-file metadata \
                          --m-metadata-column Plant \
                          --o-grouped-table table_filt_Plant.qza

With --p-mode we choose the mode of grouping the samples.

sum will sum the frequencies across all samples or features within a group
mean-ceiling will take the ceiling of the mean of these frequencies
median-ceiling will take the ceiling of the median of these frequencies

We can plot the result. But now the SampleID is not the SRR file but the content of Plant column, ('Bulk' , 'DRT', Fresh', 'SOJ'). Let's prepare a new metadata file (metadataPlant) containing just those #SampleID. Once we have it we can run the command to plot the result.

qiime taxa barplot --i-table table_filt_Plant.qza \
                   --i-taxonomy taxa/classification.qza \
                   --m-metadata-file metadataPlant \
                   --o-visualization taxa/taxa_barplot_Plant.qzv

Study of the Diversity

In an indirect way, we have already visualised the variations in the diversity of the samples, but by means of a qiime taxa barplot, though we do not have the statistical values that can tell us if the differences are significant. To obtain a statistical value of the diversity of the samples in the study, we have to take into account two concepts:

Alpha diversity: this refers to the variation of populations within each sample. We can use it as an indicator to determine if the depth of the study (see number of reads obtained) covers all the taxa in the sample.
Beta Diversity: refers to the variation in populations between samples. It can be used to determine if samples are very similar, or if population differences are significant.

Rarefaction curves

The depth or number of reads in a sample does not necessarily represent the entire biodiversity of a sample. To determine the degree of representability, several tests can be performed. One of them is the rarefaction test, in which random samples are taken from an experiment by modifying the depth to determine the number of ASVs that appear. There are three scenarios

Anne Schöler. Biology and Fertility of Soils. July 2017, Volume 53, Issue 5, pp 485–489

With the data obtained we can plot a curve (number of reads vs. number of ASVs) that will grow as the number of reads increases. We expect this growth to stop at a point where an increase in the number of reads will no longer provide information on biodiversity. To obtain these curves we will use the following command.

qiime diversity alpha-rarefaction --i-table table_filt.qza \
                                  --p-max-depth 122000 \
                                  --p-steps 20 \
                                  --i-phylogeny rooted-tree.qza \
                                  --m-metadata-file metadata \
                                  --o-visualization rarefaction_curves.qzv

In the --p-max-depth parameter we have to indicate the maximum depth of reads to be evaluated. Therefore we have to enter the number of reads that the most abundant sample contains. To do this check table.qzv. With --p-steps we indicate the number of samples the command has to do. In this case as we are evaluating aroun 123000 sequences, the command will count the number of ASVs each 6000 sequences.

Core Diversity determination

QIIME2 has the "diversity" plugin with which most of the currently used diversity indexes can be obtained (see https://forum.qiime2.org/t/alpha-and-beta-diversity-explanations-and-commands/2282). Let's run a command with the minimum argumentsto obtain alpha indexes such as Shannon and beta indexes such as Unifrac, Jaccard...

qiime diversity core-metrics-phylogenetic --i-table table_filt.qza \
                                          --i-phylogeny rooted-tree.qza \
                                          --p-sampling-depth 25000 \
                                          --m-metadata-file metadata \
                                          --p-n-jobs-or-threads 2 \
                                          --output-dir diversity

In this case as we want to compare all the samples in the same conditions as we cannot use the reads amount in the larger sample as sampling depth value, we must use the smaller sample as reference.

As output we will get the diversity directory where we can find different distances tables and their corresponding representations on PCoA (Principal Coordinate Analysis) plots.

The Kruskal-Wallis test (by William Kruskal and W. Allen Wallis) is a non-parametric method that can be used to determine whether two samples have the same diversity according to their Shannon index. Intuitively, it is identical to ANOVA but in this case the data are replaced by categories. It is not the only test to check whether two populations are different. QIIME2 has other pairwise comparisons that are explained in https://docs.qiime2.org/2022.11/tutorials/longitudinal/.

qiime diversity alpha-group-significance --i-alpha-diversity diversity/shannon_vector.qza \
                                     --m-metadata-file metadata \
                                     --o-visualization diversity/shannon_compare_groups.qzv

However, this test only indicates whether there are differences between samples in terms of their richness of diversity and does not take into account their composition (beta diversity). If we want to check the pairwise differences among samples we can the command qiime diversity beta-group-significance visualizer as follows,

qiime diversity beta-group-significance --i-distance-matrix diversity/weighted_unifrac_distance_matrix.qza \
                                        --m-metadata-file metadata \
                                        --m-metadata-column Plant \
                                        --o-visualization diversity/weighted_unifrac_plant_significance.qzv \
                                        --p-method permanova \
                                        --p-pairwise

In this case we are using as distance matrix the unweighted unifrac data and we are comparing samples by Plant column, using pairwise PERMANOVA analysis (https://en.wikipedia.org/wiki/Permutational_analysis_of_variance). Additional methods available are ANOSIM (https://en.wikipedia.org/wiki/Permutational_analysis_of_variance) and PERMDISP that is usually run wiht PERMANOVA to check whether the dissimilarities found in PERMANOVA were due to large within-group dispersions. Finally, there is a critical parameter to compute the p-value, --p-permutations, that is, the number of permutations. I have not included it in the command as by default is running 999 permutations, but we can increase that value to improve p-value determination. WHen describing the test in an article you must include the number of permutations, in a similar way you do when describing the bootstrap number for a phylogenetic tree.

What about noise?

Though, we have filtered to remove low abundance seqs, we might have lost biological information or or the threshold used may not have been appropriate or sufficient. Instead of eliminating ASVs by abundance we could also eliminate them by their presence throughout all samples. Por ejemplo podríamos eliminar aquellas que no aparecieran en al menos un tercio de las muestras (~5).

qiime feature-table filter-features \
      --i-table table.qza \
      --p-min-samples 5 \
      --o-filtered-table feature-frequency-filtered-table.qza

Try to repeat the diversity analyses including taxonomy plots and rarefaction curves using feature-frequency-filtered-table.qza