Amplicon SOP v2 (qiime2 2018.6) - LangilleLab/microbiome_helper GitHub Wiki

Please note that out current efforts for distributing virtual environments are focused on shotgun metagenomics analyses instead of amplicon analyses. If you are interested in getting an updated virtual environment for running a QIIME 2 workflow you can download a docker image available from the QIIME 2 website. Our virtual box image can also be used, but is not for the latest version of QIIME 2.

This workflow is based on the QIIME2 tutorials and is meant for users who want to quickly run their amplicon data through the Microbiome Helper virtual box and for internal use. There are more detailed descriptions and tutorials on the QIIME2 website, which we recommend you check out!

If you use this workflow make sure to keep track of the commands you use locally as this page will be updated over time (see "revisions" above for earlier versions).

This workflow assumes that 4 CPUs are available. Also, it's assumed that the input is raw paired-end MiSeq data in demultiplexed FASTQ format located within a folder called raw_data. The filenames are expected to look like this: 105CHE6WT_S325_L001_R1_001.fastq.gz, where each field separated by an _ character corresponds to: (1) the sample name, (2) the sample# on the run, (3) the lane number, (4) the read number (R1 = forward, R2 = reverse), and (5) the set number. You may need to re-name your files to match this format; however, QIIME2 accepts many different input formats so the format and naming scheme you're using may also be supported.

1. First steps

1.1 Format metadata file

You can format the metadata file (which is in a compatible format to the QIIME1 mapping files) in a spreadsheet and validate the format using a google spreadsheet plugin as described on the QIIME2 website. The only required column is the sample id column, which should be first. All other columns should correspond to sample metadata. Below, we use the default filename of metadata.txt.

1.2 Activate conda environment

You should run this workflow in a conda environment, which makes sure the correct version of the Python packages required by QIIME2 are being used. You can activate this conda environment with this command (this is the qiime2 version on our virtual box images, but it should work with other versions as well):

source activate qiime2-2018.6

1.3 Inspect read quality

Run FastQC to allow manual inspection of sequence quality. This is important as a sanity check that your reads are of reasonable quality and to determine how your reads should be trimmed in downstream steps. In particular, you'll want to determine at what length the reads can be trimmed so that low quality bases near the end of the reads are discarded while sufficient reads remain overlapping between the forward and reverse reads. You can run FastQC with this command:

mkdir fastqc_out
fastqc -t 4 raw_data/* -o fastqc_out/

1.4 Trim primers with cutadapt

Screen out reads that do not begin with primer sequence and remove primer sequence from reads using cutadapt. The below primers correspond to the 16S V6-V8 region (Bacteria-specific primer set).

mkdir primer_trimmed_fastqs

parallel --link --jobs 4 \
  'cutadapt \
    --pair-filter any \
    --no-indels \
    --discard-untrimmed \
    -g ACGCGHNRAACCTTACC \
    -G ACGGGCRGTGWGTRCAA \
    -o primer_trimmed_fastqs/{1/} \
    -p primer_trimmed_fastqs/{2/} \
    {1} {2} \
    > primer_trimmed_fastqs/{1/}_cutadapt_log.txt' \
  ::: raw_data/*_R1_*.fastq.gz ::: raw_data/*_R2_*.fastq.gz

If using 16S V4/V5 primers, use these options in the above command:

    -g GTGYCAGCMGCCGCGGTAA \
    -G CCGYCAATTYMTTTRAGTTT \

If using 16S V6/V8 Archaeal primers, use these options in the above command:

    -g TYAATYGGANTCAACRCC \
    -G CRGTGWGTRCAAGGRGCA \

If using 18S V4 primers, use these options in the above command:

    -g CYGCGGTAATTCCAGCTC \
    -G AYGGTATCTRATCRTCTTYG \

If using Fungal ITS2 primers, use these options in the above command:

    -g GTGAATCATCGAATCTTTGAA \
    -G TCCTCCGCTTATTGATATGC \

Each sample's logfile created above can be parsed to put the counts of passing reads into a single table (cutadapt_log.txt by default).

parse_cutadapt_logs.py -i primer_trimmed_fastqs/*log.txt

Lastly, remove the intermediate cutadapt logfiles:

rm primer_trimmed_fastqs/*log.txt

1.5 Import primer trimmed FASTQs as QIIME2 artifact

To keep the directory clean you can put the artifact files in a new directory

mkdir reads_qza
    
qiime tools import --type SampleData[PairedEndSequencesWithQuality] \
                   --input-path primer_trimmed_fastqs \
                   --output-path reads_qza/reads_trimmed.qza \
                   --source-format CasavaOneEightSingleLanePerSampleDirFmt

1.6 Generating table of sample depths at each step

The script qiime2_fastq_lengths.py, which is part of the Microbiome Helper repo, will output a table of the number of reads in each FASTQ found within a set of QZA files. In other words, it's a way to summarize how read depths change at multiple stages along the workflow.

The command takes on this structure (where at least one QZA file containing FASTQ files needs to be specified):

qiime2_fastq_lengths.py [QZA1] [QZA2] ... [QZAN] --proc 4 -o read_counts.tsv

At this stage the main 2 pipelines you can use are based on either deblur or DADA2.

2A. Prepping and running deblur workflow (Option 1):

2A.1 Join paired-end reads

Currently deblur doesn't support unjoined paired-end reads. Forward and reverse reads can be joined with VSEARCH:

qiime vsearch join-pairs --i-demultiplexed-seqs reads_qza/reads_trimmed.qza \
                         --o-joined-sequences reads_qza/reads_trimmed_joined.qza

2A.2 Filter out low-quality reads.

This command will filter out low-quality reads based on the default options.

qiime quality-filter q-score-joined --i-demux reads_qza/reads_trimmed_joined.qza \
                                    --o-filter-stats filt_stats.qza \
                                    --o-filtered-sequences reads_qza/reads_trimmed_joined_filt.qza

2A.3 Running deblur

Run the deblur workflow to correct reads and get amplicon sequence variants (ASVs). Note that the below command will retain singletons, which would have been filtered out unless we set --p-min-reads 1, and is for 16S sequences only. For other amplicon regions, you can either use the denoise-other option in the command and specify a reference database of sequences to use for positive filtering (as in the below versions for 18S and ITS) or use DADA2 as described below (the former still being preferred due to speed considerations).

Note that you will need to trim all sequences to the same length with the --p-trim-length option if you get an error. In order to determine the correct length to trim down to, run the following QC summary on the joined reads:

qiime demux summarize --i-data reads_qza/reads_trimmed_joined_filt.qza --o-visualization reads_trimmed_joined_filt_summary.qzv

You would then use the generated visualization (Interactive Quality Plot tab) in order to select a length to trim back to that maintains the largest/acceptable quantity of reads. Finally, input that length in the deblur option --p-trim-length below and rerun the command.

qiime deblur denoise-16S --i-demultiplexed-seqs reads_qza/reads_trimmed_joined_filt.qza \
                         --p-trim-length -1 \
                         --p-sample-stats \
                         --p-jobs-to-start 4 \
                         --p-min-reads 1 \
                         --output-dir deblur_output

For the 18S version (note the first command imports the deblur database and only needs to be run once on your system):

qiime tools import \
  --input-path /home/shared/rRNA_db/gb203_pr2_all_10_28_99p_clean.fasta \
  --output-path /home/shared/rRNA_db/gb203_pr2_all_10_28_99p_clean.qza \
  --type 'FeatureData[Sequence]'

qiime deblur denoise-other --i-demultiplexed-seqs reads_qza/reads_trimmed_joined_filt.qza \
                           --i-reference-seqs /home/shared/rRNA_db/gb203_pr2_all_10_28_99p_clean.qza \
                           --p-trim-length -1 \
                           --p-sample-stats \
                           --p-jobs-to-start 4 \
                           --p-min-reads 1 \
                           --output-dir deblur_output

For the ITS version (note the first command imports the deblur database and only needs to be run once on your system):

qiime tools import \
  --input-path /home/shared/rRNA_db/UNITE_sh_refs_qiime_ver7_dynamic_20.11.2016.fasta \
  --output-path /home/shared/rRNA_db/UNITE_sh_refs_qiime_ver7_dynamic_20.11.2016.qza \
  --type 'FeatureData[Sequence]'

qiime deblur denoise-other --i-demultiplexed-seqs reads_qza/reads_trimmed_joined_filt.qza \
                           --i-reference-seqs /home/shared/rRNA_db/UNITE_sh_refs_qiime_ver7_dynamic_20.11.2016.qza \
                           --p-trim-length -1 \
                           --p-sample-stats \
                           --p-jobs-to-start 4 \
                           --p-min-reads 1 \
                           --output-dir deblur_output

2B. Running DADA2 workflow (Option 2):

Running the DADA2 pipeline is more straight-forward than the above, but is slower.

2B.1 Running DADA2

qiime dada2 denoise-paired --i-demultiplexed-seqs reads_qza/reads_trimmed.qza \
                           --p-trunc-len-f 270 \
                           --p-trunc-len-r 210 \
                           --p-max-ee 3 \
                           --p-n-threads 4 \
                           --output-dir dada2_output

2B.2 Convert stats QZA to TSV

You should take a look at the read count table to see how many reads were retained at each step of the DADA2 pipeline:

qiime tools export dada2_output/denoising_stats.qza --output-dir dada2_output

3. Filtering resultant table

3.1 Summarize output table

Once a denoising pipeline has been run you can summarize the output table with the below command, which will create a visualization artifact for your to view. You should use this visualization to decide how to filter your table and also if you want to rarefy your data (which simplifies your analyses, but throws out data!) you can choose a cut-off based on this file.

qiime feature-table summarize --i-table deblur_output/table.qza --o-visualization deblur_output/deblur_table_summary.qzv

3.2 Filter out rare ASVs

Based on the above summary visualization you can choose a cut-off for how frequent a variant needs to be (and optionally how many samples need to have the variant) for it to be retained. One possible choice would be to remove all ASVs that have a frequency of less than 0.1% of the mean sample depth. This cut-off excludes ASVs that are likely due to MiSeq bleed-through between runs (reported by Illumina to be 0.1% of reads). To calculate this cut-off you would identify the mean sample depth in the above visualization, multiply it by 0.001, and round to the nearest integer.

Once you've determined how you would like to filter your table you can do so with this command (X is a placeholder for your choice):

qiime feature-table filter-features --i-table deblur_output/table.qza \
                                    --p-min-frequency X \
                                    --p-min-samples 1 \
                                    --o-filtered-table deblur_output/deblur_table_filt.qza

Since the ASVs will be in a separate FASTA file you can exclude the low frequency ASVs from the sequence file with this command:

qiime feature-table filter-seqs --i-data deblur_output/representative_sequences.qza \
                                --i-table deblur_output/deblur_table_filt.qza \
                                --o-filtered-data deblur_output/rep_seqs_filt.qza

Finally, you can make a new summary of the filtered abundance table:

qiime feature-table summarize --i-table deblur_output/deblur_table_filt.qza --o-visualization deblur_output/deblur_table_filt_summary.qzv

4. Build quick phylogeny with FastTree

4.1 Making multiple-sequence alignment

We'll need to make a multiple-sequence alignment of the ASVs before running FastTree. First, we'll make a folder for the output files.

mkdir tree_out

We'll use MAFFT to make a de novo multiple-sequence alignment of the ASVs.

qiime alignment mafft --i-sequences deblur_output/rep_seqs_filt.qza \
                      --p-n-threads 4 \
                      --o-alignment tree_out/rep_seqs_filt_aligned.qza

4.2 Filtering multiple-sequence alignment

Variable positions in the alignment need to be masked before FastTree is run, which can be done with this command:

qiime alignment mask --i-alignment tree_out/rep_seqs_filt_aligned.qza \
  --o-masked-alignment tree_out/rep_seqs_filt_aligned_masked.qza

4.3 Running FastTree

Finally FastTree can be run on this masked multiple-sequence alignment:

qiime phylogeny fasttree --i-alignment tree_out/rep_seqs_filt_aligned_masked.qza \
                         --p-n-threads 4 \
                         --o-tree tree_out/rep_seqs_filt_aligned_masked_tree

4.4 Add root to tree

FastTree returns an unrooted tree. One basic way to add a root to a tree is to add it add it at the midpoint of the largest tip-to-tip distance in the tree, which is done with this command:

qiime phylogeny midpoint-root --i-tree tree_out/rep_seqs_filt_aligned_masked_tree.qza \
                              --o-rooted-tree tree_out/rep_seqs_filt_aligned_masked_tree_rooted.qza

5. Generate rarefaction curves

A key quality control step is to plot rarefaction curves for all of your samples to determine if you performed sufficient sequencing. The below command will generate these plots (X is a placeholder for the maximum depth in your dataset, which you can determine by running the summarize command above).

qiime diversity alpha-rarefaction --i-table deblur_output/deblur_table_filt.qza \
                                  --p-max-depth X \
                                  --p-steps 20 \
                                  --i-phylogeny tree_out/rep_seqs_filt_aligned_masked_tree_rooted.qza \
                                  --m-metadata-file metadata.txt \
                                  --o-visualization rarefaction_curves.qzv

For some reason, the QIIME2 default in the above curves with the metadata file (which you can see in the visualization) is to not give you the option of seeing each sample's rarefaction curve individually (even though this is the default later on in stacked barplots!), only the "grouped" curves by each metadata type. As it can be quite important in data QC to see if you have inconsistent samples, we need to rerun the above command, but this time omitting the metadata file (use the same X for the maximum depth, as above).

qiime diversity alpha-rarefaction --i-table deblur_output/deblur_table_filt.qza \
                                  --p-max-depth X \
                                  --p-steps 20 \
                                  --i-phylogeny tree_out/rep_seqs_filt_aligned_masked_tree_rooted.qza \
                                  --o-visualization rarefaction_curves_eachsample.qzv

6. Assign taxonomy

You can assign taxonomy to your ASVs using a Naive-Bayes approach implemented in the scikit learn Python library and the SILVA database. Note that we have trained classifiers for a few different amplicon regions already (which are available in the /home/shared/taxa_classifiers folder), but you will need to generate your own if your region of interest isn't there. The classifier filename below is for the V6/V8 B969F-BA1406R region. The regions that we have trained classifiers for are:

16S V4/V5 region (classifier_silva_132_99_16S_V4.V5_515F_926R.qza)
16S V6/V8 region (classifier_silva_132_99_16S_V6.V8_B969F_BA1406R.qza)
16S V6/V8 region targeting archaea (classifier_silva_132_99_16S_V6.V8_A956F_A1401R.qza)
16S V3/V4 region targeting cyanobacteria (classifier_silva_132_99_16S_V3.V4_CYA359F_CYA781R.qza)
18S V4 region (classifier_silva_132_99_18S_V4_E572F_E1009R.qza)
Full ITS (classifier_sh_refs_qiime_ver7_99_s_01.12.2017_ITS.qza)

Note that this is a memory intensive job. If you receive an error related to insufficient memory (and if you cannot increase your memory usage) then you can look into the --p-reads-per-batch option and set this to be lower than the default (which is dynamic depending on sample depth and the number of threads) and also try running the command with fewer jobs (e.g. set --p-n-jobs 1).

qiime feature-classifier classify-sklearn --i-reads deblur_output/rep_seqs_filt.qza \
                                          --i-classifier /home/shared/taxa_classifiers/classifier_silva_132_99_16S_V6.V8_B969F_BA1406R.qza \
                                          --p-n-jobs 4 \
                                          --output-dir taxa

As with all QZA files, you can export the output file to take a look at the classifications and confidence scores:

mkdir taxa

qiime tools export taxa/classification.qza --output-dir taxa

A more useful output is the interactive stacked bar-charts of the taxonomic abundances across samples, which can be output with this command:

qiime taxa barplot --i-table deblur_output/deblur_table_filt.qza \
                   --i-taxonomy taxa/classification.qza \
                   --m-metadata-file metadata.txt \
                   --o-visualization taxa/taxa_barplot.qzv

Now the QIIME2 default in the above barplots is to plot each sample individually (opposite of what it was for the rarefaction curves!). However, in the visualizer you can label the samples according to their metadata types, but the plots are not summed together according to their metadata (as we could do in QIIME1 using the summarize_taxa_through_plots.py script), so we have to rerun the above command after running a new command first to group the samples by metadata - this creates a new feature table (you have to do this for each metadata category) which becomes the new input for the above same taxa barplot command (again, run each time for each metadata category). Note that CATEGORY here below is a placeholder for the text label of your category of interest from the metadata file.

qiime feature-table group --i-table deblur_output/deblur_table_filt.qza \
                          --p-axis sample \
                          --p-mode sum \
                          --m-metadata-file metadata.txt \
                          --m-metadata-column CATEGORY \
                          --o-grouped-table deblur_output/deblur_table_filt_CATEGORY.qza

qiime taxa barplot --i-table deblur_output/deblur_table_filt_CATEGORY.qza \
                   --i-taxonomy taxa/classification.qza \
                   --m-metadata-file metadata.txt \
                   --o-visualization taxa/taxa_barplot_CATEGORY.qzv

7. Calculating diversity metrics and generating ordination plots

Common alpha and beta-diversity metrics can be calculated with a single command in QIIME2. In addition, ordination plots (such as PCoA plots for weighted UniFrac distances) will be generated automatically as well. This command will also rarefy all samples to the sample sequencing depth before calculating these metrics (X is a placeholder for the lowest reasonable sample depth; samples with depth below this cut-off will be excluded).

qiime diversity core-metrics-phylogenetic --i-table deblur_output/deblur_table_filt.qza \
                                          --i-phylogeny tree_out/rep_seqs_filt_aligned_masked_tree_rooted.qza \
                                          --p-sampling-depth X \
                                          --m-metadata-file metadata.txt \
                                          --p-n-jobs 4 \
                                          --output-dir diversity

You can then produce an alpha-diversity Shannon metric boxplot comparing the different categories in your metadata file using the following command (note that all the individual Shannon values for each sample are inside the shannon_vector.qza input file, if you wish to see them, and QIIME2 has many different metrics if you wish to use them).

qiime diversity alpha-group-significance --i-alpha-diversity diversity/shannon_vector.qza \
                                         --m-metadata-file metadata.txt \
                                         --o-visualization diversity/shannon_compare_groups.qzv

8. Identifying deferentially abundant features with ANCOM

ANCOM is one method to test for difference in the relative abundance of features between sample groupings. It is a compositional approach that makes no assumptions about feature distributions. However, it requires that all features have non-zero abundances so a pseudocount of 1 first needs to be added:

qiime composition add-pseudocount --i-table deblur_output/deblur_table_filt.qza \
                                  --o-composition-table deblur_output/deblur_table_filt_pseudocount.qza

Then ANCOM can be run with this command; note that CATEGORY is a placeholder for the text label of your category of interest from the metadata file:

qiime composition ancom --i-table deblur_output/deblur_table_filt_pseudocount.qza \
                        --m-metadata-file metadata.txt \
                        --m-metadata-column CATEGORY \
                        --output-dir ancom_output

9. Exporting the final abundance, profile and sequence files

Lastly, to get the BIOM file (with associated taxonomy), the STAMP profile file (.spf) and FASTA file (one per ASV) for your dataset to plug into other programs you can use the commands below. Note that there are a couple of file manipulations included here since the conversion scripts expect certain slightly different headers than those produced by QIIME2 by default and certain SILVA taxonomy labeling problems need to be fixed for STAMP:

# For FASTA:
mkdir deblur_output_exported

qiime tools export deblur_output/rep_seqs_filt.qza --output-dir deblur_output_exported

# For BIOM w/taxonomy:
sed -i -e '1 s/Feature/#Feature/' -e '1 s/Taxon/taxonomy/' taxa/taxonomy.tsv
qiime tools export deblur_output/deblur_table_filt.qza --output-dir deblur_output_exported
biom add-metadata -i deblur_output_exported/feature-table.biom -o deblur_output_exported/feature-table_w_tax.biom --observation-metadata-fp taxa/taxonomy.tsv --sc-separated taxonomy
biom convert -i deblur_output_exported/feature-table_w_tax.biom -o deblur_output_exported/feature-table_w_tax.txt --to-tsv --header-key taxonomy

# For STAMP profile:
source deactivate
biom_to_stamp.py -m taxonomy deblur_output_exported/feature-table_w_tax.biom > deblur_output_exported/feature-table_w_tax.spf