BIOC4010 16S Tutorial - LangilleLab/microbiome_helper GitHub Wiki

This is a 16S tutorial based on the use of QIIME 2 and our Amplicon SOP v2 (see sidebar). It is meant as a self guided walk through that shows how to analyze a typical 16S dataset along with questions to ensure you understand the commands you are running.This tutorial was created for the class BIOC4010: Bioinformatics at Dalhousie University, but can be used by anyone.

Requirements

  • Access to a command line shell. If on Mac you can use the already installed program "Terminal". If on Windows, check to see if you already have "PowerShell" installed. If so, you should be all set. Otherwise, on Windows you can install Git for Windows (https://gitforwindows.org/), which includes a nice command line terminal.

**Note: if you are doing this tutorial and you are NOT in the BIOC4010 class you will need to install your own version of QIIME2 (version 2022.2).

If this is your first time using the command-line you may want to consider taking a basic unix tutorial like this one: http://korflab.ucdavis.edu/bootcamp.html

Background

Amplicon sequencing is a common method of identifying which taxa are present in a sample based on amplified marker genes. This approach contrasts with shotgun metagenomics where all the DNA in a sample is sequenced. The most common marker gene used for prokaryotes is the 16S ribosomal RNA gene. It features regions that are conserved among these organisms, as well as variable regions that allow distinction among organisms. These characteristics make this gene useful for analyzing microbial communities at reduced cost compared to shotgun metagenomics approaches. Only a subset of variable regions are generally sequenced for amplicon studies and you will see them referred to using syntax like "V3-V4" (i.e. variable regions 3 to 4) in scientific papers. For this tutorial we will process V6-V8 amplified sequences.

This tutorial dataset was originally used in a project to determine whether knocking out the protein chemerin affects gut microbial composition. Originally 116 mouse samples acquired from two different facilities were used for this project (only a subset of samples were used in this tutorial dataset, for simplicity). Metadata associated with each sample is indicated in the mapping file (map.txt). In this mapping file the genotypes of interest can be seen: wildtype (WT), chemerin knockout (chemerin_KO) and chemerin receptor knockout (CMKLR1_KO). Also of importance are the two source facilities: "BZ" and "CJS". It is generally a good idea to include as much metadata as possible, since this data can easily be explored later on.

Getting set up

Open your terminal window and ensure that you are on the machine that has QIIME2 installed (this may mean ssh'ing to a remote server). For those in BIOC4010, the command to ssh will be provided in Brightspace.

Activate your QIIME2 conda environment

conda activate qiime2-2022.2

Note: If you get an error than you do not have QIIME2 installed properly, or are not logged into the correct server

Note2: You will need to run the above commands every time you restart your terminal

Create your tutorial directory, replace "<your_name>" with your actual first name (no spaces allowed). For example, for me I would call it "morgan_lab":

mkdir ~/<your_name>_lab/

cd ~/<your_name>_lab/

Download the tutorial data:

cp ~/16S_chemerin_tutorial.zip ./

unzip 16S_chemerin_tutorial.zip

cd 16S_chemerin_tutorial

Browse data

You can see what is in this directory with the ls command:

ls  

You should see two names: "raw_data" is the directory containing all the sequencing files, which we are going to process. "map.txt" is a text file which contains metadata about the samples. We can look at it with the less command:

less -S map.txt

You can see the first column is the sample IDs and each additional column contains "metadata" about these samples. The "source" column describes which mouse facility the samples came from, the "Cage#" indicates which mouse cage the sample originated from (mice are often co-housed), and the last column contains information about whether the genotype of the mouse from which the sample is from.

Here is what the first 5 lines should look like:

#SampleID	source	Cage#	genotype
105CHE6WT	BZ	7	WT
106CHE6WT	BZ	7	WT
107CHE6KO	BZ	7	chemerin_KO
108CHE6KO	BZ	7	chemerin_KO

Q1) Based on the source column, how many samples are from each of the "BZ" and "CJS" source facilities?

Type "q" to exit less once you are done viewing the file.

Now lets list all of the FASTQ files:

ls raw_data/*fastq.gz

You can easily count the number of FASTQ files in the directory with this command:

ls raw_data/*fastq.gz | wc -l

Q2) Why isn't the number of FASTQ files equal to the number of samples?

1. First steps in QIIME pipeline

1.1 Set up variables that point to NCORES and METADATA

Below we will have commands that rely on our metadata file (e.g "map.txt"). We are going to simply save it now to a convenient name so the commands below will work for all users.

Be sure to replace "<your_name>" with the same name you used above

METADATA="/home/student/<yourname>_lab/16S_chemerin_tutorial/map.txt"

Several commands throughout this workflow can run on multiple cores in parallel. How many cores to use in these cases will be saved to the NCORES variable defined below. We set this variable to 2 below.

NCORES=2

Note that both of these commands will need to be run again if you log out and log back in later. In other words these variables are removed when you are disconnected

1.2 Import FASTQs as QIIME 2 artifact

To standardize QIIME 2 analyses and to keep track of provenance (i.e. a list of what commands were previously run to produce a file) a special format is used for all QIIME 2 input and output files called an "artifact" (with the extension QZA). The first step is to import the raw reads as a QZA file.

mkdir reads_qza
    
qiime tools import \
   --type SampleData[PairedEndSequencesWithQuality] \
   --input-path raw_data/ \
   --output-path reads_qza/reads.qza \
   --input-format CasavaOneEightSingleLanePerSampleDirFmt

All of the FASTQs are now in the single artifact file reads_qza/reads.qza. This file format can be a little confusing at first, but it is actually just a zipped folder. You can manipulate and explore these files better with the qiime tools utilities (e.g. peek and view).

1.3 Trim primers with cutadapt

Screen out reads that do not begin with primer sequence and remove primer sequence from reads using the cutadapt QIIME 2 plugin. The below primers correspond to the 16S V6-V8 region (Bacteria-specific primer set).

qiime cutadapt trim-paired \
   --i-demultiplexed-sequences reads_qza/reads.qza \
   --p-cores $NCORES \
   --p-front-f ACGCGHNRAACCTTACC \
   --p-front-r ACGGGCRGTGWGTRCAA \
   --p-discard-untrimmed \
   --p-no-indels \
   --o-trimmed-sequences reads_qza/reads_trimmed.qza

If you received the error --p-cores option requires an argument make sure that you set the NCORES bash variable as described above.

Summarize trimmed FASTQs

You can run the demux summarize command after trimming the reads to get a report of the number of reads per sample and quality distribution across the reads.

qiime demux summarize \
   --i-data reads_qza/reads_trimmed.qza \
   --o-visualization reads_qza/reads_trimmed_summary.qzv

Note that we gave the output file above the extension .qzv since this a special type of artifact file - a visualization. You can look at the visualization by uploading the file to the QIIME2 view website and clicking on the Interactive Quality Plot tab at the top of the page.

If you are working on a remote server via ssh (which we are in BIOC4010), then you need to download these .qzv files before uploading to QIIME Viewer. There are many ways to do this, but one way is to use the scp command in a new terminal window. Before the next command open a NEW terminal window where you are not connected to the server

scp [email protected]:~/<your_name>_lab/16S_chemerin_tutorial/reads_qza/reads_trimmed_summary.qzv ./

Then on your computer find where this file is and upload it on the QIIME2 Viewer webpage: https://view.qiime2.org/

Q3) What is the median number of sequences in the forward reads?

Choose the "Interactive Quality Plot" tab on the top left of the viewer.

Q4) By looking at the plots for forward and reverse reads, are the forward or the reverse reads of higher quality?

2. Denoising the reads into amplicon sequence variants

At this stage the main 2 pipelines you can use are based on either deblur or DADA2. Below we will describe the commands for running deblur. See here if you are interested in running DADA2.

2.1 Join paired-end reads

Currently deblur doesn't support unjoined paired-end reads. Forward and reverse reads can be joined with VSEARCH as shown below.

qiime vsearch join-pairs \
   --i-demultiplexed-seqs reads_qza/reads_trimmed.qza \
   --o-joined-sequences reads_qza/reads_trimmed_joined.qza

2.2 Filter out low-quality reads

This command will filter out low-quality reads based on the default options.

qiime quality-filter q-score \
   --i-demux reads_qza/reads_trimmed_joined.qza \
   --o-filter-stats filt_stats.qza \
   --o-filtered-sequences reads_qza/reads_trimmed_joined_filt.qza

It is a good idea at this point just to verify that there haven't been any substantial losses of reads, before going through the whole ASV process, at either the joining or quality-filtering steps above:

qiime demux summarize \
   --i-data reads_qza/reads_trimmed_joined_filt.qza \
   --o-visualization reads_qza/reads_trimmed_joined_filt_summary.qzv

Using the file you just created, download it, and upload to the QIIME Viewer to answer the following two questions.

Q5) What sample has the most remaining reads after filtering and how many reads are in that sample?

Q6) What sample has the least remaining reads after filtering and how many reads are in that sample?

2.3 Running deblur

Run the deblur workflow to correct reads and get amplicon sequence variants (ASVs). Note that the below command will retain singletons, which would have been filtered out unless we set --p-min-reads 1, and is for 16S sequences only. For other amplicon regions, you can either use the denoise-other option in the command and specify a reference database of sequences to use for positive filtering (as in the below versions for 18S and ITS) or use DADA2 as described below (the former still being preferred due to speed considerations).

Note that you will need to trim all sequences to the same length with the --p-trim-length option. We are going to trim the sequences to a length of 400.

qiime deblur denoise-16S \
   --i-demultiplexed-seqs reads_qza/reads_trimmed_joined_filt.qza \
   --p-trim-length 400 \
   --p-sample-stats \
   --p-jobs-to-start $NCORES \
   --p-min-reads 1 \
   --output-dir deblur_output

2.4 Summarizing deblur output

Once a denoising pipeline has been run you can summarize the output table with the below command, which will create a visualization artifact for you to view.

qiime feature-table summarize \
   --i-table deblur_output/table.qza \
   --o-visualization deblur_output/deblur_table_summary.qzv

3. Assign taxonomy to ASVs

You can assign taxonomy to your ASVs using a Naive-Bayes approach implemented in the scikit learn Python library and the SILVA database.

3.1 Build or acquire taxonomic classifier

This approach requires that a classifier be trained in advance on a reference database. We recommend users use a widely used classifier to help ensure there are no unexpected issues with the Naive-Bayes model.

The full-length 16S/18S classifier can be downloaded from the QIIME 2 website (silva-138-99-nb-classifier.qza for the latest classifier). This file has already been downloaded and placed on the server.

3.2 Run taxonomic classification

You can run the taxonomic classification with this command, which is one of the longest running and most memory-intensive command of the SOP. If you receive an error related to insufficient memory (and if you cannot increase your memory usage) then you can look into the --p-reads-per-batch option and set this to be lower than the default (which is dynamic depending on sample depth and the number of threads) and also try running the command with fewer jobs (e.g. set --p-n-jobs 1).

qiime feature-classifier classify-sklearn \
   --i-reads deblur_output/representative_sequences.qza \
   --i-classifier /home/shared/taxa_classifiers/qiime2-2022.2_classifiers/silva-138-99-nb-classifier.qza \
   --p-n-jobs $NCORES \
   --output-dir taxa

Note: We accidentally moved the necessary file in /home/shared. You can skip the running of this step by copying the file from another location using the following command:

cp -r ~/taxa ./

As with all QZA files, you can export the output file to take a look at the classifications and confidence scores:

qiime tools export \
   --input-path taxa/classification.qza --output-path taxa

4. Filtering resultant table

Filtering the denoised table is an important step of microbiome data analysis. You can see more details on this process in the QIIME 2 filtering tutorial. However, for this tutorial we are simply going to skip over any filtering steps and copy our table to the final filtered copy.

cp deblur_output/table.qza deblur_output/deblur_table_final.qza

4.4 Subset and summarize filtered table

Once we have our final filtered table we will need to subset the QZA of ASV sequences to the same set. You can exclude the low frequency ASVs from the sequence file with this command:

qiime feature-table filter-seqs \
   --i-data deblur_output/representative_sequences.qza \
   --i-table deblur_output/deblur_table_final.qza \
   --o-filtered-data deblur_output/rep_seqs_final.qza

Finally, you can make a new summary of the final filtered abundance table:

qiime feature-table summarize \
   --i-table deblur_output/deblur_table_final.qza \
   --o-visualization deblur_output/deblur_table_final_summary.qzv

5. Build tree with SEPP QIIME 2 plugin

SEPP is one method for placing short sequences into a reference phylogenetic tree. This is a useful way of determining a phylogenetic tree for your ASVs. For 16S data you can do this with q2-fragment-insertion using the below command:

qiime fragment-insertion sepp \
   --i-representative-sequences deblur_output/rep_seqs_final.qza \
   --i-reference-database /home/shared/rRNA_db/16S/sepp-refs-gg-13-8.qza \
   --o-tree asvs-tree.qza \
   --o-placements insertion-placements.qza \
   --p-threads $NCORES

Note that if you do not already have this file locally you will need to download sepp-refs-gg-13-8.qza as specified in the fragment-insertion instructions.

If this step takes too long you can simply copy the files I already ran using the following command. Note: You can cancel your previous command with "ctrl+c" (hold control button and type c).

cp ~/asvs-tree.qza ~/insertion-placements.qza ./

6. Generate rarefaction curves

A key quality control step is to plot rarefaction curves for all of your samples to determine if you performed sufficient sequencing. The below command will generate these plots. Normally you would set the --p-max-depth to sample with the highest sequencing depth. For this tutorial that should be 69. Also note that the $METADATA bash variable was defined at the first step of this SOP and simply points to the metadata file (e.g. map.txt).

qiime diversity alpha-rarefaction \
   --i-table deblur_output/deblur_table_final.qza \
   --p-max-depth 69 \
   --p-steps 10 \
   --i-phylogeny asvs-tree.qza \
   --m-metadata-file $METADATA \
   --o-visualization rarefaction_curves.qzv

Using the file you just created and the QIIME Viewer, answer the following question. You can change the "Sample Metadata Column" dropdown to analyze different groupings of the samples.

Q7) What genotype has the highest shannon diversity?

7. Generate stacked barchart of taxa relative abundances

A more useful output is the interactive stacked bar-charts of the taxonomic abundances across samples, which can be output with this command:

qiime taxa barplot \
   --i-table deblur_output/deblur_table_final.qza \
   --i-taxonomy taxa/classification.qza \
   --m-metadata-file $METADATA \
   --o-visualization taxa/taxa_barplot.qzv

Use this file and the QIIME viewer to make a stacked bar chart. Set the Taxonomic Level at 6 (which is genus), and sort the samples by genotype.

Q8) Download the stacked bar chart and legend and include it in your report.

8. Calculating diversity metrics and generating ordination plots

Common alpha and beta-diversity metrics can be calculated with a single command in QIIME 2. In addition, ordination plots (such as PCoA plots for weighted UniFrac distances) will be generated automatically as well. This command will also rarefy all samples to the sample sequencing depth before calculating these metrics. Normally you would use the lowest reasonable sample depth; samples with depth below this cut-off will be excluded. For this simple tutorial we will use the low cutoff of 10.

qiime diversity core-metrics-phylogenetic \
   --i-table deblur_output/deblur_table_final.qza \
   --i-phylogeny asvs-tree.qza \
   --p-sampling-depth 10 \
   --m-metadata-file $METADATA \
   --p-n-jobs-or-threads $NCORES \
   --output-dir diversity

You can then produce boxplots comparing the different categories in your metadata file. For example, to create boxplots comparing the Shannon alpha-diversity metric you can use this command:

qiime diversity alpha-group-significance \
   --i-alpha-diversity diversity/shannon_vector.qza \
   --m-metadata-file $METADATA \
   --o-visualization diversity/shannon_compare_groups.qzv

Q9) Is the Shannon diversity significantly different (e.g. p<0.05) between the genotype groups?

Browse the file diversity/weighted_unifrac_emperor.qzv using QIIME Viewer. This will create a PCoA plot. Using the "Select a Colour Category" choose source. Rotate the viewer (by clicking and dragging) to show the separation between the red and blue points.

Q10) Save the PCoA plot and include it in your report.

Lastly, lets see if that separation is statistically significant using a PERMANOVA test:

qiime diversity beta-group-significance  \
  --i-distance-matrix diversity/weighted_unifrac_distance_matrix.qza  \
  --m-metadata-file $METADATA  \
  --m-metadata-column source  \
  --o-visualization diversity/unifrac_source_compare_groups.qzv

Q11) Is the weighted unifrac beta-diversity significantly different between the "source" groups (e.g. p-value < 0.05)?

Q12) Is the weighted unifrac beta-diversity significantly different between the "genotype" groups (e.g. p-value < 0.05)? (Hint: change the command above to answer this question.)

Reconnecting

If you get disconnected from the server or you need to come back to it later. You will need to do the following commands again before picking up where you left off.

ssh [email protected]
conda activate qiime2-2022.2
cd ~/<your_name>_lab/16S_chemerin_tutorial
METADATA="/home/student/<yourname>_lab/16S_chemerin_tutorial/map.txt"
NCORES=2

Other resources

There are many other possible QIIME 2 analyses that we recommend you look into. You may also find these other resources useful:

References

⚠️ **GitHub.com Fallback** ⚠️