The dada2 Pipeline for 16s Amplicon Pair_ended Reads - Yanmei-Zhang/DADA2_SchillingLab GitHub Wiki
Welcome to the dada2_SchillingLab wiki! This page will give you an example of step-by-step illustration of dada2 16S pipeline run in RStudio through the interactive HPC resource. This is helpful if you are new to this pipeline and want to have a graphic view of each step. It assumes that you have set up the work environment as shown in the dada2_SchillingLab tutorial. You can log into the interactive HPC resource such as NICE session and then open the terminal there. You can also run it in your own computer once you have all the software and packages set up in your own device. But it will run slowly if you have large dataset.
The example shown here was sequenced by Miseq 2x300 run at the University of Minnesota Genomic Center (UMGC) by their dual index protocol.
In most microbial species, the 16S fourth hypervariable (V4) region is approximately 254 bp, and only deviates from this length by a few basepairs. If your reverse reads quality is low or you think you lost too much reads after merging the forward and reverse reads, it should be ok to use only the forward reads. In this situation, I also included an example here instead.
Part 1: Load the software from the HPC server
# Load RStudio and start it
module load rstudio/2022.02.3 # use the latest version of rstudio and in this version R/4.1.0 is loaded
rstudio
Part 2: dada2 workflow processed by R/4.1.0
Now you are in Rstudio graphic view. You can navigate to the working folder in Rstudio by clicking on the files tab (In this example: "~/dada2_tutorial"). Then click dada2_pipeline_16s.R to open the R script.
0. Get ready
# Load dada2 and other required packages
library(dada2)
packageVersion("dada2")
## [1] ‘1.20.0’
library(ggplot2)
The example data is located in the tutorial folder you downloaded in the set up step. You can also download here.
# Define data path
data.path <- "~/dada2_tutorial/example/16S" # Change it to YOUR data folder
list.files(data.path)
## [1] "Q23_16S_S27_R1_001.fastq.gz" "Q23_16S_S27_R2_001.fastq.gz"
## [3] "Q24_16S_S39_R1_001.fastq.gz" "Q24_16S_S39_R2_001.fastq.gz"
## [5] "Q25_16S_S51_R1_001.fastq.gz" "Q25_16S_S51_R2_001.fastq.gz"
## [7] "Q26_16S_S63_R1_001.fastq.gz" "Q26_16S_S63_R2_001.fastq.gz"
## [9] "Q27_16S_S75_R1_001.fastq.gz" "Q27_16S_S75_R2_001.fastq.gz"
## [11] "Q28_16S_S87_R1_001.fastq.gz" "Q28_16S_S87_R2_001.fastq.gz"
# Generate matched lists of the forward and reverse read files, as well as parsing out the sample name.
fnFs <- sort(list.files(data.path, pattern = "_R1_001.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(data.path, pattern = "_R2_001.fastq.gz", full.names = TRUE))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
head(sample.names)
## [1] "Q23" "Q24" "Q25" "Q26" "Q27" "Q28"
sample.namesR <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
if(!identical(sample.names, sample.namesR)) stop("Forward and reverse files do not match.")
# Set up work path in YOUR directory where you want data; you can also use the same path as the data is in, but it was nice to keep the data folder as original and store the processed data in another folder.
work.path <- "~/dada2_tutorial/dada2_16S" # Change it to YOUR work path
# Set up names of sub directories to stay organized
filter.fp <- file.path(work.path, "01_filter")
table.fp <- file.path(work.path, "02_tabletax")
1. Run dada2 pipeline
1.1 Inspect Read Quality Profiles
From the dada2 tutorial:
In gray-scale is a heat map of the frequency of each quality score at each base position. The median quality score at each position is shown by the green line, and the quartiles of the quality score distribution by the orange lines. The red line shows the scaled proportion of reads that extend to at least that position (this is more useful for other sequencing technologies, as Illumina reads are typically all the same length, hence the flat red line).
# Inspect read quality profiles
# If the number of samples is 10 or less, plot them all, otherwise, just plot 10 randomly selected samples
if( length(fnFs) <= 10) {
fwd_qual_plots <- plotQualityProfile(fnFs)
rev_qual_plots <- plotQualityProfile(fnRs)
} else {
rand_samples <- sample(size = 10, 1:length(fnFs)) # grab 20 random samples to plot
fwd_qual_plots <- plotQualityProfile(paste0(fnFs[rand_samples]))
rev_qual_plots <- plotQualityProfile(paste0(fnRs[rand_samples]))
}
if(!dir.exists(work.path)) dir.create(work.path)
if(!dir.exists(filter.fp)) dir.create(filter.fp)
ggsave(paste0(filter.fp, "/16s_fwd_qual_10.png"), fwd_qual_plots)
ggsave(paste0(filter.fp, "/16s_rev_qual_10.png"), rev_qual_plots)
fwd_qual_plots
rev_qual_plots

# Put filtered reads into separate sub-directories for big data workflow
subF.fp <- file.path(filter.fp, "filt_F")
subR.fp <- file.path(filter.fp, "filt_R")
dir.create(subF.fp)
dir.create(subR.fp)
1.2 Filter and Trim
Before chosing sequence variants, we want to trim reads where their quality scores begin to drop (the truncLen and truncQ values) and remove any low-quality reads that are left over after we have finished trimming (the maxEE value).
From dada2 tutorial:
If there is only one part of any amplicon bioinformatics workflow on which you spend time considering the parameters, it should be filtering! The parameters ... are not set in stone, and should be changed if they don’t work for your data. If too few reads are passing the filter, increase maxEE (perhaps especially on the reverse reads) and/or reduce truncQ. If quality drops sharply at the end of your reads, reduce truncLen. If your reads are high quality and you want to reduce computation time in the sample inference step, reduce maxEE.
Considerations for your own data:
THESE PARAMETERS ARE NOT OPTIMAL FOR ALL DATASETS. Make sure you try and determine the trim and filtering parameters for your data .
Remember though, when choosing truncLen for paired-end reads you must maintain overlap after truncation in order to merge them later. The tutorial is using 2x300 V4 sequence data, so the forward and reverse reads completely overlap and our trimming can be completely guided by the quality scores. The trimLeft = c(FWD_PRIMER_LEN, REV_PRIMER_LEN) argument of dada2’s filtering functions here is to remove the primers.
# Filter and Trim
filtFs <- file.path(subF.fp, paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(subR.fp, paste0(sample.names, "_R_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft=c(19,20), truncLen=c(240,220),
maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=TRUE) # On Windows set multithread=FALSE
head(out)
## reads.in reads.out
## Q23_16S_S27_R1_001.fastq.gz 70330 62401
## Q24_16S_S39_R1_001.fastq.gz 50068 44031
## Q25_16S_S51_R1_001.fastq.gz 54056 47972
## Q26_16S_S63_R1_001.fastq.gz 33477 29365
## Q27_16S_S75_R1_001.fastq.gz 75873 66094
## Q28_16S_S87_R1_001.fastq.gz 64221 57029
1.3 Infer Sequence Variants and Merge
In this part of the pipeline dada2 will learn to distinguish error from biological differences using a subset of our data as a training set. After it understands the error rates, we will reduce the size of the dataset by combining all identical sequence reads into "unique sequences". Then, using the dereplicated data and error rates, dada2 will infer the sequence variants (ASVs) in our data. Finally, we will merge the corresponding forward and reverse reads to create a list of the fully denoised sequences and create a sequence table from the result.
Learn the Error Rates
# Set seed to ensure that randomized steps can be replicated
set.seed(100)
# Learn the Error Rates
errF <- learnErrors(filtFs, multithread = TRUE)
## 67823132 total bases in 306892 reads from 6 samples will be used for learning the error rates.
errR <- learnErrors(filtRs, multithread = TRUE)
## 61378400 total bases in 306892 reads from 6 samples will be used for learning the error rates.
errF_plot <- plotErrors(errF, nominalQ = TRUE)
errR_plot <- plotErrors(errR, nominalQ=TRUE)
ggsave(paste0(filter.fp, "/16s_errF_plot.png"), errF_plot)
ggsave(paste0(filter.fp, "/16s_errR_plot.png"), errR_plot)
We want to make sure that the machine learning algorithm is learning the error rates properly. In the plots below, the red line represents what we should expect the learned error rates to look like for each of the 16 possible base transitions (A->A, A->C, A->G, etc.) and the black line and grey dots represent what the observed error rates are. If the black line and the red lines are very far off from each other, it may be a good idea to increase the nbases parameter. This allows the machine learning algorithm to train on a larger portion of your data and may help improve the fit.
errF_plot
errR_plot

Dereplication, Sequence Inference, and Merging of paired-end Reads
In this part of the pipeline, dada2 will make decisions about assigning sequences to ASVs (called "sequence inference"). There is a major parameter option in the core function dada() that changes how samples are handled during sequence inference. The parameter pool = can be set to: pool = FALSE (default), pool = TRUE, or pool = psuedo. For details on parameter choice, please see below, and further information and explanation on the this dada2 tutorial.
Details
pool = FALSE: Sequence information is not shared between samples. Fast processing time, less sensitivity to rare taxa.
pool = psuedo: Sequence information is shared in a separate "prior" step. Intermediate processing time, intermediate sensitivity to rare taxa.
pool = TRUE: Sequence information from all samples is pooled together. Slow processing time, most sensitivity to rare taxa.
Default: SAMPLES NOT POOLED
For simple communities or when you do not need high sensitivity for rare taxa
# make lists to hold the loop output
mergers <- vector("list", length(sample.names))
names(mergers) <- sample.names
ddF <- vector("list", length(sample.names))
names(ddF) <- sample.names
ddR <- vector("list", length(sample.names))
names(ddR) <- sample.names
# For each sample, get a list of merged and denoised sequences
for(sam in sample.names) {
cat("Processing:", sam, "\n")
# Dereplicate forward reads
derepF <- derepFastq(filtFs[sam](/Yanmei-Zhang/DADA2_SchillingLab/wiki/sam))
# Infer sequences for forward reads
ddF[sam]] <- dada(derepF, err=errF, multithread=TRUE)
# Dereplicate reverse reads
derepR <- derepFastq(filtRs[sam](/Yanmei-Zhang/DADA2_SchillingLab/wiki/sam))
# Infer sequences for reverse reads
ddR[sam](/Yanmei-Zhang/DADA2_SchillingLab/wiki/sam) <- dada(derepR, err = errR, multithread = TRUE)
# Merge reads together
merger <- mergePairs(ddF[sam](/Yanmei-Zhang/DADA2_SchillingLab/wiki/sam), derepF, ddR[sam](/Yanmei-Zhang/DADA2_SchillingLab/wiki/sam), derepR)
mergers[sam](/Yanmei-Zhang/DADA2_SchillingLab/wiki/sam) <- merger
}
rm(derepF); rm(derepR)
Alternative: SAMPLES POOLED
For complex communities when you want to preserve rare taxa alternative: swap pool = TRUE with pool = "pseudo"
# Dereplicate identical reads
derepFs <- derepFastq(filtFs, verbose = TRUE)
derepRs <- derepFastq(filtRs, verbose = TRUE)
# Name the derep-class objects by the sample names
names(derepFs) <- sample.names
names(derepRs) <- sample.names
# Sample Inference
dadaFs <- dada(derepFs, err = errF, multithread = TRUE, pool = TRUE)
## 6 samples were pooled: 306892 reads in 79664 unique sequences.
dadaRs <- dada(derepRs, err = errR, multithread = TRUE, pool = TRUE)
## 6 samples were pooled: 306892 reads in 82240 unique sequences.
# Inspecting the returned dada-class object:
dadaFs[1](/Yanmei-Zhang/DADA2_SchillingLab/wiki/1)
## dada-class: object describing DADA2 denoising results
## 3038 sequence variants were inferred from 18196
## input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
# Merge paired reads
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
## 57688 paired-reads (in 2719 unique pairings) successfully merged out of 59846 (in 3739 pairings) input.
## 38127 paired-reads (in 2979 unique pairings) successfully merged out of 40858 (in 4204 pairings) input.
## 44090 paired-reads (in 3027 unique pairings) successfully merged out of 45714 (in 3888 pairings) input.
## 26496 paired-reads (in 2692 unique pairings) successfully merged out of 27697 (in 3379 pairings) input.
## 62307 paired-reads (in 2539 unique pairings) successfully merged out of 63964 (in 3183 pairings) input.
## 52983 paired-reads (in 2687 unique pairings) successfully merged out of 54734 (in 3610 pairings) input.
# Inspect the merger data.frame from the first sample
head(mergers[1](/Yanmei-Zhang/DADA2_SchillingLab/wiki/1))
# Construct Sequence Table
seqtab <- makeSequenceTable(mergers)
dim(seqtab)
## [1] 6 4583
# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab)))
seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% 250:256]
dim(seqtab2)
## [1] 6 4271
# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab2)))
1.4 Remove Chimeras and Summary of Reads
Although dada2 has searched for indel errors and substitutions, there may still be chimeric sequences in our dataset (sequences that are derived from forward and reverse sequences from two different organisms becoming fused together during PCR and/or sequencing). To identify chimeras, we will search for rare sequence variants that can be reconstructed by combining left-hand and right-hand segments from two more abundant "parent" sequences. After removing chimeras, we will use a taxonomy database to train a classifer-algorithm to assign names to our sequence variants.
# Remove Chimeras
seqtab.nochim <- removeBimeraDenovo(seqtab2, method="consensus", multithread=TRUE, verbose=TRUE)
## Identified 97 bimeras out of 4271 input sequences.
dim(seqtab.nochim)
## [1] 6 4174
table(nchar(getSequences(seqtab.nochim)))
## 250 251 252 253 254 255 256
## 10 9 37 3622 428 55 13
# Print percentage of our seqences that were not chimeric.
sum(seqtab.nochim)/sum(seqtab2)
## [1] 0.978298
# Track reads through the pipeline
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)
## input filtered denoisedF denoisedR merged nonchim
## Q23 70330 62401 60639 60588 57688 49915
## Q24 50068 44031 41805 41874 38127 36810
## Q25 54056 47972 46409 46418 44090 39657
## Q26 33477 29365 28185 28205 26496 23896
## Q27 75873 66094 64496 64584 62307 48063
## Q28 64221 57029 55446 55416 52983 46166
Looks good! We kept the majority of our raw reads, and there is no over-large drop associated with any single step.
From dada2 tutorial: This is a great place to do a last sanity check. Outside of filtering (depending on how stringent you want to be) there should no step in which a majority of reads are lost. If a majority of reads were removed as chimeric, you may need to revisit the removal of primers, as the ambiguous nucleotides in unremoved primers interfere with chimera identification. If a majority of reads failed to merge, the culprit could also be unremoved primers, but could also be due to biological length variation in the sequenced ITS region that sometimes extends beyond the total read length resulting in no overlap.
# Save file
if(!dir.exists(table.fp)) dir.create(table.fp)
write.table(t(seqtab.nochim), paste0(table.fp, "/dada2_16s_counts.txt"), sep="\t", quote=F, row.names=T)
write.table(track , paste0(table.fp, "/dada2_16s_track.txt"), sep="\t", quote=F, row.names = T)
1.5 Assign Taxonomy
Here is a link to download the database we need. In this tutorial, we will assign taxonomy with Silva db v138.
# Assign taxonomy
taxa <- assignTaxonomy(seqtab.nochim, "~/dada2_tutorial/db_files/silva_nr99_v138.1_train_set.fa.gz", multithread=TRUE, tryRC = TRUE)
# Let’s inspect the taxonomic assignments:
taxa.print <- taxa # Removing sequence rownames for display only
rownames(taxa.print) <- NULL
head(taxa.print)
# Make species level assignments based on exact matching
taxa.plus <- addSpecies(taxa, "~/dada2_tutorial/db_files/silva_species_assignment_v138.1.fa.gz", verbose=TRUE)
# Let’s inspect the taxonomic assignments:
taxa.plus.print <- taxa.plus # Removing sequence rownames for display only
rownames(taxa.plus.print) <- NULL
head(taxa.plus.print)
## Kingdom Phylum Class
## [1,] "Bacteria" "Proteobacteria" "Alphaproteobacteria"
## [2,] "Bacteria" "Proteobacteria" "Alphaproteobacteria"
## [3,] "Bacteria" "Proteobacteria" "Alphaproteobacteria"
## [4,] "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## [5,] "Bacteria" "Acidobacteriota" "Acidobacteriae"
## [6,] "Bacteria" "Proteobacteria" "Alphaproteobacteria"
## Order Family
## [1,] "Rickettsiales" "Mitochondria"
## [2,] "Rhizobiales" "Beijerinckiaceae"
## [3,] "Rhizobiales" "Xanthobacteraceae"
## [4,] "WD260" NA
## [5,] "Acidobacteriales" "Acidobacteriaceae (Subgroup 1)"
## [6,] "Rhodospirillales" "Magnetospirillaceae"
## Genus Species
## [1,] NA NA
## [2,] "Methylovirgula" "ligni"
## [3,] NA NA
## [4,] NA NA
## [5,] "Terracidiphilus" NA
## [6,] "Telmatospirillum" NA
write.table(taxa.plus, paste0(table.fp, "dada2_16s_taxa.txt"), sep="\t", quote=F, row.names = T)
seqtable.taxa<- cbind('#seq'=rownames(taxa), t(seqtab.nochim), taxa)
write.table(seqtable.taxa, paste0(table.fp, "dada2_16s_counts.taxon.txt"), sep="\t", quote=F, row.names=F)
seqtable.taxa.plus<- cbind('#seq'=rownames(taxa.plus), t(seqtab.nochim), taxa.plus)
write.table(seqtable.taxa.plus, paste0(table.fp, "dada2_16s_counts.taxon.species.txt"), sep="\t", quote=F, row.names=F)
save.image(file= paste0(work.path, "/dada2_16s.RData"))
Summary of output files:
dada2_16s_counts.txt - A tab-delimited sequence-by-sample (i.e. OTU) table
dada2_16s_taxa.txt - a tab-demilimited file showing taxonomy information
dada2_16s_counts.taxon.species.txt - a tab-delimited file with sequence as rows, samples as columns and the last few columns showing the taxonomy of the ASV sequence
dada2_16s_track.txt - a tab-delimited file showing the tracking process of reads in each step