The dada2 Pipeline for 16s Amplicon Pair_ended Reads - Yanmei-Zhang/DADA2_SchillingLab GitHub Wiki

Welcome to the dada2_SchillingLab wiki! This page will give you an example of step-by-step illustration of dada2 16S pipeline run in RStudio through the interactive HPC resource. This is helpful if you are new to this pipeline and want to have a graphic view of each step. It assumes that you have set up the work environment as shown in the dada2_SchillingLab tutorial. You can log into the interactive HPC resource such as NICE session and then open the terminal there. You can also run it in your own computer once you have all the software and packages set up in your own device. But it will run slowly if you have large dataset.

The example shown here was sequenced by Miseq 2x300 run at the University of Minnesota Genomic Center (UMGC) by their dual index protocol.

In most microbial species, the 16S fourth hypervariable (V4) region is approximately 254 bp, and only deviates from this length by a few basepairs. If your reverse reads quality is low or you think you lost too much reads after merging the forward and reverse reads, it should be ok to use only the forward reads. In this situation, I also included an example here instead.

Part 1: Load the software from the HPC server

# Load RStudio and start it
module load rstudio/2022.02.3 # use the latest version of rstudio and in this version R/4.1.0 is loaded
rstudio

Part 2: dada2 workflow processed by R/4.1.0

Now you are in Rstudio graphic view. You can navigate to the working folder in Rstudio by clicking on the files tab (In this example: "~/dada2_tutorial"). Then click dada2_pipeline_16s.R to open the R script.

0. Get ready

# Load dada2 and other required packages
library(dada2)
packageVersion("dada2")
## [1] ‘1.20.0’
library(ggplot2)

The example data is located in the tutorial folder you downloaded in the set up step. You can also download here.

# Define data path
data.path <- "~/dada2_tutorial/example/16S" # Change it to YOUR data folder
list.files(data.path)

## [1] "Q23_16S_S27_R1_001.fastq.gz" "Q23_16S_S27_R2_001.fastq.gz"
## [3] "Q24_16S_S39_R1_001.fastq.gz" "Q24_16S_S39_R2_001.fastq.gz"
## [5] "Q25_16S_S51_R1_001.fastq.gz" "Q25_16S_S51_R2_001.fastq.gz"
## [7] "Q26_16S_S63_R1_001.fastq.gz" "Q26_16S_S63_R2_001.fastq.gz"
## [9] "Q27_16S_S75_R1_001.fastq.gz" "Q27_16S_S75_R2_001.fastq.gz"
## [11] "Q28_16S_S87_R1_001.fastq.gz" "Q28_16S_S87_R2_001.fastq.gz"

# Generate matched lists of the forward and reverse read files, as well as parsing out the sample name. 
fnFs <- sort(list.files(data.path, pattern = "_R1_001.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(data.path, pattern = "_R2_001.fastq.gz", full.names = TRUE))

# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
head(sample.names)
## [1] "Q23" "Q24" "Q25" "Q26" "Q27" "Q28"

sample.namesR <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
if(!identical(sample.names, sample.namesR)) stop("Forward and reverse files do not match.")

# Set up work path in YOUR directory where you want data; you can also use the same path as the data is in, but it was nice to keep the data folder as original and store the processed data in another folder. 
work.path <- "~/dada2_tutorial/dada2_16S" # Change it to YOUR work path

# Set up names of sub directories to stay organized
filter.fp <- file.path(work.path, "01_filter") 
table.fp <- file.path(work.path, "02_tabletax") 

1. Run dada2 pipeline

1.1 Inspect Read Quality Profiles

From the dada2 tutorial:

In gray-scale is a heat map of the frequency of each quality score at each base position. The median quality score at each position is shown by the green line, and the quartiles of the quality score distribution by the orange lines. The red line shows the scaled proportion of reads that extend to at least that position (this is more useful for other sequencing technologies, as Illumina reads are typically all the same length, hence the flat red line).

# Inspect read quality profiles
# If the number of samples is 10 or less, plot them all, otherwise, just plot 10 randomly selected samples
if( length(fnFs) <= 10) {
  fwd_qual_plots <- plotQualityProfile(fnFs)
  rev_qual_plots <- plotQualityProfile(fnRs)
} else {
  rand_samples <- sample(size = 10, 1:length(fnFs)) # grab 20 random samples to plot
  fwd_qual_plots <- plotQualityProfile(paste0(fnFs[rand_samples]))
  rev_qual_plots <- plotQualityProfile(paste0(fnRs[rand_samples]))
}

if(!dir.exists(work.path)) dir.create(work.path)
if(!dir.exists(filter.fp)) dir.create(filter.fp)
ggsave(paste0(filter.fp, "/16s_fwd_qual_10.png"), fwd_qual_plots)
ggsave(paste0(filter.fp, "/16s_rev_qual_10.png"), rev_qual_plots)

fwd_qual_plots fwd_qual_plots rev_qual_plots rev_qual_plots

# Put filtered reads into separate sub-directories for big data workflow
subF.fp <- file.path(filter.fp, "filt_F") 
subR.fp <- file.path(filter.fp, "filt_R") 
dir.create(subF.fp)
dir.create(subR.fp)

1.2 Filter and Trim

Before chosing sequence variants, we want to trim reads where their quality scores begin to drop (the truncLen and truncQ values) and remove any low-quality reads that are left over after we have finished trimming (the maxEE value).

From dada2 tutorial:

If there is only one part of any amplicon bioinformatics workflow on which you spend time considering the parameters, it should be filtering! The parameters ... are not set in stone, and should be changed if they don’t work for your data. If too few reads are passing the filter, increase maxEE (perhaps especially on the reverse reads) and/or reduce truncQ. If quality drops sharply at the end of your reads, reduce truncLen. If your reads are high quality and you want to reduce computation time in the sample inference step, reduce maxEE.

Considerations for your own data:

THESE PARAMETERS ARE NOT OPTIMAL FOR ALL DATASETS. Make sure you try and determine the trim and filtering parameters for your data .

Remember though, when choosing truncLen for paired-end reads you must maintain overlap after truncation in order to merge them later. The tutorial is using 2x300 V4 sequence data, so the forward and reverse reads completely overlap and our trimming can be completely guided by the quality scores. The trimLeft = c(FWD_PRIMER_LEN, REV_PRIMER_LEN) argument of dada2’s filtering functions here is to remove the primers.

# Filter and Trim
filtFs <- file.path(subF.fp, paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(subR.fp, paste0(sample.names, "_R_filt.fastq.gz"))

names(filtFs) <- sample.names
names(filtRs) <- sample.names

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft=c(19,20), truncLen=c(240,220),
                     maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
                     compress=TRUE, multithread=TRUE) # On Windows set multithread=FALSE
head(out)
##                             reads.in reads.out
## Q23_16S_S27_R1_001.fastq.gz    70330     62401
## Q24_16S_S39_R1_001.fastq.gz    50068     44031
## Q25_16S_S51_R1_001.fastq.gz    54056     47972
## Q26_16S_S63_R1_001.fastq.gz    33477     29365
## Q27_16S_S75_R1_001.fastq.gz    75873     66094
## Q28_16S_S87_R1_001.fastq.gz    64221     57029

1.3 Infer Sequence Variants and Merge

In this part of the pipeline dada2 will learn to distinguish error from biological differences using a subset of our data as a training set. After it understands the error rates, we will reduce the size of the dataset by combining all identical sequence reads into "unique sequences". Then, using the dereplicated data and error rates, dada2 will infer the sequence variants (ASVs) in our data. Finally, we will merge the corresponding forward and reverse reads to create a list of the fully denoised sequences and create a sequence table from the result.

Learn the Error Rates

# Set seed to ensure that randomized steps can be replicated
set.seed(100)
# Learn the Error Rates
errF <- learnErrors(filtFs, multithread = TRUE)
## 67823132 total bases in 306892 reads from 6 samples will be used for learning the error rates.
errR <- learnErrors(filtRs, multithread = TRUE)
## 61378400 total bases in 306892 reads from 6 samples will be used for learning the error rates.
errF_plot <- plotErrors(errF, nominalQ = TRUE) 
errR_plot <- plotErrors(errR, nominalQ=TRUE)

ggsave(paste0(filter.fp, "/16s_errF_plot.png"), errF_plot)
ggsave(paste0(filter.fp, "/16s_errR_plot.png"), errR_plot)

We want to make sure that the machine learning algorithm is learning the error rates properly. In the plots below, the red line represents what we should expect the learned error rates to look like for each of the 16 possible base transitions (A->A, A->C, A->G, etc.) and the black line and grey dots represent what the observed error rates are. If the black line and the red lines are very far off from each other, it may be a good idea to increase the nbases parameter. This allows the machine learning algorithm to train on a larger portion of your data and may help improve the fit.

errF_plot errF_plot errR_plot errR_plot

Dereplication, Sequence Inference, and Merging of paired-end Reads

In this part of the pipeline, dada2 will make decisions about assigning sequences to ASVs (called "sequence inference"). There is a major parameter option in the core function dada() that changes how samples are handled during sequence inference. The parameter pool = can be set to: pool = FALSE (default), pool = TRUE, or pool = psuedo. For details on parameter choice, please see below, and further information and explanation on the this dada2 tutorial.

Details
pool = FALSE: Sequence information is not shared between samples. Fast processing time, less sensitivity to rare taxa.
pool = psuedo: Sequence information is shared in a separate "prior" step. Intermediate processing time, intermediate sensitivity to rare taxa.
pool = TRUE: Sequence information from all samples is pooled together. Slow processing time, most sensitivity to rare taxa.

Default: SAMPLES NOT POOLED

For simple communities or when you do not need high sensitivity for rare taxa

# make lists to hold the loop output
mergers <- vector("list", length(sample.names))
names(mergers) <- sample.names
ddF <- vector("list", length(sample.names))
names(ddF) <- sample.names
ddR <- vector("list", length(sample.names))
names(ddR) <- sample.names

# For each sample, get a list of merged and denoised sequences
for(sam in sample.names) {
    cat("Processing:", sam, "\n")
    # Dereplicate forward reads
    derepF <- derepFastq(filtFs[sam](/Yanmei-Zhang/DADA2_SchillingLab/wiki/sam))
    # Infer sequences for forward reads
    ddF[sam]] <- dada(derepF, err=errF, multithread=TRUE)
    # Dereplicate reverse reads
    derepR <- derepFastq(filtRs[sam](/Yanmei-Zhang/DADA2_SchillingLab/wiki/sam))
    # Infer sequences for reverse reads
    ddR[sam](/Yanmei-Zhang/DADA2_SchillingLab/wiki/sam) <- dada(derepR, err = errR, multithread = TRUE)
    # Merge reads together
    merger <- mergePairs(ddF[sam](/Yanmei-Zhang/DADA2_SchillingLab/wiki/sam), derepF, ddR[sam](/Yanmei-Zhang/DADA2_SchillingLab/wiki/sam), derepR)
    mergers[sam](/Yanmei-Zhang/DADA2_SchillingLab/wiki/sam) <- merger
}
rm(derepF); rm(derepR)

Alternative: SAMPLES POOLED

For complex communities when you want to preserve rare taxa alternative: swap pool = TRUE with pool = "pseudo"

# Dereplicate identical reads
derepFs <- derepFastq(filtFs, verbose = TRUE)
derepRs <- derepFastq(filtRs, verbose = TRUE)
# Name the derep-class objects by the sample names
names(derepFs) <- sample.names
names(derepRs) <- sample.names

# Sample Inference
dadaFs <- dada(derepFs, err = errF, multithread = TRUE, pool = TRUE)
## 6 samples were pooled: 306892 reads in 79664 unique sequences.
dadaRs <- dada(derepRs, err = errR, multithread = TRUE, pool = TRUE)
## 6 samples were pooled: 306892 reads in 82240 unique sequences.
# Inspecting the returned dada-class object:
dadaFs[1](/Yanmei-Zhang/DADA2_SchillingLab/wiki/1)
## dada-class: object describing DADA2 denoising results
## 3038 sequence variants were inferred from 18196 
## input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

# Merge paired reads
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
## 57688 paired-reads (in 2719 unique pairings) successfully merged out of 59846 (in 3739 pairings) input.
## 38127 paired-reads (in 2979 unique pairings) successfully merged out of 40858 (in 4204 pairings) input.
## 44090 paired-reads (in 3027 unique pairings) successfully merged out of 45714 (in 3888 pairings) input.
## 26496 paired-reads (in 2692 unique pairings) successfully merged out of 27697 (in 3379 pairings) input.
## 62307 paired-reads (in 2539 unique pairings) successfully merged out of 63964 (in 3183 pairings) input.
## 52983 paired-reads (in 2687 unique pairings) successfully merged out of 54734 (in 3610 pairings) input.
# Inspect the merger data.frame from the first sample
head(mergers[1](/Yanmei-Zhang/DADA2_SchillingLab/wiki/1))

# Construct Sequence Table
seqtab <- makeSequenceTable(mergers)
dim(seqtab)
## [1]    6 4583
# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab)))

seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% 250:256]
dim(seqtab2)
## [1]    6 4271
# Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab2)))

1.4 Remove Chimeras and Summary of Reads

Although dada2 has searched for indel errors and substitutions, there may still be chimeric sequences in our dataset (sequences that are derived from forward and reverse sequences from two different organisms becoming fused together during PCR and/or sequencing). To identify chimeras, we will search for rare sequence variants that can be reconstructed by combining left-hand and right-hand segments from two more abundant "parent" sequences. After removing chimeras, we will use a taxonomy database to train a classifer-algorithm to assign names to our sequence variants.

# Remove Chimeras 
seqtab.nochim <- removeBimeraDenovo(seqtab2, method="consensus", multithread=TRUE, verbose=TRUE)
## Identified 97 bimeras out of 4271 input sequences.
dim(seqtab.nochim)
## [1]    6 4174
table(nchar(getSequences(seqtab.nochim)))
##  250  251  252  253  254  255  256 
##   10    9   37 3622  428   55   13 
# Print percentage of our seqences that were not chimeric.
sum(seqtab.nochim)/sum(seqtab2)
## [1] 0.978298

# Track reads through the pipeline
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)
##     input filtered denoisedF denoisedR merged nonchim
## Q23 70330    62401     60639     60588  57688   49915
## Q24 50068    44031     41805     41874  38127   36810
## Q25 54056    47972     46409     46418  44090   39657
## Q26 33477    29365     28185     28205  26496   23896
## Q27 75873    66094     64496     64584  62307   48063
## Q28 64221    57029     55446     55416  52983   46166

Looks good! We kept the majority of our raw reads, and there is no over-large drop associated with any single step.

From dada2 tutorial: This is a great place to do a last sanity check. Outside of filtering (depending on how stringent you want to be) there should no step in which a majority of reads are lost. If a majority of reads were removed as chimeric, you may need to revisit the removal of primers, as the ambiguous nucleotides in unremoved primers interfere with chimera identification. If a majority of reads failed to merge, the culprit could also be unremoved primers, but could also be due to biological length variation in the sequenced ITS region that sometimes extends beyond the total read length resulting in no overlap.

# Save file
if(!dir.exists(table.fp)) dir.create(table.fp)
write.table(t(seqtab.nochim), paste0(table.fp, "/dada2_16s_counts.txt"), sep="\t", quote=F, row.names=T)
write.table(track , paste0(table.fp, "/dada2_16s_track.txt"), sep="\t", quote=F, row.names = T)

1.5 Assign Taxonomy

Here is a link to download the database we need. In this tutorial, we will assign taxonomy with Silva db v138.

# Assign taxonomy
taxa <- assignTaxonomy(seqtab.nochim, "~/dada2_tutorial/db_files/silva_nr99_v138.1_train_set.fa.gz", multithread=TRUE, tryRC = TRUE)

# Let’s inspect the taxonomic assignments:
taxa.print <- taxa # Removing sequence rownames for display only
rownames(taxa.print) <- NULL
head(taxa.print)

# Make species level assignments based on exact matching
taxa.plus <- addSpecies(taxa, "~/dada2_tutorial/db_files/silva_species_assignment_v138.1.fa.gz", verbose=TRUE)

# Let’s inspect the taxonomic assignments:
taxa.plus.print <- taxa.plus # Removing sequence rownames for display only
rownames(taxa.plus.print) <- NULL
head(taxa.plus.print)
##       Kingdom    Phylum            Class                
## [1,] "Bacteria" "Proteobacteria"  "Alphaproteobacteria"
## [2,] "Bacteria" "Proteobacteria"  "Alphaproteobacteria"
## [3,] "Bacteria" "Proteobacteria"  "Alphaproteobacteria"
## [4,] "Bacteria" "Proteobacteria"  "Gammaproteobacteria"
## [5,] "Bacteria" "Acidobacteriota" "Acidobacteriae"     
## [6,] "Bacteria" "Proteobacteria"  "Alphaproteobacteria"
##      Order              Family                          
## [1,] "Rickettsiales"    "Mitochondria"                  
## [2,] "Rhizobiales"      "Beijerinckiaceae"              
## [3,] "Rhizobiales"      "Xanthobacteraceae"             
## [4,] "WD260"            NA                              
## [5,] "Acidobacteriales" "Acidobacteriaceae (Subgroup 1)"
## [6,] "Rhodospirillales" "Magnetospirillaceae"           
##      Genus              Species
## [1,] NA                 NA     
## [2,] "Methylovirgula"   "ligni"
## [3,] NA                 NA     
## [4,] NA                 NA     
## [5,] "Terracidiphilus"  NA     
## [6,] "Telmatospirillum" NA  
   
write.table(taxa.plus, paste0(table.fp, "dada2_16s_taxa.txt"), sep="\t", quote=F, row.names = T)

seqtable.taxa<- cbind('#seq'=rownames(taxa), t(seqtab.nochim), taxa)
write.table(seqtable.taxa, paste0(table.fp, "dada2_16s_counts.taxon.txt"), sep="\t", quote=F, row.names=F)

seqtable.taxa.plus<- cbind('#seq'=rownames(taxa.plus), t(seqtab.nochim), taxa.plus)
write.table(seqtable.taxa.plus, paste0(table.fp, "dada2_16s_counts.taxon.species.txt"), sep="\t", quote=F, row.names=F)

save.image(file= paste0(work.path, "/dada2_16s.RData"))

Summary of output files:

dada2_16s_counts.txt - A tab-delimited sequence-by-sample (i.e. OTU) table

dada2_16s_taxa.txt - a tab-demilimited file showing taxonomy information

dada2_16s_counts.taxon.species.txt - a tab-delimited file with sequence as rows, samples as columns and the last few columns showing the taxonomy of the ASV sequence

dada2_16s_track.txt - a tab-delimited file showing the tracking process of reads in each step