DADA2 - Michael-D-Preston/PrestonLab GitHub Wiki

Introduction

After you get your data from genome Quebec we first need to process it to take out all the junk sequences and put it into a useable format. This is were the program DADA2 comes in. It does all this for you (yay). So ready to start the DADA2 pipeline? Well I'll present our labs process; however, no one can tell you how to something better than the people who made the product. I'll be going through everything in detail; however, the writers of this packages also have tutorials on the DADA2 page(the intro tutorial, ITS specific tutorial, BIG data, pooling samples, ect).

A quick note on computational speed and computers

When I wrote the "Hey you need access to the super computer at unbc" thing i was surely mistaken. You CAN run this on your home laptop its just gonna take awhile... These kinds of programs require two things CPU and memory. a better or more CPUs increase speed and you need to have a certain amount of memory to be able to physically run dada2 (because of how the data is stored within commands). Dada2's whole thing is that the memory requirements are actually quite low so you can just run it on your home computer, it'll just take a bit. See the HPC (or the linux box I'm using) has 12 CPU's this means it can do most of these process 12x faster than your single cpu computer. When I ran kenzies 27 nextseq samples it took about 20-22 CPU hours. This means it'll take your single cpu computer around 20-22 hrs to run the whole dada2 pipeline, and it'll take me on the 12 CPU linux box about two hours.

Overview of process

Once you download your data off of genome Quebec or (insert other sequencing facility here) it can be in two forms, multiplexed or demultiplex, you'll need the demultiplexed data.

A quick note on demultiplexing

Sometimes your samples are "multiplexed," you can imagine this as a different kind of zip folder around your files. MOST companies will demultiplex your samples for you before giving them to you which is what genome Quebec does so you don't have to worry about demultiplexing! If your samples are multiplexed than there are a variety of programs happy to demultiplex your samples for you, but how do you tell if your samples are demultiplexed? If you have a couple files for each sample you sent off your data is most likely demultiplexed. If you have one massive file that contains all your sample data, your data is likely multiplexed.

Oh the files you'll see!

Genome Quebec gives you 4 files per sample. these files more or less follow the format samplename_I1.fastq.gz, samplename_I2.fastq.gz, samplename_R1.fastq.gz, and samplename_R2.fastq.gz. Files with a 1 mean they are the forward read. Files with a 2 mean they are the reverse read. I refers to an index sample (these are necessary for the multiplexing processes but we don't necessarily care about them). R refers to the reads. These files contain the actual sequencing data you care about.

Starting the pipeline

First you'll bring your samples into R and sort them into the forward and reverse reads per sample. Then you have to remove any reads with ambiguous bases (i.e. N's) this is because DADA2 specifically can't processes reads with N's. DADA2 also wants you to remove all the primers from your reads

Why remove primers?

Depending on where you send your samples they may or may not have the primers in your reads removed already (genome Quebec removed the primers in the first ITS data set I processed but ALWAYS check). Having primers floating around will make it harder to analyze your data. Since all the sequences will have the exact same code at the start that makes it harder when trying to assign taxonomy. PLUS usually the first couple base pairs (bp) in the read will have poor quality so removing the primers also conveniently removes these bases. PS. with 16S reads you can just remove the primers in the "forward" direction of the read i.e. if the primer length is 10, you need to remove the first 10 base pairs from the forward and reverse reads and then everything is fine. BUT if you work with ITS data you have to worry about really short ITS sequences. See Illumina MiSeq will sequence the first 200-250 bps of your sample, if you work with 16S data there is no way that this covers the entire 16S region, but with ITS data it can. (some fungal species have ITS2 regions between 160-390 bp. Therefore you must check and remove primers in to opposite direction of your read (i.e. if you are looking at the forward read you must remove the reverse read primer).

back to the pipeline and looking at reads

After you're sure you've removed all the primers you can take a look at the quality of the reads. Read quality has a couple different attributes, Firstly read length, the longer the better! and secondly read quality score. Quality score is a metric of how likely each base pair is correct. The higher the better! This scale is also logarithmic sorta. So a read quality of 10 means a wrong base is reported 10% of the time, but a quality of 20 means a wrong base is reported 1% of the time and so on. Check out me!. You'll see what some examples of read plots look like in the tutorial.

filtering the reads

Since alot of the reads suck we have to filter them. There are quite a few ways to do this and exactly how depends on your barcode/endgoal. When filtering we want to remove the reads that are just horrid e.g. quality score = 2, contains N's ect. Then we are going to remove reads with too much sequencing error but aren't implicitly bad. This is done by determining the maximum amount of bp errors that can occur per sequence. This is actually more accurate then just removing bases with low quality scores. Its recommended you remove bases with an expected error rate of 2 or more bases, but in the reverse read (which generally is of a worse quality than the forward read) its publishable to go up to an expected error of 5. You may also want to cut your sequences to certain lengths; however, do NOT do this for ITS sequences as they are different lengths), and you may want too be careful, cut too far and there wont be any overlap between the forward and reverse reads! It's normal to loose alot of reads during this process, up to 75%, as long as you have enough reads to do analysis later. If you loose butt loads of reads, you can try and relaxing the filtering parameters and seeing how many more reads you save. Its a delicate balance between good quality sequences and reads. In general though, if you have enough reads to do analysis (somewhere on the scale of thousands) then prioritize the good quality part of the equation.

Error learning

DADA2 then uses your data to determine the error rates of different transitions between bases (A->G). These transitions happen naturally during PCR and sequencing so by using your own dataset you use your sample specific transitions rate (This is the best part of DADA2, different sequencing runs can have massively different transition rates). These transition rates are used to then determine the differences between species. The two alternatives are firstly, use standard transition rates (this is bad, as before, different sequencing runs can have massively different transition rates), or secondly, just count the amount of mismatches (this is bad because a since the different transitions have different error rates, not every mismatch is created equal). (This means samples from different sequencing runs should be processed independently and combined only during later steps e.g. ASV's or after taxonomic assignment)

Error learning within nextseq data

Life cannot be simple. I started learning tis material on the precipice of the end of Miseq and the beginning of nextseq. Why is this important? well Mi and nextseq are two methods to perform next generation sequencing (i.e. getting the data you need for dada2); however, nextseq produces many more reads (on an order of magnitude) at similar costs. nextseq is the future and Miseq is getting phased out. On the dollar this is alltogether a good things, BUT (Theres always a but) nextseq uses binned quality scores, unlike miseq which produced raw quality scores. This binning is done to reduce data size.

So whats binning? When you sequence a nucleotide it has a quality score associated with it based on the likelihood that the correct nucleotide was recorded (i.e. Your sequencer reads a "T" but its unsure its actually a "T" then it gets a low quality score). These quality scores are on a continuous scale for miseq data, but for nextseq data they are binned into categorical categories. (e.g. Qscore of Q0 is binned into the Q0 bin. Q scores between Q0-Q15 are now Q12, Q15-30 are now Q26 and Q scores over 30 are Q34). This is a fine way to do things when you consider how we filter poor sequences out, but the problem arises when we try and learn the error rates of dada2. dada2 wants continuous Q scores and these categorical Q scores really mess up this kind of analysis and we see a stepwise pattern in our error rates estimations. The current solution to this is force the error rate estimations to be more linear and less stepwise/categorical. This seems a bit hamfisted but alas, you gotta do what you gotta do.

Unique sequences

After DADA2 knows how to tell sequences apart, it can figure out how many unique sequences there are. Then DADA2 compares the unique forward and reverse sequences to merge them into specific paired-reads. Some reads wont successfully merge. That's okay! some of the sequences are gonna be from weird places, if they don't merge they probably were just non-specific amplification of DNA. From this merge DADA2 will determine how many unique ASV's there are.

ASV's and OTU's

Operational taxonomic units (OTU's) are sequencing reads that differ by a specific percentage (usually 97% identity). Amplicon sequence variants (ASV's) are sequences that are specifically unique down to single nucleotide differences. There are a variety of benefits by using ASV's over OTU's (especially if you think the species concept at a bacterial level is a bit silly to begin with). Firstly, the 97% similarity used by OTU's is an arbitrary choice based on observations, this means it is open to discussion (would 99% identity be better? 99.5% and so on). Secondly, as DNA analysis pipelines get better (stronger denoising algorithms ect) ASV's are able to determine which sequences are biologically relevant with more confidence, meaning a wider range of what sequences are the "same" is unnecessary. Thirdly, in comparisons studies ASV's are better or equal compared to OTU's to disseminate ecological patterns. Fourthly (and very importantly), ASV's can be merged from different methods into a single comparison (there are important caveats to this), whereas OTU's can't. For a quick indepth read read me.

Chimeras

After generating your ASV's you want to remove the Chimeras. Chimeras (as their namesake would suggest) are merged ASV's that are from forward and reverse reads from different species. Chimeras are found using likelihoods! If you have a low abundance ASV that can created by mixing the sequences of two highly abundant ASV's its likely a chimera that was created during the merge read step. It's normal for chimeras to be a large part of the unique ASV's but IMPORTANTLY they should only be a small amount of the total reads in a sample.

Assigning taxonomy

Congrats! Getting here means you have successfully gone through DADA2, now it's time to take your ASV's and assign taxonomy to them. Taxonomy is assigned by comparing your ASV's to a sequence database that has been curated. This means that firstly, the database you choose is VERY important for your biological interpretations. There are specific databases that are better at identifying species species for certain applications e.g. a database just for human pathogens, and there are more wide ranging general databases. If you are working with ITS sequences you'll likely use the UNITE database. Secondly, since these databases are curated, not every ASV exists yet. Especially with soil samples its normal to have alot of "well this organism is a fungi for sure" or " well its an ascomycete but what species? who knows!". It's important to not remove these sequences from your dataset, they exist and are alive we just don't know who they are yet!!.

Phyloseq

Once you assign taxonomy you'll have to save your data and parse it into a phyloseq object, after than you can start data analyzing to your hearts content!!

The ...code

First you should start off with a test run on just a single sample on your home computer. This'll get you used to the code and I explain whats going on. This is an ITS workflow on Miseq data: DADA2 ITS Miseq test run

This is an ITS workflow on Nextseq data: DADA2 ITS Nextseq test run

And when you're ready you can use the true script with the correct code when running on the UNBC HPC for ITS sequences: Big ITS DADA2 DATA

18S sequences are a bit different, here are the test and real code chunks:

Heres an 18S example for Long read seq: DADA2 18S long read run

Want a quick reference for IDtaxa? here you go: Assigning taxonomy with IDTAXA

References

For this section please reference:

Shortread

M. Morgan, S. Anders, M. Lawrence, P. Aboyoun, H. Pag`es, and R. Gentleman (2009): "ShortRead: a Bioconductor package for input, quality assessment and exploration of high-throughput sequence data". Bioinformatics 25:2607-2608.

BioStrings:

Pagès H, Aboyoun P, Gentleman R, DebRoy S (2023). Biostrings: Efficient manipulation of biological strings. doi:10.18129/B9.bioc.Biostrings https://doi.org/10.18129/B9.bioc.Biostrings, R package version 2.68.1, https://bioconductor.org/packages/Biostrings.

DADA2:

Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP (2016). “DADA2: High-resolution sample inference from Illumina amplicon data.” Nature Methods, 13, 581-583. doi:10.1038/nmeth.3869 https://doi.org/10.1038/nmeth.3869.

Phyloseq:

phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. Paul J. McMurdie and Susan Holmes (2013) PLoS ONE 8(4):e61217.

Cutadapt:

Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1), pp. 10-12. doi:https://doi.org/10.14806/ej.17.1.200

If you are working with ITS data please cite the database you use. In the tutorial I use UNITE

UNITE Database Make sure this is the one you actually use

Abarenkov, Kessy; Zirk, Allan; Piirmann, Timo; Pöhönen, Raivo; Ivanov, Filipp; Nilsson, R. Henrik; Kõljalg, Urmas (2023): UNITE general FASTA release for Fungi 2. Version 18.07.2023. UNITE Community.

Kenzie use this one: Abarenkov, Kessy; Zirk, Allan; Piirmann, Timo; Pöhönen, Raivo; Ivanov, Filipp; Nilsson, R. Henrik; Kõljalg, Urmas (2024): UNITE general FASTA release for eukaryotes 2. Version 04.04.2024. UNITE Community. https://doi.org/10.15156/BIO/2959335