Checking your 16S reads for technical sequences - kviljoen/16S-rDNA-dada2-pipeline GitHub Wiki
This interactive exercise is designed to help you check for technical sequences (most commonly primers) in your reads in an intuitive manner. Here we use a single sample as an example, but you should check multiple randomly selected samples in your data. This exercise should help you determine the parameters --trimFor --trimRev --truncFor --truncRev for the dada2 pipeline. If primers have already been removed from your data you can set --trimFor and --trimRev to 0 and ignore --truncFor and --truncRev. Alternatively you can use a pattern-matching primer removal software such as cutadapt and perform FastQC before and after primer stripping to confirm complete removal of primer sequences.
Perform alignment between a few randomly selected read pairs from your data
You can use the head
command to quickly get test reads:
e.g.
head W097-V1_R1.fastq
@M00707:218:000000000-ABK1D:1:1101:16025:1375 1:N:0:5
CCATAGTGCCAGCCGCCGCGGTAATACGGAAGGTCCAGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCTGTTTGTTAAGCGTGTTGTGAAATGTAAGAGCTCAACTTTTAGATTGCAGCGCGAACTGGCAGACTTGAGTGCGCACAACGTAGGCGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGACGAACTCCGATTGCGAAGGCAGCTTACGGGAGCGCAACTGACGCTAAA
head W097-V1_R2.fastq
@M00707:218:000000000-ABK1D:1:1101:16025:1375 2:N:0:5
GGCGGGGACTACACGGGTATCTAATCCTGTTCGATACCCGCACCTTCGAGCTTTAGCGTCAGTTGCGCTCCCGTAAGCTGCCTTCGCAATCGGAGTTCGTCATGATATCTAAGCATTTCACCGCTACACCATGAATTCCGCCTACGTTGTGCGCACTCAAGTCTGCCAGTTCGCGCTGCAATCTAAAAGTTGAGCTCTTACATTTCACAACACGCTTAACAAACAGCCTACGCTCCCTTTAAACCCAATAA
Now take these matching R1 and R2 reads above and see how R1 aligns with the reverse complement (RC) of R2: To get the reverse complement of R2 you can use this tool: https://www.bioinformatics.org/sms/rev_comp.html
The RC of R2 is:
TTATTGGGTTTAAAGGGAGCGTAGGCTGTTTGTTAAGCGTGTTGTGAAATGTAAGAGCTCAACTTTTAGATTGCAGCGCGAACTGGCAGACTTGAGTGCGCACAACGTAGGCGGAATTCATGGTGTAGCGGTGAAATGCTTAGATATCATGACGAACTCCGATTGCGAAGGCAGCTTACGGGAGCGCAACTGACGCTAAAGCTCGAAGGTGCGGGTATCGAACAGGATTAGATACCCGTGTAGTCCCCGCC
The alignment should look something like this:
From the alignment we can see there’s about a 200bp overlap when aligning R1 and R2 (note that this may vary slightly but should be constant for the majority of reads). Next let’s try figure out if you still have primer sequences in your reads. For the example we’re using the primers:
- R1: NNNNNGTGCCAGCMGCCGCGGTAA
- R2: NNNNNGGACTACHVGGGTWTCTAAT
In the above we see ambiguity codes (e.g. N, H, V, M) in addition to A, T, C and G. For now for simplicity we’ll just grab a section of R1 with no ambiguity codes i.e. ‘GCCGCGGTAA’ Take this string and in your unix terminal use the command ‘egrep’ to look for and highlight this sequence if present in your reads. The command in this case would be:
egrep --colour GCCGCGGTAA W097-V1_R1.fastq
From this we can clearly see that R1 primer sequences are still present in our reads. You’ll notice that they’re not all in exactly the same place across reads, but the majority should be the same, e.g. in this example N=24 is the number of nucleotides from the start of R1 to the end of the primer sequence. We would therefore want to trim off N=24 in our pipeline (so if you’re using the Nextflow dada2 pipeline you would set --trimFor to 24). The above doesn’t tell us how commonly reads have the primer sequence though, lets have a look with the following command that counts the number of occurrences of the specified pattern GCCGCGGTAA:
grep -o GCCGCGGTAA raw/W097-V1_R1.fastq | wc -l
247830
And how many reads in total in this file?
grep -o @M00 raw/W097-V1_R1.fastq | wc -l
254884
So about 97% of reads have an exact match for this sequence (more if you allow for mismatches). Sometimes, depending on the length of the insert we’ll get ‘adapter readthrough’ where the R2 primer is also read into the R1 read after our desired insert.. If we look at the tail end of the R1 read, we would take the RC of the R2 primer, which is ATTAGAWACCCBDGTAGTCCNNNNN to check for adapter/primer ‘readthrough’. Note that this won’t necessarily be present in all reads. Below an example where, unlike the first alignment of 250bp reads, where we did not see adapter readthrough (because the insert was long enough), in the example below the insert is shorter than expected in which case we will see adapter readthrough, with primer seqs highlighted in yellow.
If we use the section ‘ATTAGA’ from the RC of R2 primer we see several sequences with adapter readthrough (but not all will have this):
egrep --colour ATTAGA W097-V1_R1.fastq
From this we can see that we do see adapter readthrough from the R2 primer in at least some of our R1 reads - but in what proportion? Lets check:
grep -o ATTAGA raw/W097-V1_R1.fastq | wc -l
448
So quite a small proportion of reads have adapter readthrough in this sample.
Now lets check for primers in our R2 reads, where the R2 primer is GGACTACHVGGGTWTCTAAT. Lets just grab the first few bases GGACTAC (to avoid ambiguous bases):
egrep --colour GGACTAC W097-V1_R2.fastq
Here again we see that for the R2 primer, judging from this sample, we need to trim off 25bp (highlighted section) from our reads. Let's see what proportion of reads contain R2 primer:
grep -o GGACTAC raw/W097-V1_R2.fastq | wc -l
240297
Total reads:
grep -o @M00 raw/W097-V1_R2.fastq | wc -l
254884
So again, about 94% of reads with exact matches to or R2 primer test segment.
Let's also check for primer readthrough in the R2 reads. First get the RC of R1 (TTACCGCGGCKGCTGGCACNNNNN) and then grep in R2 reads:
egrep --colour TTACCGCGGC W097-V1_R2.fastq
Again we see adapter readthrough at the tail-end of (some) of the R2 reads (highlighted section), let’s check the proportion of reads affected:
egrep -o TTACCGCGGC raw/W097-V1_R2.fastq | wc -l
So very small proportion affected here.
In summary, primer trimming is not always straightforward. There are two main strategies for trimming primers a) Pattern matching approaches like trimmomatic and cutadapt and b) absolute length truncation as implemented in our dada2 pipeline. Both approaches have pros and cons, pattern matching can fail to remove sections that you actually did want removed whereas absolute length truncation may trim reads that you didn’t want trimmed. The trick is to look at your data carefully, starting with FastQC and to come up with an acceptable solution. Never just assume default values/approaches will be appropriate for your data, dig deeper and make sure you know what you’re doing. Also important to keep in mind is that you may want to trim the tail end of seqs based on quality patterns you see from FastQC (common trend to see decreased sequencing quality towards the tail end of reads, especially R2).
In the example above, based on what we’ve seen, I would suggest trimming as follows:
--trimFor 24 (headcrop R1 read) --trimRev 25 (headcrop R2 read) --truncFor 251 (tailcrop R1 read) --truncRev 251 (tailcrop R2 read)
These settings will first trim all reads to 251 (most will already be 251). This setting also depends on sample quality - use a more aggressive trim if you see a distinct quality drop towards the tail of your reads. If we think about our two examples above where we a) aligned two standard 250bp sequences and b) aligned two shorter 118bp sequences (in which case we saw adapter readthrough) the settings truncFor and truncRev will only affect the > 251bp sequences, which means that our shorter 118bp reads will still contain adapter readthroughs (but this is ok for minority of sequences + the primer is not a technical sequence, it IS biologically relevant). Next, trimFor and trimRev will remove R1 primers from the start of our R1 reads (with trimFor) and R2 primers from the start of our R2 read (with trimRev) non-specifically.
Lastly it’s always a good idea to recheck quality with FastQC after trimming and filtering reads.