3C. Day 2 : Quality control - bioinfokushwaha/Livestock_Genomics GitHub Wiki
Raw read trimming and adapter removal
It is a fact that on Illumina sequencers, the quality of a base pair is linked to its position in the read, bases in the later cycles of the sequencing process have a lower average quality than the earliest cycles (as was observed in the QC report above). This effect depends on the sequencing facility and on the chemistry used and it is only recently that sequencing aligners have integrated methods to correct for this - and not all alignment software does so. A common approach to increase the mapping rate of reads is to trim (remove) low quality bases from the 3โ end until the quality reaches a user-selected Phred-quality threshold. A threshold of 20 is widely accepted as it corresponds to a base call error of 1 in a 100, which is approximately the inherent technical error rate of the Illumina sequencing platform.
An additional issue encountered with Illumina sequencing is the presence of partial adapter sequences within sequenced reads. This arises when the sample fragment size has a large variance and fragments shorter than the sequencer read-length are sequenced. As the resulting reads may contain a significant part of the adapter - a bp sequence of artificial origin - earlier generation alignment software ( those that do not use Maximum Exact Matching and require global alignment of an entire read) may not be able to map such reads. Being able to identify the adapter-like sequences at the end of the read and clip/trim them - a process termed adapter removal - may consequently significantly increase the aligned read proportion.
There are numerous tools available to perform either or both of these tasks (quality trimming and adapter removal). Here, we exemplify using Trimmomatic (Bolger, Lohse, and Usadel 2014), a tool that does both. The command line to run the first two samples is as follows:
# Creating new folder for trimming files
mkdir Trim
# Runing trimming on reads
java -jar /Bioinfo/Trimmomatic-0.39/trimmomatic-0.39.jar PE -phred33 brain_R1.fq.gz brain_R2.fq.gz Trim/brain"_PE_R1.fq" Trim/brain"_SE_R1.fq" Trim/brain"_PE_R2.fq" Trim/brain"_SE_R2.fq" \
ILLUMINACLIP:/bioinfo/Trimmomatic-0.38/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 \
MINLEN:36
ls -lh Trim/
Cutadpt
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.Cleaning your data in this way is often required: Reads from small-RNA sequencing contain the 3โ sequencing adapter because the read is longer than the molecule that is sequenced. Amplicon reads start with a primer sequence. Poly-A tails are useful for pulling out RNA from your sample, but often you donโt want them to be in your reads. Cutadapt helps with these trimming tasks by finding the adapter or primer sequences in an error-tolerant way. It can also modify and filter reads in various ways.
cutadapt -a TGGAATTCTCGGGTGCCAAGG --trimmed-only --minimum-length=18 brain_R1.fq.gz --cut=4 -o Trim/Cutadapt_trim".fq"
ls -lh Trim/
Assignment-1:
- Download: Assignment-2
- Open in web browser
- Compare Raw and Trim multiQC file
- Write down observation
- Guess type of sequencing data
Assignment-2:
- Download:Assignment-3
- Open in web browser
- Compare Raw and Trim multiQC file
- Write down observation
- Guess type of sequencing data