09 Preprocessing and quality evlauation of Illumina short reads - saltpinna/Genome_analysis_project GitHub Wiki

FastQC results before trimming

Forward sequence:

The Forward sequence reads had high quality scores over all bases in the reads, Phred scores varying between 32 and 40. There is a slight dip in quality towards the beginning and end of the reads.

Reverse seqeunce:

The Reverse sequence reads have bad quality at the ends of the reads, and the same dip in quality can be seen as for the Forwards strand towards the beginning of the reads.

Trimming with Trimmomatic

Because of the decrease in Phred scores towards the ends of the sequence reads, trimming was performed using Trimmomatic, including LEADING, TRAILING and MINLEN settings. The script used for this is called Trimmomatic_script.sh and can be found under code/scripts. The reads provided with the article were in fact already trimmed, so this was not necessary in order to remove the adapters. Trimming was performed to increase the quality score of the reads used for assembly.

FastQC results after trimming

Forward sequence:

The forward reads show higher Phred scores (around 38 for most positions), varying between 36 and 38 after trimming.

Reverse seqeunce:

The reverse reads do not show the same decrease in quality towards the ends of the reads as before trimming, and the Phred scores are more centered around 36 which was used as a threshold in the script.

Questions

How many reads have been discarded after trimming?

4 reads were discarded after trimming, according to the log in the slurm output file. The reads had already been trimmed, which explains why so few reads were fully discarded.

How can this affect your future analyses and results?

If too many reads are dropped, we risk loosing too much sequence information which can lead to gaps in our alignment. This can in turn lead to problems in further analysis steps. For example, if part of the genome is not assembled properly due to that too many reads are dropped in the preprocessing steps, the RNA reads cannot be mapped against those genes in the genome assembly and we might interpret this as that gene not being expressed even though it might actually be.

How is the quality of your data after trimming?

After trimming, the lowest phred score of the reads is 36 which means that most reads have higher scores than that, which means very high quality of the reads.

What do the LEADING, TRAILING and SLIDINGWINDOW options do?

LEADING trims the reads in the beginning of the reads, by specifying what phred quality score a base has to have in order to be kept in that read. TAILING does the same thing, but at the end on the read. SLIDINGWINDOW slides across the whole sequence read using a window of a specified size, calculating the average score of the bases in the window and removing them if they are below a specified threshold.