Preprocessing - sellwe/Genome-Analysis GitHub Wiki

In order to evaluate the quality of the Illumina DNA short reads for the Illumina/Nanopore extra step, i first ran FastQC on the raw paired end reads (forward and reverse).

Raw quality control output from the .html files:

Both forward and reverse seem fine. The paired reads are 90 bp long ( Illumina sequencing length). There is 0 adapter content in either, no overrepresented sequence in either. The second file dips slightly into the yellow (<26 Phred score) for per base sequence quality, but both files still fulfill the Q20 standard.

Even if the FastQC-files looked good i decided to perform trimming using Trimmomatic on them anyways, as this is common practice for increasing the quality, removing potential adapter content. I also want to perform the test for the sake of experience.

I ran trimmomatic first, then i ran fastqc again to see if there is any change in quality. For trimming with Trimmomatic i used the setting: SLIDINGWINDOW:4:20 MINLEN:50. It will look in a "sliding window" of 4 bp and trim the rest of that read if the window has a Phred score < 20. If the read is trimmed and the remaining reads is < 50, it will remove that read.

Quality after trimming:

How much trimming was done:

Total reads pre trimming: 1666667 * 2 (forward and reverse) = 3333334

Surviving pairs post trimming: 1466331

Forward unpaired post trimming (R1): 181610 Reverse unpaired post trimming(R2): 5896

Total reads after trimming: 1466331 + 1466331 + 181610 + 5896 = 3120168

Discarded reads: 3333334 - 3120168 = 213166 reads, which is 6.4%. Some of these are totally discarded, and some are just loss of paired mates. This is not alot, but to be expected since the quality was already good.

As expected, the quality is now even better, indicating successful trimming process. We get a warning for all of the trimmed files however:

But this also as expected, as these are probably the trimmed fragments, leading to variable "Sequence Length Distribution" with a few fragments being 50-89 bp long instead of 90bp. Most reads are still 90bp long, some are 89, and a few are between 50-89. The shortest being 50 due toMINLEN:50. As the rest looked fine i will keep using these trimmed files going forward with the analysis.

But for R2_unpaired i get an additional warning:

It appears as if some tiles gave lower quality scores than the others, meaning that the imaging process on the flow cell tiles were inconsistent. But as the rest of the sequencing scores are overall of good quality, this shouldnt impact further analyses.

R2_unpaired also doesnt contain that many sequences (5896) as compared to the others, so i doubt it will have a large negative impact on the assembly quality. And since the more data Spades get the better it will perform, i will move forward with all the remaining read data post trimming for the Spades assembly.