3. Reads Quality Control - lovisalittbrand/Genome_Analysis GitHub Wiki

Introduction

The purpose of doing a quality control on the reads is to minimize making erroneous interpretations when doing the assembly. Since the assembly is based on the reads, their quality might also affect the quality on the assembly. Including bad quality reads in the assembly will therefore result in an assembly of poor quality, which in turn will affect all downstream analyses.

Method

The quality check of the reads is done using the software FastQC through UPPMAX. The parameters used when running FastQC are the following:

-t 2: Specifying the number of threads I want to run
-f fastq: Specifying the format of the RNA and DNA reads
-o ~/analyses/01_preprocessing/reads_QC_DNA or ~/analyses/01_preprocessing/reads_QC_RNA: Specifying the output directory of the result

The files containing the reads are also given as input when calling FastQC. This step is done for both the trimmed DNA reads and the untrimmed RNA reads. Trimming of RNA will be done in a later step, where an additional quality check will also be done to evaluate the result of the trimming. The script for this analysis can be found here.

Result

The output of FastQC consists of an HTML-report containing a set of valuable metrics for each file of reads that was given as an input.

DNA reads

The quality of all DNA reads were pretty similar. They had good quality (green marking) on all statistics except for Per sequence GC content (red marking) and Sequence Length Distribution (orange marking). When analyzing the plot showing the number of GC counts per read it did not look similar to the theoretical distribution. This is because the theoretical distribution assumes we have data coming from one genome. As we have a mix of genomes this statistic will probably never be perfect. However, overall the quality of the DNA reads were good. Figure 1 below shows the Per Base Sequence Quality for SRR43421291, where it is possible to see that the quality was good even in the end of the read where the quality otherwise usually may decrease. All reports assessing quality of DNA reads can be found here. In conclusion, I would say that the quality of the DNA reads is good.

Quality DNA

Figure 1: Per Base Sequence Quality DNA

RNA reads

The quality on the RNA reads were much worse than the DNA reads. The majority of the metrics were had a red or orange marking. On the Per base sequence quality it was possible to see that the quality scores of the reads got much lower as the sequencing of the read progressed. This exceeded to a level that is not acceptable. This pattern is visible in figure 2, displaying the result of read SRR4342137. In conclusion, I would say that the quality of the RNA reads is not acceptable.

RNA Quality

Figure 2: Per Base Sequence Quality RNA

Another interesting metric to analyze was the Adapter Content, where all RNA reads resulted in high signals of the Illumina Universal Adapter (figure 3). When comparing the result with the DNA reads, they had no signal from either adapter. This is therefore explained by the RNA reads not being trimmed, and this result will probably change as they are trimmed. Additionally, it was possible to see that the RNA reads had a number of overrepresented sequences consisting of both adapters and PCR primers. This is in contrast to the DNA reads, which had no overrepresented sequences as the primers are also removed during the trimming. All reports assessing quality of RNA reads can be found here.

Figure 3: Adapter content in RNA read

Questions

What is the structure of a FASTQ file? How is the quality of the data stored in the FASTQ files? How are paired reads identified?

FASTQ-files have a general structure with 4 lines consisting of different types of information [1]: 1. The first line starts with @ and is followed by an Illumina identifier. The identifier contains /1 or /2 in case the read corresponds to a pair. The identifier could contain other information regarding the run or cluster. 2. The second line contains the nucleotide sequence letters. 3. The third line contains a separator, +, and is followed by the same Illumina identifier as in 1. 4. The fourth line contains the base quality scores for the sequence given in 2. It contains the same number of characters as there is letters in the sequence, since each base is given a score. The quality score of a base is defined as Q = -log10(e). e reflects the probability of the base being wrong. A higher Q corresponds to lower probably of an error, and too low Q could lead to those bases not being usable.

How is the quality of your data?

The quality of the DNA and RNA data is discussed above. To summarize, the quality of the DNA-reads were acceptable where the quality of the RNA-reads were not. The quality score of the base decreased significantly as the read progressed, reaching a level which is below acceptable.

What can generate the issues you observe in your data? Can these cause any problems during subsequent analyses?

One thing that I notice generate these issues is the presence of Illumina adaptors still remaining on the sequences. The adaptors contain f.ex. primer binding sites, indexes and binding sites to the flow cell which are all crucial for the technology. This will be a problem in later analyses where they might interfere, such as in the mapping. To not affect downstream analyses it is thereby necessary to perform trimming of RNA reads, where the software Trimmomatic has the ability to remove Illumina adaptors as well as regions with too low quality score. [2]

References

Illumina. 2020. Sequencing Quality Scores. WWW-document 2020-: https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/quality-scores.html. Retrieved 2020-05-26.
Adapter trimming: Why are adapter sequences trimmed from only the 3' ends of reads? WWW-document 2020-: https://support.illumina.com/bulletins/2016/04/adapter-trimming-why-are-adapter-sequences-trimmed-from-only-the--ends-of-reads.html. Retrieved 2020-05-26.