Preprocessing and Quality Control - MaryamDost/GenomeAnalysis GitHub Wiki

Processing and Quality Control

Leptospirillum ferriphilum transcriptome is sequenced by Illumina HiSeq2500 obtaining paired-read data. The raw reads are available on SRA and contains two files per run. A necessary step in the RNA-Seq workflow is to pre-process the raw data e.g., take the FASTQ files received from the sequencing facility and assess the quality of the sequence reads.

To examine quality metrics, FastQC was performed for the raw data and for the trimmed data. FastQC is a tool used to provide an overview of basic quality control metrics. The result from FastQC was then summarized by MultiQC, which is a reporting tool that parses summary statistics from results and log files generated by other bioinformatics tools.

The detailed script for how to run FastQC for the row reads is found Code directory, for further understanding of its methods and result investigation its documentation.

After controlling the quality of the raw reads, the reads were trimmed using Trimomatic. However, this was performed for one file only for practice purposes, the trimmed reads were already provided by TAs.

Trimomatic tool works well with FASTQ reads, the detailed script on how to run in is found in Code directory. More information about the tool and how to run it can also be found on its manual. The quality of the provided trimmed reads were then checked by FastQC running this script (which is coded the same as before but with input path to trimmed data).

As mentioned, the result from FastQC in both cases above was then summarized by running MultiQC. This tool was installed on local computer with the help av git commands provided here. The program then runs simply by this command: "multiqc .".

Result

Data from raw reads

Tabel1: General statistics in raw reads

Sample	%Duplicates	%G	Sample	%Duplicates	%G
ERR2036629_1	96.3	56.0	ERR2117288_1	96.3	56.0
ERR2036629_2	95.8	56.0	ERR2117288_2	95.8	56.0
ERR2036630_1	96.8	56.0	ERR2117289_1	96.8	56.0
ERR2036630_2	96.1	56.0	ERR2117289_2	96.1	56.0
ERR2036631_1	82.5	52.0	ERR2117290_1	82.5	52.0
ERR2036631_2	81.7	52.0	ERR2117290_2	81.7	52.0
ERR2036632_1	87.9	52.0	ERR2117291_1	87.9	52.0
ERR2036632_2	87.3	51.0	ERR2117291_2	87.3	51.0
ERR2036633_1	96.7	56.0	ERR2117292_1	96.7	56.0
ERR2036633_2	96.1	56.0	ERR2117292_2	96.1	56.0

Figure 1: The mean quality value across each base position in the raw reads

Figure 2: Adapter content for all runs in row reads

Figure 3: The distribution of GC content in raw reads

Figur 4: The number of reads with average quality scores in raw reads

Data from trimmed reads

Figure 5:The mean quality value across each base position in the trimmed reads

Figure 6: Adapter content for all runs in trimmed reads

Figure 7: The distribution of GC content in raw reads**

Figur 8: The number of reads with average quality scores in trimmed reads

Discussion

The “Per sequence quality scores” plot gives the average quality score. We see that the majority of our reads have a high average quality score with no large bumps at the lower quality values.

“Per base sequence content” gives a FAIL for RNA-seq data which is due to hexamer priming that occurs during RNA-seq library preparation. “Per Sequence GC Content” has typically a roughly normal distribution of GC content. However, our curve looks very bumpy and we have only 4 sequences that has passed the test. Warnings in this module usually indicate a problem with the library. Sharp peaks are the result of a specific contaminant (adapter dimers for example), which may well be picked up by the overrepresented sequences module. This might also be due to sequences having different GC content in different regions. What we could do is to separate the reads based on GC content and analyses all peaks individually. However, this step is not required for this project.

"Sequence Duplication Levels-plot for the sequence is not looking good which might depend to different factors. Contamination, over-sequencing a lot of PCR or the presence of adapter is some of the factors the effects this curve. In our case it is hard to say exactly the reason behind it being not good. It could be due to one or many of these factors acting together.

“Per Base N Content” tells us if there some uncalled bases at the library so this curve is looking good.

Overall, the trimmed sequence is good enough to be able to generate some reliable results.

Lab manual questions

What is the structure of a FASTQ file?

A FASTQ file is a text file that contains the sequence data and uses four lines per sequence. The first line begins with a '@' character followed by a sequence identifier with the information about the sequencing run and the cluster. Line two is the sequence (the base calls; A, C, T, G and N) and line three begins with a '+' separator and is optionally followed by the same se-quence identifier (and any description like sequence length) again. The fourth line contains the base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.

How is the quality of the data stored in the FASTQ files? How are paired reads identified?

As mention in previous answer the quality of data is stored as Phred quality score in fourth line of a sequence entry in the file. The Phred score is difined by the equation Q = -10 log10 P, a property which is logarithmically related to the base-calling error probabilities. The higher the score is the higher is the probability that the base is assigned correctly.

The FASTQ format encodes phred scores as ASCII characters. For the paired reads, the re-verse reads are found in a second file that has the same order as the first file, with the same headers.

How is the quality of your data?

Figure 1 shows clearly that our data needed to be trimmed to get rid of adapters, as it shows bad mean quality score at the ends of the sequences. We can see in figure 2 that our raw data contains a high number of adapters and in figure 4 we see that the Phres score is quite law for many of the sequences. As conclusion we can say that I quality of the raw read are not good and needs to be processed.

What can generate the issues you observe in your data? Can these causes any problems during subsequent analyses?

The issues in figure 1 is that the quality score is low at both ends which mean that sequence is still attached to adapters so we need to get rid av them. We see that the trimmning has worked and after trimming the number of adapters has decreased a lot. The phred score got also higher after the Data was trimmed.

The per base sequence content plot does not look good, which is probably due to the fact that we are dealing with RNA sequencing data so this we can ignore.

One issue that is concerning the is the Gc content, with has lots large bumps and far from be-ing a smooth curve. This could fairly indicate substantial contamination, but could also be caused by biased selection during sample preparation (e.g. low quality sample, low yield, PCR amplification bias).

The plot for sequence duplication levels is also not good. Some factors that could affect the duplication levels, which may also be relevant in our case, are over-sequencing, a lot of PCR, the presence of the adapters and the bump on the right side can also be caused by contami-nation. High duplication levels are to be expected in RNA-seq libraries since to capture lowly expressed transcripts it can be necessary to oversequence. This is common in smaller RNA-seq libraries which is what we are dealing with here.

Overall, the quality of the raw reads is not bad as some the problems were corrected by trimming the reads and the problems which were caused due to the type of library, we are using is hard to correct. We have now seen another piece of evidence indicating that we are working on a contaminated sample and therefore must take this in consideration while doing further analysis and evaluating the results.

How many reads have been discarded after trimming?

ERR2036629 for raw reads has a total sequence 24278349 and for the trimmed one (paired forward) we got 21793091 which is a difference of 2485258 e.g., a 10.24% of the reads were discarded.

How can this affect your future analyses and results?

This will improve our results. The law quality reads might have error that will affect the align-ment.

How is the quality of your data after trimming?

After trimming the quality of the data got much better, which is a reasonable result because adapters negatively impact sequencing data quality, and may even cause a run to stop prema-turely (hence the red lines). Figure 5 and 8 show how the score are improved after trimming and show that there is no subset of reads with particularly low quality. The per base sequence content has also improved significantly since we have trimmed the beginning of the reads.

What do the LEADING, TRAILING and SLIDINGWINDOW options do?

LEADING removes low quality bases from the beginning. As long as a base has a value below the chosen threshold the base will be removed and the next base will be investigated.

TRAILING does the same as LEADING but from the other end (the tail) of the read.

SLIDINGWINDOW trimmer will cut the leftmost position in the window where the average quality drops below the threshold and remove the rest of the read. However if there is low quality in the very beginning of the read then it will fail the minimum length tests and be removed completely.