1. Preprocessing - Sara-SL/GenomeAnalysis GitHub Wiki

Method

As a first step of the project I wanted to quality check the data that I had been given. I had been given whole genome sequencing (wgs) data and rna seq data that had already been trimmed by our TAs. Only one sample of the rna seq data had not been trimmed. I therefore expected the quality of my data to be good.

Quality check

FastQC

I started off by createing soft links to the given data(wgs, raw rna_seq and trimmed rna_seq) and I manually renamed the files at the same time. I saved all wgs_data in one folder, the raw rna_seq data in one folder and the trimmed rna_seq data in one folder. Then I created three different .sh scripts that each loops through one of the folders, run fastqc on all of the files in the folder and outputs the result in the 1_quality_check folder. Then I run the following commands:

salloc -A g2020008 -p core -n 2 -t 00:40:00

module load bioinfo-tools

module load FastQC

cd git/GenomAnalysis/Scripts/

sh fastqc_wgs.sh

sh fastqc_rna_seq_raw.sh

sh fastqc_rna_seq_trimmed.sh

MultiQC

I got a lot of output files from FastQC so instead of manually looking trough all of them I used MultiQC to get a summarized overview of the quality of the data. I did this by typing the commands:

module load MultiQC

multiqc -o 1_quality_check/multiqc_wgs/ 1_quality_check/fastqc_wgs/

multiqc -o 1_quality_check/multiqc_rna_seq/trimmed/ 1_quality_check/fastqc_rna_seq/trimmed/

multiqc -o 1_quality_check/multiqc_rna_seq/raw/ 1_quality_check/fastqc_rna_seq/raw/

Where the last parameter was the folder containing all output files from FastQC

Trimming

The quality of the raw data looked okey but to remove adapters and other artifacts I run Trimmomatic with the following commands:

module load trimmomatic

java -jar $TRIMMOMATIC_HOME/trimmomatic.jar PE -trimlog rna_seq_trimLogFile HL_CS17_45.1.fastq.gz HL_CS17_45.2.fastq.gz HL_CS17_45.trimmed_1P.fastq.gz HL_CS17_45.trimmed_1U.fastq.gz HL_CS17_45.trimmed_2P.fastq.gz HL_CS17_45.trimmed_2U.fastq.gz ILLUMINACLIP:$TRIMMOMATIC_HOME/adapters/TruSeq3-PE.fa:2:30:10 LEADING:17 TRAILING:17

I put LEADING and TRAILING to 17 since that was the quality used in the paper. The ILLUMINACLIP values was set according to an example in the trimmomatic manual.

Quality check 2

After the trimming I quality checked the generated files using FastQC. I did this by creating a new .sh script and run the following commands:

module load bioinfo-tools

module load FastQC

cd git/GenomAnalysis/Scripts/

sh fastqc_HL_CS17_45.sh

The unpaired reverse strand failed to process since it was empty from the Trimmomatic run. It was therefore removed. The rest of the output files were manually checked.

Result

Figure 1: MultiQC result of given trimmed wgs data

MultiQC_wgs

Figure 2: MultiQC result of given trimmed rna_seq data

MultiQC_rna_trim

In figure 1 and 2 we can see the following:

Per Base Sequence Quality: All green show that we in average have good base call.

Per Sequence Quality Scores: All green show that no significant proportion of the sequences in a run have overall low quality.

Per Base Sequence Content: Some yellow and red flags but when manually looking at theses libraries (all data can be found here) none goes over 50% which shows that there is no highly biased position and the per base sequence content quality is not too bad.

Per Sequence GC Content: Quite many read and yellow flags which could indicate a contaminated library or some other kinds of biased subset.

Per Base N Content: All green show that the analysis pipeline was able to interpret the data well enough to make valid base calls

Sequence Length Distribution: All yellow show that there is some variation in sequence length but all lie between 95-100bp so no big difference in length.

Duplicate Sequences: Many yellow and red flags where some have a high level of duplication. This could indicate some kind of enrichment bias (eg PCR over amplification).

Overrepresented Sequences: Some yellow flags in the wgs_data set and only red and yellow flags in the rna_seq_data set. When manually looking at the wgs_data set, unpaired forward strand of 8to10kb library are most overrepresented. When manually looking at the rna_seq_data set, unpaired reverse strand of both forelimb and hindlimb from stage CS17, individual 42, are overrepresented. Finding that a single sequence is very overrepresented in the set either means that it is highly biologically significant, or indicates that the library is contaminated, or not as diverse as you expected.

Figure 3: FastQC result of given raw rna_seq data - forward strand

1_raw

Figure 4: FastQC result of trimmed rna_seq data - paired forward strand

1P_trim

Figure 5: FastQC result of given raw rna_seq data - reverse strand

2_raw

Figure 6: FastQC result of trimmed rna_seq data - paired reverse strand

2P_trim

As can be seen in figure 3 and 5 the base calls falls into the orange area towards the end of a read. This is common since the quality of calls on most platforms will degrade as the run progresses. Though comparing figure 3 vs 4 and 5 vs 6 we can se that the per base sequence quality improved after trimming the rna_seq library HL_CS17_45.

Discussion

To have good quality from the beginning of the analysis is very important to get reliable results. If you put garbage in you will get garbage out. From the result we can see that the overall quality is good even though some parts indicate that there might be some kind of biased subset like contamination present or some kind of enrichment bias. If there is contamination in the sample it is likely because of a non perfect sample prep. To do a perfect sample prep is hard and since we are working with mammals there isn't endless of sample to redo the sample prep because of ethical reasons. This means that one might need to accept small defects in the result but also keep it in mind when trying to biologically interpret the results.

Questions

1. What is the structure of a FASTQ file?

Line 1 begins with the '@' character followed by a sequence ID and a description(optional).
Line 2 contains the sequence.
Line 3 begins with a '+' character and is sometimes followed by the sequence ID and description.
Line 4 encodes the quality value for each letter in the sequence in Line 2. Each symbol represent the quality value of the corresponding base so there are equal number of symbols as letters in the sequence.

The structure of a FASTQ file containing one sequence might look like this

@SEQ_ID

GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

+

!''((((+))%%%++)(%%%%).1*-+*''))**55CCF>>>>>>CCCCCCC65

The symbols are increasing in quality from left to right in the following order:

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~

2. How is the quality of the data stored in the FASTQ files? How are paired reads identified?

As mentioned in Q1, the quality of the data is stored as symbols after each sequence. The original Sanger FASTQ files also allowed the sequence and quality strings to be split over multiple lines, but this is generally discouraged as it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string).