3B. Day 2: NGS data:Quality Assessment, Control and Mapping (Lecture cum Hands on Session) - bioinfokushwaha/Livestock_Genomics GitHub Wiki

1. High throughput genomic data processing and analysis workflow

image

2. Sequencing reads and type

image

3. Quality Assessment: Raw Data Pre-processing

3.1 FastQ: Sequencing data format

image

2.2 Basic Statistics

Most metrics within this section are self-explanatory. For PE reads, the total number of sequences should match the forward and reverse read files. It is good practice to take note of the FASTQ Phred encoding, as some downstream tools require the user to specify whether Phred64 or Phred33 encoding should be used. Finally, the %GC should lie within the expected values for the sample species per base sequence quality

The Phred scale quality represents the probability p that the corresponding base call is incorrect.

A Phred score Q is an integer mapping of p where: Q=−10⋅log10(p)

image

Briefly, a Phred score of 10 corresponds to one error in every 10 base calls or 90% accuracy; a Phred score of 20 to one error in every 100 base calls or 99% accuracy. The maximum Phred score is 40 (41 for Illumina version 1.8+ encoding).

https://github.com/bioinfokushwaha/Workshop/blob/master/Image/encoding.png

Overall sequence quality

This section details the quality distribution at the read level, in contrast to the quality per base position of the previous section. If the data is of good quality, the histogram will be skewed to the right. This FastQC section details the Phred scaled quality as a function of the position in the read. It is very common to observe a quality decrease as a function of the read length.

https://github.com/bioinfokushwaha/Workshop/blob/master/Image/PSSQ.png

3.3 Per base sequence content

In this section, the average proportion of individual bases (A, C, G and T) is plotted as a line across the length of the reads. The 12 first bases often show a bias that is characteristic of Illumina RNA-Seq data. This is in contrast with the DNA-Seq protocol, which does not show the same bias. The difference between protocols lies in three additional steps performed during the conversion of mRNA to cDNA, which is subsequently sequenced as if it were genomic DNA. Several hypotheses have been proposed as to the cause of this bias: during reverse transcription of the captured cDNA, random hexamer primers are used and these may introduce a positional bias of the reads; artifacts from end repair; and possibly a tenuous sequence specificity of the polymerase may each play a role either singularly in, most likely, in combination.

https://github.com/bioinfokushwaha/Workshop/blob/master/Image/PBSC4.png

3.4 Per sequence GC content

The plot in this section (see Figure 2 for an example) represents the distribution of GC content per read, where the data (red curve) is expected to approximately follow the theoretical distribution (blue curve). If the curve presents a shoulder in a region of high GC content, this is usually an indication that rRNA is present in the sample. However, it may also represent contamination by an organism with a higher GC content (such as bacteria or fungi). In contrast, a peak on the left-hand side would indicate enrichment for A/T-rich sequences. In particular a sharp peak for very low GC content (in the 0-3 range) is usually indicative of the sequencing of the mRNA poly-A tails. If this plot still shows issues after quality and rRNA filtering, additional steps would have to be taken to filter contaminants.

https://github.com/bioinfokushwaha/Workshop/blob/master/Image/GC_content.png

A comparison of the “theoretical” and “observed” GC distribution, the blue and read lines of FastQC “Per sequence GC content”. A. Examples of “observed” GC distribution with a poly-A enrichment (green), rRNA enrichment (red) or no (black) bias. B. The corresponding “theoretical” curve that FastQC would devise from such read GC content distribution.

3.5 Per base N content

This plot shows the fraction of indistinguishable bases as a function of the base position in the reads. In high-quality sequence data this is expected to be close to zero. Deviations from the expected values indicate problems during the sequencing.

https://github.com/bioinfokushwaha/Workshop/blob/master/Image/Ncontent.png

3.6 Sequence length distribution

This section shows the distribution of read lengths. Prior to trimming, there should be exactly one peak located at the sequenced read length.

https://github.com/bioinfokushwaha/Workshop/blob/master/Image/LD.png

3.7 Overrepresented sequences

This table shows sequences that are present at unusually large frequency in the reads. These are most commonly sequencing adapters and will be identified as such. If unidentified sequences are detailed these may originate from rRNA or other contaminants, in which case contaminant filtering should be considered. Often a BLAST search of the unidentified sequence using the NCBI nt database will be informative.

https://github.com/bioinfokushwaha/Workshop/blob/master/Image/ORS.png

3.8 Kmer content

The final plot of the FastQC report details the occurrence of kmers - nucleotide sequences of fixed k length - that were present at a higher-than-expected frequency as a function of their position within the read. Commonly, the early bases show the aforementioned Illumina sequencing bias (see section d), whereas the last bases may show enrichment for sequencing adapters.

https://github.com/bioinfokushwaha/Workshop/blob/master/Image/Kmer.png

3.9 Sequence duplication level

This plot represents the level of duplicate sequences in the library. FastQC assumes that the library is diverse, with even representation of all sequences, it assumes a uniform coverage as would usually be obtained for DNA-Seq experiments. However, this assumption is not valid for RNA-Seq/chipseq/ enriched libraries , which have a large dynamic range, possibly containing a million fold difference between lowly and highly expressed genes. As a result it is common to observe high duplication levels for sequences originating from highly expressed genes.

https://github.com/bioinfokushwaha/Workshop/blob/master/Image/SD.png

It is worth noting that before version 0.11 of FastQC, all duplication levels >= 10 were aggregated into a single bin. In more recent version this has been made more comprehensive in order to provide a more accurate representation of the data. This check for sequences that occur more frequently than expected in your data. It also checks any sequences it finds against a small database of known sequences. In this case it has found that a small number of reads 4000 out of 600000 appear to contain a sequence used in the preparation for the library. A typical cause is that the original DNA was shorter than the length of the read - so the sequencing overruns the actual DNA and runs into the adaptors used to bind it to the flow cell. At this level there is nothing to worry about - they will be trimmed in later stages.

4. QA : Hands-on session

Copy the read Data from Day2 folder (you have been instructed where to find these). Simply follow the instruction to pre-process the data. We downsize the original raw data to a manageable amount. To do so adapt the following instructions:

# First copy the data from day2 folder

$cp ../Day2/brain_R* ./
$zless brain_R1.fq.gz 


# read the compressed forward and reverse fastq file, piping the result in sed to extract the 
# first 151968 lines (equivalent to 37k sequences, as one sequence entry always span 4 lines
# and finally compressing it as a new file 

zcat brain_R1.fq.gz |sed -n 1,100000p | gzip -c > Brain_sub_R1.fq.gz
zcat brain_R2.fq.gz |sed -n 1,100000p | gzip -c > Brain_sub_R2.fq.gz

Note: It is just a example of subset the data.

Raw data FastQC

Commonly, Raw data is delivered in FASTQ format. The first pre-processing step is to assess the quality of the raw data received from the sequencing facility. it is essential that some initial QC assessments be performed. Most importantly, one should check the overall sequence quality, the GC percentage distribution ( the proportion of guanine and cytosine bp across the reads) and the presence / absence of overrepresented sequences. FastQC has become a de-facto standard for performing this task. A FastQC command line example is as follows:


# First, make an output directory called "Raw  and FastQC" and a sub-directory of
$mkdir -p FastQC/FastQC_Raw
  
# Then you can run fastqc on the file(s) you have selected. FastQC can run
# a number of files in parallel (see the -t option)

$fastqc -o FastQC/FastQC_Raw *.fq.gz
  
# Finally, also run multiqc to summarize all data in one plot
$mkdir -p MultiQC/MultiQC_Raw

$multiqc FastQC/FastQC_Raw/ -o MultiQC/MultiQC_Raw


# FastQC output is a zip archive and an HTML document under different sections describing the specific metrics that # were analyzed. These sections are detailed below. In order to view the html page that was created,

## Copy your fastQC and multiqc raw and trim files from remote server to ur desktop
1. Through scp command (recommended)
mkdir QC && cd
scp -r [email protected]:~/NGSWorkshop_2024/sandeep/FastQC* ./
scp -r [email protected]:~/NGSWorkshop_2024/sandeep/MultiQC* ./


2. Through filezilla 

FastQC Results

Sample data download

Assignment-1:

  1. How many reads does the file contain?
  2. What is the length of the reads ?
  3. Are there any positions with low sequence quality ?
  4. What could be the cause of the failure in the per base sequence content plot ?
  5. Which FASTQC module allows you to confirm this suspicion ?
  6. What does this module tell you ?
  7. What about sequence duplication levels ?
  8. Why did the per base sequence quality plot raise a failure ?
  9. Why does the Sequence duplication levels modules give a failure ?
  10. Estimation of coverage for E. coli K-12 strain substr. MG1655?