BMA231 III: QC and mapping - bcfgothenburg/HT24 GitHub Wiki

Course: HT24 Next Generation Sequencing data analysis with clinical applications (BMA231)


The aim of these exercises is to introduce you to the QC, filtering and mapping of NGS data. You will be using Galaxy to do the different analyses with a focus on DNA and RNA sequencing data



Data

You will be working with portions of real data sets to illustrate the overall workflow. The samples can be downloaded from Canvas.

DNA samples

The data is from a patient with Papillion-Lefèvre Syndrome (PLS). PLS is a very rare disorder of autosomal recessive inheritance distinguished by palmar plantar hyperkeratosis and early onset of periodontitis affecting dentition.

The data has been kindly provided by Johan Bylund, Professor at the Institute of Odontology at University of Gothenburg.

  • DNA_chr11_R1.fastq.gz and DNA_chr11_R2.fastq.gz

RNA samples

The data is from an article, where Pierce et. al. compared bulk RNA sequencing from pedriatic and adult patients with COVID-19. They provided direct evidence of a more vigorous early mucosal immune response in children compared with adults, suggesting that this contributes to favorable clinical outcomes.

  • RNA_chr18_R1.fastq.gz and RNA_chr18_R2.fastq.gz

You find this files in CANVAS under the NGS intro module:

Galaxy set up

You will be using Galaxy.

  • Log in into your account
  • Create a history called QC and mapping practical
    • You do this in the right History panel

Upload your data

  • Click on Upload on the left menu
  • The Upload from Disk or Web to .. window will open

  • Click on Choose local file

  • Select your Fastq files (*_R1.fastq.gz and *_R2.fastq.gz)

  • Click on Start

    Click here for output
  • Wait until the data has been uploaded

    • You will see that both of your jobs were added on the History panel
      Click here for output

Fastq files QC

FastQC - per sample reporting

The first thing to do when you receive data is to check its quality. FastQC is a program that generates general statistics from high throughput data (and pipelines).

  • Search for FastQC in the Tools panel

  • Click on the tool

  • Set the following parameters as indicated:

    • Raw read data from your current history
      • Select the icon for Multiple datasets
      • Select your four fastq files
    • Click Run Tool
      Click here for output
  • You will see how the jobs are queued and run in the History panel

    Click here for output

For each sample, there will be 2 results:

  1. One labeled as Webpage: FastQC on data X: Webpage

    • This is the html report of FastQC
    • Click on the eye icon to have a look
      Click here for output
  2. And the other labeled as RawData: FastQC on data X: RawData

    • This contains the data to make the plots in the html report
    • Click on the eye icon to have a look

We could check each file individually, however it is time consuming and thus it is convenient to merge all the reports and stats files. We can achieve this by using MultiQC.

MultiQC - merging reports

  • Search for MultiQC in the Tools panel

  • Click on the tool

  • Set the following parameters as indicated:

    • Which tool was used generate logs? --> FastQC
    • FastQC output --> Insert FastQC output
      • Type of FastQC output? --> Raw data
      • Select the 4 jobs annotated as FastQC on data X: RawData
    • Report title --> Raw fastq
    • Custom comment --> Raw fastq
    • Click Run Tool
      Click here for output

There are 2 results:

  1. One labeled as Webpage: MultiQC on data X, data X and others: Webpage

    • Click on the eye icon
      The html report of MultitQC will be displayed.
    • As you can see all the samples are now summarized in one single report. Handy when you need to analyze several samples
      Click here for output

Q1. What can you conclude about the quality of the sequencing data? You can either look at the individual html reports from FastQC or a the report from MultiQC
Make clear distinctions between DNA and RNA data while you focus on Basic Statistics, Per base sequence quality, Per sequence quality scores, Per base sequence content, Sequence Length Distribution, Sequende Duplication Levels and Adapter Content.

Click here for output
0.2M - DNA - 151 bp
0.3M - RNA - 101 bp

DNA few duplicates, RNA high duplicates % as expected
typical sequence quality curve, all good quality
DNA seems to be fragmented with sonication, while RNA seems to be fragmented enzymatically
A little of N's at the begining of the DNA reads
RNA have adapters since they are of different lengths
  1. And the other labeled as Stats: MultiQC on data X, data X, and others: Stats

    • This contains the data to make the plots in the html reports
    • Click on the eye icon to have a look

Quality trimming with TrimGalore!

There are a lot of tools that can help us to trim our data based on quality, either by removing the low quality ends, possible adapters left, etc. TrimGalore! offers both quality and adapter trimming as well as quality control.

  • Search for Trim Galore! in the Tools panel

  • Click on the tool

  • Set the following parameters as indicated:

    • Is this library paired- or single-end? --> Paired-end
    • Reads in FASTQ format
      • Select the icon for Multiple datasets
      • Select both R1 fastq files (*_R1.fastq.gz)
    • Reads in FASTQ format
      • Select the icon for Multiple datasets
      • Select both R2 fastq files (*_R2.fastq.gz)
    • Click Run Tool
      Click here for output

Q2. Why did we select a paired-end library?

Click here for output
because we sequenced the 2 ends of the DNA fragments:  R1 ->  <- R2

Q3. What are the thresholds used for filtering?
Hint: click on the tool again and under Trim Galore! advanced settings display the Full parameter list

Click here for output
Trim low-quality ends from reads in addition to adapter removal (Enter phred quality score threshold): 20
Overlap with adapter sequence required to trim a sequence: 1
Maximum allowed error rate: 0.1
Discard reads that became shorter than length N: 20

Now, let's compare statistics from before and after the filtering step:

  1. Run FastQC on the filtered dataset:

    • Search for FastQC in the Tools panel

    • Click on the tool

    • Set the following parameters as indicated:

      • Raw read data from your current history
        • Select the icon for Multiple datasets
        • Select your four filtered fastq files:
          • Trim Galore! on data X and data X: trimmed reads pair 1
          • Trim Galore! on data X and data X: trimmed reads pair 2
      • Click Run Tool
        Click here for output
  2. Run MultiQC

    • Search for MultiQC in the Tools panel

    • Click on the tool

    • Set the following parameters as indicated:

      • Which tool was used generate logs? --> FastQC
      • FastQC output --> Insert FastQC output
        • Type of FastQC output? --> Raw data
        • Select the 4 jobs annotated as FastQC on data X: RawData
      • Report title --> Filtered fastq
      • Custom comment --> Filtered fastq
      • Click Run Tool
        Click here for output

Q4. What differences do you see between the raw files and the filtered ones?

Click here for output
Just less data, not a real change in quality 
since we already had a good dataset and it is just a small portion

Mapping filtered reads

BWA - aligner for DNA samples

Here we will use BWA, an aligner based on the Burrows-Wheeler transform that has been optimized for aligning short reads. BWA mem is one of the algorithms that can be used.

NOTE: Remember to use the DNA samples!

  • Search for Map with BWA-MEM in the Tools panel

  • Click on the tool

  • Set the following parameters as indicated:

    • Will you select a reference genome from your history or use a built-in index? --> Use a built-in genome index
    • Using reference genome --> Human Dec. 2013 (GRCh38/hg38)
    • Single or Paired-end reads --> Paired
    • Select first set of reads --> Trim Galore! on data X and data X: trimmed read pair 1
    • Select second set of reads --> Trim Galore! on data X and data X: trimmed read pair 2
    • Click Run Tool
      Click here for output

You will get 1 output file: Map with BWA-MEM on data X: (mapped reads in BAM format)

  • Click on the eye icon
    You will see the alignment data, or at least part of it.

    • The first part starts with an @ and it is part of the header

    • The second part is the detailed information of how each read is mapped to the genome

      Click here for output

STAR - aligner for RNA samples

There are several short read aligners that can handle the non-contiguous transcript structure of RNAseq experiments. Here we will be using STAR.

NOTE: Remember to use the RNA samples!

  • Search for RNA Star in the Tools panel

  • Click on the tool

  • Set the following parameters as indicated:

    • Single-end or paired-end reads --> Paired-end (as individual datasets)
    • RNA-Seq FASTQ/FASTA file, forward reads --> Trim Galore! on data X on data X: trimmed reads pair 1
    • RNA-Seq FASTQ/FASTA file, reverse reads --> Trim Galore! on data x on data X: trimmed reads pair 2
    • Custom or built-in reference genome --> Use a built-in index
    • Reference genome with or without an annotation --> use genome reference without builtin gene-model and no not provide a gtf
    • Select reference genome --> Human Dec. 2013 (GRCh38/hg38) (hg38)
    • Click Run Tool
      Click here for output

You will get 3 output files:

  1. One labeled as log: RNA STAR on data X and data X: log This file contains all the statistics of the mapping

    • Click on the eye icon to have a look
      Click here for output
  2. The second one, labeled as splice.junctions.bed: RNA STAR on data X and data X: splice junctions.bed

    This is a bed file (tab delimited file) with information about where the splice junctions where found according to the mapped information. Remember that a splice junction is an intron-exon boundarie in pre-RNA.

    • Click on the eye icon to have a look
      Click here for output
  3. And the third file labeled as mapped.bam: RNA STAR on data X and data X: mapped.bam

    • Clik on the eye icon
      You will see the alignment data, or at least part of it.
      Click here for output

Statistics with samtools

Since it is difficult to visualize this data, let's create first some statistics with flagstat and idxstats.

  1. flagstat

    • Search for Samtools flagstat in the Tools panel

    • Click on the tool

    • Set the following parameters as indicated:

      • BAM File to report statistics of?
      • Select the icon for Multiple datasets
      • Select both alignments --> Map with BWA-MEM on data X: (mapped reads in BAM format) and RNA STAR on data X and data X: mapped.bam
    • Click Run Tool

      Click here for output
    • You will get 2 output files labeled as: Samtools flagstat on data X

    • Click on the eye icon and have a look

      Click here for output
  2. idxstats

    • Search for Samtools idxstats in the Tools panel

    • Click on the tool

    • Set the following parameters as indicated:

      • BAM file
      • Select the icon for Multiple datasets
      • Select both alignments: Map with BWA-MEM on data X: (mapped reads in BAM format) and RNA STAR on data X and data X: mapped.bam
    • Click Run Tool

      Click here for output
    • You will get 2 output file labeled as: Samtools idxstats on data X

    • Click on the eye icon to have a look

      Click here for output

Let's use MultiQC to gather all these statistics:

  • Search for MultiQC in the Tools panel

  • Click on the tool

  • Set the following parameters as indicated:

    • Which tool was used generate logs? --> Samtools

    • Click on Insert Samtools output

    • Type of Samtools output? --> idxstats

    • Samtools idxstats ouput --> Samtools idxstats on data X

    • Click on Insert Samtools output

    • Type of Samtools Output? --> flagstat

    • Samtools flagstat ouput --> Samtools flagstat on data X

    • Click on Insert Results

    • Which tool was used generate logs? --> STAR

    • Click on Insert STAR output

    • Type of STAR output? --> Log

    • Samtools flagstat ouput --> RNA STAR on data X and data X:log

    • Click Run Tool

      Click here for output

Inspect the MultiQC report briefly and answer the following:

Click here for output

Q5. Is the percentage of mapped sequences acceptable?

Click here for output
0.4M  RNA
0.8M  DNA

it is difficult to get the exact numbers in multiqc however, they are so similar that we can infer that most of them were mapped, so it is ok!

Q6. Looking at the XY counts, can you determine if your samples are from a female or a male? Have a look at the following graph for comparison:

Click here for output
This graph, is an example of how to visually distinguish (allegedly) between female and male. 
Since the X and Y chromosome share some sequence similarity, we will always have some reads mapping
to both chromosomes (unless you set the algorithms to unique hits). However males will have a larger
portion of mapping reads. In this example the first sample is a male and the second a female.

Taking this into account, our samples seem to be males

Q7. Which chromosomes were sequenced for each sample?

Click here for output
chr11 DNA
chr18 RNA



Developed by Marcela Dávila, 2017. Modified by Vanja Börjesson, 2021. Modified by Marcela Dávila, 2022. Updated by Marcela Dávila, 2023.

⚠️ **GitHub.com Fallback** ⚠️