BMA231 III: QC and mapping - bcfgothenburg/HT24 GitHub Wiki
Course: HT24 Next Generation Sequencing data analysis with clinical applications (BMA231)
The aim of these exercises is to introduce you to the QC, filtering and mapping of NGS data. You will be using Galaxy to do the different analyses with a focus on DNA and RNA sequencing data
You will be working with portions of real data sets to illustrate the overall workflow. The samples can be downloaded from Canvas.
The data is from a patient with Papillion-Lefèvre Syndrome (PLS). PLS is a very rare disorder of autosomal recessive inheritance distinguished by palmar plantar hyperkeratosis and early onset of periodontitis affecting dentition.
The data has been kindly provided by Johan Bylund, Professor at the Institute of Odontology at University of Gothenburg.
-
DNA_chr11_R1.fastq.gz
andDNA_chr11_R2.fastq.gz
The data is from an article, where Pierce et. al. compared bulk RNA sequencing from pedriatic and adult patients with COVID-19. They provided direct evidence of a more vigorous early mucosal immune response in children compared with adults, suggesting that this contributes to favorable clinical outcomes.
-
RNA_chr18_R1.fastq.gz
andRNA_chr18_R2.fastq.gz
You find this files in CANVAS under the NGS intro
module:
You will be using Galaxy.
- Log in into your account
- Create a history called
QC and mapping practical
- You do this in the right
History panel
- You do this in the right
- Click on
Upload
on the left menu
-
The
Upload from Disk or Web to ..
window will open -
Click on
Choose local file
-
Select your Fastq files (
*_R1.fastq.gz
and*_R2.fastq.gz
) -
Click on
Start
Click here for output
-
Wait until the data has been uploaded
- You will see that both of your jobs were added on the
History panel
Click here for output
- You will see that both of your jobs were added on the
The first thing to do when you receive data is to check its quality. FastQC is a program that generates general statistics from high throughput data (and pipelines).
-
Search for
FastQC
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Raw read data from your current history
- Select the icon for
Multiple datasets
- Select your four fastq files
- Select the icon for
- Click
Run Tool
Click here for output
-
Raw read data from your current history
-
You will see how the jobs are queued and run in the
History panel
Click here for output
For each sample, there will be 2 results:
-
One labeled as Webpage:
FastQC on data X: Webpage
- This is the
html
report ofFastQC
- Click on the
eye icon
to have a lookClick here for output
- This is the
-
And the other labeled as RawData:
FastQC on data X: RawData
- This contains the data to make the plots in the
html
report - Click on the
eye icon
to have a look
- This contains the data to make the plots in the
We could check each file individually, however it is time consuming and thus it is convenient to merge all the reports and stats files. We can achieve this by using MultiQC.
-
Search for
MultiQC
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Which tool was used generate logs? -->
FastQC
-
FastQC output --> Insert FastQC output
- Type of FastQC output? --> Raw data
- Select the 4 jobs annotated as
FastQC on data X: RawData
-
Report title -->
Raw fastq
-
Custom comment -->
Raw fastq
- Click
Run Tool
Click here for output
-
Which tool was used generate logs? -->
There are 2 results:
-
One labeled as Webpage:
MultiQC on data X, data X and others: Webpage
- Click on the
eye icon
Thehtml
report ofMultitQC
will be displayed. - As you can see all the samples are now summarized in one single report. Handy when you need to analyze several samples
Click here for output
- Click on the
Q1. What can you conclude about the quality of the sequencing data? You can either look at the individual html reports from FastQC or a the report from MultiQC
Make clear distinctions between DNA and RNA data while you focus onBasic Statistics
,Per base sequence quality
,Per sequence quality scores
,Per base sequence content
,Sequence Length Distribution
,Sequende Duplication Levels
andAdapter Content
.
Click here for output
0.2M - DNA - 151 bp
0.3M - RNA - 101 bp
DNA few duplicates, RNA high duplicates % as expected
typical sequence quality curve, all good quality
DNA seems to be fragmented with sonication, while RNA seems to be fragmented enzymatically
A little of N's at the begining of the DNA reads
RNA have adapters since they are of different lengths
-
And the other labeled as Stats:
MultiQC on data X, data X, and others: Stats
- This contains the data to make the plots in the
html
reports - Click on the
eye icon
to have a look
- This contains the data to make the plots in the
There are a lot of tools that can help us to trim our data based on quality, either by removing the low quality ends, possible adapters left, etc. TrimGalore! offers both quality and adapter trimming as well as quality control.
-
Search for
Trim Galore!
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Is this library paired- or single-end? -->
Paired-end
-
Reads in FASTQ format
- Select the icon for
Multiple datasets
- Select both R1 fastq files (
*_R1.fastq.gz
)
- Select the icon for
-
Reads in FASTQ format
- Select the icon for
Multiple datasets
- Select both R2 fastq files (
*_R2.fastq.gz
)
- Select the icon for
- Click
Run Tool
Click here for output
-
Is this library paired- or single-end? -->
Q2. Why did we select a
paired-end
library?
Click here for output
because we sequenced the 2 ends of the DNA fragments: R1 -> <- R2
Q3. What are the thresholds used for filtering?
Hint: click on the tool again and underTrim Galore! advanced settings
display the Full parameter list
Click here for output
Trim low-quality ends from reads in addition to adapter removal (Enter phred quality score threshold): 20
Overlap with adapter sequence required to trim a sequence: 1
Maximum allowed error rate: 0.1
Discard reads that became shorter than length N: 20
Now, let's compare statistics from before and after the filtering step:
-
Run FastQC on the filtered dataset:
-
Search for
FastQC
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Raw read data from your current history
- Select the icon for
Multiple datasets
- Select your four filtered fastq files:
Trim Galore! on data X and data X: trimmed reads pair 1
Trim Galore! on data X and data X: trimmed reads pair 2
- Select the icon for
- Click
Run Tool
Click here for output
-
Raw read data from your current history
-
-
Run MultiQC
-
Search for
MultiQC
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Which tool was used generate logs? -->
FastQC
-
FastQC output --> Insert FastQC output
- Type of FastQC output? --> Raw data
- Select the 4 jobs annotated as
FastQC on data X: RawData
-
Report title -->
Filtered fastq
-
Custom comment -->
Filtered fastq
- Click
Run Tool
Click here for output
-
Which tool was used generate logs? -->
-
Q4. What differences do you see between the raw files and the filtered ones?
Click here for output
Just less data, not a real change in quality
since we already had a good dataset and it is just a small portion
Here we will use BWA, an aligner based on the Burrows-Wheeler transform that has been optimized for
aligning short reads. BWA mem
is one of the algorithms that can be used.
NOTE: Remember to use the DNA samples!
-
Search for
Map with BWA-MEM
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Will you select a reference genome from your history or use a built-in index? -->
Use a built-in genome index
-
Using reference genome -->
Human Dec. 2013 (GRCh38/hg38)
-
Single or Paired-end reads -->
Paired
-
Select first set of reads -->
Trim Galore! on data X and data X: trimmed read pair 1
-
Select second set of reads -->
Trim Galore! on data X and data X: trimmed read pair 2
- Click
Run Tool
Click here for output
-
Will you select a reference genome from your history or use a built-in index? -->
You will get 1 output file: Map with BWA-MEM on data X: (mapped reads in BAM format)
-
Click on the
eye icon
You will see the alignment data, or at least part of it.-
The first part starts with an
@
and it is part of the header -
The second part is the detailed information of how each read is mapped to the genome
Click here for output
-
There are several short read aligners that can handle the non-contiguous transcript structure of RNAseq experiments. Here we will be using STAR.
NOTE: Remember to use the RNA samples!
-
Search for
RNA Star
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Single-end or paired-end reads -->
Paired-end (as individual datasets)
-
RNA-Seq FASTQ/FASTA file, forward reads -->
Trim Galore! on data X on data X: trimmed reads pair 1
-
RNA-Seq FASTQ/FASTA file, reverse reads -->
Trim Galore! on data x on data X: trimmed reads pair 2
-
Custom or built-in reference genome -->
Use a built-in index
-
Reference genome with or without an annotation -->
use genome reference without builtin gene-model and no not provide a gtf
-
Select reference genome -->
Human Dec. 2013 (GRCh38/hg38) (hg38)
- Click
Run Tool
Click here for output
-
Single-end or paired-end reads -->
You will get 3 output files:
-
One labeled as log:
RNA STAR on data X and data X: log
This file contains all the statistics of the mapping- Click on the
eye icon
to have a lookClick here for output
- Click on the
-
The second one, labeled as splice.junctions.bed:
RNA STAR on data X and data X: splice junctions.bed
This is a bed file (tab delimited file) with information about where the splice junctions where found according to the mapped information. Remember that a splice junction is an intron-exon boundarie in pre-RNA.
- Click on the
eye icon
to have a lookClick here for output
- Click on the
-
And the third file labeled as mapped.bam:
RNA STAR on data X and data X: mapped.bam
- Clik on the
eye icon
You will see the alignment data, or at least part of it.
Click here for output
- Clik on the
Since it is difficult to visualize this data, let's create first some statistics with flagstat
and idxstats
.
-
flagstat
-
Search for
Samtools flagstat
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
- BAM File to report statistics of?
- Select the icon for
Multiple datasets
- Select both alignments -->
Map with BWA-MEM on data X: (mapped reads in BAM format)
andRNA STAR on data X and data X: mapped.bam
-
Click
Run Tool
Click here for output
-
You will get 2 output files labeled as:
Samtools flagstat on data X
-
Click on the
eye icon
and have a lookClick here for output
-
-
idxstats
-
Search for
Samtools idxstats
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
- BAM file
- Select the icon for
Multiple datasets
- Select both alignments:
Map with BWA-MEM on data X: (mapped reads in BAM format)
andRNA STAR on data X and data X: mapped.bam
-
Click
Run Tool
Click here for output
-
You will get 2 output file labeled as:
Samtools idxstats on data X
-
Click on the
eye icon
to have a lookClick here for output
-
Let's use MultiQC
to gather all these statistics:
-
Search for
MultiQC
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Which tool was used generate logs? -->
Samtools
-
Click on Insert Samtools output
-
Type of Samtools output? -->
idxstats
-
Samtools idxstats ouput -->
Samtools idxstats on data X
-
Click on Insert Samtools output
-
Type of Samtools Output? -->
flagstat
-
Samtools flagstat ouput -->
Samtools flagstat on data X
-
Click on Insert Results
-
Which tool was used generate logs? -->
STAR
-
Click on Insert STAR output
-
Type of STAR output? -->
Log
-
Samtools flagstat ouput -->
RNA STAR on data X and data X:log
-
Click
Run Tool
Click here for output
-
Inspect the MultiQC
report briefly and answer the following:
Click here for output
Q5. Is the percentage of mapped sequences acceptable?
Click here for output
0.4M RNA
0.8M DNA
it is difficult to get the exact numbers in multiqc however, they are so similar that we can infer that most of them were mapped, so it is ok!
Q6. Looking at the XY counts, can you determine if your samples are from a female or a male? Have a look at the following graph for comparison:
Click here for output
This graph, is an example of how to visually distinguish (allegedly) between female and male.
Since the X and Y chromosome share some sequence similarity, we will always have some reads mapping
to both chromosomes (unless you set the algorithms to unique hits). However males will have a larger
portion of mapping reads. In this example the first sample is a male and the second a female.
Taking this into account, our samples seem to be males
Q7. Which chromosomes were sequenced for each sample?
Click here for output
chr11 DNA
chr18 RNA
Developed by Marcela Dávila, 2017. Modified by Vanja Börjesson, 2021. Modified by Marcela Dávila, 2022. Updated by Marcela Dávila, 2023.