BMA231 III: QC and mapping - bcfgothenburg/HT24 GitHub Wiki
Course: HT24 Next Generation Sequencing data analysis with clinical applications (BMA231)
The aim of these exercises is to introduce you to the QC, filtering and mapping of NGS data. You will be using Galaxy to do the different analyses with a focus on DNA and RNA sequencing data

You will be working with portions of real data sets to illustrate the overall workflow. The samples can be downloaded from Canvas.
The data is from a patient with Papillion-Lefèvre Syndrome (PLS). PLS is a very rare disorder of autosomal recessive inheritance distinguished by palmar plantar hyperkeratosis and early onset of periodontitis affecting dentition.
The data has been kindly provided by Johan Bylund, Professor at the Institute of Odontology at University of Gothenburg.
- 
DNA_chr11_R1.fastq.gzandDNA_chr11_R2.fastq.gz
The data is from an article, where Pierce et. al. compared bulk RNA sequencing from pedriatic and adult patients with COVID-19. They provided direct evidence of a more vigorous early mucosal immune response in children compared with adults, suggesting that this contributes to favorable clinical outcomes.
- 
RNA_chr18_R1.fastq.gzandRNA_chr18_R2.fastq.gz
You find this files in CANVAS under the NGS intro module:
You will be using Galaxy.
- Log in into your account
- Create a history called QC and mapping practical- You do this in the right History panel
 
- You do this in the right 
- Click on Uploadon the left menu
- 
The Upload from Disk or Web to ..window will open
- 
Click on Choose local file
- 
Select your Fastq files ( *_R1.fastq.gzand*_R2.fastq.gz)
- 
Click on StartClick here for output  
- 
Wait until the data has been uploaded - You will see that both of your jobs were added on the History panelClick here for output  
 
- You will see that both of your jobs were added on the 
The first thing to do when you receive data is to check its quality. FastQC is a program that generates general statistics from high throughput data (and pipelines).
- 
Search for FastQCin theTools panel
- 
Click on the tool 
- 
Set the following parameters as indicated: - 
Raw read data from your current history 
 - Select the icon for Multiple datasets
- Select your four fastq files
 
- Select the icon for 
- Click Run ToolClick here for output  
 
- 
Raw read data from your current history 
- 
You will see how the jobs are queued and run in the History panelClick here for output  
For each sample, there will be 2 results:
- 
One labeled as Webpage: FastQC on data X: Webpage- This is the htmlreport ofFastQC
- Click on the eye iconto have a lookClick here for output  
 
- This is the 
- 
And the other labeled as RawData: FastQC on data X: RawData- This contains the data to make the plots in the htmlreport
- Click on the eye iconto have a look
 
- This contains the data to make the plots in the 
We could check each file individually, however it is time consuming and thus it is convenient to merge all the reports and stats files. We can achieve this by using MultiQC.
- 
Search for MultiQCin theTools panel
- 
Click on the tool 
- 
Set the following parameters as indicated: - 
Which tool was used generate logs? --> FastQC
- 
FastQC output --> Insert FastQC output
- Type of FastQC output? --> Raw data
- Select the 4 jobs annotated as FastQC on data X: RawData
 
- 
Report title --> Raw fastq
- 
Custom comment --> Raw fastq
- Click Run ToolClick here for output  
 
- 
Which tool was used generate logs? --> 
There are 2 results:
- 
One labeled as Webpage: MultiQC on data X, data X and others: Webpage- Click on the eye icon
 Thehtmlreport ofMultitQCwill be displayed.
- As you can see all the samples are now summarized in one single report. Handy when you need to analyze several samples
Click here for output  
 
- Click on the 
Q1. What can you conclude about the quality of the sequencing data? You can either look at the individual html reports from FastQC or a the report from MultiQC
Make clear distinctions between DNA and RNA data while you focus onBasic Statistics,Per base sequence quality,Per sequence quality scores,Per base sequence content,Sequence Length Distribution,Sequende Duplication LevelsandAdapter Content.
Click here for output
0.2M - DNA - 151 bp
0.3M - RNA - 101 bp
DNA few duplicates, RNA high duplicates % as expected
typical sequence quality curve, all good quality
DNA seems to be fragmented with sonication, while RNA seems to be fragmented enzymatically
A little of N's at the begining of the DNA reads
RNA have adapters since they are of different lengths
- 
And the other labeled as Stats: MultiQC on data X, data X, and others: Stats- This contains the data to make the plots in the htmlreports
- Click on the eye iconto have a look
 
- This contains the data to make the plots in the 
There are a lot of tools that can help us to trim our data based on quality, either by removing the low quality ends, possible adapters left, etc. TrimGalore! offers both quality and adapter trimming as well as quality control.
- 
Search for Trim Galore!in theTools panel
- 
Click on the tool 
- 
Set the following parameters as indicated: - 
Is this library paired- or single-end? --> Paired-end
- 
Reads in FASTQ format
- Select the icon for Multiple datasets
- Select both R1 fastq files (*_R1.fastq.gz)
 
- Select the icon for 
- 
Reads in FASTQ format
- Select the icon for Multiple datasets
- Select both R2 fastq files (*_R2.fastq.gz)
 
- Select the icon for 
- Click Run ToolClick here for output  
 
- 
Is this library paired- or single-end? --> 
Q2. Why did we select a
paired-endlibrary?
Click here for output
because we sequenced the 2 ends of the DNA fragments:  R1 ->  <- R2
Q3. What are the thresholds used for filtering?
Hint: click on the tool again and underTrim Galore! advanced settingsdisplay the Full parameter list
Click here for output
Trim low-quality ends from reads in addition to adapter removal (Enter phred quality score threshold): 20
Overlap with adapter sequence required to trim a sequence: 1
Maximum allowed error rate: 0.1
Discard reads that became shorter than length N: 20
 
Now, let's compare statistics from before and after the filtering step:
- 
Run FastQC on the filtered dataset: - 
Search for FastQCin theTools panel
- 
Click on the tool 
- 
Set the following parameters as indicated: - 
Raw read data from your current history 
 - Select the icon for Multiple datasets
- Select your four filtered fastq files:
- Trim Galore! on data X and data X: trimmed reads pair 1
- Trim Galore! on data X and data X: trimmed reads pair 2
 
 
- Select the icon for 
- Click Run ToolClick here for output  
 
- 
Raw read data from your current history 
 
- 
- 
Run MultiQC - 
Search for MultiQCin theTools panel
- 
Click on the tool 
- 
Set the following parameters as indicated: - 
Which tool was used generate logs? --> FastQC
- 
FastQC output --> Insert FastQC output
- Type of FastQC output? --> Raw data
- Select the 4 jobs annotated as FastQC on data X: RawData
 
- 
Report title --> Filtered fastq
- 
Custom comment --> Filtered fastq
- Click Run ToolClick here for output  
 
- 
Which tool was used generate logs? --> 
 
- 
Q4. What differences do you see between the raw files and the filtered ones?
Click here for output
Just less data, not a real change in quality 
since we already had a good dataset and it is just a small portion
Here we will use BWA, an aligner based on the Burrows-Wheeler transform that has been optimized for
aligning short reads. BWA mem is one of the algorithms that can be used.
NOTE: Remember to use the DNA samples!
- 
Search for Map with BWA-MEMin theTools panel
- 
Click on the tool 
- 
Set the following parameters as indicated: - 
Will you select a reference genome from your history or use a built-in index? --> Use a built-in genome index
- 
Using reference genome --> Human Dec. 2013 (GRCh38/hg38)
- 
Single or Paired-end reads --> Paired
- 
Select first set of reads --> Trim Galore! on data X and data X: trimmed read pair 1
- 
Select second set of reads --> Trim Galore! on data X and data X: trimmed read pair 2
- Click Run ToolClick here for output  
 
- 
Will you select a reference genome from your history or use a built-in index? --> 
You will get 1 output file: Map with BWA-MEM on data X: (mapped reads in BAM format)
- 
Click on the eye icon
 You will see the alignment data, or at least part of it.- 
The first part starts with an @and it is part of the header
- 
The second part is the detailed information of how each read is mapped to the genome Click here for output  
 
- 
There are several short read aligners that can handle the non-contiguous transcript structure of RNAseq experiments. Here we will be using STAR.
NOTE: Remember to use the RNA samples!
- 
Search for RNA Starin theTools panel
- 
Click on the tool 
- 
Set the following parameters as indicated: - 
Single-end or paired-end reads --> Paired-end (as individual datasets)
- 
RNA-Seq FASTQ/FASTA file, forward reads --> Trim Galore! on data X on data X: trimmed reads pair 1
- 
RNA-Seq FASTQ/FASTA file, reverse reads --> Trim Galore! on data x on data X: trimmed reads pair 2
- 
Custom or built-in reference genome --> Use a built-in index
- 
Reference genome with or without an annotation --> use genome reference without builtin gene-model and no not provide a gtf
- 
Select reference genome --> Human Dec. 2013 (GRCh38/hg38) (hg38)
- Click Run ToolClick here for output  
 
- 
Single-end or paired-end reads --> 
You will get 3 output files:
- 
One labeled as log: RNA STAR on data X and data X: logThis file contains all the statistics of the mapping- Click on the eye iconto have a lookClick here for output  
 
- Click on the 
- 
The second one, labeled as splice.junctions.bed: RNA STAR on data X and data X: splice junctions.bedThis is a bed file (tab delimited file) with information about where the splice junctions where found according to the mapped information. Remember that a splice junction is an intron-exon boundarie in pre-RNA. - Click on the eye iconto have a lookClick here for output  
 
- Click on the 
- 
And the third file labeled as mapped.bam: RNA STAR on data X and data X: mapped.bam- Clik on the eye icon
 You will see the alignment data, or at least part of it.
 Click here for output  
 
- Clik on the 
Since it is difficult to visualize this data, let's create first some statistics with flagstat and idxstats.
- 
flagstat- 
Search for Samtools flagstatin theTools panel
- 
Click on the tool 
- 
Set the following parameters as indicated: - BAM File to report statistics of?
- Select the icon for Multiple datasets
- Select both alignments --> Map with BWA-MEM on data X: (mapped reads in BAM format)andRNA STAR on data X and data X: mapped.bam
 
- 
Click Run ToolClick here for output  
- 
You will get 2 output files labeled as: Samtools flagstat on data X
- 
Click on the eye iconand have a lookClick here for output  
 
- 
- 
idxstats- 
Search for Samtools idxstatsin theTools panel
- 
Click on the tool 
- 
Set the following parameters as indicated: - BAM file
- Select the icon for Multiple datasets
- Select both alignments: Map with BWA-MEM on data X: (mapped reads in BAM format)andRNA STAR on data X and data X: mapped.bam
 
- 
Click Run ToolClick here for output  
- 
You will get 2 output file labeled as: Samtools idxstats on data X
- 
Click on the eye iconto have a lookClick here for output  
 
 
- 
Let's use MultiQC to gather all these statistics:
- 
Search for MultiQCin theTools panel
- 
Click on the tool 
- 
Set the following parameters as indicated: - 
Which tool was used generate logs? --> Samtools
- 
Click on Insert Samtools output 
- 
Type of Samtools output? --> idxstats
- 
Samtools idxstats ouput --> Samtools idxstats on data X
- 
Click on Insert Samtools output 
- 
Type of Samtools Output? --> flagstat
- 
Samtools flagstat ouput --> Samtools flagstat on data X
- 
Click on Insert Results 
- 
Which tool was used generate logs? --> STAR
- 
Click on Insert STAR output 
- 
Type of STAR output? --> Log
- 
Samtools flagstat ouput --> RNA STAR on data X and data X:log
- 
Click Run ToolClick here for output  
 
 
- 
Inspect the MultiQC report briefly and answer the following:
Click here for output
 
Q5. Is the percentage of mapped sequences acceptable?
Click here for output
0.4M  RNA
0.8M  DNA
it is difficult to get the exact numbers in multiqc however, they are so similar that we can infer that most of them were mapped, so it is ok!
Q6. Looking at the XY counts, can you determine if your samples are from a female or a male? Have a look at the following graph for comparison:
 
Click here for output
This graph, is an example of how to visually distinguish (allegedly) between female and male. 
Since the X and Y chromosome share some sequence similarity, we will always have some reads mapping
to both chromosomes (unless you set the algorithms to unique hits). However males will have a larger
portion of mapping reads. In this example the first sample is a male and the second a female.
Taking this into account, our samples seem to be males
Q7. Which chromosomes were sequenced for each sample?
Click here for output
chr11 DNA
chr18 RNA
Developed by Marcela Dávila, 2017. Modified by Vanja Börjesson, 2021. Modified by Marcela Dávila, 2022. Updated by Marcela Dávila, 2023.