NGS I: QC - bcfgothenburg/HT24 GitHub Wiki
Course: HT24 Analysis of next generation sequencing data (SC00204)
The purpose of this exercise is to introduce you to common tools to asses the quality of sequencing data and filter it accordingly.
Our data involves sequencing data from different sequencing applications:
- http://onlinelibrary.wiley.com/doi/10.1002/mgg3.115/full
- https://www.encodeproject.org/: ENCFF121FBT, ENCFF931PQC, ENCFF836ILC, ENCFF263KEN, ENCFF831GJV, ENCFF017GWO
- http://www.ebi.ac.uk/ena: ERR1523947, ERR1523948, ERR1523949
- Other data courtesy of JBP and CJ
Connect to the server using MobaXterm (PC users) or your local Terminal (MACS users), using the credentials provided:
ssh -Y your_account@remote_server
Modules are a great way to have different versions of the same program. Load the following modules so we can access the programs we will be using without needing to write the absolute path. Use module load _program_name/version
for the following programs (you can list all of them in the same line, for less typing):
fastqc/0.12.1
multiqc/1.14
bowtie2/2.5.1
fastqscreen/0.15.3
fastx/0.0.14
trimgalore/0.6.10
prinseq/0.20.4
- If you would like to know which modules are loaded, type
module list
- If you want to remove a module (maybe you are using the incorrect version) use
module load program_name/version
- And if you want to know which programs are installed in any server go for
module avail
The first thing to do when you receive data is to check its quality and the composition of the sample. There are different tools for doing so, let's try a couple.
FastQC is a program that generates general statistics from high throughput data (and pipelines). It creates an HTML report.
-
Create a directory called
Fastq
-
Create a soft link to the samples you will be analyzing in this directory. The data is under
/home/courses/NGS/QC/Fastq
. Soft links are a special type of files that serve as a reference to another file or directory, this will avoid having several copies of the same data, saving space. Just have in mind that some programs do not work with soft links:
ln -s file1 link1
-
Create a directory called
FastQC
-
Run
fastqc
on your samples. (Remember that you can run any tool using-h
to check how to run it):
fastqc -h
- Inspect the resulting
html
files. You can usefirefox file_name
to open the files from the server. Alternatively you can copy them to your computer and inspect them locally
Q. Is your data of good quality? What kind of sequencing data do you have?
As you noticed, it is a little tedious to inspect every output file one by one. MultiQC is a tool that summarizes different types of analysis into a single report.
-
Run
multiqc -h
to know how to run the tool -
Run it on you samples
-
Inspect the output file
Q. Which sample has the highest amount of reads? and the lowest? Are there any samples with higher duplication levels? Do all the samples have an acceptable adapter content?
With FastQ Screen you can check that your libraries contain the genomes that are supposed to have, along with PhiX, Vectors or other contaminants commonly seen in sequencing experiments.
To run fastq_screen
we need to determine which aligner will be used as well as which databases (or organisms) we want to use. For this exercise we will be using bowtie2
as aligner and Human, Mouse, PhiX and Leishmania as genomes to scan our samples. The Human and Mouse genomes are already in the server together with their corresponding bowtie2 indexes. We will practice on how to set the PhiX and the Leishmania databases with 2 different approaches.
-
Go to your home directory
-
Create a directory called
db
-
Follow theses steps within that directory:
-
PhiX:
-
iGenomes are a collection of reference sequences and annotation files for commonly analyzed organism
-
Go to the webpage and download the PhiX - Illumina - RTA build with
wget webaddress
-
Uncompress it with
tar -xvzf file
-
Check that the index files are at
PhiX/Illumina/RTA/Sequence/Bowtie2Index/genome/*
- Remember that the indexes are a collection of files with the extension
.bt2
- Remember that the indexes are a collection of files with the extension
-
-
Leishmania
-
Ensembl is a huge source for genomic data (among others)
-
Find your way to the Leishmania infantum webpage in Ensembl
-
Download
Leishmania_infantum_gca_900500625.LINF.dna.toplevel.fa.gz
which contains all chromosomes for this protist. Usewget webaddress
-
Uncompress using
gunzip
-
Build the bowtie2 index using
bowtie2-build -h
- Set the
bt2_index_base
identical to thereference_in
, sometimes bowtie2 has problems to find the index if you name it differently
- Set the
-
Check that
*bt2
files are created
-
-
Now that we have all the data needed, we need to tell fastq_screen
where it can find the indexes.
-
Copy the
fastq_screen.conf.example
from/home/courses/NGS/QC
-
Remember to set the correct permission so you can modify the file
- Use
chmod
- Use
-
Open the file with
nedit
- you can also use
vim
if you are more comfortable with this text editor
- you can also use
-
Modify as follows:
- Locate the
BOWTIE2
line and correct the path- You can find out the path of a program by using
whereis program_name
- Uncomment the line
- You can find out the path of a program by using
- Set the
THREADS 8
line to1
- Though using 9 threads makes things quicker, we may overload the server!
- Set the
DATABASE Human
to/home/db/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome
- Uncomment the line
- Set the
DATABASE Mouse
to/home/db/Mus_musculus/GRCm38/Bowtie2Index/GRCm38
- Uncomment the line
- Set the
DATABASE PhiX
to/your_home_directory/db/PhiX/Illumina/RTA/Sequence/Bowtie2Index/genome
- Uncomment the line
- Add a new line for Leishmania so it looks like:
DATABASE<tab>Leishmania<tab>/your_home_directory/path_to/Leishmania_infantum_gca_900500625.LINF.dna.toplevel.fa
- Locate the
-
Go to your home directory
-
Create a directory called
FastQScreen
-
Run
fastq_screen
on all your samples, selecting the flags you think are important- Don't forget to use
bowtie2
as aligner
- Don't forget to use
Once you are done have look at the output files.
Q. Do you have any contamination in your samples? Does the composition of the samples make sense?
Once we know how the quality of our samples is, we have an idea of what kind of filtering or trimming we need to do.
For this exercise, we will be using data from the AS_2
sample (exome data). To make things faster we will focus only on chr21.
- Copy
chr21_dna_R1.fastq.gz
andchr21_dna_R2.fastq
from/home/courses/NGS/QC/Fastq
to yourFastq
directory
There are several protocols to follow in order to leave only good quality reads. And it is recommended to perform this step even when the data looks fine. Again, there are different tools to do this, some are specific for certain types of data and some others are quite general.
Let's run some of them trying to use the same options (whenever possible).
Don't forget to create FastQC
plots, to evaluate if the filtering had an impact on the quality of the sample.
This is one of the first toolkits developed for analyzing and pre-processing FASTA/FASTQ files. We can generate graphs, remove adapters, trim based on quality, etc.
- Go to your home directory
- Create a directory called
Filtering
- Create another directory called
Filter_fastx
To make our results comparable to the next tools, you have to run several tools (you can pipe them in to one single command line or run them independently one after the other)
-
Use:
fastx_clipper
,fastq_quality_filter
andfastq_quality_trimmer
- You can use
AGATCGGAAGAGC
as an adapter sequence - Save the results under
Filter_fastx
- You can use
-
Run
fastqc
with the resultingfastq
files and save the results in theFiltering
directory
This is a wrapper for quality and adapter trimming, and makes use of cutadapt
and fastqc
on the fly.
-
Go to your home directory
-
Create a directory called
Filter_galore
-
Run
trim_galore
on all your samples- Save the results under
Filter_galore
- Save the results under
-
Run
fastqc
with the resultingfastq
files and save the results in theFiltering
directory
This is a tool that generates summary statistics of sequence and quality data and is used to filter, reformat and trim next-generation sequence data.
-
Go to your home directory
-
Create a directory called
Filter_prinseq
-
Run
prinseq-lite
on all your samples- Save the results under
Filter_prinseq
- Save the results under
-
Run
fastqc
with the resultingfastq
files and save the results in theFiltering
directory
Now compare the results by running multiqc
on the files saved under the Filtering
directory
Q. Was there any improvement in the quality of the reads? Which tool do you think performed better?
Created by Marcela Dávila, 2017. Updated by Marcela Dávila, 2023. Updated by Marcela Dávila, 2024