Course: HT24 Analysis of next generation sequencing data (SC00204)

The purpose of this exercise is to introduce you to common tools to asses the quality of sequencing data and filter it accordingly.

The Data

Our data involves sequencing data from different sequencing applications:

http://onlinelibrary.wiley.com/doi/10.1002/mgg3.115/full
https://www.encodeproject.org/: ENCFF121FBT, ENCFF931PQC, ENCFF836ILC, ENCFF263KEN, ENCFF831GJV, ENCFF017GWO
http://www.ebi.ac.uk/ena: ERR1523947, ERR1523948, ERR1523949
Other data courtesy of JBP and CJ

The server

Connect to the server using MobaXterm (PC users) or your local Terminal (MACS users), using the credentials provided:

    ssh -Y your_account@remote_server

Modules are a great way to have different versions of the same program. Load the following modules so we can access the programs we will be using without needing to write the absolute path. Use module load _program_name/version for the following programs (you can list all of them in the same line, for less typing):

 fastqc/0.12.1
 multiqc/1.14
 bowtie2/2.5.1
 fastqscreen/0.15.3
 fastx/0.0.14
 trimgalore/0.6.10
 prinseq/0.20.4

If you would like to know which modules are loaded, type module list
If you want to remove a module (maybe you are using the incorrect version) use module load program_name/version
And if you want to know which programs are installed in any server go for module avail

Quality check

The first thing to do when you receive data is to check its quality and the composition of the sample. There are different tools for doing so, let's try a couple.

FastQC is a program that generates general statistics from high throughput data (and pipelines). It creates an HTML report.

Create a directory called Fastq
Create a soft link to the samples you will be analyzing in this directory. The data is under /home/courses/NGS/QC/Fastq. Soft links are a special type of files that serve as a reference to another file or directory, this will avoid having several copies of the same data, saving space. Just have in mind that some programs do not work with soft links:

ln -s file1 link1

Create a directory called FastQC
Run fastqc on your samples. (Remember that you can run any tool using -h to check how to run it):

fastqc -h

Inspect the resulting html files. You can use firefox file_name to open the files from the server. Alternatively you can copy them to your computer and inspect them locally

Q. Is your data of good quality? What kind of sequencing data do you have?

As you noticed, it is a little tedious to inspect every output file one by one. MultiQC is a tool that summarizes different types of analysis into a single report.

Run multiqc -h to know how to run the tool
Run it on you samples
Inspect the output file

Q. Which sample has the highest amount of reads? and the lowest? Are there any samples with higher duplication levels? Do all the samples have an acceptable adapter content?

Library composition

With FastQ Screen you can check that your libraries contain the genomes that are supposed to have, along with PhiX, Vectors or other contaminants commonly seen in sequencing experiments.

To run fastq_screen we need to determine which aligner will be used as well as which databases (or organisms) we want to use. For this exercise we will be using bowtie2 as aligner and Human, Mouse, PhiX and Leishmania as genomes to scan our samples. The Human and Mouse genomes are already in the server together with their corresponding bowtie2 indexes. We will practice on how to set the PhiX and the Leishmania databases with 2 different approaches.

Go to your home directory
Create a directory called db
Follow theses steps within that directory:
- PhiX:
  1. iGenomes are a collection of reference sequences and annotation files for commonly analyzed organism
  2. Go to the webpage and download the PhiX - Illumina - RTA build with wget webaddress
  3. Uncompress it with tar -xvzf file
  4. Check that the index files are at PhiX/Illumina/RTA/Sequence/Bowtie2Index/genome/*
    - Remember that the indexes are a collection of files with the extension .bt2
- Leishmania
  1. Ensembl is a huge source for genomic data (among others)
  2. Find your way to the Leishmania infantum webpage in Ensembl
  3. Download Leishmania_infantum_gca_900500625.LINF.dna.toplevel.fa.gz which contains all chromosomes for this protist. Use wget webaddress
  4. Uncompress using gunzip
  5. Build the bowtie2 index using bowtie2-build -h
    - Set the bt2_index_base identical to the reference_in, sometimes bowtie2 has problems to find the index if you name it differently
  6. Check that *bt2 files are created

Now that we have all the data needed, we need to tell fastq_screen where it can find the indexes.

Copy the fastq_screen.conf.example from /home/courses/NGS/QC
Remember to set the correct permission so you can modify the file
- Use chmod
Open the file with nedit
- you can also use vim if you are more comfortable with this text editor
Modify as follows:
- Locate the BOWTIE2 line and correct the path
  - You can find out the path of a program by using whereis program_name
  - Uncomment the line
- Set the THREADS 8 line to 1
  - Though using 9 threads makes things quicker, we may overload the server!
- Set the DATABASE Human to /home/db/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome
  - Uncomment the line
- Set the DATABASE Mouse to /home/db/Mus_musculus/GRCm38/Bowtie2Index/GRCm38
  - Uncomment the line
- Set the DATABASE PhiX to /your_home_directory/db/PhiX/Illumina/RTA/Sequence/Bowtie2Index/genome
  - Uncomment the line
- Add a new line for Leishmania so it looks like:
DATABASE<tab>Leishmania<tab>/your_home_directory/path_to/Leishmania_infantum_gca_900500625.LINF.dna.toplevel.fa
Go to your home directory
Create a directory called FastQScreen
Run fastq_screen on all your samples, selecting the flags you think are important
- Don't forget to use bowtie2 as aligner

Once you are done have look at the output files.

Q. Do you have any contamination in your samples? Does the composition of the samples make sense?

Quality filtering

Once we know how the quality of our samples is, we have an idea of what kind of filtering or trimming we need to do. For this exercise, we will be using data from the AS_2 sample (exome data). To make things faster we will focus only on chr21.

Copy chr21_dna_R1.fastq.gz and chr21_dna_R2.fastq from /home/courses/NGS/QC/Fastq to your Fastq directory

There are several protocols to follow in order to leave only good quality reads. And it is recommended to perform this step even when the data looks fine. Again, there are different tools to do this, some are specific for certain types of data and some others are quite general.

Let's run some of them trying to use the same options (whenever possible).

Don't forget to create FastQC plots, to evaluate if the filtering had an impact on the quality of the sample.

FastX

This is one of the first toolkits developed for analyzing and pre-processing FASTA/FASTQ files. We can generate graphs, remove adapters, trim based on quality, etc.

Go to your home directory
Create a directory called Filtering
Create another directory called Filter_fastx

To make our results comparable to the next tools, you have to run several tools (you can pipe them in to one single command line or run them independently one after the other)

Use: fastx_clipper, fastq_quality_filter and fastq_quality_trimmer
- You can use AGATCGGAAGAGC as an adapter sequence
- Save the results under Filter_fastx
Run fastqc with the resulting fastq files and save the results in the Filtering directory

Trim Galore!

This is a wrapper for quality and adapter trimming, and makes use of cutadapt and fastqc on the fly.

Go to your home directory
Create a directory called Filter_galore
Run trim_galore on all your samples
- Save the results under Filter_galore
Run fastqc with the resulting fastq files and save the results in the Filtering directory

PRINSEQ

This is a tool that generates summary statistics of sequence and quality data and is used to filter, reformat and trim next-generation sequence data.

Go to your home directory
Create a directory called Filter_prinseq
Run prinseq-lite on all your samples
- Save the results under Filter_prinseq
Run fastqc with the resulting fastq files and save the results in the Filtering directory

Now compare the results by running multiqc on the files saved under the Filtering directory

Q. Was there any improvement in the quality of the reads? Which tool do you think performed better?

Home: Analysis of next generation sequencing data (SC00024)

Created by Marcela Dávila, 2017. Updated by Marcela Dávila, 2023. Updated by Marcela Dávila, 2024

NGS I: QC - bcfgothenburg/HT24 GitHub Wiki

The Data

The server

Quality check

Library composition

Quality filtering

FastX

Trim Galore!

PRINSEQ

Home: Analysis of next generation sequencing data (SC00024)

⚠️ GitHub.com Fallback ⚠️

NGS I: QC - bcfgothenburg/HT24 GitHub Wiki

The Data

The server

Quality check

Library composition

Quality filtering

FastX

Trim Galore!

PRINSEQ

Home: Analysis of next generation sequencing data (SC00024)

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️