Unix VI: NGS data quality - BDC-training/VT25 GitHub Wiki
In this exercise, we are going to inspect some NGS raw data to determine if the data is of good quality
Copy Fly_R1.fastq.gz
and Fly_R2.fastq.gz
from /home/courses/Unix/files/
Have a look at the files.
As a quick reminder, fastq files look like this (a more thorough description here).
@ERR030882.1 HWI-BRUNOP16X_0001:3:1:3055:1085#0/1
NNTCAAGTTTCCGTTTTAAACACCTCTTCCTGTGTGGCTCGCTTGTTTTC
+
##(('(((''1,-.-33348FFFFFFFFFF,8/1/77777FFFFFFFF<<
Q1. How many sequences are there in Read1 and Read2?
Let's look at the quality.
Remember that the fourth row stores the quality of the sequence read.
#
indicates a low nucleotide quality (=2).
Q2. How many sequences are there where all the nucleotides in a read are
#
? In other words, how many sequences we will be discarding because we can not trust that sequence due to its really low quality?
If we say that sequences with more than 20 consecutive #
's are bad quality reads,
Q3. Which is the percentage of bad quality reads, for Read1 and Read2?
To inspect the overall quality of the sequences, we will use FastQC, a quality control application for high throughput sequence data. To install it:
- Go to the webpage
- Click on Download Now
- Download the Win/Linux zip file, using
wget
- Uncompress it using
unzip
- Go into the newly
FastQC
folder
To run the program we need to have java installed in the server, which is already! So load the module:
module load java/jdk17
Then follow the instructions in the INSTALL.txt
file. Read under the Running FastQC Interactively
section for Linux
. This will open the interactive version of the program (i.e. a window should pop with FastQC).
Click to see window

Now that we know it runs, close the window. We will be using the non-interactive version.
First let's create an alias
for this tool in our ~/.bashrc
, so you can use it wherever you are.
- Open your
~/.bashrc
with a text editor - Type:
alias run_fastqc='/home/path_to_the_fastqc_folder/fastqc'
- Save the file and close it
- Source the file:
source ~/.bashrc
To know how to use the tool run:
run_fastqc -h
For this exercise, run fastqc
as indicated in the SYNOPSIS part of the help you just displayed. seqfile1 is Fly_R1.fastq.gz
and seqfile2 is Fly_R2.fastq.gz
.
run_fastqc seqfile1 seqfile2 .. seqfileN
Once you have run the program, open the html
reports and have a quick look:
firefox name_of_the_file.html &
There are different programs to filter our data based on quality. One of them is TrimGalore!
. Let's install it:
- Go the webpage
- Click on Download Now
- Click on Trim_Galore (hoster at Github)
- Download the
Source code (tar.gz)
withwget
- Untar the tarball
- Go into the uncompressed
TrimGalore-X.X.X
folder - Display the content of the folder
- You will see that thre is an executable file names
trim_galore
- You will see that thre is an executable file names
- Run the program using
--help
- This is also an easy program to install. However
cutadapt
, a program that removes adapter sequences andfastqc
. If you try to runtrim_galore
, you will get a warning. For our casecutadapt
is already installed so we only need to load the module:
- This is also an easy program to install. However
module load cutadapt/4.4
Since the tool is already an executable program, we just need to add it to our .bashrc
so we can access it from anywhere.
- Open your
~/.bashrc
with a text editor and add the following line (don´t forget to add the correct path):
export PATH=/home/path_to_TrimGalore:$PATH
- Save the document and close it
- Run this command to execute the export command you just added:
source ~/.bash
For this exercise we will run the tool in the background since it takes around 16 mins to complete, in the meantime you can either take a break or continue with the next tool. Run trim_galore
using both fastq
files with the following arguments (add the &
sign at the end of your command to run it in the background):
* quality 20
* length 30
* paired
* fastqc True
Q4. How many sequences were removed? Hint: look at the
fastq_trimming_report.txt files
As a comparison, let's filter the sequences with fastq_quality_filter
, a program from the
FASTX-Toolkit, which is already installed in the server.
- Load the module:
module load fastx/
Type fastq_quality_filter -h
to known how to run the tool. Use the following settings:
* Minimum quality score to keep: 20
* Minimum percent of bases that must have 30 quality: 80
* -Q 33
* output file
* Verbose (to get a report)
Q5. How many sequences were removed?
Run run_fastqc
on the filtered files.
Once you have all the files (raw, filtered and the corresponding reports from FastQC)
Q6. Which program gave better quality sequences? Where you able to improve the quality of the raw sequences? Which tool would you use?
Developed by Marcela Dávila, 2018. Updated by Marcela Dávila, 2018.