Unix VI: NGS data quality - BDC-training/VT25 GitHub Wiki

Course: VT25 Unix applied to genomic data (SC00036)

In this exercise, we are going to inspect some NGS raw data to determine if the data is of good quality

Example of FastQC output

Copy Fly_R1.fastq.gz and Fly_R2.fastq.gz from /home/courses/Unix/files/

Have a look at the files.

As a quick reminder, fastq files look like this (a more thorough description here).

    @ERR030882.1 HWI-BRUNOP16X_0001:3:1:3055:1085#0/1
    NNTCAAGTTTCCGTTTTAAACACCTCTTCCTGTGTGGCTCGCTTGTTTTC
    +
    ##(('(((''1,-.-33348FFFFFFFFFF,8/1/77777FFFFFFFF<<

Q1. How many sequences are there in Read1 and Read2?

Let's look at the quality.

Remember that the fourth row stores the quality of the sequence read. # indicates a low nucleotide quality (=2).

Q2. How many sequences are there where all the nucleotides in a read are #? In other words, how many sequences we will be discarding because we can not trust that sequence due to its really low quality?

If we say that sequences with more than 20 consecutive #'s are bad quality reads,

Q3. Which is the percentage of bad quality reads, for Read1 and Read2?

To inspect the overall quality of the sequences, we will use FastQC, a quality control application for high throughput sequence data. To install it:

Go to the webpage
Click on Download Now
Download the Win/Linux zip file, using wget
Uncompress it using unzip
Go into the newly FastQC folder

To run the program we need to have java installed in the server, which is already! So load the module:

module load java/jdk17

Then follow the instructions in the INSTALL.txt file. Read under the Running FastQC Interactively section for Linux. This will open the interactive version of the program (i.e. a window should pop with FastQC).

Click to see window

Now that we know it runs, close the window. We will be using the non-interactive version.

First let's create an alias for this tool in our ~/.bashrc, so you can use it wherever you are.

Open your ~/.bashrc with a text editor
Type: alias run_fastqc='/home/path_to_the_fastqc_folder/fastqc'
Save the file and close it
Source the file: source ~/.bashrc

To know how to use the tool run:

run_fastqc -h

For this exercise, run fastqc as indicated in the SYNOPSIS part of the help you just displayed. seqfile1 is Fly_R1.fastq.gz and seqfile2 is Fly_R2.fastq.gz.

run_fastqc seqfile1 seqfile2 .. seqfileN

Once you have run the program, open the html reports and have a quick look:

firefox name_of_the_file.html &

There are different programs to filter our data based on quality. One of them is TrimGalore!. Let's install it:

Go the webpage
Click on Download Now
Click on Trim_Galore (hoster at Github)
Download the Source code (tar.gz) with wget
Untar the tarball
Go into the uncompressed TrimGalore-X.X.X folder
Display the content of the folder
- You will see that thre is an executable file names trim_galore
Run the program using --help
- This is also an easy program to install. However cutadapt, a program that removes adapter sequences and fastqc. If you try to run trim_galore, you will get a warning. For our case cutadapt is already installed so we only need to load the module:

module load cutadapt/4.4

Since the tool is already an executable program, we just need to add it to our .bashrc so we can access it from anywhere.

Open your ~/.bashrc with a text editor and add the following line (don´t forget to add the correct path):

export PATH=/home/path_to_TrimGalore:$PATH

Save the document and close it
Run this command to execute the export command you just added:

source ~/.bash

For this exercise we will run the tool in the background since it takes around 16 mins to complete, in the meantime you can either take a break or continue with the next tool. Run trim_galore using both fastq files with the following arguments (add the & sign at the end of your command to run it in the background):

  * quality 20 
  * length  30 
  * paired 
  * fastqc  True

Q4. How many sequences were removed? Hint: look at the fastq_trimming_report.txt files

As a comparison, let's filter the sequences with fastq_quality_filter, a program from the FASTX-Toolkit, which is already installed in the server.

Load the module:

 module load fastx/

Type fastq_quality_filter -h to known how to run the tool. Use the following settings:

  * Minimum quality score to keep: 20
  * Minimum percent of bases that must have 30 quality: 80
  * -Q 33
  * output file
  * Verbose (to get a report)

Q5. How many sequences were removed?

Run run_fastqc on the filtered files.

Once you have all the files (raw, filtered and the corresponding reports from FastQC)

Q6. Which program gave better quality sequences? Where you able to improve the quality of the raw sequences? Which tool would you use?

Home: Unix applied to genomic data

Developed by Marcela Dávila, 2018. Updated by Marcela Dávila, 2018.

Unix VI: NGS data quality - BDC-training/VT25 GitHub Wiki

Course: VT25 Unix applied to genomic data (SC00036)

Home: Unix applied to genomic data

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️