Unix VII: Comparing two quantification methods - BDC-training/VT25 GitHub Wiki
Here we will compare two ways of generating expression gene levels from RNAseq data. For this exercise you need to know that the sequencing was paired-end and directional. The organism is Drosophila melanogaster, the fruit fly.
First, we need to map our reads (Fly_R1.fastq.gz
and Fly_R2.fastq.gz
) to the fly genome. No need to copy them since we won't be actually running the following command line.
Bowtie2 is one of many mappers that can be used to align sequencing reads to long reference sequences. The tools is already in the cluster, so load the module to use it:
module load bowtie2/2.5.1
If you type bowtie2
you will get a pretty big list of different parameters!.
Q1. What would be a minimum command line to run bowtie2 using
Fly_R1.fastq.gz
andFly_R2.fastq.gz
as input read files? You can useBowtie2Index/genome
as the Index filename prefix. Add the command line you used
Instead, copy the alignment file Fly.sam.gz
from /home/courses/Unix/files/
. If you don't have it already.
The SAM
format harbors information on where and how each read is mapped in relation to the genome. Here is more complete explanation of the SAM format. Have look at it!
Q2. How many reads are mapped to the chromosome
3R
?
The next step is to compress this data type into a BAM file, the binary format of SAM. For that we will be using samtools, a set of tools to manipulate alignments.
- Load the tool module:
module load samtools/1.17
-
By typing
samtools
you will get a list of all the subcommands that can be used. For instanceview
will help us to convert files to different formats (SAM<->BAM<->CRAM). -
Run the following command:
time samtools view -b --threads 8 Fly.sam | samtools sort -o Fly.sort.bam - --threads 8
Q3. What does this cmd line mean?
To use other tools on our alignment file, we need to index it. Run samtools index
on the bam
file.
Click for code
samtools index Fly.sort.bam
Now let's calculate some statistics. Run idxstats
, a tool that reports alignment summary statistics:
Click for code
samtools idxstats Fly.sort.bam
The column names are: reference sequence name
, sequence length
, # mapped read-segments
and # unmapped read-segments
.
Q4. Are the number of mapped reads in
3R
the same as you got in Q2?
Now we need to count how many reads lie within each gene, so we can quantify their expression level. There are of course, several ways/algorithms that count the number of reads mapped to a specific gene. In this exercise we compare two different counting approaches: htseq and featureCounts. How these algorithms work is out of the scope of this course, but ask if you want to know more!.
Let's install featureCounts
that is part of the Subread package, a high-performance read alignment, quantification and mutation discovery tool.
- Go to the webpage
- Under
Download and installation
clickLatest version
- Download
subread-2.0.X-source.tar.gz
usingwget
- Untar the file
- Go into the newly created
subread
directory - Follow the installation steps for
Linux OS
- Read the first 2 paragraphs
Have a look under the bin
directory, you should see featureCounts
as an executable program along with other programs.
- To run
featureCounts
via the command line without having to type the absolute path, open your~/.bashrc
with a text editor and add the following line (don´t forget to add the correct path):
export PATH=/home/path_to_featureCounts_installation/bin:$PATH
- Save the document and close it
- Run this command to execute the export command you just added:
source ~/.bashrc
Typing featureCounts
will give you information on how to run the tool. For this exercise copy the genes.gtf
file from /home/courses/Unix/files/
, this file contains the list of genes in gtf
format. Then use the Fly.sort.bam
alignment file as input together with the following arguments:
-a Name of an annotation file: genes.gtf
-o Name of output file, use Fly.fc
-p Specifying the alignment file contains paired-end reads
--countReadPairs To count fragments, instead of reads
Have a look at the output files. Visit the subread
webpage to understand the ouput.
The quantification with htseq
takes some time!, so we calculated it for you. Copy the Fly.htseq
file and explore it.
Would the different algorithms give similar results? One way to answer this is by comparing the top 50 most expressed genes. To do this we need to have them in the same format.
Reformat Fly.fc
so you have it in the same format as Fly.htseq
and select the top 50 most expressed genes for each approach.
Q5. How many highly covered genes are shared by the two methods? Hint: use
comm
Developed by Marcela Dávila, 2018. Modified by Marcela Dávila, 2021. Updated by Marcela Dávila, 2024.