Unix VII: Comparing two quantification methods - BDC-training/VT25 GitHub Wiki

Course: VT25 Unix applied to genomic data (SC00036)


Here we will compare two ways of generating expression gene levels from RNAseq data. For this exercise you need to know that the sequencing was paired-end and directional. The organism is Drosophila melanogaster, the fruit fly.


Alignment


First, we need to map our reads (Fly_R1.fastq.gz and Fly_R2.fastq.gz) to the fly genome. No need to copy them since we won't be actually running the following command line.

Bowtie2 is one of many mappers that can be used to align sequencing reads to long reference sequences. The tools is already in the cluster, so load the module to use it:

module load bowtie2/2.5.1

If you type bowtie2 you will get a pretty big list of different parameters!.

Q1. What would be a minimum command line to run bowtie2 using Fly_R1.fastq.gz and Fly_R2.fastq.gz as input read files? You can use Bowtie2Index/genome as the Index filename prefix. Add the command line you used

Please don't run! Since it can take a long time!

Instead, copy the alignment file Fly.sam.gz from /home/courses/Unix/files/. If you don't have it already.

The SAM format harbors information on where and how each read is mapped in relation to the genome. Here is more complete explanation of the SAM format. Have look at it!

Q2. How many reads are mapped to the chromosome 3R?

The next step is to compress this data type into a BAM file, the binary format of SAM. For that we will be using samtools, a set of tools to manipulate alignments.

  • Load the tool module:
module load samtools/1.17
  • By typing samtools you will get a list of all the subcommands that can be used. For instance view will help us to convert files to different formats (SAM<->BAM<->CRAM).

  • Run the following command:

time samtools view -b --threads 8 Fly.sam | samtools sort -o Fly.sort.bam - --threads 8

Q3. What does this cmd line mean?

To use other tools on our alignment file, we need to index it. Run samtools index on the bam file.

Click for code
samtools index Fly.sort.bam

Now let's calculate some statistics. Run idxstats, a tool that reports alignment summary statistics:

Click for code
samtools idxstats Fly.sort.bam

The column names are: reference sequence name, sequence length, # mapped read-segments and # unmapped read-segments.

Q4. Are the number of mapped reads in 3R the same as you got in Q2?

Now we need to count how many reads lie within each gene, so we can quantify their expression level. There are of course, several ways/algorithms that count the number of reads mapped to a specific gene. In this exercise we compare two different counting approaches: htseq and featureCounts. How these algorithms work is out of the scope of this course, but ask if you want to know more!.

Let's install featureCounts that is part of the Subread package, a high-performance read alignment, quantification and mutation discovery tool.

  • Go to the webpage
  • Under Download and installation click Latest version
  • Download subread-2.0.X-source.tar.gz using wget
  • Untar the file
  • Go into the newly created subread directory
  • Follow the installation steps for Linux OS
    • Read the first 2 paragraphs

Have a look under the bin directory, you should see featureCounts as an executable program along with other programs.

  • To run featureCounts via the command line without having to type the absolute path, open your ~/.bashrc with a text editor and add the following line (don´t forget to add the correct path):
export PATH=/home/path_to_featureCounts_installation/bin:$PATH
  • Save the document and close it
  • Run this command to execute the export command you just added:
source ~/.bashrc

Typing featureCounts will give you information on how to run the tool. For this exercise copy the genes.gtf file from /home/courses/Unix/files/, this file contains the list of genes in gtf format. Then use the Fly.sort.bam alignment file as input together with the following arguments:

-a                  Name of an annotation file: genes.gtf 
-o                  Name of output file, use Fly.fc
-p                  Specifying the alignment file contains paired-end reads
--countReadPairs    To count fragments, instead of reads

Have a look at the output files. Visit the subread webpage to understand the ouput.

The quantification with htseq takes some time!, so we calculated it for you. Copy the Fly.htseq file and explore it.

Would the different algorithms give similar results? One way to answer this is by comparing the top 50 most expressed genes. To do this we need to have them in the same format.

Reformat Fly.fc so you have it in the same format as Fly.htseq and select the top 50 most expressed genes for each approach.

Q5. How many highly covered genes are shared by the two methods? Hint: use comm



Developed by Marcela Dávila, 2018. Modified by Marcela Dávila, 2021. Updated by Marcela Dávila, 2024.

⚠️ **GitHub.com Fallback** ⚠️