Sorting reads - ccsstudentmentors/tutorials GitHub Wiki

This step is easy!

All we need to do is take the aligned read data produced by Tophat (or STAR) and sort it. Right now, that data is just in whatever order the alignment program figured out its genomic site in. To efficiently count how many reads we have for each gene (which will be our next step), the program needs the reads to be sorted in a particular way. We have two choices for how to sort the reads: by their position on the chromosome, or by the name of the read.

Intuitively, sorting by position may seem like the far more useful mode to choose, but this tutorial is actually going to use sorting by name instead. The reason for this is because the program that does the counting over genomic intervals (HTSeq-count) is written in a way that handles paired ended reads in a way that uses memory much more efficiently when the data are name sorted than if they are position sorted. If your data is single-ended, you are welcome to sort them either way, both methods will use an equal amount of memory. We will use name sorting in this tutorial since it works for both.

To sort the bam file data, we need to load a suite of tools called samtools using the following command:

module load samtools/0.1.19

You may have already loaded samtools to peek at the bam file using its 'view' command. But now we are going to use its 'sort' tool.

You can find the full manual on samtools' commands here.

Sorting the Data

Head to the following folder and create a job file for sorting the bam files. We will again use the vi editor for this.

cd ~/RNASeq/thisProject/
vi bamSort.job

Now, go into the vi editor's 'insert' mode (hit the i button) and enter the following:

#!/bin/bash
#BSUB -J BamSort
#BSUB -o BamSort.out
#BSUB -e BamSort.err
#BSUB -W 80:00
#BSUB -q general
#BSUB -n 1

samtools sort -n ./Control1/accepted_hits.bam ./Control1/Control1_accepted_hits.nameSorted
samtools sort -n ./Control2/accepted_hits.bam ./Control2/Control2_accepted_hits.nameSorted
samtools sort -n ./Control3/accepted_hits.bam ./Control3/Control3_accepted_hits.nameSorted

The first seven lines of this bamSort job file are essentially the same as they were in the Tophat job file.
The naming of the job and its standard output and error files has just been updated to reflect this new task.

Lines 9-11 show examples of running the samtools sort command on the outputs of tophat alignment for three control samples. In each case, we call the executable file 'samtools' first. The computer knows where this file is because we loaded the samtools module earlier. Within the samtools file, there are numerous tools that can be run, so the next thing we do on each line is to specify that we want to use the 'sort' tool.

  • By default, the 'sort' tool sorts the reads based on their position on chromosomes. But we want to sort the reads based on their names (so read pairs show up together). To do that, we enter the '-n' parameter.
  • After that, we tell the program where our input bam file is.
  • Finally, we tell it where we want our output sorted bam file to be and what we want it to be named.

Note that samtools will append a '.bam' to whatever you write as the output name, so there is no point in adding it here in the job file.

So, the name of the output file in line #9 of the example will really be 'Control1_accepted_hits.nameSorted.bam' and it will be in the Control1 folder.

A few notes about running this job file: I wrote this file with the expectation that it would be run (submitted to the cluster) from the main project folder (~/RNASeq/thisProject/). You do not have to do it this way -- just adjust the input and output file locations accordingly (or use absolute path names each time such as ~/RNASeq/thisProject/Control1/accepted_hits.bam). I wrote this file to sort the bam files from each sample one after another. You could also make the sorting of each bam file its own job file to cut down on run time.

Now you are ready to submit this job to the cluster with the following command:

bsub < bamSort.job

While the job is running, you will see lots of little bam files appear in the folder where your input is. Those are just intermediate files used to help the sorting process. They will get cleaned up by the end. Once the job has finished running. You should see the output file that you specified.

Now you are ready to proceed to counting reads that aligned to each gene!