Counting Reads - ccsstudentmentors/tutorials GitHub Wiki

Now that you have sorted your aligned read bam files, it is time to count how many reads aligned to the genomic regions belonging to each gene.

While cufflinks takes a very intense statistical approach to resolving different isoforms and calculating differential expression, gene level analysis is much more straightforward: We align each read that was produced by the HiSeq machine for this experiment to the genome. (Done in previous tutorial)

Then we ask if the genomic location it aligned to is an exon of a gene (this is our current step). If so, we give a 'point' to that gene. We repeat this counting process for all of our tens of millions of reads for each sample. At the end, for each sample, we have a list of genes and their corresponding number of read counts (points) for this sample.

Then we use the replicate samples of our control and experimental conditions to calculate the average number of read counts for each gene for the control and experimental groups, and the variance of the counts around that average.

What happens next is basically just performing a big statistical comparison (similar to an ANOVA) to ask if the number of read counts assigned to each particular gene changed significantly between the control and experimental groups, while controlling for the different total number of reads in each sample.

Ok, so counting reads in this way is really difficult and annoying to do, but fortunately someone (Simon Anders to be precise) has written an awesome program that does all this for us. That program is called HTSeq-count.

Counting reads over gene intervals

Using the vi editor, let's create a new job file for counting these reads:

cd ~/RNASeq/thisProject/
vi getReadCounts.job

Now, go into vi editor's 'insert' mode by hitting the i button, and enter the following:

#!/bin/bash
#BSUB -J getReadCounts
#BSUB -o getReadCounts.out
#BSUB -e getReadCounts.err
#BSUB -W 80:00
#BSUB -q general
#BSUB -n 1

# To count based on Ensemble Gene IDs (ex. ENSG00000001) 
/share/opt/python/2.7.3/bin/htseq-count -f bam -r name -i gene_id Control1/Control1_accepted_hits.nameSorted.bam ~/RNASeq/GRCh38/GRCh38.gtf > ReadCounts/Control1.geneIDCounts.txt

# To count based on the official gene symbols (ex. MAPK1) 
/share/opt/python/2.7.3/bin/htseq-count -f bam -r name -i gene_name Control1/Control1_accepted_hits.nameSorted.bam ~/RNASeq/GRCh38/GRCh38.gtf > ReadCounts/Control1.geneNameCounts.txt

The first seven lines of the job file are the standard job file header, updated so the naming matches this application.

There are two examples of how to run the HTSeq-count script here. Essentially the only difference is whether you want to deal with the genes using their Ensembl gene IDs or their official gene symbols. Either way is fine and both will work for downstream applications such as EdgeR or DESeq. However, I recommend sticking with the geneIDs for now and then converting over to the gene names later (which we will cover). There are two main reasons for doing it this way: First, the gene names can sometimes be ambiguous whereas the gene IDs never are, and we want to limit the number of steps with any ambiguiuty in them. Second, using the geneIDs allows us to more easily import this data into DESeq2 (it should theoretically also work for the gene names, but it doesn't really, possibly because of reason 1).

So, what does our code mean?

The first thing we do in both of these examples is to state that we want to execute the htseq-count python script by giving its absolute location to the computer.
After that, we use the -f parameter to tell it what format our input data is in, which is bam format.
Then, we use the -r parameter to tell it how our reads were sorted (you can choose between 'name' as shown here or 'pos' if you sorted by positon). If you did sort by position and have paired ended reads, I recommend running this job on Pegasus' 'bigmem' queue to ensure you don't run out of memory. Otherwise, just stick with the 'general' queue as we are in this example.
Next, we use the -i parameter to tell the script what name to use to group exons together. This essentially boils down to picking between using exons that belong to a particular 'gene_id' or exons that belong to a particular 'gene_name'. So select one of those options.
Then, we tell the script where to find our sorted input bam file.
And finally, we tell the script where to put our output. We use the '>' symbol to funnel the output into this specified file. Otherwise, it would have spat the output out to the console.

Select which way you want to run this script, and submit it to Pegasus with this command:

bsub < getReadCounts.job

As always, you can create a separate job file for each sample to ensure things proceed as quickly as possible (each one should complete in about 1-2 hours, done in parallel at the same time) or you can submit all of the tasks as a single job that will perform the read counting in serial (takes 1-2 hours per sample, one after the other, so takes longer).

When it is done, you will see your output file in the specified location. It will be a text file with two columns and many thousands of rows. The first column will be the gene name or gene ID. The second column will be the count of reads for that gene found in this sample.

If you used geneIDs for your naming scheme in HTSeq-count, you are eventually going to want to convert those geneIDs back into official gene symbols. So, we are going to prepare a file with the relevant information to make that possible using our genome annotation gtf file. First, head to the directory with the gtf file:

cd ~/RNASeq/GRCh38/

This next step is easy to do, but might be tricky to standardize for all future cases. So, all I can say is that these lines of code work for standard Ensembl GTF files as they are at the time of writing. Basically, we need to pull out the information in the 'gene_id' and 'gene_name' fields on each line of the gtf file. In each example that I have seen, these pieces of information have been the first and sixth entries in quotation marks, respectively, on each line that describes an exon.

This allows us to use a unix utility command such as 'awk' (which basically manipulates columns and rows in a table) and tell it to treat quotation marks as column delimiters and then extract the information in the second and twelfth columns and print them to a file. First, we need to extract only the lines of the gtf file that are describing exons (lines that describe other features such as entire genes or specific transcripts have different patterns). For this, we will use the shell script below:

grep "exon" Homo_sapiens.GRCh38.79.gtf > allExons.gtf

Next, we pull out the two pieces of information that we want from each line:

awk -F '"' -v OFS='\t' '{print $2,$12}' allExons.gtf > allGeneIDsToGeneNames.txt

Then, we remove the duplicate entries from this list (because we have an entry for every exon, but we only need one for each gene):

sort -u allGeneIDsToNames.txt > GeneIDtoNameConversionTable.txt

What's next?

Congratulations! You have written shell scripts in unix/linux and you are done running jobs on Pegasus!

Now use your FTP program (FileZilla, most likely) and download these gene count files (and the GeneIDtoNameConverstionTable.txt file, if applicable) to your personal computer.

To continue following this tutorial, head to the section on Differential Expression Analysis.