Lab 05: Read Counting - ryandkuster/EPP_575_RNA_25 GitHub Wiki

Counting Reads

Today we'll use the program featureCounts to count the number of reads mapped to individual genes in our reference genome.

Using featureCounts

Steps

Before we begin

Go ahead and request resources:

cd $RNA
bash 01_tour/srun.sh

1. Prepare a counts directory

Change to your analysis directory and run the following to link the bam files from the previous STAR mapping results along with the gff file.

cd $RNA
mkdir 05_count
cd 05_count

ln -s /lustre/isaac24/proj/UTK0386/completed/03_mapping_star/*bam .
ln -s /lustre/isaac24/proj/UTK0386/data/reference/genomic_modified.gff .

Load and run featureCounts

module load anaconda3/2024.06
conda activate /lustre/isaac24/proj/UTK0386/conda/featurecounts

featureCounts \
    -p \
    -s 2 \
    -T 8 \
    -t gene \
    -g ID \
    -F GTF \
    -a genomic_modified.gff \
    -o Col_0h_rep1.counts.txt \
    Col_0h_rep1.Aligned.sortedByCoord.out.bam

We can investigate the number of genes in our count file using:

wc -l Col_0h_rep1.counts.txt

Count all samples from STAR, how many do we have?

ln -s ../03_mapping_star/*.bam .

If you managed to run all the samples, there should be 6 bams.

module load anaconda3/2024.06
conda activate /lustre/isaac24/proj/UTK0386/conda/featurecounts

featureCounts \
    -p \
    -s 2 \
    -T 8 \
    -t gene \
    -g ID \
    -F GTF \
    -a genomic_modified.gff \
    -o combined.counts.txt \
    *bam

If you have multiple files, a for loop can be employed, but the output count files will have to be stitched together later, which can be extra, unnecessary work.

conda deactivate

Using HTSeq (optional, it may take awhile)

Helpful links:

The general inputs for HTSeq are as follows:

  • --format= specify alignment file type (e.g. "bam")
  • --order= tell htseq the manner in which the alignment file has been sorted (e.g. "pos")
  • --stranded= count reads dependent on forward/reverse strand-specific orientation (defaut: "no")
  • --type= the gff3/gtf feature to count by (e.g. "gene")
  • --idattr= the gff3/gtf column feature used to group the desired type

Note:

HTSeq allows for highly customizable count approaches for many data types. Depending on the library, you may need to adjust these arguments. If in doubt, read the docs.

Steps:

Load HTSeq using conda

module load anaconda3/2024.06
conda activate /lustre/isaac24/proj/UTK0386/conda/htseq

Make sure htseq-count is functional

htseq-count --help

Run HTseq

htseq-count \
  --format=bam \
  --order=pos \
  --stranded=no \
  --type=gene \
  --idattr=ID \
  Col_0h_rep1.Aligned.sortedByCoord.out.bam \
  genomic_modified.gff \
  > Col_0h_rep1.counts.txt \
  2> Col_0h_rep1.out

--stranded is an important parameter and it will depend on the library preparation method for your sample. Typically, this is "no", but always check the publication and raw data experimental documentation. In the case of our CPK28 study, the tool Salmon was used to detect that the library is "ISR", meaning "inward" pointing reads, "stranded", and "reverse" (meaning the forward read comes from the reverse strand). However, the formatting of the gff prevents us from using strandedness with the htseq.