Read counting - sellwe/Genome-Analysis GitHub Wiki

Read counting is done with HTSeq, using each samples .bam files containing the RNA malignment to the genome produced by the BWA, and the .bff annotation file from Prokka. HTseq gives the number of reads that were mapped to each CDS in every sample. For HTSeq i used the setting "-s yes" as the RNA-seqs are strand specific (aligned to the same strand as the assembly).

HTSeqs output will look something like:

Giving the raw counts per gene. In order to visualize the distribution of reads per gene across all of the samples within each condition I plotted histograms in R:

As we see there is a huge spread as there are few genes that are very highly expressed and the majority has below 10.000 counts.

log10 transformation helps visualizing the spread better (log10+1 to include genes with 0 counts):

As we can see, there are many genes with low expression that end up in the bins that have 0-10 counts:

Except for the bins with very low expression, the count distribution pretty much follows a normal distribution. Most genes have around 1.000-10.000 counts, and a few are heavily expressed at > million counts. But still we see that many genes expressed less than 10 counts.

Percentage of genes with more than 1 count:

Percentage of genes with more than 10 counts:

For the differential expression analysis these will be filtered out. I will use a threshold of genes with minimum of 10 reads in at least three replicates, which is common practice in order to reduce noise.