Analysis 9: Expression analysis using HTSeq - cecilia-andersson/Genome_Analysis_Project GitHub Wiki
Took the .gff file from the Prokka annotation for each bin, and after removing the FASTA sequences, I ran htseq-count using the mapped reads for each bin (for each location) and the annotation file.
For each gene in a bin, HTSeq counts the number of RNA reads mapped to that gene, which gives a rough idea about the expression for that organism in the given environment. In most bins, most genes had no expression, and those which were expressed were usually only expressed at appearingly low levels (1-10 reads per gene). Below is an example of the output for bin 10, which had generally low expression. A few bins had a much larger number of reads for each expressed gene, indicating that there may have been more individuals of belonging to these bins in the collected samples. One example is bin 43, also shown below.data:image/s3,"s3://crabby-images/1477b/1477b1266e1835364304ce540d6a48db9fe2feba" alt="Screen Shot 2023-05-17 at 11 42 35 AM"
data:image/s3,"s3://crabby-images/c0966/c09668b18f1e302e9c4ccc8e4057e5643ccc7492" alt="Screen Shot 2023-05-17 at 11 42 25 AM"
Above: bin 10, generally low expression
data:image/s3,"s3://crabby-images/84f53/84f533511ddc11b7f302c8237c41a40019bd49f4" alt="Screen Shot 2023-05-17 at 11 46 40 AM"
Above: bin 43, generally more expression
To think about:
- What is the distribution of the counts per gene? Are most genes expressed? How
many counts would indicate that a gene is expressed?
Distribution of counts per gene: Expressed genes have mostly a low number of counts per gene, ranging between 1 and 20 reads. However, some genes have over 1000 reads mapped to them.
Are most genes expressed: Most genes are not expressed in bins with low overall reads. However, in bins with a high volume of reads in general, it does seem that most genes are expressed at least at low levels.
How many counts would indicate that a gene is expressed: A single read means the gene is expressed at some very low level, but to determine whether a gene is expressed enough to perform a function is more difficult.
- In the metagenomics project, the data doesn’t offer enough statistical power for a
differential expression analysis. Why not? What can you still tell from the data only
from the read counts?
The data doesn't offer enough statistical power for a differential expression analysis because, since the samples are a compilation of genetic sequences from dozens of different organisms, there are not enough reads for each individual species to be significant when comparing expression levels among the two (D1 and D3) populations. From only the read counts, you can still gain some understanding of the different genes being expressed in different environments, and they can be useful in further describing previously unknown species.