Differential Expression Analysis - sellwe/Genome-Analysis GitHub Wiki
For the differential expression analysis, we want to see if the gene expression is different between the two categories: nutrient rich BH and nutrient poor serum. The goal is to identify genes that help E.faecium survive in human serum, so we are interested in finding genes that are upregulated. I did this by using DESeq2 in R. Im using the read counts data from HTSeq. I also merged them with the .tsv output file from Prokka, in order to add the functional annotation for each CDS in the same dataframe.
In order to find out the expression difference between the two classes i merged the replicates from the same growth conditions together. 3 BH and 3 Serum.
For determining the significanty differentially expressed genes i used adjusted p-value < 0.001 and the |log2FoldChange| > 1. The adjusted p-value (or q-value) is used for statistical significance. padj is the corrected p-values for multiple testing methods such as FDR correction with Benjamin-Hochberg, or even Bonferroni correction. With FDR-correction the q-value is calculated by ranking the genes based on their simple p-values from smaller to larger, and take this rank of significance into consideration in the equation. If we used the simple p-value and threshold 0.05 we would get 5% false positives for thousands of genes, which is unwanted for this magnitude of analysis.
log2foldchange is the difference in expression between the two conditions. Here it is based on the expression of serum relative to bh, meaning that log2F > 0 the gene is upregulated in serum, and log2F < 0 the gene is downregulated in serum. log2FC is the most biologically relevant metric in differential expression, but it doesnt say anything about the statistical power (the study used fold change <0.5 or >2, which is mathematically the same thing).
Metric | Value |
---|---|
Total genes | 2733 |
Significant DE genes (padj < 0.001, abs(log2FC) > 1) | 1210 |
Percentage of significant DE genes | 44.27 % |
# Upregulated genes in Serum | 616 |
# Downregulated genes in Serum | 594 |
The paper found that 27.8% of genes were differentially expressed. Interestingly, i got a higher fraction (44.27%) based on the same significance level (padj < 0.001, |log2FC| > 1). This could be due to difference in methods used. Since i only used Canu on Pacbio reads for my assembly i might have more fragmented reads, whereas the study used PacBio reads, Illumina reads for polishing and MinION 2D NanoPore reads with Spades for plasmid gap closure. If their assembly was more complete than mine that would impact the other downstream analyses. My fragmented genes might have inflated the DE expression analysis with false positves.
PC-analysis:
The Principle Component Analysis shows clear clustering of the two conditions, with 100% of the variance attributed to the growth conditions. This shows that there are no other factors playing a role, and indicating that the study was well structured.
Heatmap:
The heatmap was based on the top30 genes sorted on padj. Same thing is seen here, clear division between the samples in BH and serum.
Simple MAplot:
The MAplot (default settings) shows high differential expression
A similar observation can be made from this Volcano plot:
The volcano plot uses both padj and log2FC for significance. Here we see that many of the most upregulated genes in the serum condition (positive log2FC) belong to the pur gene family. This is in accordance with the paper, which found that the most differentially expressed genes when growing in human serum was genes involved in purine biosynthesis, and especially purD, purH, purL, which were upregulated in serum.
In the resulting R dataframe, if we only sorting by top log2foldchange (and not padj) I got these top results:
gene_id | gene_name | log2fc | padj | functional_annotation |
---|---|---|---|---|
EJCDLKNP_00593 | purS | 9.389864 | 6.056846e-295 | Phosphoribosylformylglycinamidine synthase subunit PurS |
EJCDLKNP_00592 | purQ | 9.136631 | 7.688300e-74 | Phosphoribosylformylglycinamidine synthase subunit PurQ |
EJCDLKNP_00594 | purC | 9.007313 | 1.698686e-118 | Phosphoribosylaminoimidazole-succinocarboxamide synthase |
EJCDLKNP_00591 | purL | 8.909482 | 3.349962e-90 | Phosphoribosylformylglycinamidine synthase subunit PurL |
EJCDLKNP_00590 | purF | 8.773152 | 3.291780e-156 | Amidophosphoribosyltransferase |
EJCDLKNP_00589 | purM | 8.404809 | 3.101008e-260 | Phosphoribosylformylglycinamidine cyclo-ligase |
EJCDLKNP_00587 | purH | 8.340637 | 6.928812e-171 | Bifunctional purine biosynthesis protein PurH |
EJCDLKNP_00586 | purD | 8.257291 | 1.311576e-182 | Phosphoribosylamine--glycine ligase |
EJCDLKNP_00588 | purN | 8.183691 | 1.139693e-262 | Phosphoribosylglycinamide formyltransferase |
Where we see the same thing. The most biologically relevant differentially expressed genes are in the pur gene family. From the log2FC we can calculate that these genes are expressed between 2^8.183691=291 and 2^9.389864=671 times more in the serum condition than the BH condition.
This is also in accordance with the broader literature, showing that purine synthesis is crucial for bacteria to grow in human serum. And in many cases, having the ability to synthesize nucleotides is also linked to virulence in other bacterial species. The authors propose that this pathway are among protentional targets for development of new antibiotics as it is vital for growth in the human bloodstream.
If we instead sort the dataframe by both log2FC and padj the top 15 genes are:
Gene ID | Gene Name | log2FC | padj | Functional Annotation |
---|---|---|---|---|
EJCDLKNP_00533 | — | 9.389864 | 0.10062093 | hypothetical protein |
EJCDLKNP_00314 | gspA_1 | 7.454470 | 0.14034599 | General stress protein A |
EJCDLKNP_00317 | — | 7.424159 | 0.14905973 | hypothetical protein |
EJCDLKNP_02310 | dppE | 7.329229 | 0.08386896 | Dipeptide-binding protein DppE |
EJCDLKNP_00315 | gspA_2 | 7.303975 | 0.16407970 | General stress protein A |
EJCDLKNP_02312 | oppB | 7.147213 | 0.10507268 | Oligopeptide transport system permease protein OppB |
EJCDLKNP_00668 | folT_1 | 6.970776 | 0.12251398 | Folate transporter FolT |
EJCDLKNP_02313 | dppC | 6.640900 | 0.09920920 | Dipeptide transport system permease protein DppC |
EJCDLKNP_02314 | oppD_2 | 6.421884 | 0.07855191 | Oligopeptide transport ATP-binding protein OppD |
EJCDLKNP_02316 | — | 6.419434 | 0.07883059 | IS200/IS605 family transposase ISEfa4 |
EJCDLKNP_02315 | oppF_2 | 6.413264 | 0.08358524 | Oligopeptide transport ATP-binding protein OppF |
EJCDLKNP_01553 | yclN | 6.120752 | 0.11614826 | Petrobactin import system permease protein YclN |
EJCDLKNP_00196 | lacC_1 | -6.051703 | 0.10482648 | Tagatose-6-phosphate kinase |
EJCDLKNP_01556 | yclQ | 6.005511 | 0.08744946 | Petrobactin-binding protein YclQ |
EJCDLKNP_01554 | yclO | 5.956161 | 0.09981891 | Petrobactin import system permease protein YclO |
Here it seems as stress response genes and transport system genes are also upregulated. Which would make sense in a more nutrient poor environment where the cell is under a lot of stress and it needs to maximize the efficiency of its uptake and transport system. The only downreuglated gene here is lacC_1, transcribing Tagatose-6-phosphate kinase. This enzyme is involved in the metabolic pathway of converting galactose 6-phosphate to pyruvate. And it would make sense that this is downregulated as there arent any galactose human serum or the human bloodstream. Maybe its downregulated in order to conserve as much energy as possible, or its very upregulated in the BH growth condition where galactose is plentyful, so the relative expression is very different.
In conclusion, i found a higher amount of differentially expressed genes than the study. But i did identify similar genes and genes from similar gene families as the most significant and biologically relevant to the study.