Differential Expression Analysis - sellwe/Genome-Analysis GitHub Wiki

For the differential expression analysis, we want to see if the gene expression is different between the two categories: nutrient rich BH and nutrient poor serum. The goal is to identify genes that help E.faecium survive in human serum, so we are interested in finding genes that are upregulated. I did this by using DESeq2 in R. Im using the read counts data from HTSeq. I also merged them with the .tsv output file from Prokka, in order to add the functional annotation for each CDS in the same dataframe.

Differential expression

In order to find out the expression difference between the two classes i merged the replicates from the same growth conditions together. 3 BH and 3 Serum.

For determining the significanty differentially expressed genes i used adjusted p-value < 0.001 and the |log2FoldChange| > 1. The adjusted p-value (or q-value) is used for statistical significance. padj is the corrected p-values for multiple testing methods such as FDR correction with Benjamin-Hochberg, or even Bonferroni correction. With FDR-correction the q-value is calculated by ranking the genes based on their simple p-values from smaller to larger, and take this rank of significance into consideration in the equation. If we used the simple p-value and threshold 0.05 we would get 5% false positives for thousands of genes, which is unwanted for this magnitude of analysis.

log2foldchange is the difference in expression between the two conditions. Here it is based on the expression of serum relative to bh, meaning that log2F > 0 the gene is upregulated in serum, and log2F < 0 the gene is downregulated in serum. log2FC is the most biologically relevant metric in differential expression, but it doesnt say anything about the statistical power (the study used fold change <0.5 or >2, which is mathematically the same thing).

Metric Value
Total genes 2733
Significant DE genes (padj < 0.001, abs(log2FC) > 1) 1210
Percentage of significant DE genes 44.27 %
# Upregulated genes in Serum 616
# Downregulated genes in Serum 594

The paper found that 27.8% of genes were differentially expressed. Interestingly, i got a higher fraction (44.27%) based on the same significance level (padj < 0.001, |log2FC| > 1). This could be due to difference in methods used. Since i only used Canu on Pacbio reads for my assembly i might have more fragmented reads, whereas the study used PacBio reads, Illumina reads for polishing and MinION 2D NanoPore reads with Spades for plasmid gap closure. If their assembly was more complete than mine that would impact the other downstream analyses. My fragmented genes might have inflated the DE expression analysis with false positves.

Clustering

PC-analysis:

image

The Principle Component Analysis shows clear clustering of the two conditions, with 100% of the variance attributed to the growth conditions. This shows that there are no other factors playing a role, and indicating that the study was well structured.

Heatmap:

image

The heatmap was based on the top30 genes sorted on padj. Same thing is seen here, clear division between the samples in BH and serum.

Plots

Simple MAplot:

image

The MAplot (default settings) shows high differential expression

A similar observation can be made from this Volcano plot:

image

The volcano plot uses both padj and log2FC for significance. Here we see that many of the most upregulated genes in the serum condition (positive log2FC) belong to the pur gene family. This is in accordance with the paper, which found that the most differentially expressed genes when growing in human serum was genes involved in purine biosynthesis, and especially purD, purH, purL, which were upregulated in serum.

In the resulting R dataframe, if we only sorting by top log2foldchange (and not padj) I got these top results:

gene_id gene_name log2fc padj functional_annotation
EJCDLKNP_00593 purS 9.389864 6.056846e-295 Phosphoribosylformylglycinamidine synthase subunit PurS
EJCDLKNP_00592 purQ 9.136631 7.688300e-74 Phosphoribosylformylglycinamidine synthase subunit PurQ
EJCDLKNP_00594 purC 9.007313 1.698686e-118 Phosphoribosylaminoimidazole-succinocarboxamide synthase
EJCDLKNP_00591 purL 8.909482 3.349962e-90 Phosphoribosylformylglycinamidine synthase subunit PurL
EJCDLKNP_00590 purF 8.773152 3.291780e-156 Amidophosphoribosyltransferase
EJCDLKNP_00589 purM 8.404809 3.101008e-260 Phosphoribosylformylglycinamidine cyclo-ligase
EJCDLKNP_00587 purH 8.340637 6.928812e-171 Bifunctional purine biosynthesis protein PurH
EJCDLKNP_00586 purD 8.257291 1.311576e-182 Phosphoribosylamine--glycine ligase
EJCDLKNP_00588 purN 8.183691 1.139693e-262 Phosphoribosylglycinamide formyltransferase

Where we see the same thing. The most biologically relevant differentially expressed genes are in the pur gene family. From the log2FC we can calculate that these genes are expressed between 2^8.183691=291 and 2^9.389864=671 times more in the serum condition than the BH condition.

This is also in accordance with the broader literature, showing that purine synthesis is crucial for bacteria to grow in human serum. And in many cases, having the ability to synthesize nucleotides is also linked to virulence in other bacterial species. The authors propose that this pathway are among protentional targets for development of new antibiotics as it is vital for growth in the human bloodstream.

If we instead sort the dataframe by both log2FC and padj the top 15 genes are:

Gene ID Gene Name log2FC padj Functional Annotation
EJCDLKNP_00533 9.389864 0.10062093 hypothetical protein
EJCDLKNP_00314 gspA_1 7.454470 0.14034599 General stress protein A
EJCDLKNP_00317 7.424159 0.14905973 hypothetical protein
EJCDLKNP_02310 dppE 7.329229 0.08386896 Dipeptide-binding protein DppE
EJCDLKNP_00315 gspA_2 7.303975 0.16407970 General stress protein A
EJCDLKNP_02312 oppB 7.147213 0.10507268 Oligopeptide transport system permease protein OppB
EJCDLKNP_00668 folT_1 6.970776 0.12251398 Folate transporter FolT
EJCDLKNP_02313 dppC 6.640900 0.09920920 Dipeptide transport system permease protein DppC
EJCDLKNP_02314 oppD_2 6.421884 0.07855191 Oligopeptide transport ATP-binding protein OppD
EJCDLKNP_02316 6.419434 0.07883059 IS200/IS605 family transposase ISEfa4
EJCDLKNP_02315 oppF_2 6.413264 0.08358524 Oligopeptide transport ATP-binding protein OppF
EJCDLKNP_01553 yclN 6.120752 0.11614826 Petrobactin import system permease protein YclN
EJCDLKNP_00196 lacC_1 -6.051703 0.10482648 Tagatose-6-phosphate kinase
EJCDLKNP_01556 yclQ 6.005511 0.08744946 Petrobactin-binding protein YclQ
EJCDLKNP_01554 yclO 5.956161 0.09981891 Petrobactin import system permease protein YclO

Here it seems as stress response genes and transport system genes are also upregulated. Which would make sense in a more nutrient poor environment where the cell is under a lot of stress and it needs to maximize the efficiency of its uptake and transport system. The only downreuglated gene here is lacC_1, transcribing Tagatose-6-phosphate kinase. This enzyme is involved in the metabolic pathway of converting galactose 6-phosphate to pyruvate. And it would make sense that this is downregulated as there arent any galactose human serum or the human bloodstream. Maybe its downregulated in order to conserve as much energy as possible, or its very upregulated in the BH growth condition where galactose is plentyful, so the relative expression is very different.

In conclusion, i found a higher amount of differentially expressed genes than the study. But i did identify similar genes and genes from similar gene families as the most significant and biologically relevant to the study.

⚠️ **GitHub.com Fallback** ⚠️