Differential expression analysis - MaryamDost/GenomeAnalysis GitHub Wiki
Differential expression analysis
Differential expression analysis is a process that takes the normalized read count data and performs statistical analysis to discover quantitative changes in expression levels between experimental groups. In this project we conduct the differences in the expression of different genes between cultures in different conditions: chemostat and bioleaching.
For this analysis R packages and DESeq2 library were used. I ran R in UPPMAX, the steps which conducted the analysis and created the PCA and MA-plots is scripted and found here. Heatmap was created online and grep
was used to find genes in file and compare the results that are presenten in table 2 and 3.
Results
![]
Figure 1: PCA Plot; the bottom axes read PCA scores of the samples (dots) and left axes is the loading plot; how strongly each characteristic influences the principal components.
Figure 2: Heatmap; force each gene to have zero mean and standard deviation=1, shows the comparisons between genes. Pink low score, white middle score and green shows high scores
Figure 3: MA plot; scatter plot of log2 fold changes (M, on the y-axis) versus the average expression signal (A, on the x-axis). Points away from M=0 and with high A-values ares significant, the ones above M=0 are upregulated and the ones below are downregulated.
Table 1: the comparison of project results with the results presented in the papper and its bilaga
Protein (Prokka) | Gene(prokka&eggNOG) | function (eggNOG) | Paper result | |
---|---|---|---|---|
LFT_00035 | Cation efflux sytem protein CusA | Prokka: cusA_1 eggNOG: czcA | silver ion transport | heavy metal efflux pump, CzcA family |
LFT_00250 | cheZ | Chemotaxis phosphatase | cheZ; chemotaxis regulator, protein phosphatase for CheY; K03414 chemotaxis protein CheZ (A) | |
LFT_00398 | Glutamine-fructose-6-phosphate aminotransferase [isomerizing] | glmS_1 | Catalyzes the first step in hexosamine metabolism, converting fructose-6P into glucosamine-6P using glutamine as a nitrogen source | glutamine-fructose-6-phosphate transaminase (isomerizing) |
LFT_00419 | hypothetical protein | Type I phosphodiesterase nucleotide pyrophosphatase | Type I phosphodiesterase/nucleotide pyrophosphatase | |
LFT_00432 | hypothetical protein | epsH | Putative glycosyltransferase EpsH | Glycosyltransferase involved in cell wall bisynthesis ; Glycan biosynthesis and metabolism;Lipopolysaccharide biosynthesis |
LFT_00433 | hypothetical protein | ntpI | ATP hydrolysis coupled proton transport | |
LFT_00447 | Beta-barrel | bepA_2 | ||
LFT_00544 | Adenylylsulfate reductase subunit beta | aprB | aprB: adenylylsulfate reductase, beta subunit | |
LFT_00628 | L-ectoine synthase | ectC | Catalyzes the circularization of gamma-N-acetyl- alpha,gamma-diaminobutyric acid (ADABA) to ectoine (1,4,5,6- tetrahydro-2-methyl-4-pyrimdine carboxylic acid), which is an excellent osmoprotectant | putative ectoine synthase; L-ectoine synthase |
LFT_00761 | RNA polymerase sigma factor FliA | fliA | fliA; RNA polymerase, sigma 28 (sigma F) factor; RNA polymerase sigma factor for flagellar operon FliA (A) | |
LFT_00765 | Flagellar biosynthetic protein FlhB | flhB | Required for formation of the rod structure in the basal body of the flagellar apparatus. Together with FliI and FliH, may constitute the export apparatus of flagellin | flhB; flagellin export apparatus, substrate specificity protein; flagellar biosynthetic protein FlhB (A) |
LFT_00769 | hypothetical protein | fliZ | fliZ flagellar reglatory | flagellar protein FliO/FliZ |
LFT_00771 | hypothetical protein | fliL | Controls the rotational direction of flagella during chemotaxis | fliL; flagellar biosynthesis protein; flagellar FliL protein (A) |
LFT_00785 | Signal transduction histidine-protein kinase AtoS | Prokka: atoS eggNOG: flrB | phosphorelay sensor kinase activity | atoS; sensory histidine kinase in two-component regulatory system with AtoC; two-component system, NtrC family, sensor histidine kinase AtoS (A) & sensory box sensor histidine kinase; two-component system, sensor histidine kinase FlrB (A) |
LFT_01065 | hypothetical protein | yeiH | Membrane protein |
Table 2: some other intresstion genes discussed in the papper
Gene/gene group mention in paper | Genes annotated | The functional annotation of the annotated genes |
---|---|---|
Cytochrome (reposible for energy conservation) | I annotated 6 genes (3 of them were presented as hypothetical protein by prokka);They annotated 3 genes | ccoN: oxidoreductase activity, acting on a heme group of donors, oxygen as acceptor; ccpA: Cytochrome c551 peroxidase |
RubisCO (Carbon dioxide fixation) Ribulose bisphosphate carboxylase large | mtnW | product=2%2C3-diketo-5-methylthiopentyl-1-phosphate enolase |
Nif_specific | Alla expect nifU | |
nasA | Major Facilitator Superfamily | Nitrate transporter |
Discussion
PCA plot (figure 2) looks good as the chemostat samples are clustered on the left and the bioleaching samples are clustered on the right side in the plot. The differences between cultures are clearly much higher than the differences between the replicates. However, we have only 5 samples (3 samples in chemostat and 2 samples in bioleaching) to compare the gene expression between them. I feel like it is not enough data to be able to draw definite conclusions on gene expression for all genes between cultures but enough to mark some differences. These differences in gene expressions are also visible in heatmap (figure 2).
In the produced heatmap, Z-scores of the gene expression measurements are used instead of log2-transformed. Z-scores are centered and normalized, so it is easier to interpret the color as standard deviations from the mean, which make is easier to have an intuitive idea of the relative variation of that value. If we want to show differences between samples, it is good to make Z-score by genes.
The MA plot (figure 3) visualizes gene expression datasets identifying gene expression changes from the two different conditions in terms of log fold change (M) on Y-axis and log of the mean of normalized expression counts of chemostat and bioleaching samples (A) on X-axis. Genes the are clustered around M=0 are genes with similar expression, we are not going to see significant differences between these expression between the two conditions. The points away from M=0 and with high A values are genes which interests us the most.
MA plot does not show which genes are statistically significant, we will need Volcano plot for that. I did not know this when producing the result so I choose them manually by finding significant genes with decent adjusted p values and logfold value, comparing it with the results from the paper. As seen in table 1 almost all genes were match with there results. I identified a gene bepA_2 (beta-barrel) which interestingly was not identified by them. When I search for ectC gene amongst their annotation I recognize that they have annotated a gene with structural annotation, L-ectoine synthase but no gene name; eggNOG identified the gene as ectC and its function as following: Catalyzes the circularization of gamma-N-acetyl- alpha,gamma-diaminobutyric acid (ADABA) to ectoine (1,4,5,6- tetrahydro-2-methyl-4-pyrimdine carboxylic acid), which is an excellent osmoprotectant.
Other interesting genes that were mentioned in paper we also identified(except nifU), some of these are presented in table 2.
Lab-manual Questions
- If your expression results differ from those in the published article, why could it be?
Any steps in this project could cause a difference in the results. In this project, not all steps were followed in paper (eg annotation refinement) and in each step different programs and parameters were used to obtain approximated results.
If we consider mapping steps, the difference may be due to mapping ambiguity (see mapping), which means the read matches in different regions. The difference in results would than be due to the softwares choosing to keep different matches.
- How do the different samples and replicates cluster together?
PCA plot shows different samples and replicates clustering together. Between the two culture conditions there is 87% difference and they are clustered together along the x-axis, The y axis visualizes the differences between biological replicates, a 8% of the total variance.
- What effect and implications has the p-value selection in the expression results?
For example, if we use a p-value of 0.05, 5% of the tests will hold a significant p-value which is obtained by chance. Since differential expression in 2525 genes are tested simultaneously, about 125 of these tests would be false positives.
- What is the q-value and how does it differ from the p-value? Which one should you use to determine if the result is statistically significant?
q-value provides a means to control the positive false discovery rate (pFDR). In this project, the p-values are corrected using the FDR method. This will minimize the number of false positives; thus, the p-value is the one that should be used to determine the statistical significance of the results.
- Do you need a normalization step? What would you normalize against? Does DESeq do it?
Normalization is necessary to make accurate comparisons of gene expression between samples. The DESeq2 package is designed for normalization and makes use of empirical Bayes techniques to estimate priors for log fold.
DESeq corrects for library size by normalizing the total number of reads so it is the same for all the samples. In this project we are working with 2 sample of one condition and 3 sample of the other condition, which in total is only 5 samples. Normalization with help us by making the counts for each gene comparable between samples
- What would you do to increase the statistical power of your expression analysis?
We could increase sample size or sequencing depth to increase statistical power. In our case both ways will have a significant change but if the sequencing depth reaches 20 million read then increasing the sample size will be more significant.
Reference
Ching Travers ,Huang Sijia, Garmirehttps Lana X. (2011) Power analysis and sample size estimation for RNA-Seq differential expression://www.ncbi.nlm.nih.gov/pmc/articles/PMC4201821/ Bedrehttps Renesh (2021) MA plot to visualize gene expression data using Python://www.reneshbedre.com/blog/ma.html