08 Differential Expression analysis - saltpinna/Genome_analysis_project GitHub Wiki

Differential expression analysis was performed using Deseq in R and the script used for this can be found under code/Deseq2_script.r. A python script was then used to extract the 10 most down- and up regulated genes as well as their gene name and annotation from the annotation file. This script can be found under code/finding_min_max_regulated.py, and the result is a csv file containing the following table listing the most down- and up regulated genes in Serum compared to BH.

Questions

If your expression results differ from those in the published article, why could it be?

The article did not report any down regulated genes, so they are difficult to compare to the article, but the up regulated proteins identified were also found to be up regulated in the article. Nine out of the ten most up regulated proteins according to the differential expression analysis participate in purine biosynthesis. This was the same cluster of genes and biological pathway that they identified in the article. Some of the genes that were identified as up regulated in the article were not found to be differantially expressed in this experiment. This could be due to different algorithms used for expression analysis or failed assembly in those regions of the genome.

How do the different samples and replicates cluster together?

Looking at the plot in the previous section on read counting, it is clear that the replicates cluster together quite well in the triplicates. There are, however, many differences in clustering between the BH and serum samples.

What effect and implications has the p-value selection in the expression results?

The p-value is a measure of how likely it would be to obtain the observed result by chance. A high p-value means that there is a higher likelihood that we would see the same result by chance. A lower p-value means that it is very unlikely that we would observe the same result just by chance, meaning that we have statistic significance in our results. So, decreasing the p-value in the differential expression analysis means that we would get fewer hits, since fewer of them would be deemed statistically significant.

What is the q-value and how does it differ from the p-value? Which one should you use to determine if the result is statistically significant?

The q-value is a measure of the false detection rate, meaning how likely the genes are to be false positives based on their p-value. The p-value is usually used to determine statistical significance.

Do you need a normalization step? What would you normalize against? Does DESeq do it?

In order to compare the read counts to each other, they need to be normalized. Factors that can be normalized against are seqeunce depth, gene length and RNA composision. In Deseq2, which was used in this study, the reads are analyzed one gene at a time, so we do not need to normalize against gene length. Deseq does, however, normalize against sequence depth and RNA composition.

What would you do to increase the statistical power of your expression analysis?

To increase the statistical power of the expression analysis, I would look at both the q-value and p-value and lower these values for the analysis.