Assignment 2 Journal - bcb420-2023/Angela_Uzelac GitHub Wiki

Assignment 2 - Differential Gene expression and Over-Representation Analysis

Objective

document progress while working on assignment 2

Duration

Day 1: 5 hrs

Day 2: 5 hrs

Day 3: 7 hrs

Day 4: 3 hrs

Day 5: 5 hrs

Day 6: 10 hrs

Procedure

re-run previous assignment
use write.table to save the data from the final normalized counts data into a file
got an error with ComplexHeatmap::Heatmap() so tried using pheatmap, stats::heatmap, and gplots::heatmap.2() instead
whenever I try to specify the color or color ramp I get this error: "Error in color[1] : object of type 'closure' is not subsettable"
the gene of interest here is actually IGF2 (which is also the HGNC symbol which works nicely) - ENSG00000167244
we have this in the table of normalized counts, can use this to do t-test to determine null and alt hypothesis
found this info at GeneCards
take expression data for only igf2
separate samples into control, bipolar, and schizo
t(): pass a df or matrix, returns the transpose (flipped sideways, rows become cols and cols become rows)
combined bipol and scz sample data into combined disease samples then ran t-test between control and sample
t-test: A t-test is an inferential used to determine if there is a significant difference between the means of two groups and how they are related. T-tests are used when the data sets follow a normal distribution and have unknown variances. Mathematically, the t-test takes a sample from each of the two sets and establishes the problem statement. It assumes a null hypothesis that the two means are equal. The assumed null hypothesis is accepted or rejected accordingly. If the null hypothesis qualifies to be rejected, it indicates that data readings are strong and are probably not due to chance.
therefore we reject the null hypothesis

Notes:

the first question is actually referring to the p-values in the edgeR QLF test
the FDR column is actually the adjusted p-value so that's the answer for the second question
want upregulation in disease and downregulation in control
annotation source means data source like GO bp
ORA method means the tool like g:profiler
the volcano plot will be x axis: logFC, y axis: -log(pvalue)

Roadblocks:

complex heatmap not working - too many genes, used pheatmap instead, but also cut down on the number of genes
not getting good results for heatmap and for volcano plot, ended up investigating the problem and saw that bipolar samples were too all over the place so I chose to only work with schizophrenia samples
with this i also cut down on the number of sig diff expressed genes
legends were not being created on the plots, was saying "plot.new has not been called". for one of them I decreased the size of the legend and it worked. For the other I had to use the graphics package for the plot function instead of base
all the log fold changes were negative: should have intercept in model instead of writing the + 0 and having both of the conditions in the model
always do QLF test on the disease not the control, or else the values will be backwards

Referencing

added the following citations to the bibliography2.bib file in bibtex format:
- G:profiler
- the associated paper
- all packages used (pheatmap, ComplexHeatmap, circlize, edgeR)---- not using complex heatmap and circlize
had to find those citations, just typed citation("pheatmap") in R
for edgeR had to do print(citation("edgeR"), bibtex=TRUE)

Conclusion and Outlook

have to put the bibliography in the repo too
have to have all the paths start with ./ so that they're relative
pull repo by pressing green code button and downloading zip then running the command docker run --rm -it -v ${PWD}:/home/rstudio/projects --user rstudio risserlin/bcb420-base-image /usr/local/bin/R -e "rmarkdown::render('/home/rstudio/projects/file_name.rmd', output_file='/home/rstudio/projects/file_name.html')" > processing_output_filename