# Assignment 2 - Differential Gene expression and Over-Representation Analysis

## Objective

• document progress while working on assignment 2

Day 1: 5 hrs

Day 2: 5 hrs

Day 3: 7 hrs

Day 4: 3 hrs

Day 5: 5 hrs

Day 6: 10 hrs

## Procedure

• re-run previous assignment

• use write.table to save the data from the final normalized counts data into a file

• got an error with ComplexHeatmap::Heatmap() so tried using pheatmap, stats::heatmap, and gplots::heatmap.2() instead

• whenever I try to specify the color or color ramp I get this error: "Error in color[1] : object of type 'closure' is not subsettable"

• the gene of interest here is actually IGF2 (which is also the HGNC symbol which works nicely) - ENSG00000167244

• we have this in the table of normalized counts, can use this to do t-test to determine null and alt hypothesis

• found this info at GeneCards

• take expression data for only igf2

• separate samples into control, bipolar, and schizo

• t(): pass a df or matrix, returns the transpose (flipped sideways, rows become cols and cols become rows)

• combined bipol and scz sample data into combined disease samples then ran t-test between control and sample

• t-test: A t-test is an inferential used to determine if there is a significant difference between the means of two groups and how they are related. T-tests are used when the data sets follow a normal distribution and have unknown variances. Mathematically, the t-test takes a sample from each of the two sets and establishes the problem statement. It assumes a null hypothesis that the two means are equal. The assumed null hypothesis is accepted or rejected accordingly. If the null hypothesis qualifies to be rejected, it indicates that data readings are strong and are probably not due to chance.

• therefore we reject the null hypothesis

Notes:

• the first question is actually referring to the p-values in the edgeR QLF test
• the FDR column is actually the adjusted p-value so that's the answer for the second question
• want upregulation in disease and downregulation in control
• annotation source means data source like GO bp
• ORA method means the tool like g:profiler
• the volcano plot will be x axis: logFC, y axis: -log(pvalue)

• complex heatmap not working - too many genes, used pheatmap instead, but also cut down on the number of genes
• not getting good results for heatmap and for volcano plot, ended up investigating the problem and saw that bipolar samples were too all over the place so I chose to only work with schizophrenia samples
• with this i also cut down on the number of sig diff expressed genes
• legends were not being created on the plots, was saying "plot.new has not been called". for one of them I decreased the size of the legend and it worked. For the other I had to use the graphics package for the plot function instead of base
• all the log fold changes were negative: should have intercept in model instead of writing the + 0 and having both of the conditions in the model
• always do QLF test on the disease not the control, or else the values will be backwards

### Referencing

• added the following citations to the bibliography2.bib file in bibtex format:
• G:profiler
• the associated paper
• all packages used (pheatmap, ComplexHeatmap, circlize, edgeR)---- not using complex heatmap and circlize
• had to find those citations, just typed `citation("pheatmap")` in R
• for edgeR had to do `print(citation("edgeR"), bibtex=TRUE)`

## Conclusion and Outlook

• have to put the bibliography in the repo too
• have to have all the paths start with ./ so that they're relative
• pull repo by pressing green code button and downloading zip then running the command `docker run --rm -it -v \${PWD}:/home/rstudio/projects --user rstudio risserlin/bcb420-base-image /usr/local/bin/R -e "rmarkdown::render('/home/rstudio/projects/file_name.rmd', output_file='/home/rstudio/projects/file_name.html')" > processing_output_filename`