A1 Journal - bcb420-2024/Dien

Download data and assess quality

Downloaded raw counts data set from the accession page.
Save the data frame as a csv file in the current directory and check that the file doesn't exist before running.
To assess data quality, calculate:
- Number of samples
- Number of conditions
- Reads per sample (mean and standard deviation)
- Genes per sample (mean and standard deviation)
Plotted using ggplot2
The samples are coded, so I had to manually create a vector with informative sample names for easier interpretation

The sequencing depth seems consistent, not too much deviation between each sample.
The number of genes are also similar across the different samples.
There are three conditions, the control is the solo cultured endothelial cells (ECs). The test conditions are ECs cocultured with human pericytes, and ECs cocultured with bovine pericytes. For each condition, there are 3 samples, which are sampled at different timepoints, and for each timepoint there are three replicated. In total, there are 27 samples.
There were no genes that were duplicated, or had their expression measured twice in one sample.

First, I tried to use the hgnc package, which includes functions to convert entrez IDs to HUGO IDs. However, the data set takes too long to load, since it loads other gene identifiers as well.
When I tried to knit, it gave an SSL error whenever I use functions from the hgnc package.

OpenSSL SSL_read: unexpected eof while reading, errno 0
Backtrace:
 1. hgnc (local) 
 6. base (local) 
Execution halted

The data set from the hgnc package also doesn't contain mappings of gene symbols that start with LOC, which are likely pseudogenes, or genes that have not been extensively studied. Around 30% of the genes were unmapped, but when searched online, the LOC + entrez ID gene name came up.
Switched to using org.Hs.eg.db package instead. I was able to map more than 99% of the entrez IDs using this package, since the LOC genes were mapped.
Knitting also ran without error.

Possible options: TMM using edgeR or RLE using DESeq
Both methods are quite similar and results should not differ much, both assume that most of the genes are non-DE genes
Method chosen: TMM using edgeR
Have to specify groups (group replicates together)
Before and after plots using ggplot2 to plot distribution.