Entry 8: Working with Data(week 5 lecture notes) - bcb420-2025/Izumi_Ando GitHub Wiki

🕊️: the lecture notes have code snippets not included in my personal notes. will refer to those and try it on my own dataset in an R notebook.

Normalization - high level overview

normalization is done to control variation caused by technical details
take note of the differences in each sample (techniques, who handled the data)
cell lines can reduce the amount of biological differences compared to using samples from different patients in the same state

RPKM, FPKM
sometimes done by distribution (normal, bimodal, poisson, power log), your dataset will probably mimic one of these
common methods are different per data type!
for microsrray: quantile normalization is common
for RNA seq: trimmed mean of m-values (TMM), relative log expression (RLE)
TMM: implemented in edgeR, compares btwn samples
RLE: implemented in DESeq, compared btwn genes
Question : in the video, there are a few distributions being shown, what are the x and y axes for these?
Question: what is an MA plot?

if you have an ensemble ID, you can tell between a gene or sample by checking if they have a "G"
remove low counts (different ways / rules to do this)
if you are going to use edgeR's rule for removing genes that are not expressed by all members of a group (ex: control), make sure you are defining your groups clearly
check the differences in the distribution of your data (density) before and after removing low counts
you need to tell edgeR your groups to do certain computations related to normalization
you can inspect the separation of your datasets using a multidimensional scaling (MDS) plot after normalization. if it looks too "unseparated" you can consider changing your groupings is possible.
calculate dispersion and plot with BCV
Question: why do we need to see the dispersion?
all of this is to make sure you datasets follow your tool's assumptions! (for example, edgeR assumes a negative binomial distribution)
Question: what is the exhaustive list of assumptions we need to account for??

lots of databases have gene annotations, NCBI/ensembl/uniprot (more protein heavy) all talk to each other
emsemble biomart is helpful, no code necessary, but for this course code is necessary
some datasets may not have gene identifiers originally because data generation can rely on different identifiers
some times gene names get lost as datasets get updated (weird)