Entry 8: Working with Data(week 5 lecture notes) - bcb420-2025/Izumi_Ando GitHub Wiki
🕊️: the lecture notes have code snippets not included in my personal notes. will refer to those and try it on my own dataset in an R notebook.
Normalization - high level overview
- normalization is done to control variation caused by technical details
- take note of the differences in each sample (techniques, who handled the data)
- cell lines can reduce the amount of biological differences compared to using samples from different patients in the same state
Methods
- RPKM, FPKM
- sometimes done by distribution (normal, bimodal, poisson, power log), your dataset will probably mimic one of these
- common methods are different per data type!
- for microsrray: quantile normalization is common
- for RNA seq: trimmed mean of m-values (TMM), relative log expression (RLE)
- TMM: implemented in
edgeR
, compares btwn samples - RLE: implemented in
DESeq
, compared btwn genes - Question : in the video, there are a few distributions being shown, what are the x and y axes for these?
- Question: what is an MA plot?
Normalizing Our Dataset
- if you have an ensemble ID, you can tell between a gene or sample by checking if they have a "G"
- remove low counts (different ways / rules to do this)
- if you are going to use
edgeR
's rule for removing genes that are not expressed by all members of a group (ex: control), make sure you are defining your groups clearly - check the differences in the distribution of your data (density) before and after removing low counts
- you need to tell
edgeR
your groups to do certain computations related to normalization - you can inspect the separation of your datasets using a multidimensional scaling (MDS) plot after normalization. if it looks too "unseparated" you can consider changing your groupings is possible.
- calculate dispersion and plot with BCV
- Question: why do we need to see the dispersion?
- all of this is to make sure you datasets follow your tool's assumptions! (for example,
edgeR
assumes a negative binomial distribution) - Question: what is the exhaustive list of assumptions we need to account for??
Identifier mapping
- lots of databases have gene annotations, NCBI/ensembl/uniprot (more protein heavy) all talk to each other
- emsemble
biomart
is helpful, no code necessary, but for this course code is necessary - some datasets may not have gene identifiers originally because data generation can rely on different identifiers
- some times gene names get lost as datasets get updated (weird)