Entry 8: Working with Data(week 5 lecture notes) - bcb420-2025/Izumi_Ando GitHub Wiki

🕊️: the lecture notes have code snippets not included in my personal notes. will refer to those and try it on my own dataset in an R notebook.

Normalization - high level overview

  • normalization is done to control variation caused by technical details
  • take note of the differences in each sample (techniques, who handled the data)
  • cell lines can reduce the amount of biological differences compared to using samples from different patients in the same state

Methods

  • RPKM, FPKM
  • sometimes done by distribution (normal, bimodal, poisson, power log), your dataset will probably mimic one of these
  • common methods are different per data type!
  • for microsrray: quantile normalization is common
  • for RNA seq: trimmed mean of m-values (TMM), relative log expression (RLE)
  • TMM: implemented in edgeR, compares btwn samples
  • RLE: implemented in DESeq, compared btwn genes
  • Question : in the video, there are a few distributions being shown, what are the x and y axes for these?
  • Question: what is an MA plot?

Normalizing Our Dataset

  • if you have an ensemble ID, you can tell between a gene or sample by checking if they have a "G"
  • remove low counts (different ways / rules to do this)
  • if you are going to use edgeR's rule for removing genes that are not expressed by all members of a group (ex: control), make sure you are defining your groups clearly
  • check the differences in the distribution of your data (density) before and after removing low counts
  • you need to tell edgeR your groups to do certain computations related to normalization
  • you can inspect the separation of your datasets using a multidimensional scaling (MDS) plot after normalization. if it looks too "unseparated" you can consider changing your groupings is possible.
  • calculate dispersion and plot with BCV
  • Question: why do we need to see the dispersion?
  • all of this is to make sure you datasets follow your tool's assumptions! (for example, edgeR assumes a negative binomial distribution)
  • Question: what is the exhaustive list of assumptions we need to account for??

Identifier mapping

  • lots of databases have gene annotations, NCBI/ensembl/uniprot (more protein heavy) all talk to each other
  • emsemble biomart is helpful, no code necessary, but for this course code is necessary
  • some datasets may not have gene identifiers originally because data generation can rely on different identifiers
  • some times gene names get lost as datasets get updated (weird)