4 Data Normalization - bcb420-2023/Helena_Jovic GitHub Wiki

Lecture Notes

Objective

  • Learn about the first steps of the analysis
  • Importance of normalization. Why do we need to normalize our data?

Time Management

Date Started: 2023-02-02
Data Completed: 2023-02-02
Estimated Time: 1h
Actual Time: 1h

Procedure

  1. Watch Lecture 4 Normalization Part 2

Notes

  • Adjusting data, to make it possible to directly compare samples and other datasets and account for known and unknown issues in the data.
  • Very important because otherwise data results might be useless.

Variations
TECHNICAL

  • Caused by instrumental, experimental variation
  • Samples run on different days by different people with slightly different reagents
  • Want to control for these factors as best we can but with larger experiments there will always be some level of variation
  • Read depth, gene length, ...
  • This is what we want to correct for

BIOLOGICAL

  • Came from a different sample or a different condition
  • These are the things we are interested in

Normalization
In order to see the changes for a specific condition, we need to normalize the data. If we don't normalize the data, we eliminate the signal. There are many different types of well characterized distributions, including: normal distribution, box plot, density plot, poisson distribution. Different kinds of normalization by distribution methods: Z-scored normalization,

Need to check which changes have been made (check them on the same scale). We will often see at the edges, that the data has been cleaned up. The main part of the distribution stays the same. Plots need titles (so you know easily, which one is which).

Conclusion

  • Normalization is an important step in the data analysis pipeline and can drastically affect your downstream results.
  • Normalization methods have inherent assumptions and you need to make sure that your data fits with the assumptions made.
  • Normalization shouldn't drastically change your data but should help to control for technical variation.

Reference

Isserlin, Ruth. (2023). Week 4 Normalization and Identifier Mapping. University of Toronto.

A1: Apply Normalization

Objective

  • Prepare a Notebook that will produce a clean, normalized dataset that will be used for the remaining assignments in this course.

Time Management

Date Started: 2022-02-12
Data Completed: 2022-02-13
Estimated Time: 3 hours
Actual Time: 5 hours

Workflow

  • Used TMM normalization technique following procedure outline in lecture 4 part 2
  • Included box, density and MDS plots for comparison purposes of the orginal dataset and the normalized dataset
  • Did not find any outliers