3 Data Cleaning - bcb420-2023/Helena_Jovic GitHub Wiki

Lecture Notes

Objective

  • Watch: Data Exploration & Data standards - what are they and what are they good for?
  • Example Data Set: BiomaRt example
  • Data Normalization: Why do we need to normalize out data?
  • What different types of distributions are there and why does this matter?
  • Identifier Mapping

Time Management

Date Started: 2023-02-02
Data Completed: 2023-02-02
Estimated Time: 2h
Actual Time: 2h

Procedure

  1. Watch lecture 4 Data Cleaning Part 1

Report notes and questions.

Notes

  • Need to read the paper related to the dataset
  • Will be using RNASeq data

Critical elements forming data standards
GEO is the gene expression omnibus (data included: what was the technology used to generate the data, what database was used to align the data?). Idea behind the standards is to require the user uploading data to include certain information that is critical towards understanding the raw data

  1. Raw data for each assay (e.g., CEL or FASTQ files)
  2. Final processed (normalized) data for the set of assays in the study (e.g., the gene expression data count matrix used to draw the conclusions in the study)
  3. Essential sample annotation (e.g., tissue, sex, and age) and the experimental factors and their values (e.g., compound and dose in a dose response study)
  4. Experimental design including sample data relationships (e.g., which raw data file relates to which sample, which assays are technical, which are biological replicates)
  5. Sufficient annotation of the array or sequence features examines (e.g., gene identifiers, genomic coordinates)
  6. Essential laboratory and data processing protocols (e.g., what normalization method has been used to obtain the final processed data)

Inconsistencies in Bioinformatic Data
In context of MICROARRAY data: Standards were updating as the technology was updating, so it was a bit scrambled. Data storage caused some inconsistencies in bioinformatics. Different interpretations of data standards, subjectivity of guidelines. Each submission is inspected by GEO curators for content integrity. Lacking a controlled vocabulary, means one experimenter will express things differently than another experimenter. This is a difficult feat, even computationally, need to go through the data manually. In context of PROTEOMIC data: Proteomics Standards Initiative (PSI) has its own controlled vocabulary for each standard that it releases.

Example Dataset = GSE70072
Title of the paper - Apoptosis enhancing drugs overcome innate platinum resistance in CA125 negative tumor initiating populations of high grade serous ovarian cancer Very briefly in my own words: This study investigates two cell populations that they isolated from high-grade serous ovarian cancer, CA125+ve cells and CA125-ve cells. They hypothesize the CA125-ve cells are stem-like cells and are the cause for resurgence of cancer following treatment as the chemo drugs kills all the CA125+ve cells enriching for CA125-ve cells. They further demonstrate how a combination of two chemo drugs given together could help to kill both CA125+ve and CA125-ve cells.

  • Need to have the right markers for the right data.

Procedure

  1. Get the GEO description of dataset gse <- getGEO("GSE70072", GSEMatrix = FALSE)
  2. Information about platform association with dataset current_gp1 <- names(GPPList(gse))(1) current_gpl_info <- Meta(getGEO(current_gpl))
  3. Important for assignment: Put in format using Markdown, with embeddd R expression surrounded by grave accent.
  4. Get the expression data sfiles = getGEOSuppFiles('GSE70072') fnames = rownames(sfiles) cal125_expr = read.delim(fnames[1], header = TRUE, check.names = FALSE) check.names adds the negative and positive sights in the dataframe.

Cleaning the data

  • How many unique genes do we have?
  • Are there any non-genes in our dataset? If so, what are they?
  • Can we exclude the them?
  1. Define the groups Given the lacking of specification in format, how can we go about getting the information that pertains to each sample? By looking at the column names we see "Pt.A.CA125-" and "pt.F.CA125-". We can infer Pt.A means Patient A. Then we clean the data by spliting the column name and summarize. What is "Y_RNA" "SNOR", etc.? But we don't necessarily need to worry about it now. We don't need to filter them out (yet).

  2. Filtering Filtering weakly expressed and non-informative features is very important. Make sure we solve some duplicate issues and get a summarized count for each gene. Only output those that are greater than 1.

Conclusion

  • Discussed data standards, and where they are lacking when it comes to raw expression data. We have to work with what is out there. Choosing a good data set is very important!!
  • Learned about the dataset of interest
  • Played around with an example dataset

A1: Clean the data and map to HUGO symbols

Objective

  • Prepare a Notebook that will produce a clean, normalized dataset that will be used for the remaining assignments in this course.

Time Management

Date Started: 2022-02-12
Data Completed: 2022-02-13
Estimated Time: 3 hours
Actual Time: 5 hours

Workflow

  • Dataset already included HGNC gene symbols and EnsemblIDs, decided to map EnsemblIDs anyways using ensembl as a tool for identifier mapping.
  • This resulted in 670 identifiers that did not match to current HUGO mapping, which is about 5% of the total number of genes. I decided to remove these genes.
  • Dataset contained 3 genes that map to the same HGNC symbol (CYB561D2 has 2 duplicates, HSPA14 has 4 duplicates, COG8 has 2 duplicates. I decided to keep them in the dataset, to not harm the analysis. Removing duplicates can introduce differences where they don’t exist and potential bias on some algorithms. Since there are so few, I think keeping them or removing them would've been fine.

Issues

  • When downloading the supplemental files from GEO using the getGEOSuppFiles function, I continuously ran into issues where the second file would take too long to download and the script would quit. This happened using Docker as well. I solved this by using the filter_regex = ".txt" parameter in my code, to only download the relevant counts file for this assignment.