3 Data Cleaning - bcb420-2023/Helena_Jovic GitHub Wiki
Lecture Notes
Objective
- Watch: Data Exploration & Data standards - what are they and what are they good for?
- Example Data Set: BiomaRt example
- Data Normalization: Why do we need to normalize out data?
- What different types of distributions are there and why does this matter?
- Identifier Mapping
Time Management
Date Started: 2023-02-02
Data Completed: 2023-02-02
Estimated Time: 2h
Actual Time: 2h
Procedure
- Watch lecture 4 Data Cleaning Part 1
Report notes and questions.
Notes
- Need to read the paper related to the dataset
- Will be using RNASeq data
Critical elements forming data standards
GEO is the gene expression omnibus (data included: what was the technology used to generate the data, what database was used to align the data?). Idea behind the standards is to require the user uploading data to include certain information that is critical towards understanding the raw data
- Raw data for each assay (e.g., CEL or FASTQ files)
- Final processed (normalized) data for the set of assays in the study (e.g., the gene expression data count matrix used to draw the conclusions in the study)
- Essential sample annotation (e.g., tissue, sex, and age) and the experimental factors and their values (e.g., compound and dose in a dose response study)
- Experimental design including sample data relationships (e.g., which raw data file relates to which sample, which assays are technical, which are biological replicates)
- Sufficient annotation of the array or sequence features examines (e.g., gene identifiers, genomic coordinates)
- Essential laboratory and data processing protocols (e.g., what normalization method has been used to obtain the final processed data)
Inconsistencies in Bioinformatic Data
In context of MICROARRAY data: Standards were updating as the technology was updating, so it was a bit scrambled. Data storage caused some inconsistencies in bioinformatics. Different interpretations of data standards, subjectivity of guidelines. Each submission is inspected by GEO curators for content integrity. Lacking a controlled vocabulary, means one experimenter will express things differently than another experimenter. This is a difficult feat, even computationally, need to go through the data manually. In context of PROTEOMIC data: Proteomics Standards Initiative (PSI) has its own controlled vocabulary for each standard that it releases.
Example Dataset = GSE70072
Title of the paper - Apoptosis enhancing drugs overcome innate platinum resistance in
CA125 negative tumor initiating populations of high grade serous ovarian cancer
Very briefly in my own words:
This study investigates two cell populations that they isolated from high-grade serous
ovarian cancer, CA125+ve cells and CA125-ve cells.
They hypothesize the CA125-ve cells are stem-like cells and are the cause for
resurgence of cancer following treatment as the chemo drugs kills all the CA125+ve
cells enriching for CA125-ve cells.
They further demonstrate how a combination of two chemo drugs given together could
help to kill both CA125+ve and CA125-ve cells.
- Need to have the right markers for the right data.
Procedure
- Get the GEO description of dataset
gse <- getGEO("GSE70072", GSEMatrix = FALSE)
- Information about platform association with dataset
current_gp1 <- names(GPPList(gse))(1)
current_gpl_info <- Meta(getGEO(current_gpl))
- Important for assignment: Put in format using Markdown, with embeddd R expression surrounded by grave accent.
- Get the expression data
sfiles = getGEOSuppFiles('GSE70072')
fnames = rownames(sfiles)
cal125_expr = read.delim(fnames[1], header = TRUE, check.names = FALSE)
check.names adds the negative and positive sights in the dataframe.
Cleaning the data
- How many unique genes do we have?
- Are there any non-genes in our dataset? If so, what are they?
- Can we exclude the them?
-
Define the groups Given the lacking of specification in format, how can we go about getting the information that pertains to each sample? By looking at the column names we see "Pt.A.CA125-" and "pt.F.CA125-". We can infer Pt.A means Patient A. Then we clean the data by spliting the column name and summarize. What is "Y_RNA" "SNOR", etc.? But we don't necessarily need to worry about it now. We don't need to filter them out (yet).
-
Filtering Filtering weakly expressed and non-informative features is very important. Make sure we solve some duplicate issues and get a summarized count for each gene. Only output those that are greater than 1.
Conclusion
- Discussed data standards, and where they are lacking when it comes to raw expression data. We have to work with what is out there. Choosing a good data set is very important!!
- Learned about the dataset of interest
- Played around with an example dataset
A1: Clean the data and map to HUGO symbols
Objective
- Prepare a Notebook that will produce a clean, normalized dataset that will be used for the remaining assignments in this course.
Time Management
Date Started: 2022-02-12
Data Completed: 2022-02-13
Estimated Time: 3 hours
Actual Time: 5 hours
Workflow
- Dataset already included HGNC gene symbols and EnsemblIDs, decided to map EnsemblIDs anyways using ensembl as a tool for identifier mapping.
- This resulted in 670 identifiers that did not match to current HUGO mapping, which is about 5% of the total number of genes. I decided to remove these genes.
- Dataset contained 3 genes that map to the same HGNC symbol (CYB561D2 has 2 duplicates, HSPA14 has 4 duplicates, COG8 has 2 duplicates. I decided to keep them in the dataset, to not harm the analysis. Removing duplicates can introduce differences where they don’t exist and potential bias on some algorithms. Since there are so few, I think keeping them or removing them would've been fine.
Issues
- When downloading the supplemental files from GEO using the
getGEOSuppFiles
function, I continuously ran into issues where the second file would take too long to download and the script would quit. This happened using Docker as well. I solved this by using thefilter_regex = ".txt"
parameter in my code, to only download the relevant counts file for this assignment.