Assignment #1 - bcb420-2023/Helena_Jovic GitHub Wiki

Select an Expression Data Set

Objective

Select an expression dataset

Time Management

Date Started: 2022-02-09
Data Completed: 2022-02-12
Estimated Time: 3 hours
Actual Time: 10 hours

Procedure

Select an Expression Data Set. Choose a dataset of native, healthy human cells or tissue.
Choose an interesting experiment. Their expression response to the experimental conditions must reflect some biological property. Ideally, this will be a physiological response of some sort. It is your task to reflect on this question and choose accordingly.
Make sure the coverage is as complete as possible. Experiments that measure expression for only a small subset of genes are not suitable.
Choose high-quality experiments. The experiments should be performed with biological replicates (the more the better). It also should be performed with mature experimental platforms, according to best-practice procedures; therefore we should choose recent experiments (not older than ten years). As above, contact me for special permission if you want to deviate from this requirement.
Claim the dataset on the dataset signup page of the Student Wiki Links to an external site.

Course Notes

GEO contains expression data collected from a variety of technologies.
Choose a gene expression platform like microarrays
We want expression datasets with good coverage; not much older than ten years (quality!); with sufficient numbers of replicates; collected under interesting conditions; mapped to unique human gene identifiers.

Workflow

Choose dataset

Visited the GEO website (https://www.ncbi.nlm.nih.gov/geo/) and navigated to the Browse Contents tab, and clicked on "Series".
Filtered my query search in the Builder using the following filters: (((((count[Description]) AND txt[Description]) AND homo sapiens[Organism]) AND ("2013/01/01"[Publication Date] : "3000"[Publication Date]))) AND HIV[Title].
This yielded a total of 5 results, and I chose the first result Accession Number: GSE184320 "Loss of skin and mucosal CXCR3+ resident memory T cells causes irreversible tissue-confined immunodeficiency in HIV" because it seemed the most interesting to me.
The dataset was submitted on Sep 16, 2021, uses the Illumina HiSeq 4000 (Homo sapiens) platform and has three replicates for each group out of 28 samples.

Read paper associated with selected dataset:

Citation: Saluzzo S, Pandey RV, Gail LM, Dingelmaier-Hovorka R et al. Delayed antiretroviral therapy in HIV-infected individuals leads to irreversible depletion of skin- and mucosa-resident memory T cells. Immunity 2021 Dec 14;54(12):2842-2858.e5. PMID: 34813775

Issues

I had issues with the 'GEOmetadb.sqlite' file when following the procedure described in Lecture 3: Finding Expression Data so I searched for my dataset manually. Initially, the first dataset I selected was not suitable for the assignment because there was no supplemental file containing count data.

Clean the data and map to HUGO symbols

Objective

Prepare a Notebook that will produce a clean, normalized dataset that will be used for the remaining assignments in this course.

Time Management

Date Started: 2022-02-12
Data Completed: 2022-02-13
Estimated Time: 3 hours
Actual Time: 5 hours

Workflow

Dataset already included HGNC gene symbols and EnsemblIDs, decided to map EnsemblIDs anyways using ensembl as a tool for identifier mapping.
This resulted in 670 identifiers that did not match to current HUGO mapping, which is about 5% of the total number of genes. I decided to remove these genes.
Dataset contained 3 genes that map to the same HGNC symbol (CYB561D2 has 2 duplicates, HSPA14 has 4 duplicates, COG8 has 2 duplicates. I decided to keep them in the dataset, to not harm the analysis. Removing duplicates can introduce differences where they don’t exist and potential bias on some algorithms. Since there are so few, I think keeping them or removing them would've been fine.

Issues

When downloading the supplemental files from GEO using the getGEOSuppFiles function, I continuously ran into issues where the second file would take too long to download and the script would quit. This happened using Docker as well. I solved this by using the filter_regex = ".txt" parameter in my code, to only download the relevant counts file for this assignment.

Apply Normalization

Objective

Prepare a Notebook that will produce a clean, normalized dataset that will be used for the remaining assignments in this course.

Time Management

Date Started: 2022-02-12
Data Completed: 2022-02-13
Estimated Time: 3 hours
Actual Time: 5 hours

Workflow

Used TMM normalization technique following procedure outline in lecture 4 part 2
Included box, density and MDS plots for comparison purposes of the orginal dataset and the normalized dataset
Did not find any outliers

Issues

#Interpret, and document

Objective

Answer all questions required for assignment 1

Time Management

Date Started: 2022-02-13
Data Completed: 2022-02-13 Estimated Time: 30 minutes for answering questions
Actual Time: 30 minutes + time spent on previous tasks in assignment 1

Procedure

Read the paper
Complete all data cleaning and normalization techniques

Questions and Answers

What are the control and test conditions of the dataset?

The control condition in this dataset is the skin of healthy, HIV-negative controls (patient = HC, cell_type = SKIN . The test conditions are the skin samples of two cohorts of people living with HIV (PLWH): HIV “late ART” (HIVLA) and HIV “early ART” (HIVEA) (patient = A, cell_type = SKIN and cell_type = PBMC).

Why is the dataset of interest to you?

The skin is an important barrier against infections and cancer, and it is protected by a type of immune cells called resident memory T (Trm) cells. These cells are important for fighting infections and preventing cancer in the skin. People with HIV can have a weakened immune system, which puts them at a higher risk for certain types of cancer, including skin and mucosal cancers.The study found that people with HIV who were diagnosed late (when their immune system was already weak) had a permanent reduction in the number of Trm cells in their skin, even if they were taking medicine to treat their HIV. However, people who were diagnosed and started treatment earlier had a temporary reduction in Trm cells, but they eventually reconstituted (rebuilt) their Trm cell population. This highlights the importance of early diagnosis and treatment of HIV in order to prevent skin-confined immunodeficiency and reduce the risk of HPV-related malignancies.

Were there expression values that were not unique for specific genes? How did you handle these?

Yes, there are 3 Ensembl_IDs that map to the same HGNC symbol (CYB561D2 has 2 duplicates, HSPA14 has 4 duplicates, COG8 has 2 duplicates. I have decided to keep them in the dataset, to not harm the analysis. Removing duplicates can introduce differences where they don't exist and potential bias on some algorithms.

Were there expression values that could not be mapped to current HUGO symbols?

Yes, here are 491 identifiers that are not matched to current HUGO mapping, which is 3.9% of the total number of genes (not including genes of low count).

How many outliers were removed?

No outliers were removed from the dataset. Original and normalized boxplots and density plots did not show significant variation to indicate the presence of any outliers.

How did you handle replicates?

There are a maximum of 5 biological replicates for each of the three conditions in which the samples are tested. I grouped the replicates under the conditions they were tested in.

What is the final coverage of your dataset?

A total of 46500 genes were removed from the dataset. The final dataset represents 20% of the original dataset. The number of samples remains 16.

Conclusion

A total of 46500 genes were removed from the dataset. The final dataset represents 20% of the original dataset. The number of samples remains 16.