Entry 3.2: Dataset Selection - bcb420-2025/Chloe_Calica GitHub Wiki

Objective: Outline my process of obtaining a dataset to be used for Assignment #1

Estimated Time: 1 hour

Actual Time: 1 hr 45 mins (Took longer because I had to read the papers and look at the dataset carefully)

Selection Process

Navigated to the GEO website.
Broad search for the term Lyme Disease.
Added the following filters to narrow down search:
- Organism: Homo Sapiens
- Publication Year: From 2020/01/01 so at least 5 years ago
- Study Type: Expression profiling by high througput sequencing
- At this point, only ten entries were left and so, I looked at the available ones further eliminating them if they are single-cell as opposed to bulk RNAseq and whether or not they have a publication listed.
Only two dataset were left after the previous filtering step:
- Genome-wide transcriptome analysis of human cell models exposed to Borrelia burgdorferi
  - Accession: GSE194294
  - Samples: human primary cell line (HUVEC) and an immortalized cell line (HEK-293) exposed to Borrelia burgdorferi strain B31
  - Number of Samples: 12
  - Type of Data: Raw counts
- Changes in gene expression of human monocyte derived dendritic cells exposed to live Borrelia burdgorferi or LTA
  - Accession: GSE211551
  - Samples: monocyte-derived dendritic cells from healthy donors cultured with live Borrelia burdgorferi (2 strains) or stimulated with LTA
  - Number of Samples: 91
  - Type of Data: Normalized

I decided to go with the first dataset mentioned above (GSE194294) pending approval from Profeswor Isserlin.
I chose this one because its size is more manageable (12 vs. 91) and the data is in the form of raw counts as opposed to the other one which was normalized.
I do prefer the other dataset's sample because it was from human samples rather than cell lines. It was also more specific to dendritic cells. However, there are way too many samples and I'm afraid that I may not be able to process all of them at once since the data is all stored in one file and not separated per condition.