Entry 3.2: Dataset Selection - bcb420-2025/Chloe_Calica GitHub Wiki

Objective: Outline my process of obtaining a dataset to be used for Assignment #1

Estimated Time: 1 hour

Actual Time: 1 hr 45 mins (Took longer because I had to read the papers and look at the dataset carefully)

Selection Process

  1. Navigated to the GEO website.
  2. Broad search for the term Lyme Disease.
  3. Added the following filters to narrow down search:
    • Organism: Homo Sapiens
    • Publication Year: From 2020/01/01 so at least 5 years ago
    • Study Type: Expression profiling by high througput sequencing
    • At this point, only ten entries were left and so, I looked at the available ones further eliminating them if they are single-cell as opposed to bulk RNAseq and whether or not they have a publication listed.
  4. Only two dataset were left after the previous filtering step:

Final Dataset

  • I decided to go with the first dataset mentioned above (GSE194294) pending approval from Profeswor Isserlin.
  • I chose this one because its size is more manageable (12 vs. 91) and the data is in the form of raw counts as opposed to the other one which was normalized.
  • I do prefer the other dataset's sample because it was from human samples rather than cell lines. It was also more specific to dendritic cells. However, there are way too many samples and I'm afraid that I may not be able to process all of them at once since the data is all stored in one file and not separated per condition.