3. Assignment 1: Data set selection and initial Processing - bcb420-2022/RuoXuan_Wang GitHub Wiki

Objective

Prepare a RNotebook producing a clean, normalized dataset.

Duration

Time estimated: 10h; taken 24+h;
date started: 2022-02-17; date completed: 2022-02-25
also had to get my laptop fixed in the interim; expected to still be able to finish on time but apparently not

Progress

Tasks:

  1. Select an Expression Data Set
  • went to https://www.ncbi.nlm.nih.gov/gds to search for dataset since GEOmetadb functions were not functioning efficiently
  • used the query "(((Expression profiling by high throughput sequencing[DataSet Type]) AND Homo sapiens[Organism]) AND ("2017"[UDAT] : "3000"[UDAT]) AND neurological[All Fields])"
  • clicked through to confirm replicates and gene coverage
  • chose GSE157852
  • Read the paper
  1. Clean the data and map to HUGO symbols
  • 1 – Download the data ...

    • use the GEOquery Bioconductor package
    • to download only when necessary, we can check for presence of directory as downloading creates one(consider if empty?)
  • 2 – Assess ...

    • what overview statistics?
    • mark control and test conditions for normalization
  • 3 – Map ...

    • already have HUGO gene symbols as row identifiers
    • should I do another check with BiomaRt?
    • unmapped rows are LOCxxxx
    • any map to more than one symbol or same symbol? Check duplicates - no
  • 4 – Clean ...

    • Removing outliers
    • filter low counts
  1. Apply Normalization
  • normalize by distribution and use Trimmed Mean of M-values
    • specialized for RNASeq
    • based on the hypothesis that most genes are not differentially expressed
    • normalizing across the sample
  • used box plots and then density plots
  • surprised that that there seemed to be no difference between plots
    • checked paper, they normalized raw counts, but I thought the file was raw counts?
  • same with density plots
  • made a mistake in choosing? but i like this topic
  • it's a shame that normalization was unnecessary
  1. Interpret, and document
  • Answer listed questions:

What are the control and test conditions of the dataset?
Why is the dataset of interest to you?
Were there expression values that were not unique for specific genes? How did you handle these?
Were there expression values that could not be mapped to current HUGO symbols?
How many outliers were removed?
How did you handle replicates?
What is the final coverage of your dataset?

Submitting

  • knit to html isn't working:

Quitting from lines 46-59 (BCB420_A1.Rmd) Error in read.table(file = file, header = header, sep = sep, quote = quote, : object 'filenames' not found

  • tried generalizing, may have to use explicit file name
  • added else clause doing so
  • ran

docker run --rm -it -v "$(pwd)":/home/rstudio/projects --user rstudio risserlin/bcb420-base-image /usr/local/bin/R -e "rmarkdown::render('/home/rstudio/projects/name_of_rmd.Rmd',output_file='/home/rstudio/projects/name_of_html.html')" > processing_output_filename

  • trying to link github to RStudio and it't not working
  • got a personal access token and finally worked

Conclusion and outlook

  • I was really interested in this experiment, but the data was already normalized and I didn't realize until I tried to normalize it. I decided to not switch to another dataset, because Professor Isserlin said we should be interested. Also, I assume the remaining assignments are just as important.
  • Wasn't quite sure what to do about the 3 conditions, will probably ask about it for Assignment 2.
  • This assignment was a mess in terms of the process, but I think my RNotebook turned out ok.
  • I will get started on the next required journal entry (due March 1). Maybe also ask whether we need to include our notes here.

References

  1. https://www.ncbi.nlm.nih.gov/gds
  2. Isserlin, R. (2022, February 13). BCB420 - Computational Systems Biology - Lecture 4 - Exploring the data and basics of Normalization. Toronto; Quercus.
  3. Jacob, F., Pather, S. R., Huang, W. K., Zhang, F., Wong, S., Zhou, H., Cubitt, B., Fan, W., Chen, C. Z., Xu, M., Pradhan, M., Zhang, D. Y., Zheng, W., Bang, A. G., Song, H., Carlos de la Torre, J., & Ming, G. L. (2020). Human Pluripotent Stem Cell-Derived Neural Cells and Brain Organoids Reveal SARS-CoV-2 Neurotropism Predominates in Choroid Plexus Epithelium. Cell stem cell, 27(6), 937–950.e9. https://doi.org/10.1016/j.stem.2020.09.016