3. Assignment 1: Data set selection and initial Processing - bcb420-2022/RuoXuan_Wang GitHub Wiki

Objective

Prepare a RNotebook producing a clean, normalized dataset.

Duration

Time estimated: 10h; taken 24+h;
date started: 2022-02-17; date completed: 2022-02-25
also had to get my laptop fixed in the interim; expected to still be able to finish on time but apparently not

Progress

Tasks:

Select an Expression Data Set

went to https://www.ncbi.nlm.nih.gov/gds to search for dataset since GEOmetadb functions were not functioning efficiently
used the query "(((Expression profiling by high throughput sequencing[DataSet Type]) AND Homo sapiens[Organism]) AND ("2017"[UDAT] : "3000"[UDAT]) AND neurological[All Fields])"
clicked through to confirm replicates and gene coverage
chose GSE157852
Read the paper

Clean the data and map to HUGO symbols

1 – Download the data ...
- use the GEOquery Bioconductor package
- to download only when necessary, we can check for presence of directory as downloading creates one(consider if empty?)
2 – Assess ...
- what overview statistics?
- mark control and test conditions for normalization
3 – Map ...
- already have HUGO gene symbols as row identifiers
- should I do another check with BiomaRt?
- unmapped rows are LOCxxxx
- any map to more than one symbol or same symbol? Check duplicates - no
4 – Clean ...
- Removing outliers
- filter low counts

Apply Normalization

normalize by distribution and use Trimmed Mean of M-values
- specialized for RNASeq
- based on the hypothesis that most genes are not differentially expressed
- normalizing across the sample
used box plots and then density plots
surprised that that there seemed to be no difference between plots
- checked paper, they normalized raw counts, but I thought the file was raw counts?
same with density plots
made a mistake in choosing? but i like this topic
it's a shame that normalization was unnecessary

Interpret, and document

Answer listed questions:

What are the control and test conditions of the dataset?
Why is the dataset of interest to you?
Were there expression values that were not unique for specific genes? How did you handle these?
Were there expression values that could not be mapped to current HUGO symbols?
How many outliers were removed?
How did you handle replicates?
What is the final coverage of your dataset?

Submitting

knit to html isn't working:

Quitting from lines 46-59 (BCB420_A1.Rmd) Error in read.table(file = file, header = header, sep = sep, quote = quote, : object 'filenames' not found

tried generalizing, may have to use explicit file name
added else clause doing so
ran

docker run --rm -it -v "$(pwd)":/home/rstudio/projects --user rstudio risserlin/bcb420-base-image /usr/local/bin/R -e "rmarkdown::render('/home/rstudio/projects/name_of_rmd.Rmd',output_file='/home/rstudio/projects/name_of_html.html')" > processing_output_filename

trying to link github to RStudio and it't not working
got a personal access token and finally worked

Conclusion and outlook

I was really interested in this experiment, but the data was already normalized and I didn't realize until I tried to normalize it. I decided to not switch to another dataset, because Professor Isserlin said we should be interested. Also, I assume the remaining assignments are just as important.
Wasn't quite sure what to do about the 3 conditions, will probably ask about it for Assignment 2.
This assignment was a mess in terms of the process, but I think my RNotebook turned out ok.
I will get started on the next required journal entry (due March 1). Maybe also ask whether we need to include our notes here.

References

https://www.ncbi.nlm.nih.gov/gds
Isserlin, R. (2022, February 13). BCB420 - Computational Systems Biology - Lecture 4 - Exploring the data and basics of Normalization. Toronto; Quercus.
Jacob, F., Pather, S. R., Huang, W. K., Zhang, F., Wong, S., Zhou, H., Cubitt, B., Fan, W., Chen, C. Z., Xu, M., Pradhan, M., Zhang, D. Y., Zheng, W., Bang, A. G., Song, H., Carlos de la Torre, J., & Ming, G. L. (2020). Human Pluripotent Stem Cell-Derived Neural Cells and Brain Organoids Reveal SARS-CoV-2 Neurotropism Predominates in Choroid Plexus Epithelium. Cell stem cell, 27(6), 937–950.e9. https://doi.org/10.1016/j.stem.2020.09.016