2. Assignment 1 - bcb420-2022/Inika_Prasad GitHub Wiki

2.3 Submitting & reflections

  • A lot of the work in this assignment was just choosing the dataset, and I spent about 3-4 hours picking out my first dataset. When I realized it wouldn't work, I spent another 4 hours picking another one and getting acquainted with it.
  • The sample code from the lectures was very useful–it helped me figure out which functions to use, and which packages were useful.
  • Polishing the notebook by adding figure legends, headings, table of contents, and making colours aesthetically pleasing took surprising amounts of time in the end: almost 1.5 hours.
  • Making sure everything is correctly uploaded and linked on GitHub...I upload the .Rmd file & .html file to my repository? And then link the .Rmd file from the StudentWiki page?
  • I made most of my notes about the assignment in the R notebook. How can I use this journal better for next time?

R Notebook for Assignment 1

R Notebook (HTML) knitted for Assignment 1

2.2 Working with the data: Normalization

Normalization by library size:

  • Total count normalizations - values in sample/total reads for the run. Transcript length NOT factored in.
  • RPKM (read per kilobase per million mapped reads). Transcript length factored in.
  • FPKM (fragments per kilobase per million mapped reads). Transcript length factored in.

Normalization by distribution Assumptions:

  1. Differentially expressed and non-differentially expressed genes behave the same way. Technical variations in the data will effect both.
  2. The data is roughly balanced - a gene up regulated is one sample is correspondingly down-regulated in the other. Similar numbers
  • Z-scored normalization: converts into standard normal distribution

Specialized normalization methods for RNA-Seq Assumptions: most genes are not differentially expressed.

  • TMM: Trimmed Mean of M-values. EdgeR package
  • RLE: Relative Log Expression. DESeq package.

More about the TMM Method:

  • based on the M vs A plot

2.1 Unsuitable dataset: choosing anew

Issue: GSE113863 associated with "An assessment of prognostic immunity markers in breast cancer" is not suitable for BCB420. Rationale: After trying to access the microarray data that much of the paper is based on, I realize that the authors of the paper compiled the microarray data from a variety of publications. This compiled data is not available on the GEO page, and poses additional problems of trying to compile data from various experiments and ensuring that it is robust, suitable, etc.

The only data available on the GEO page is for Targeted Microarray seq, which looks at a panel of 72 genes in 483 patients. This is not suitable for further analysis in this BCB420 assignment.

Result: Therefore, I am choosing a new dataset.

  1. Gene expression analysis suggests immunological changes of peripheral blood monocytes in the progression of patients with coronary artery disease, GSE166780, HiSeq X Ten (Homo sapiens), 2021-02-15. Cited by 2 articles. 8 x 3 = 24 samples. .tar supplementary file. GEO

  2. Effects of IL-4 conditioning on human umbilical cord blood-derived mast cells GSE165804, Illumina NextSeq 500 (Homo sapiens), 2021-01-29. Cited by 7 articles. 2 x 2 = 4 samples. GEO

  3. Transcriptome-wide profiling of palmitic acid-exposed astrocytes reveals widespread immunometabolic dysregulation, GSE166500, Illumina HiSeq 2000 (Homo sapiens), 2021-02-09. 5x2x3 = 30 samples, No publication

  4. Transcriptomic characteristics and impaired immune function of patients who retest positive for SARS-CoV-2 RNA, GSE166253, HiSeq X Ten (Homo sapiens), 2021-02-05. 26 samples. The associated paper is not linked on GEO and vice versa. Cited by: unavailable.

  5. Analysis of the transcriptome and DNA methylome in response to acute and recurrent low glucose in human primary astrocytes (RNA-Seq) GSE166847, Illumina HiSeq 2500 (Homo sapiens), 2021-02-16.

  6. Gene expression after knockdown of transcription factors in human iPSC-derived cardiac myocytes, GSE166823, NextSeq 550 (Homo sapiens), 2021-02-16. 11 experimental groups including 1 control. 11 x 3 samples. Publication unavailable

Decision: Option 1 (GSE166780) seems like the most suitable. It has a good number of samples, an extensive and thorough publication associated with it, and a supplementary file with raw data.

After accessing the OPTION 1 (GSE166780) data, it appears to be cleaned and normalized already. Gene identifiers are mapped, and the data accessed via GEO accession number on R is identical to the Supplementary material where FPKM (fragments per kilobase per million mapped reads) for annotated genes. No raw data is available.

Lesson learned: the file name having RAW in it doesn't mean the data is raw (un-annotated or non-normalized)

  • If the sample GSM is linked, clicking on it can give you more relevant information about how the data has been treated.

For instance, when looking at the GEO platform, click on a sample like GSM5066812.

  • Good to write in a reproducible way (with embedded R code) without hardcoding; my work is redoable quickly with a new dataset

Next step: Time to choose yet another dataset.

Option 5: Analysis of the transcriptome and DNA methylome in response to acute and recurrent low glucose in human primary astrocytes(GSE166847): supplementary file available, good sample size?: yes, 4 x 5 = 20 samples.

  • Supplementary_files_format_and_content: csv file of counts

  • Raw data are available in SRA. Processed data are available on the Series record.(Source: GEO)

  • Therefore, SRAdb package required to access the raw data. Concerns:

  • Is it different from GEOmetadb for downstream analysis? No, because we read the data into a dataframe and matrix later, which is identifcal no matter the source of the data.

    • But! the SRA file may contain un-aligned, untrimmed, or even FASTQ sequences. (Source)
    • For high-throughput sequencing, GEO brokers the complete set of raw data files, e.g., FASTQ, to the SRA database on your behalf. ((Source: GEO)
    • Whether I can use this dataset depends on the type of RAW data that is present in the SRA database. If it's FASTQ files then it isn't feasible to align. If it's counts per million with ENSEMBL gene IDs then we're good.
    • It is, in fact, a FASTQ file (Source: SRA page)
  • Are we supposed to only use GEO for this project?

  • Would the code I have already written be redundant? No, because the GEO page still holds valid information about the project and paper, even though the raw data is being accessed with SRA.

Note: Differential gene expression and ontology analyses were performed using DESeq2 and GOseq respectively. Source:GEO

FDR correction was used (Paper,-.%20Volcano%20plots%20displaying))

Useful link for RNAseq data: Guide to creating design matrices for gene expression experiments

More options for datasets:

  1. PD-1 is imprinted on cytomegalovirus-specific CD4+ T cells and attenuates Th1 cytokine production whilst maintaining cytotoxicity, GSE165952, supplementary file counts matrix available, 2 x 12 + 1 = 13 samples

  2. Gene expression analyses of hematopoietic stem and progenitor cells treated with extracellular vesicles isolated from adult and fetal MSCs, GSE165921, 3 x 3 samples, ENSEMBL IDs + counts (normalized?). .tar data

  3. GDF11 rapidly increases lipid accumulation in liver cancer cells through ALK5-dependent signaling, GSE165842

  4. Recurrent human papillomavirus-related head and neck cancer undergoes metabolic re-programming and is driven by oxidative phosphorylation, GSE165883, 10 x 2 samples. Primary tumour vs recurrent tumour. Genes mapped. Normalized?

  5. Dopamine Receptor Antagonists and Radiation Create a Metabolic Vulnerability in Mouse Models of Glioblastoma. 3 samples per condition, many different conditions. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE165624

2. Choosing dataset

  • About the knitr: :kable() function: Info

  • Error: I'm only getting breast cancer datasets when doing a general search.

Progress: The R markdown file I have gives the appropriate data on a friend's computer.

  • Error: The script hasn't changed, but fetching the metadata file is proving problematic.

Error message: Error in download.file(url_geo, destfile = localfile, mode = "wb") download from 'http://starbuck1.s3.amazonaws.com/sradb/GEOmetadb.sqlite.gz' failed

Decided to choose a dataset from what I could see, so chose GSE113863 associated with "An assessment of prognostic immunity markers in breast cancer" Approved by Prof. Isserlin on Monday 14th Feb.

1. Getting started

Objective: Choose a dataset fulfilling the following criteria and read the paper/publication associated with the data:

  • with good coverage;
  • not much older than ten years (quality!);
  • with sufficient numbers of replicates;
  • collected under interesting conditions;
  • mapped to unique human gene identifiers.

Time estimated: 1.5 hours. Started 11th Feb, 15:20 Time taken XXX h

Progress

Task 1: Get Bioconductor and all packages running

Error message: Do you want to install from sources the packages which need compilation?

Answer: Probably not (R Studio Community)

Error message: Error in getSQLiteFile() : could not find function "getSQLiteFile"

Answer: Try library loading BiocManager and GEOmetadb. library(BiocManager) works well. library(GEOmetadb)gives

package or namespace load failed for ‘GEOquery’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[i](/bcb420-2022/Inika_Prasad/wiki/i)): namespace ‘rlang’ 0.4.11 is already loaded, but >= 1.0.0 is required Error: package ‘GEOquery’ could not be loaded

Restart everything What R version do I have? It's the first message on the console when you open Don't forget the load the BiocManager & GEOmetadb packages with library(BiocManager)

Error: dbListTables gives character(0); not sure why. No error message. Answer: Installed RSQLite package.

But now, a new problem appears

Command: `geometadbfile <- getSQLiteFile()

Error message: trying URL 'http://starbuck1.s3.amazonaws.com/sradb/GEOmetadb.sqlite.gz'`

Content type 'binary/octet-stream' length 774275550 bytes (738.4 MB)

downloaded 151.9 MB

Error in download.file(url_geo, destfile = localfile, mode = "wb") : download from 'http://starbuck1.s3.amazonaws.com/sradb/GEOmetadb.sqlite.gz' failed In addition: Warning messages: 1: In download.file(url_geo, destfile = localfile, mode = "wb") : downloaded length 159324985 != reported length 774275550 2: In download.file(url_geo, destfile = localfile, mode = "wb") : URL 'http://starbuck1.s3.amazonaws.com/sradb/GEOmetadb.sqlite.gz': Timeout of 60 seconds was reached

Update: I can't figure out what's wrong right now so I'm going to do a software update on my Mac (macOS Monterey) and try again in a few hours.

Conclusion & Outlook

The software update ran into issues so I did not pursue it. The code worked the next day, even though I did nothing different. Seems like sometimes waiting it out is the smartest idea.