1. BCB420 Journal - bcb420-2022/Inika_Prasad GitHub Wiki

8. Data Exploration & Identifier Mapping

  • Apoptosis enhancing drugs overcome innate platinum resistance in CA125 negative tumor initiating populations of high grade serous ovarian cancer
  • mistakenly annotated with wrong identifiers, and their pretty heat map disappeared
  • Identifier mapping is an important step. Make sure your identifiers are up to date, but they may not match common names
  • CA125 is Muc16, but people know it as the former
  • When you load your data check.names = FALSE to help your data look okay and your

Filter out low counts

  • Filter out data with low counts: provide little evidence for differential expression. There must be minimal expression of the RNA to translated into a protein (and we are using RNA seq as a proxy for protein activity). Can create noise
  • EdgeR and DEC are packages for filtering out low counts. Fundament: make sure the genes you keep are expressed in at least 50% (adustable) of your samples.
  • There's a filter expression function in EdgeR

Define the groups

  • Column headers contain the info
  • pull out the relevant columns and make a dataframe
  • Differential Gene Expression List (DGEList)

Normalization

  • Since we want to know about relative expression of data, not absolute data
  • Read counts are proportional to the length of a transcript
  • Resd depth & Library effect size
  • calcNormFactors to normalize. Creates data within the dataframe that reflects scaling factors that it will use to scale downstream analysis. Puts data in bins and normalizes within them
  • norm.factors less than 1 => small number of high count genes predominant in the sample, and so the library size is reduced to compensate
  • EdgeR
  • MDS plot (like principal component analysis)

Dispersion

  • Common dispersion: across your entire dataset

  • Tagwise dispersion: gene specific

  • Parameter of the negative binomial model, which EdgeR is based on

  • Count data tends to be Poisson distribution.

  • RNA-seq is negative binomial distribution

  • Genes with small counts tend to have a higher degree of variation.

  • dispersion squared is Biological Coefficient of Variation (BCV) and is a measure of how much variation is in the data. WHY NOT USUAL VARIATION??

  • Graphing the BCV creates a visual representation of the mean-variant relationship

  • Biological causes of the conditions themselves. Technical causes ELEPHANT

  • Use trendline of BCV vs counts per million graph

  • Normalize using BCV and mean

  • Data fits the negative binomial distribution! Very nice.

Identifier Mapping

  • Very important in bioinformatics. Just gets worse lol.

  • Enrichment analysis software are hugely popular, DAVID especially

  • Using outdated identifiers can severely impact your data

  • If you use data from an old database/publication, make sure your annotations are up-to-date

  • GO (Gene Ontology) updated monthly, so identifiers in the labs are updated monthly too

  • BioMart ensembl website: lets you load identifiers onto the website and convert them. Good for checking, but quick & dirty

  • Why canät data just contain the relevant identifiers?

  • Technology doesn't actually deal with full genes or proteins, it often works with fragments.
  • Databases change & get annotated constantly. New genes and regulatory mechanisms are discovered and may become part of the analysis

Tools for Identifier Mapping

  • Ensembl Biomart: make sure R is using the most up to date version.
  • Bioconductor package biomaRt and R package biomartr
  • BridgeDB: restful service (you can get to it with code?). A but flaky sometimes

Biomart

  • library(biomaRt)

  • listMarts()

  • Ensembl Genes version 105 is the latest

  • listEnsemblArchive() so you know which version of the genome

  • Connect to the mart you're interested in (ensembl <- useMart("ensembl")

  • BioGRID stuff!

  • Find the human dataset using kable and grepl and then specify

  • IDs start with ENSG, ENSP, ENST (Gene, Protein, Transcript)

  • (The version stuff you don't want .... I think)

  • We want to convert ensembl gene ID to HGNC: HUGO Gene Nomeclature Committee. HGNC symbol is the standard gene name for stuff in humans.

  • Consider stashing your conversions to use later. It may change your whole analysis though. But! it's less time & computation-consuming. You can specify a REFREH <- TRUE or FALSE in the beginning of your R notebook

  • use merge command to merge the conversion table and the

  • Prof Isserlin has some code to test whether the notebook compiles

To do

  • Download package for indexing GEO data (Omnibus of Gene Expression, now has microarraw & RNA seq data) (last lecture)
  • getGEOsuppFiles('mydata') Put up your data on the wiki

7. Types of Expression Data Lecture

Types of Expression data

  • Gene expression data (not necessarily the gene that we're measuring, rather mRNA)
  • Genomics: lots of genes rather than a handful. But how do we turn this isnto something meaningful?
  • Normalization & Filtering
  • Filter out artefacts in your dataset (genes with low counts)
  • Read depth/library size: get rid of genes that are shorter or longer.
  • Trimmed Men of M-values (method of normalization of RNA seq data)
  1. Microarray expression data (being replaced by RNA-seq)
  • A chip in a well with probes that match the mRNA
  • Compare expression between disease/not disease, for example
  • You can use multiple dyes, multiple probes for a given gene. Computational analysis involves collapsing the probes.
  • The chip needed to have the probe in order for you to detect the mRNA
  • Basic pipeline: isolate mRNA, enrich mRNA, hybridize, stain, scan.
  1. Bulk RNASeq expression
  • Sequencing not the whole, but "enough" of the genome
  • Pipeline: Extract mRNA, enrich mRNA, fragment your DNA, (find the unambiguous transcripts that uniquely identify your gene), sequence, FASTQ file generator, run it through an aligner. Count how many times you count the gene – but the longer the gene, the higher the chance of finding the fragments of the gene so normalization is important.
  • Many different platforms, but mostly from Illumina
  • Different ways to do reads: short reads, long reads, direct reads
  • Most common: Short reads
  • Parameters to consider:
    • No. of samples (minimum 6, 3 control 3 disease. Helps reduce outliers. You may not get a strong signal with few samples.)
    • Sample prep
    • Read depth: target no. of sequence reads for each sample (10-30 million generally)
    • Read length
    • Single vs. paired end-reads: important when you're looking for isoforms, or something very specific-. Single is cheaper and more common
  • IMPORTANT: Note the reference genome! The aligned processed files will be available, because the alignment is a computationally intensive file.
  • Tools relying on a reference genome: TopHat, STAR, HISAT
  • Quantification: Identify no. of transcripts per gene. Tools: RSEM, CuffLinks, MMSeq, HTSeq (popular).
  • Some methods try to account for the fact that long genes show many reads. Example: RPKM/FPKM (read or fragments per kilobase million) or TPM (transcripts per million) or raw counts (more common, do normalization as followup instead of embedding it in the raw data. HTSeq uses raw counts.) IMPORTANT: Make sure you know what your base data is!
  1. Single Cell RNAseq expression data (very hard on personal computer)

  2. Protein Expression data using Mass spectrometry

Next steps

  • Choose an expression dataset
  • What is a good dataset? That's a tricky thing. Good depth, good experiment, but also of interest
  • Look at the data in GO

6. R Basics

  • .Rprofile: special R-script that is executed automatically on startup7. R expects to find it in the user's home directory.
  • You can include setwd(), libraries, functions you've defined, etc.
  • Workspace: info about the objects you create in R is stored here. Saving a Workspace is not a great idea, since you might save rubbish, corrupted objects, etc. INSIGHT
  • Make good scripts so you can recreate the objects you need.
  • If you expensively/time-consumingly computed, you can save() and later load() them explicitly. save(your_object, file = "your_file_name.RData")

5. Setting up Docker: second try

Tried logging in again: same errors as 1. Setup and Installing Docker Tried deleting and re-downloading containers: new error message appeared The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

This video by Execute Automation explained it quite well.

  • Apple with M1 chip (my current Macbook) can only use containers with Arm64 architecture, and the image for this course is an amd64 build.
  • RStudio also has problems running with the right architecture, according to Prof. Isserlin on the course discussion forum.

Conclusion: use Rstudio on your laptop directly

and just check that your notebook knits intermittently with the command:

docker run --rm -it -v "$(pwd)":/home/rstudio/projects --user rstudio risserlin/bcb420-base-image:winter2022-arm64 /usr/local/bin/R -e "rmarkdown::render('/home/rstudio/projects/name_of_rmd.Rmd',output_file='/home/rstudio/projects/name_of_html.html')" > processing_output_filename

Next steps: Actually try out this command.

4. Bioinfo Basics

Objective: Time estimated: Time taken XXX h

Progress

4.1 Abstractions

Time estimated: 20 min, started 19:22, 24 Jan 2022 Time taken: 20 min, ended 19:45, 24 Jan 2022 Material adapted from Abstractions, A Bioinformatics Course, Boris Steipe

4.1.1 Abstractions Notes

  • What is an abstraction: creating concepts which map to biological entities in a meaningful way, representable on a computer
  • Examples of abstractions:
  • representation of a mol property (sequence, 3D coordinates)
  • description of a function/role (transcription factor, enzyme)
  • abstract label (gene name, protein name, etc.)
  • relationships (node/edge graphs)
  • Problems wth abstractions
  • not rich enough to capture property of interest
  • ambiguous abstraction
  • non-unique abstraction
  • abstraction not stable over time
  • To structure an abstraction: define labels & structure relationships
  • Labels: must be unique to the object they describe BUT everyday language doesn't really do that. Two approaches arise for controlled vocabularies.
  1. Numerically controlled vocabularies
  • number uniquely represents the thing (atomic number for Hydrogen = 1)
  • unique, abstract, and essentially arbitrary labels are called identifiers
  1. Synonym constrained vocabularies

use only one form of the string in the database (define labels, have a system to accept/reject accordingly, and make the abstraction known to users)

Ontology: a set of terms (nucleus, cell, muscle, dendritic cell, etc.) + relationships (has, is, part of, causing, etc.)

4.1.2. Abstractions Task Time estimated: 10 min, started 19:50, 24 Jan 2022 Time taken: 4 min, 19:54

TF affect other TF affect protein expression PTM-altering proteins affect TF Actors: TF, other TF, proteins, PTM altering proteins An abstraction: Gene regulatory network: nodes = genes labels, edges = relationship (activating/repressing) Strength of relationship: thickness of line This abstraction misses other players that can affect the relationships Other abstractions: ontology. If you focus on a particular TF, then

4.2. Storing data

Time estimated: 40 min, 19:55 on 24 Jan 2022. Time taken: 35 min for reading + notes, ___ min for task

4.2.1 Notes

  • Formats: text files, excel sheets (complex queries need programming, do not scale well, gene names changed to dates), R

  • Three ways of using R

  1. Read a dataset

read.table()`, `readLines()`,`scan() other packages to parse XML or JSON and import it

  1. Make your own dataset

rbind() nrow(mydataframe) mydataframe(, "attribute I wanna know about") sum(mydataframe[,"attribute I wanna know about"] == "wanted_value") save(mydataframe, file="mydata.Rda") rm(mydataframe) load("mydata.Rda")

  1. Connect to a "real" database
  • Use "drivers" to connect to mySQL, MariaDB, Neo4j, etc. Install extra software on your computer
  • guaranteed integrity, multi-user support (aka concurrency), industry-level performance, easy to scale, ACID transactional guarantee. What's ACID? Atomicity, Consistency, Isolation, Durability
  • Atomicity = all or nothing transactions
  • Consistency = any transaction brings database from one valid state to another? (ELEPHANT)
  • Isolation: ensures concurrent execution results in same state as if they had been executed serially
  • Durability: a committed transaction remained permanently committed.

Note: Broken link on Task 2: linking to storing data for bioinformatics (https://bcb420-2022.github.io/Bioinfo_Basics/storing-data.html)

Conclusion & Outlook

  • What I learned

Next steps:

Footnotes & References

Material adapted from

3. R Basics

Objective: Use R with Docker and work through any gaps in my knowledge that come up using the tasks in the Modules. Time estimated: 4 hours, started 11.00, 24 Jan 2022

3.1 Introduction

  • Bioconductor Project for molecular biology data
  • Using Docker: What is Docker Containers & why we like them Image Docker Volume Create notebook in Docker Add commands to notebook and your observations (Time estimated 20 min)

Hello Hello Hello

2. Basic Course Prep

Objective: make notes for the Bioinfo Prep Bookdown to refer to later, and thus prep for the upcoming quiz. Sections to work through: Course Journal, Insights, Plagiarism & Academic Integrity, Data Backup, Network Etiquette, Technical Questions, Info Sources

Time estimated: 2 h, 2022-01-23 Time taken:

Tasks Remaining:

  • Add category tag to insights page
  • Plagiarism model references + footnote
  • Data Backup Task

Progress

2.1. Course Journal

2.2. Insights

  • Link to a new insights! page as a subpage of user page on the Student Wiki
  • Create page in the correct namespace, not in the main space of the Wiki
  • Add category tag
  • Insights template can include: title, context, insight, date

2.3. Plagiarism & Academic Integrity

Time estimated: 20 minutes, started on 21:10, 2022-01-23 Time elapsed: 40 min (20 minutes extra to read linked sources), ended on 22:40, 2022-01-23

  • Contextualize your attributions: inspired by, based on, according to, following, see also, etc.
  • What's a "text tag" to organize citations?
  • If you're citing code, add the citation in a comment
  • Link to the original source, and the URL alone isn't enough
  • Use StackOverflow! But link the post and author
  • Use APA Citation format:
  • Falsifying code output is called "concoction" and BAD IDEA
  • Creating a footnote is proving difficult. Are they self-updating? Or are we hard-coding numbers in superscript, creating a footnotes section, and calling them footnotes? I have spent 10 minutes on this. 1
  • Use in-text citation and bibliography full citation
  • Model references:
  • a procedure in the methods section of a journal article, as you would cite it in a technical report;
  • a piece of code you found in a StackOverflow article, as you would put it as a comment into computer code;
  • some contents from a classmate’s journal that you incoporate into your own journal.

2.4. Data Backup Best Practice

  • Time estimated: 20 minutes, started 22:40, 2022-01-23 Time taken: 12 minutes SO FAR. TASK STILL INCOMPLETE
  • ...I've never backed up my computer hard drive
  • Mac OSX uses Time machine/capsule
  • My reservations about backing up my computer: I have many things on my computer that I have access to on my Drive or Cloud, so it feels unnecessary to back them up in yet another place. Although having multiple backups is probably a good idea.
  • Task: Decide on a backup strategy for your computer, Implement your strategy, Create a test file, Backup your computer, Delete your test file, Recreate the file from your last backup. (12 min remaining)

2.5. Netiquette

Time estimated: 20 min Time taken: 5 min

  • Informative subject lines! Not "XYZ doesn't work"
  • No thread hijacking, new question = new thread.
  • No screenshots
  • No need to address by name unless you're responding to something specific
  • Share the resolution of your issue (what worked/didn't). It's nice and allows for archiving of the thread.

2.6. Technical Questions

Time estimated: 20 min Time taken: 45 min

2.6.1 How To Ask Questions The Smart Way By Eric Steven Raymond Time expected: 15 minutes, 23:18 on Jan 23, 2022 Time taken: 30 minutes, 23:48 on Jan 23, 2022

  • RTFM: Read the F*** Manual
  • STFW: Search the F*** Web

2.6.2 How to Create a Minimal Reproducible Example

Time expected: 10 min, 23:10 on Jan 23,2022 Time taken: 5 minutes

  • Provide minimal code that is understandable, complete, and can reproduce the problem
  • Use spaces, not tabs to create indentation (since tabs may not be correctly formatted on Stack Overflow
  • Give wording of error message + which line produces it
  • Eliminate other errors that aren't relevant

2.6.3 How to ask good questions that prompt useful answers

Time expected: 10 min, 22:56 on Jan 23, 2022 Time taken = 10 min, completed 23:07 on Jan 23, 2022

  • Statistics questions: R mailing lists aren't quite the right place, although a well-asked and interesting response can get an answer.
  • Use net groups sci.stat.consult (applied statistics and consulting) and sci.stat.math (mathematical stat and probability).
  • Choosing the right mailing list: depends on content, type of query, and platform (like Mac-related bugs go to R-sig-Mac)
  • help.search("keyword") , apropos("keyword") , RSiteSearch("keyword")
  • An unexpected behaviour? Copy paste output from sessionInfo() and consider Sys.getlocale()
  • Refer to An Introduction to Rif you need help
  • An Introduction to R

1. Setup and Installing Docker

Objective: Complete task 1 for BCB420 due 2022/01/14.

Estimated duration: 3 hours Taken: 4 hours

Procedure

This includes

  1. Start course journal on your repo wiki. Done
  2. Add links to your wiki and repo to the main Student Wiki page (Links to an external site.) Done
  3. download and install docker (Links to an external site.) Done
  4. create a new image and container from the bcb420 Dockerfile Done, just the notebook has been a problem
  5. Document your progress as a new entry in your journal Done

Process

  • I haven't used GitHub in nearly 1.5 years, getting familiar takes time.

  • Accidentally forked the Student Wiki – delete it when you are certain you won't delete anything else accidentally.

  • Started course journal. URL contains Inika_Prasad so it seems to be in my userspace.

  • Tried editing the Student Wiki, and was unable to. Noticed that the invitation for doing so was separate from the one given for the course. Editing was succesful thereafter.

  • Followed Install Docker. It was an intuitive process.

  • Creating image and container from bcb420 Dockerfile. Attempted following instructions on R Basics Wiki. Error message zsh: permission denied: /Users/inikaprasad/Desktop/BCB420 Cannot gain access to the file, may have to change system permissions. In the meantime, creating the image using the following code from R Basics

docker run -e PASSWORD=changeit --rm \ -v "$(pwd)":/home/rstudio/projects -p 8787:8787 \ risserlin/bcb420-base-image:winter2022

  • Create first notebook using Docker Reached login page, but upon adding in username and password, the following error message appeared: Could not connect to the R Session on the RStudio Server. Unable to connect to service (1)

  • Troubleshooting Installing Rosetta improves performance for Docker according to the Setup Guide softwareupdate --install-rosetta in Terminal

I just tried restarting Docker and there are far more containers there than originally. Possible that it just takes a little longer than I expected. There are 4 containers running, amongst which this seems to be the right one: intelligent roentgen risserlin/bcb420-base-image:winter2022 Port 8787

Comes with the warning Image may have poor performance, or fail, if run via emulation

Clicking opening in browser gives same error message. Port is correct (8787) and container is running.

Next steps:

Try making a notebook. Perhaps restarting the computer will help with performance.

0. Journal Template

Objective: Time estimated: Time taken XXX h

Progress

Task 1:

Task 2:

Conclusion & Outlook

  • What I learned

Next steps:

Footnotes & References

Material adapted from

⚠️ **GitHub.com Fallback** ⚠️