8. Data Exploration & Identifier Mapping

Apoptosis enhancing drugs overcome innate platinum resistance in CA125 negative tumor initiating populations of high grade serous ovarian cancer
mistakenly annotated with wrong identifiers, and their pretty heat map disappeared
Identifier mapping is an important step. Make sure your identifiers are up to date, but they may not match common names
CA125 is Muc16, but people know it as the former
When you load your data check.names = FALSE to help your data look okay and your

Filter out low counts

Filter out data with low counts: provide little evidence for differential expression. There must be minimal expression of the RNA to translated into a protein (and we are using RNA seq as a proxy for protein activity). Can create noise
EdgeR and DEC are packages for filtering out low counts. Fundament: make sure the genes you keep are expressed in at least 50% (adustable) of your samples.
There's a filter expression function in EdgeR

Define the groups

Column headers contain the info
pull out the relevant columns and make a dataframe
Differential Gene Expression List (DGEList)

Normalization

Since we want to know about relative expression of data, not absolute data
Read counts are proportional to the length of a transcript
Resd depth & Library effect size
calcNormFactors to normalize. Creates data within the dataframe that reflects scaling factors that it will use to scale downstream analysis. Puts data in bins and normalizes within them
norm.factors less than 1 => small number of high count genes predominant in the sample, and so the library size is reduced to compensate
EdgeR
MDS plot (like principal component analysis)

Dispersion

Common dispersion: across your entire dataset
Tagwise dispersion: gene specific
Parameter of the negative binomial model, which EdgeR is based on
Count data tends to be Poisson distribution.
RNA-seq is negative binomial distribution
Genes with small counts tend to have a higher degree of variation.
dispersion squared is Biological Coefficient of Variation (BCV) and is a measure of how much variation is in the data. WHY NOT USUAL VARIATION??
Graphing the BCV creates a visual representation of the mean-variant relationship
Biological causes of the conditions themselves. Technical causes ELEPHANT
Use trendline of BCV vs counts per million graph
Normalize using BCV and mean
Data fits the negative binomial distribution! Very nice.

Identifier Mapping

Very important in bioinformatics. Just gets worse lol.
Enrichment analysis software are hugely popular, DAVID especially
Using outdated identifiers can severely impact your data
If you use data from an old database/publication, make sure your annotations are up-to-date
GO (Gene Ontology) updated monthly, so identifiers in the labs are updated monthly too
BioMart ensembl website: lets you load identifiers onto the website and convert them. Good for checking, but quick & dirty
Why canät data just contain the relevant identifiers?

Technology doesn't actually deal with full genes or proteins, it often works with fragments.

Databases change & get annotated constantly. New genes and regulatory mechanisms are discovered and may become part of the analysis

Tools for Identifier Mapping

Ensembl Biomart: make sure R is using the most up to date version.
Bioconductor package biomaRt and R package biomartr
BridgeDB: restful service (you can get to it with code?). A but flaky sometimes

Biomart

library(biomaRt)
listMarts()
Ensembl Genes version 105 is the latest
listEnsemblArchive() so you know which version of the genome
Connect to the mart you're interested in (ensembl <- useMart("ensembl")
BioGRID stuff!
Find the human dataset using kable and grepl and then specify
IDs start with ENSG, ENSP, ENST (Gene, Protein, Transcript)
(The version stuff you don't want .... I think)
We want to convert ensembl gene ID to HGNC: HUGO Gene Nomeclature Committee. HGNC symbol is the standard gene name for stuff in humans.
Consider stashing your conversions to use later. It may change your whole analysis though. But! it's less time & computation-consuming. You can specify a REFREH <- TRUE or FALSE in the beginning of your R notebook
use merge command to merge the conversion table and the
Prof Isserlin has some code to test whether the notebook compiles

To do

Download package for indexing GEO data (Omnibus of Gene Expression, now has microarraw & RNA seq data) (last lecture)
getGEOsuppFiles('mydata') Put up your data on the wiki

7. Types of Expression Data Lecture

Types of Expression data

Gene expression data (not necessarily the gene that we're measuring, rather mRNA)
Genomics: lots of genes rather than a handful. But how do we turn this isnto something meaningful?
Normalization & Filtering

Filter out artefacts in your dataset (genes with low counts)

Read depth/library size: get rid of genes that are shorter or longer.

Trimmed Men of M-values (method of normalization of RNA seq data)

Microarray expression data (being replaced by RNA-seq)

A chip in a well with probes that match the mRNA

Compare expression between disease/not disease, for example

You can use multiple dyes, multiple probes for a given gene. Computational analysis involves collapsing the probes.

The chip needed to have the probe in order for you to detect the mRNA

Basic pipeline: isolate mRNA, enrich mRNA, hybridize, stain, scan.

Bulk RNASeq expression

Sequencing not the whole, but "enough" of the genome

Pipeline: Extract mRNA, enrich mRNA, fragment your DNA, (find the unambiguous transcripts that uniquely identify your gene), sequence, FASTQ file generator, run it through an aligner. Count how many times you count the gene – but the longer the gene, the higher the chance of finding the fragments of the gene so normalization is important.

Many different platforms, but mostly from Illumina

Different ways to do reads: short reads, long reads, direct reads

Most common: Short reads

Parameters to consider:

No. of samples (minimum 6, 3 control 3 disease. Helps reduce outliers. You may not get a strong signal with few samples.)

Sample prep

Read depth: target no. of sequence reads for each sample (10-30 million generally)

Read length

Single vs. paired end-reads: important when you're looking for isoforms, or something very specific-. Single is cheaper and more common

IMPORTANT: Note the reference genome! The aligned processed files will be available, because the alignment is a computationally intensive file.
Tools relying on a reference genome: TopHat, STAR, HISAT
Quantification: Identify no. of transcripts per gene. Tools: RSEM, CuffLinks, MMSeq, HTSeq (popular).
Some methods try to account for the fact that long genes show many reads. Example: RPKM/FPKM (read or fragments per kilobase million) or TPM (transcripts per million) or raw counts (more common, do normalization as followup instead of embedding it in the raw data. HTSeq uses raw counts.) IMPORTANT: Make sure you know what your base data is!

Single Cell RNAseq expression data (very hard on personal computer)
Protein Expression data using Mass spectrometry

Next steps

Choose an expression dataset
What is a good dataset? That's a tricky thing. Good depth, good experiment, but also of interest
Look at the data in GO

6. R Basics

.Rprofile: special R-script that is executed automatically on startup7. R expects to find it in the user's home directory.
You can include setwd(), libraries, functions you've defined, etc.
Workspace: info about the objects you create in R is stored here. Saving a Workspace is not a great idea, since you might save rubbish, corrupted objects, etc. INSIGHT
Make good scripts so you can recreate the objects you need.
If you expensively/time-consumingly computed, you can save() and later load() them explicitly. save(your_object, file = "your_file_name.RData")

5. Setting up Docker: second try

Tried logging in again: same errors as 1. Setup and Installing Docker Tried deleting and re-downloading containers: new error message appeared The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

This video by Execute Automation explained it quite well.

Apple with M1 chip (my current Macbook) can only use containers with Arm64 architecture, and the image for this course is an amd64 build.
RStudio also has problems running with the right architecture, according to Prof. Isserlin on the course discussion forum.

Conclusion: use Rstudio on your laptop directly

and just check that your notebook knits intermittently with the command:

docker run --rm -it -v "$(pwd)":/home/rstudio/projects --user rstudio risserlin/bcb420-base-image:winter2022-arm64 /usr/local/bin/R -e "rmarkdown::render('/home/rstudio/projects/name_of_rmd.Rmd',output_file='/home/rstudio/projects/name_of_html.html')" > processing_output_filename

Next steps: Actually try out this command.

4. Bioinfo Basics

Objective: Time estimated: Time taken XXX h

Progress

4.1 Abstractions

Time estimated: 20 min, started 19:22, 24 Jan 2022 Time taken: 20 min, ended 19:45, 24 Jan 2022 Material adapted from Abstractions, A Bioinformatics Course, Boris Steipe

4.1.1 Abstractions Notes

What is an abstraction: creating concepts which map to biological entities in a meaningful way, representable on a computer
Examples of abstractions:

representation of a mol property (sequence, 3D coordinates)

description of a function/role (transcription factor, enzyme)

abstract label (gene name, protein name, etc.)

relationships (node/edge graphs)

Problems wth abstractions

not rich enough to capture property of interest

ambiguous abstraction

non-unique abstraction

abstraction not stable over time

To structure an abstraction: define labels & structure relationships
Labels: must be unique to the object they describe BUT everyday language doesn't really do that. Two approaches arise for controlled vocabularies.

Numerically controlled vocabularies

number uniquely represents the thing (atomic number for Hydrogen = 1)

unique, abstract, and essentially arbitrary labels are called identifiers

Synonym constrained vocabularies

use only one form of the string in the database (define labels, have a system to accept/reject accordingly, and make the abstraction known to users)

Ontology: a set of terms (nucleus, cell, muscle, dendritic cell, etc.) + relationships (has, is, part of, causing, etc.)

4.1.2. Abstractions Task Time estimated: 10 min, started 19:50, 24 Jan 2022 Time taken: 4 min, 19:54

TF affect other TF affect protein expression PTM-altering proteins affect TF Actors: TF, other TF, proteins, PTM altering proteins An abstraction: Gene regulatory network: nodes = genes labels, edges = relationship (activating/repressing) Strength of relationship: thickness of line This abstraction misses other players that can affect the relationships Other abstractions: ontology. If you focus on a particular TF, then

4.2. Storing data

Time estimated: 40 min, 19:55 on 24 Jan 2022. Time taken: 35 min for reading + notes, ___ min for task

4.2.1 Notes

Formats: text files, excel sheets (complex queries need programming, do not scale well, gene names changed to dates), R
Three ways of using R

Read a dataset

read.table()`, `readLines()`,`scan() other packages to parse XML or JSON and import it

Make your own dataset

rbind() nrow(mydataframe) mydataframe(, "attribute I wanna know about") sum(mydataframe[,"attribute I wanna know about"] == "wanted_value") save(mydataframe, file="mydata.Rda") rm(mydataframe) load("mydata.Rda")

Connect to a "real" database

Use "drivers" to connect to mySQL, MariaDB, Neo4j, etc. Install extra software on your computer

guaranteed integrity, multi-user support (aka concurrency), industry-level performance, easy to scale, ACID transactional guarantee. What's ACID? Atomicity, Consistency, Isolation, Durability

Atomicity = all or nothing transactions
Consistency = any transaction brings database from one valid state to another? (ELEPHANT)
Isolation: ensures concurrent execution results in same state as if they had been executed serially
Durability: a committed transaction remained permanently committed.

Note: Broken link on Task 2: linking to storing data for bioinformatics (https://bcb420-2022.github.io/Bioinfo_Basics/storing-data.html)

Conclusion & Outlook

What I learned

Next steps:

Footnotes & References

Material adapted from

3. R Basics

Objective: Use R with Docker and work through any gaps in my knowledge that come up using the tasks in the Modules. Time estimated: 4 hours, started 11.00, 24 Jan 2022

3.1 Introduction

Bioconductor Project for molecular biology data
Using Docker: What is Docker Containers & why we like them Image Docker Volume Create notebook in Docker Add commands to notebook and your observations (Time estimated 20 min)

Hello Hello ~~Hello~~

2. Basic Course Prep

Objective: make notes for the Bioinfo Prep Bookdown to refer to later, and thus prep for the upcoming quiz. Sections to work through: Course Journal, Insights, Plagiarism & Academic Integrity, Data Backup, Network Etiquette, Technical Questions, Info Sources

Time estimated: 2 h, 2022-01-23 Time taken:

Tasks Remaining:

Add category tag to insights page
Plagiarism model references + footnote
Data Backup Task

Progress

2.1. Course Journal

Made a template journal based on the WikiText Template, it is 0. Journal Template

2.2. Insights

Link to a new insights! page as a subpage of user page on the Student Wiki
Create page in the correct namespace, not in the main space of the Wiki
Add category tag
Insights template can include: title, context, insight, date

2.3. Plagiarism & Academic Integrity

Time estimated: 20 minutes, started on 21:10, 2022-01-23 Time elapsed: 40 min (20 minutes extra to read linked sources), ended on 22:40, 2022-01-23

Contextualize your attributions: inspired by, based on, according to, following, see also, etc.
What's a "text tag" to organize citations?
If you're citing code, add the citation in a comment
Link to the original source, and the URL alone isn't enough
Use StackOverflow! But link the post and author
Use APA Citation format:
Falsifying code output is called "concoction" and BAD IDEA
Creating a footnote is proving difficult. Are they self-updating? Or are we hard-coding numbers in superscript, creating a footnotes section, and calling them footnotes? I have spent 10 minutes on this. 1
Use in-text citation and bibliography full citation
Model references:

a procedure in the methods section of a journal article, as you would cite it in a technical report;

a piece of code you found in a StackOverflow article, as you would put it as a comment into computer code;

some contents from a classmate’s journal that you incoporate into your own journal.

2.4. Data Backup Best Practice

Time estimated: 20 minutes, started 22:40, 2022-01-23 Time taken: 12 minutes SO FAR. TASK STILL INCOMPLETE
...I've never backed up my computer hard drive
Mac OSX uses Time machine/capsule
My reservations about backing up my computer: I have many things on my computer that I have access to on my Drive or Cloud, so it feels unnecessary to back them up in yet another place. Although having multiple backups is probably a good idea.
Task: Decide on a backup strategy for your computer, Implement your strategy, Create a test file, Backup your computer, Delete your test file, Recreate the file from your last backup. (12 min remaining)

2.5. Netiquette

Time estimated: 20 min Time taken: 5 min

Informative subject lines! Not "XYZ doesn't work"
No thread hijacking, new question = new thread.
No screenshots
No need to address by name unless you're responding to something specific
Share the resolution of your issue (what worked/didn't). It's nice and allows for archiving of the thread.

2.6. Technical Questions

Time estimated: 20 min Time taken: 45 min

2.6.1 How To Ask Questions The Smart Way By Eric Steven Raymond Time expected: 15 minutes, 23:18 on Jan 23, 2022 Time taken: 30 minutes, 23:48 on Jan 23, 2022

RTFM: Read the F*** Manual
STFW: Search the F*** Web

2.6.2 How to Create a Minimal Reproducible Example

Time expected: 10 min, 23:10 on Jan 23,2022 Time taken: 5 minutes

Provide minimal code that is understandable, complete, and can reproduce the problem
Use spaces, not tabs to create indentation (since tabs may not be correctly formatted on Stack Overflow
Give wording of error message + which line produces it
Eliminate other errors that aren't relevant

2.6.3 How to ask good questions that prompt useful answers

Time expected: 10 min, 22:56 on Jan 23, 2022 Time taken = 10 min, completed 23:07 on Jan 23, 2022

Statistics questions: R mailing lists aren't quite the right place, although a well-asked and interesting response can get an answer.
Use net groups sci.stat.consult (applied statistics and consulting) and sci.stat.math (mathematical stat and probability).
Choosing the right mailing list: depends on content, type of query, and platform (like Mac-related bugs go to R-sig-Mac)
help.search("keyword") , apropos("keyword") , RSiteSearch("keyword")
An unexpected behaviour? Copy paste output from sessionInfo() and consider Sys.getlocale()
Refer to An Introduction to Rif you need help
An Introduction to R

1. Setup and Installing Docker

Objective: Complete task 1 for BCB420 due 2022/01/14.

Estimated duration: 3 hours Taken: 4 hours

Procedure

This includes

Start course journal on your repo wiki. Done
Add links to your wiki and repo to the main Student Wiki page (Links to an external site.) Done
download and install docker (Links to an external site.) Done
create a new image and container from the bcb420 Dockerfile Done, just the notebook has been a problem
Document your progress as a new entry in your journal Done

Process

I haven't used GitHub in nearly 1.5 years, getting familiar takes time.
Accidentally forked the Student Wiki – delete it when you are certain you won't delete anything else accidentally.
Started course journal. URL contains Inika_Prasad so it seems to be in my userspace.
Tried editing the Student Wiki, and was unable to. Noticed that the invitation for doing so was separate from the one given for the course. Editing was succesful thereafter.
Followed Install Docker. It was an intuitive process.
Creating image and container from bcb420 Dockerfile. Attempted following instructions on R Basics Wiki. Error message zsh: permission denied: /Users/inikaprasad/Desktop/BCB420 Cannot gain access to the file, may have to change system permissions. In the meantime, creating the image using the following code from R Basics

docker run -e PASSWORD=changeit --rm \ -v "$(pwd)":/home/rstudio/projects -p 8787:8787 \ risserlin/bcb420-base-image:winter2022

Create first notebook using Docker Reached login page, but upon adding in username and password, the following error message appeared: Could not connect to the R Session on the RStudio Server. Unable to connect to service (1)
Troubleshooting Installing Rosetta improves performance for Docker according to the Setup Guide softwareupdate --install-rosetta in Terminal

I just tried restarting Docker and there are far more containers there than originally. Possible that it just takes a little longer than I expected. There are 4 containers running, amongst which this seems to be the right one: intelligent roentgen risserlin/bcb420-base-image:winter2022 Port 8787

Comes with the warning Image may have poor performance, or fail, if run via emulation

Clicking opening in browser gives same error message. Port is correct (8787) and container is running.

Next steps:

Try making a notebook. Perhaps restarting the computer will help with performance.

0. Journal Template

Objective: Time estimated: Time taken XXX h

Progress

Task 1:

Task 2:

Conclusion & Outlook

What I learned

Next steps:

Footnotes & References

Material adapted from

1. BCB420 Journal - bcb420-2022/Inika_Prasad GitHub Wiki

8. Data Exploration & Identifier Mapping

Filter out low counts

Define the groups

Normalization

Dispersion

Identifier Mapping

Tools for Identifier Mapping

To do

7. Types of Expression Data Lecture

Types of Expression data

Next steps

6. R Basics

5. Setting up Docker: second try

Conclusion: use Rstudio on your laptop directly

4. Bioinfo Basics

Progress

4.1 Abstractions

4.2. Storing data

Conclusion & Outlook

Footnotes & References

3. R Basics

3.1 Introduction

2. Basic Course Prep

Progress

2.1. Course Journal

2.2. Insights

2.3. Plagiarism & Academic Integrity

2.4. Data Backup Best Practice

2.5. Netiquette

2.6. Technical Questions

1. Setup and Installing Docker

Procedure

Process

Next steps:

0. Journal Template

Progress

Conclusion & Outlook

Footnotes & References

⚠️ GitHub.com Fallback ⚠️

1. BCB420 Journal - bcb420-2022/Inika_Prasad GitHub Wiki

8. Data Exploration & Identifier Mapping

Filter out low counts

Define the groups

Normalization

Dispersion

Identifier Mapping

Tools for Identifier Mapping

To do

7. Types of Expression Data Lecture

Types of Expression data

Next steps

6. R Basics

5. Setting up Docker: second try

Conclusion: use Rstudio on your laptop directly

4. Bioinfo Basics

Progress

4.1 Abstractions

4.2. Storing data

Conclusion & Outlook

Footnotes & References

3. R Basics

3.1 Introduction

2. Basic Course Prep

Progress

2.1. Course Journal

2.2. Insights

2.3. Plagiarism & Academic Integrity

2.4. Data Backup Best Practice

2.5. Netiquette

2.6. Technical Questions

1. Setup and Installing Docker

Procedure

Process

Next steps:

0. Journal Template

Progress

Conclusion & Outlook

Footnotes & References

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️