Entry 5: Bioinformatics Basics - bcb420-2025/Izumi_Ando GitHub Wiki

Chapter 2 : Abstrations for Bioinformatics

Estimate time for completion: 15 mins
Actual time taken for completion: 30 mins

Windsor knot example : Fink and Mao abstracted tie tying into a "walk" on a triangular lattice to enumerate and comupute all possible ways the knot can be achieved.
To make abstractions in biology, we must rigorously define 1) representations, 2) semantics, 3) operations, and 4) metrics of biological data.
Example of abstraction in biology -> SEQ as an abstraction of a gene or protein. (more examples below taken from slides)

Note that as comp biologist, we often work with the REPRESENTATION not the actual BIOLOGICAL ENTITY.
Common problems with abstractions 1) not rich enough, 2) ambiguity, 3) not unique, 4) not stable across time
Ontologies : a set of terms from a controlled vocabulary + set of relationships between those terms.
Controlled Vocabulary : numerically OR synonym controlled, labels must be unique

Task 1

Propose an abstraction for the functional relationships between TFs with other TFs and other proteins in regulation of gene expression. Note that a common abstraction for this is a "gene-regulatory network".

I would propose using a gene-regulatory network with a binary weight to determine the expression state or presence of each gene in the context being studied to capture the overall expression state of the cell rather than just the relationships between the actors.

Chapter 3 : Storing Data

Estimate time for completion: 30 mins
Actual time taken for completion: 40 mins

Text files : difficult to perform operations
Spreadsheets : data in one place, easy to edit, good for quick analysis BUT complex queries require programming & does tno scale well
R : kept in dfs and lists, very easy to do analysis, can be stored in multiple forms including databases. Requires installation of extra software but otherwise robust.
MySQL and friends : free, relational database, based on a client-server model, used across various institutions. server responds to requests from client. Reasons to use this over R: scalability, concurrency (multiple people need access to the data simultaneously), ACID compliance for databases.

ERD = Entity Relationship Diagram, similar to OOP. Relationships between entities have a unique primary key between 1 to (0, n).
Think of entities as a df, with each attribute being a column.

Task 2 - skipped as the link was broken

Task 3 - done on local machine

Chapter 4 : Bioinformatics Databases

Estimate time for completion: 30 mins
Actual time taken for completion: X mins

Introductory Notes

"Database" can refer to the data stored or the layered system to maintain & organize the data.
Each category of users interfaces with the database in a different way. Regular users:web, applications:API, Other information systems (ex databases):Libraries, Database Admin (DBA):Console.
Query System : brain 🧠 of the database. translates interaction requests into commands for the storage engine.
Storage management system : heart 🫀 of the system. carries out transactions of the data (adding, modifying, retrieving, deleting etc). must make sure transactions between multiple users do not interfere with one another. -> achieved by implementing ACID requirements

ACID REQUIREMENTS
ATOMICITY : transactions are NOT divisible AND always either completely succeed or completely fail
CONSISTENCY :