Entry 2.3: Bioinformatics Basics - bcb420-2025/Chloe_Calica GitHub Wiki
Abstraction
- Abstracting biology means creating concepts which map to biological entities in a meaningful way such that they can be represented in a computer
- Abstraction is not the biology <\u> but its value relies in our ability to map entities back to biological entities.
- To define an abstraction, need to ask:
- what is fundamental and essential?
- what can be stripped away?
- To make biology computable, need to rigorously define our system of objects, their categories, and relationships i.e.
- representations of biological data
- semantics of data
- operations with data entities
- metrics of operations
- Working with abstractions implies we are no longer manipulating the biological entity, but its representation.
- Common problems with abstractions include that they:
- may not be rich enough to capture the property we are investigating
- may be ambiguous
- may not be unique
- are not stable over time and cross-references to an old labe may not be valid
Examples of abstractions
- representation of a molecular property: nucleotide/amino acid sequence, 3D coordinates
- description of a function or role: transcription factor, checkpoint control element
- abstract label: gene name, protein name
Commonly Used Abstractions
Biological Entity | Abstraction | Theoretical Domain | Database |
---|---|---|---|
Polymer | 2o letter AA code | String Processing | Genbank, GenPept, Refseq |
Molecular Conformation | XYZ coordinates, matrices | Floating point processing, linear algebra | PDB |
Molecular Interactions | Node-Edge Graphs | Networks, Graph Theory | STRING, IntAct |
Function | Ontology, DAG | Networks, Graph Theory | Gene Ontology |
Taxonomy | Hierarchy | Database Methods, Graph Theory | NCBI/EBI/DDBJ, Taxon |
Evolutionary Relationship | Tree | Graph Theory, combinatronics | TreeBase |
Structuring Abstractions
To structure an abstraction, need to define labels and structure relationaships.
- Labels = controlled vocabularies: uniques to object they describe
- Numerically controlled vocabularies
- number is defined that uniquely represents the item
- Ex. number of protons in an element
- requires unique numbers exist or can be computed (ideal solution)
- Synonym controlled vocabularies
- use only one form of the string in a database
- Ex. if "calcium" then not "Ca", "Calcium" etc.
- Requires that labels are defined and that a mechanism exists to accept/reject such instances
- Numerically controlled vocabularies
Ontologies
- Ontology is a set of terms from a controlled vocabulary and the set of relationships between them.
- represent knowledge bases and define the semantics of the domain
- useful for exchanging data between databases: mapping the meaning when structure can't be mapped.
Protein Identifiers
- UniProt
- Refseq - specify in the names what they are so if the names have a "P", it represents a protein
Hugo Gene ID / HGNC- only for genes- Genbank
- Ensembl - also specify in the names: T=transcript, P=protein, G=gene