Entry 2.3: Bioinformatics Basics - bcb420-2025/Chloe_Calica GitHub Wiki

Abstraction

  • Abstracting biology means creating concepts which map to biological entities in a meaningful way such that they can be represented in a computer
    • Abstraction is not the biology <\u> but its value relies in our ability to map entities back to biological entities.
    • To define an abstraction, need to ask:
      • what is fundamental and essential?
      • what can be stripped away?
  • To make biology computable, need to rigorously define our system of objects, their categories, and relationships i.e.
    • representations of biological data
    • semantics of data
    • operations with data entities
    • metrics of operations
  • Working with abstractions implies we are no longer manipulating the biological entity, but its representation.
  • Common problems with abstractions include that they:
    • may not be rich enough to capture the property we are investigating
    • may be ambiguous
    • may not be unique
    • are not stable over time and cross-references to an old labe may not be valid

Examples of abstractions

  • representation of a molecular property: nucleotide/amino acid sequence, 3D coordinates
  • description of a function or role: transcription factor, checkpoint control element
  • abstract label: gene name, protein name

Commonly Used Abstractions

Biological Entity Abstraction Theoretical Domain Database
Polymer 2o letter AA code String Processing Genbank, GenPept, Refseq
Molecular Conformation XYZ coordinates, matrices Floating point processing, linear algebra PDB
Molecular Interactions Node-Edge Graphs Networks, Graph Theory STRING, IntAct
Function Ontology, DAG Networks, Graph Theory Gene Ontology
Taxonomy Hierarchy Database Methods, Graph Theory NCBI/EBI/DDBJ, Taxon
Evolutionary Relationship Tree Graph Theory, combinatronics TreeBase

Structuring Abstractions

To structure an abstraction, need to define labels and structure relationaships.

  • Labels = controlled vocabularies: uniques to object they describe
    • Numerically controlled vocabularies
      • number is defined that uniquely represents the item
      • Ex. number of protons in an element
      • requires unique numbers exist or can be computed (ideal solution)
    • Synonym controlled vocabularies
      • use only one form of the string in a database
      • Ex. if "calcium" then not "Ca", "Calcium" etc.
      • Requires that labels are defined and that a mechanism exists to accept/reject such instances

Ontologies

  • Ontology is a set of terms from a controlled vocabulary and the set of relationships between them.
  • represent knowledge bases and define the semantics of the domain
  • useful for exchanging data between databases: mapping the meaning when structure can't be mapped.

Protein Identifiers

  • UniProt
  • Refseq - specify in the names what they are so if the names have a "P", it represents a protein
  • Hugo Gene ID / HGNC - only for genes
  • Genbank
  • Ensembl - also specify in the names: T=transcript, P=protein, G=gene