Homologies - WormBase/db-prototypes GitHub Wiki

The "homol" mechanism in ACeDB is used to represent both sequence alignments (e.g. EST and protein sequences aligned to the genome), and families of "feature" which could map to more than one location on the genome (e.g. motifs).

In the current model, homols can either be attached directly to a parent (typically ?Sequence or ?Protein), or indirectly via a Homol_data object. ?Sequence homologies are -- with a handful of exceptions that I suspect are erroneous -- all indirectly attached, despite the schema allowing both, while ?Protein homologies are more variable. As with Feature_data, I propose that we should no longer boxcar homologies together in Homol_data objects, and instead each one should be a standalone lightweight feature, i.e a datomic entity with attributes from the locatable and homology namespaces, but without anything equivalent to an ACeDB object ID. This means that the distinction between directly and indirectly-attached homologies will be lost.

Here's my first attempt at a proposed schema:

   (schema homology
    (fields
     ;;
     ;; The target of this homology.  Only one is allowed per homology entity.
     ;;
     [dna :ref
        "Sequence entity representing the target of a DNA homology."]

     [protein :ref
        "Protein entity representing the target of a peptide homology."]

     [motif :ref
        "A motif entity which is mapped to a sequence by this homology."

     [rnai :ref
        "An RNAi entity which is mapped to a sequence by this homology."]

     [oligo-set :ref
        "An oligo-set which is mapped to a sequence by this homology."]

     [structure :ref
        "Structure-data which is mapped to a sequence by this homology."]

     [expr :ref
        "Expression-pattern which is mapped to a sequence by this homology."]

     [ms-peptide :ref
        "Mass-spec-peptide which is mapped to a sequence by this homology."]

     [sage :ref
        "SAGE-tag which is mapped to a sequence by this homology."]

     ;;
     ;; Parent sequence, parent location, and method are specified using "locatable".
     ;;

     [min :long :indexed
        "Lower bound of a half-open interval defining the extent of this homology 
         in the target's coordinate system."]
     [max :long :indexed
        "Upper bound of a half-open interval defining the extent of this homology 
         in the target's coordinate system."]
     [strand :enum [positive negative]
          "Token designating the strand or orientation of this homology on the 
           target's coordinate system. Should only be used in situations where 
           a negative-to-negative alignment would be meaningful (e.g. tblastx)"]
     [gap :string
          "Gapped alignment.  The locations of matches and gaps are encoded 
           in a CIGAR-like format as defined in 
           http://www.sequenceontology.org/gff3.shtml"]

     ;; 
     ;; Parity with legacy #Homol_info -- are these needed in the long run?
     ;;

     [target-species :ref
         "Link to target species of alignment."]

     [align-id :string
         "Alignment ID to emit in GFF dumps."]))

So a simple homology would look something like:

     {:locatable/parent   [:sequence/id "2L52"]
      :locatable/min      563
      :locatable/max      602
      :locatable/strand   :locatable.strand/positive
      :locatable/method   [:method/id "RepeatMasker"]
      :locatable/score    246.0
      :homology/motif     [:motif/id "Ce000154"]
      :homology/min       289
      :homology/max       328}

While something with gaps might end up like:

     {:locatable/parent   [:sequence/id "Bmal_v3_scaffold6_5"]
      :locatable/min      123537
      :locatable/max      124043
      :locatable/strand   :locatable.strand/negative
      :locatable/method   [:method/id "BLAT_EST_BEST"]
      :locatable/score    94.0
      :homology/dna       [:sequence/id "AI052856"]
      :homology/min       0
      :homology/max       230
      :homology/gap       "M124 D277 M106"
      :homology/align-id  "AI052856.1"}

This schema seems sufficient to represent everything we've got at the moment, but if anyone has tricky cases that they consider to be important, now would be a good time to discuss them.

One inelegance is that this is an asymmetrical model, i.e. every homology has a parent/reference side (:locatable/parent) and a target (one of the various homology attributes). I have seen arguments in favour of symmetrical models, and I can see they're quite pretty, at least for the simpler cases. But a symmetrical model would require more joins in Datomic, and this seem to be a good match to what's currently in ACeDB (while XREFs notionally make the Homol system symmetrical, in practice the reverse XREFs just end up with an object ID, not a full homology record), and it's also a very close match to the GFF3 model of doing things.