Ace to Datomic mapping - WormBase/db-prototypes GitHub Wiki

Ace objects are trees. Datomic is, for our purposes, a triple store. In principle, a triple store can model arbitrary graphs (and so could maintain a 100% faithful representation of arbitrary Ace objects). However, triple stores don't distinguish between top level entities (== Ace objects) and entities representing internal nodes within a tree. This means that the 100%-faithful Ace representation would be a pain to query (since you'd have to explicitly walk up and down every level of the tree), and probably not perform great either.

We therefore distinguish between "internal tags" in the Ace models (those which exist only to structure the data for display purpose) and "leaf tags", and only retain the leaf tags. In general, this is fairly straightforward: a tag is a leaf tag if it's followed by either a variable (e.g. Text or ?Gene), a hash-model reference (e.g. #Evidence), or a newline. E.g. in:

 ?Gene   Evidence #Evidence
         Identity Version UNIQUE Int
                  Name  CGC_name UNIQUE ?Text
                        Sequence_name UNIQUE ?Text
                        Public_name UNIQUE ?Text
	      DB_info Database ?Database ?Database_field ?Text
                  Species UNIQUE ?Species
         Map_info Well_ordered

Leaf tags are Evidence, Version, CGC_name, Sequence_name, Public_name, Database, Species, and Well_ordered, while internal tags are Identity, Name, DB_info, and Map_info. This rule is sufficient to convert most Ace models into a sensible triple-store model. The one notable exception is "enum-like structures" (see below).

Because AceDB requires tags to be unique within a class, the set of leaf tags within a class are guaranteed to be uniquely named. In most cases, their names are also reasonably self-explanatory. However, there are a few leaf tags with names that only make sense in the context of one or more preceding tags, e.g.:

 ?Gene Disease_info Experimental_model ?DO_term
                    Potential_model ?DO_term

and these will need to be re-named. The Datomic schema generator allows Datomic attribute names to be explicitly specified in such cases.

Making names Datomic-friendly

Datomic attribute identifiers are canonically all-lower-case, and use hyphens as word-separators. Use of underscores, while not prohibited, is generally avoided both for stylistic reasons (Datomic has some LISP heritage) and because underscores are used to indicate reverse-traversal of attributes in the Datomic entity API.

Object identity

In Datomic, the primary identifier for an entity is an opaque (and usually large) integer. These aren't supposed to be used outside of the database (in particular, they aren't guaranteed to remain stable across a dump-restore cycle), and certainly aren't user friendly.

Datomic has no concept directly analogous to the AceDB object names, but it's straightforward to build one. It's possible to define arbitrary "identity attributes" by setting the :db/unique schema attribute to :db.unique/identity. This has several consequences:

  • The attribute is "indexed" (i.e. included in the AVET index).
  • The values of the attribute are enforced as unique (i.e. can't give two different entities the same value for this attribute).
  • The attribute can be used to form lookup refs.
  • Transaction data which specifies a value for the attribute has "upsert" semantics (i.e. if a matching entity already exists, that entity will be updated rather than creating a new entity).

For every ACeDB class, we create a corresponding "class-identifier" attribute with a namespace portion equal to a (Datomic-friendly) representation of the class name, a name of "id", and a :db/valueType of :db.type/string. E.g. :gene/id, :expr-pattern/id, etc. Note that each of these defines a separate namespace, e.g., you can't have two entities with the same :gene/id, but you could have a gene and and expr-pattern with the same name -- just like AceDB object names.

Note that identity attributes are considerably more flexible than AceDB object names, in that:

  • The don't have to be strings.
  • You can have multiple identity attributes on a single object (i.e. several independent identity schemes addressing the same underlying collection of objects).

Neither of these are needed in a straight AceDB port, but could be useful in the future.

Simple attributes

A tag followed by a single variable is translated into a simple attribute in Datomic. The type mapping here is mostly straightforward

     Text      -> :db.type/string
     ?Text     -> :db.type/string with indexing enabled
     Int       -> :db.type/long
     Float     -> :db.type/float
     DateType  -> :db.type/instant

AceDB object references translate to :db.type/ref attributes. There is a little subtlety here, though: in AceDB, bidirectional links between objects are created by specifying a tag at each end and linking them up with XREFs. This results in duplication of data (but mostly managed by AceDB). In Datomic, all entity-to-entity relationships are traversable efficiently in both directions (using the VAET index for reverse traversal), so doubling up the links would be useless. Therefore it's necessary to pick a primary direction for each object-to-object link and only model this in Datomic. This is currently controlled by hints in the augmented models file.

There's also a special case of a bare leaf tag, e.g.

   ?Gene Map_info Well_ordered

this is translated into a :db.type/boolean attribute. AceDB doesn't have a real boolean type, but presence of the tag is translated into true values in Datomic, while absence of the tag means the attribute will not be created in the Datomic import. Boolean attributes are always :db.cardinality/one (Datomic supports cardinality-many boolean attributes for reasons of symmetry, but they're not terribly useful...)

Component attributes

A leaf tag becomes a component attribute if:

  • It's followed by more than one variable node (e.g. pMap ?Contig Int Int)
  • It's followed by a hash-model (regardless of the number of "positional" variable nodes).

Consider the model:

   ?Gene Version_change Int UNIQUE DateType UNIQUE ?Person #Gene_history_action

This results in:

  • An attribute linking a gene to a Version change component (:gene/version-change) -- hereafter, the "component attribute".
  • A namespace for the positional values of a version-change record (gene.version-change)
  • An attribute within that namespace for each positional value.

By default, the value attributes are named based on their types, but this can be overriden based on naming hints in the augmented models file.

Note that there aren't any schema elements directly reflecting the hash-model. The schema generator will have given that model its own namespace (in this case "gene-history-action") and attributes from that namespace can be used directly. But for the benefit of curation tools, etc., namespaces referred to by hash-models are recorded in :pace/use-ns attributes on the component entities.

So, to give a concrete example, the model:

  ?Gene History Version_change Int ^version UNIQUE DateType ^date UNIQUE Person ^person #Gene_history_action

(note this now has name hints), yields the following attributes:

  • :gene/version-change (:db.type/ref, :db/isComponent)
  • :gene.version-change/version (:db.type/long)
  • :gene.version-change/date (:db.type/instant)
  • :gene.version-change/person (:db.type/ref)

And some typical data using this schema might look like:

 {:gene/id "WBGene00003020"
  :gene/version-change [
     {:gene.version-change/version   1
      :gene.version-change/date      #inst "2004-04-07T12:29:30"
      :gene.version-change/person    [:person/id "WBPerson1971"]
      :gene-history-action/imported  "Initial conversion from geneace"}
  ]}

Cardinality

For simple attributes, UNIQUE in ACeDB means :db.cardinalty/one in Datomic.

For component attributes, if all positional values are unique then the component attribute will be cardinality-one, otherwise cardinality-many.

Attributes storing positional values within a component are always cardinality-one.

Enum-like structures

XREFing basics

XREFs in hash models

AceDB allows XREFs out of hash models. These are potentially problematic for the ace->datomic mapping, since we want to know the type of object at each end of the XREF. Theoretically, a hash model with an XREF which gets used in multiple classes could lead to an XREF attribute in Datomic which allows objects of more than one class at the outbound end.

Fortunately, this flexibility isn't used in WormBase. Currently, there are outbound XREFs in the following hashes:

  #Multi_counts -- only used in ?Multi_pt_data
  #Mass_spec_data -- only used in ?Mass_spec_peptide
  #Interactor_info -- only used in ?Interaction

...so no issues with multiple classes. However, the tooling wants to know the classes for each end of the XREF, and this is hard to infer from the models file. Currently, the schema generator sets the :pace.xref/obj-ref property to the "class" of the hash model, which is wrong but hard to fix. Instead, we set the correct :pace.xref/obj-ref in the schema-fixups.

Outstanding issues