Moving data onto transactions - WormBase/db-prototypes GitHub Wiki

Datomic transactions are first class entities and can be annotated with arbitrary properties. It's straightforward to query the transaction associated with any datom as part of a query, e.g. find who assigned gene names using a query like:

   [:find ?gene-id ?name ?curator
    :where [?gene :gene/id ?gene-id]
           [?gene :gene/public-name ?name ?tx]   ;; NB fourth item is transaction ID.
           [?tx :wormbase/curator ?curator]]

This trivially acts as a provenance mechanism roughly equivalent to ACeDB timestamps (and improving on them, since historical data can still be queried).

But we might be able to simplify the data model by moving more information onto transactions...

Recording imports

Much WormBase data is periodically imported from some external source. Sometimes this can be spotted by the appearance of a script name in the timestamp, although this is a bit ad-hoc. In some cases, there's also an explicit record of versions (e.g. on the GO_Term and SO_Term classes).

At the very least we should be able to handle the provenance a bit better in Datomic. As a sketch, transaction metadata could look like:

 {:curation/script          "wb.core.import-so"
  :curation/script-version  "16f1c5e550c9c321b8b92d1448a556ae69cbc222"
  :curation/source-uri      "http://sourceforge.net/p/song/svn/HEAD/tree/trunk/so-xp.obo"
  :curation/source-version  "2.5.2"
  :db/txInstant             #inst "2015-04-30"}

Note that the version is encoded in its own field, which would be "reachable" from any SO term entity in a query.

Replacing "evidenced links".

Other ideas?

  • Smarter handling of data which is mostly bulk-imported but gets hand-edited in a few places? Should be possible to identify the hand-edited bits and merge them into the next import?