Importer TempIDs - WormBase/db-prototypes GitHub Wiki

(Mostly of interest to people hacking on the importer code -- this stuff isn't needed for general use).

The ace-file importer sometimes needs to assign temporary (but persistent-during-the-import) IDs to Datomic entities. This is principally because an entity can contain values with different acedb timestamps: to preserve the timestamps, the data must be transacted into Datomic in several different transactions, but the second and subsequent transaction must be able to refer back to an entity created by the first transaction.

A secondary reason why they are needed is to allow unification of "extension data" (often, but not always, #Evidence) on the outbound and inbound ends of an XREF. The importer will generally handled the two XREFs completely separately but they need to refer to the same component entity once transacted into Datomic.

TempIDs are stored in an attribute called :importer/temp. This is special in that it is never displayed by TrACeView. It is possible (and recommended) to excise :importer/temp values from a database after the import is finished. New :importer/temp values can be created if the database is patched using further .ace files, so it may be useful to re-run the excision periodically.

Original TempID system.

The original importer using SQUUIDs (semi-sequential UUIDs, created by datomic.api/squuid), and stored them in an attribute of valueType :db.type/uuid. This did the job for the most part, but in retrospect it wasn't a great choice -- if you create a lot of SQUUIDs in a short period of time (as during an import), they're not actually very sequential, and ended up causing substantial amounts of index-thrashing during journal replay.

This system also didn't handle XREF unification properly, so needed to be replaced during the September 2015 XREF revamp.

September 2015 TempID system

:importer/temp is still used, but now has the type of :db.part/string

Various types of TempID are now used.

  • TempIDs for component entities. These consist of a lookup-ref for the parent entity, the ident of the attribute used to reach the component, and a string representation of any "positional" values within the component, e.g. "[:gene/id "WBGene00003020"] :gene/id lin-35".

  • TempIDs for top-level entities with unknown names (e.g. "__ALLOCATE__foo" IDs in ace file patches). These consist of the basis-t of the database that's being imported into, followed by a colon, followed by some tag.

  • TempIDs for lightweight features created in the locatable-importer. These are still SQUUIDs, but represented as a string rather than a UUID object.

Future?

There's probably scope for improvement here, e.g. thinking of something saner than SQUUIDs for locatable TempIDs.

Having a big index of TempIDs isn't really playing to Datomic's strengths. Reducing the number of TempIDs used during import would improve both performance and disk-space requirements of the log-replay stage. There might be an argument for using different attributes for different "types" of TempID in order to make the individual indices smaller and improve index locality.

One thing to consider might be to just force all attributes in a lightweight locatable to use the same timestamp, at which point TempIDs wouldn't be needed at all.