Importing extra feature data - WormBase/db-prototypes GitHub Wiki

There is an extra, optional import step for "lightweight" feature data, i.e. Feature and Homol lines in Feature_data, Homol_data, Protein, and Sequence objects.

NB: Don't try this until you've tried and fully understood the basic import procedure

Overview.

In principle, the locatable importer creates extra "log" data which is simply appended to the primary log segments. The accumulated logs are then sorted together and played back into the database as a single operation.

There's on complication, though, and it's a pretty nasty one: ?Feature_data and ?Homol_data objects sometimes reflect just a portion of their parent ?Sequence. This isn't replicated in the Datomic model, => this feature data needs to be remapped into sequence coordinates. To store the mappings, we need an extra ("helper") database.

Procedure

Decide what you want to import

You may not want to import all the data that the locatable-import system can theoretically handle. E.g. get protein and sequence features, but ignore all the Homol_data (which is huge and mostly consists of EST alignments).

Look at what objects end up in what acedb dump files and make a note of which dump files you'll need.

Convert primary data to "log" format.

Follow the main import instructions up to and including "Converting ace dumps to Datomic-log format".

Don't sort your log data yet! But might be worth taking a backup of it. If something goes wrong during the locatables import, you can revert to this backup and try again (or just import the primary data).

Create a helper DB.

Create an extra Datomic database to contain the mapping between Sequence coordinates and Feature_data/Homol_data coordinates. This database won't be huge, and it might be okay to use in-memory storage (but this hasn't been tested).

Transact the schema as for the main database.

Then locate the "helper.edn.gz" file from your log directory and play (just) this into the helper database:

   (binding [*suppress-timestamps* true]
      (play-logfile helper-connection "/path/to/helper.edn.gz"))

We've bound the special variable *suppress-timestamps* to allow playback of unsorted log data. Make sure you don't play this into your main database!

Run the locatables importer

Running this is almost exactly the same as running the main importer. You specify the same log directory, so that the locatable log data gets appended to the primary log data. Theoretically you can run a whole database dump through this, but there's no reason why it needs to see any objects other than Sequence, Protein, Feature_data, and Homol_data

    (use 'pseudoace.locatable-import)

    (def helper-db (db helper-connection))
    (def locatable-blocks (concat (range 100 200) (range 500 600)))   ;; List of blocks to import.
    
    (doseq [fid locatable-blocks    ;; highest dump-file number + 1.
         :let [f (str "/Users/tdown/Projects/wormbase247-dump/dump_2015-02-19_A." fid ".ace.gz")]]
    (println "Doing" f)
    (doseq [blk (->> (FileInputStream. f)
                     (GZIPInputStream.)
                     (ace-reader)
                     (ace-seq)
                     (partition-all 20))]   ;; Larger block size may be faster if
                                            ;; you have plenty of memory.
      (split-locatables-to-dir helper-db blk log-dir)))

Once this is done, you can delete the helper DB if you want.

Finish off the import

Sort all your log files (less "helper.edn.gz") and play them back into the main database, exactly like the normal import.

This will obviously take a big longer than the normal import (depending on exactly how much extra data you've imported). More importantly, it can take a lot more disk space during the import. Recommend allowing 1GiB of storage space if you want to import absolutely everything without scrambling around to do mid-import garbage collections. Once you've dumped and restored the Datomic database, it won't be huge (~25GiB for everything) -- the extra size is garbage generated during import.