Timestamp Importer - WormBase/db-prototypes GitHub Wiki

Timestamp importer

NB. please use this importer in preference to the original importer -- there's little difference in performance, it sidesteps some issues which made the old importer memory-intensive when handling very large objects, and the old importer is no longer maintained.

Preparation

The importer process uses a dump of the target ACeDB -- it never actually connects to ACeDB directly. Make a dump from the xace "admin" menu, being sure to include timestamps. It's fine to include comments as well, but they will be ignored by the importer.

Once the dump is complete, I usually compress it with gzip (ace files are huge! - compressed ~ 2.2G). Make a note of the highest dump-file number.

On the development server you can run the following command to get the latest release:

 tar -zcvf acedb-WS25-2015-09-04.tar.gz /usr/local/wormbase/tmp/staging/<release>/acedmp

Once on the computer the data will be loaded from unzip the folder and then zip the individual ace file tar xvfz acedb-WS250-2015-09-04.tar.gz #to extract the directory

  for file in $(ls *.ace); do gzip "$file"; done # to zip each file in the directory individually

Here is a script that I used to zip the files and rename them according to how the import script requires them.

  a=1
  for i in *.ace.gz; do
         new=$(printf "dump_2015-02-19_A.%d.ace.gz" "$a") #04 pad to length of 4
         mv -- "$i" "$new"
         let a=a+1
  done    

Then secure copy the file onto the machine you are working on

Alternatively, dump out the ACE data using tace:

  % mkdir -p /nfs/panda/ensemblgenomes/wormbase/tmp/WS250_dump
  % tace /nfs/panda/ensemblgenomes/wormbase/DATABASES/current_DB/
  > Dump -T /nfs/panda/ensemblgenomes/wormbase/tmp/WS250_dump			 # NB it doesn't work if there is a '/' at the end of the path
  > quit

Environment

Currently, there aren't any command-line driver scripts (although they would be reasonably straightforward to add if there's demand). But for now, the expectation is that you'll run most of this from a Clojure REPL. Either type lein repl at the command line, or use a REPL interface of your choice (e.g. CIDER if you like Emacs, Cursive if you prefer a newfangled GUI).

Schema generation

  ;; lein repl has a concept of a project that you do everything in, e.g. 'pseudoace'. 
  ;; You are expected to be in the directory for that project.
  
  ;; general java io package standard in clojure. 
  ;; This is the standard java io not the standard clojure io package, so explicitly imported
  (use 'clojure.java.io)
  ;; package of functions for dealing with acedb models.
  (use 'pseudoace.model)
  ;; the schema generator package
  (use 'pseudoace.model2schema)
  
  ;; read in annotated acedb models file generated by hand - PAD to create this
  (def models (parse-models (reader "models/models.wrm.WS248.annot")))
  ;; make the datomic schema from the acedb annotated models
  (def schema (models->schema models))

Creating the DB and loading the schema

It's fine to use an in-memory database (datomic:mem://...) for testing purposes so long as you're not planning to load the whole database.

  ;; loads in the package then defines 'd' as an alias for 'datomic.api'
  (require '[datomic.api :as d])
  ;; define uri of database - 'WS250' is the name of the database you will create
  ;; best to use the release name 'WS250' etc.
  (def uri "datomic:free://localhost:4334/WS250")
  ;; create the database - it won't harm an existing database, 
  ;; but importing into an existing database will hit problems later, 
  ;; so don't try that!
  (d/create-database uri)
  ;; define the alias 'con'
  (def con (d/connect uri))

  ;; define function tx-quiet that runs a transaction, 
  ;; ensures it completes and throws away all of the output so it runs quietly
  (defn tx-quiet
     "Run a transaction but suppress the (potentially-large) report if it succeeds." 
     [con tx]
     @(d/transact con tx)
     nil)

  ;; loads the datomic schema's meta data about the datomic mapping
  ;; this is where the 'pace/*' namespace stuff in the schema is defined
 (use 'pseudoace.metadata-schema)
  ;; TD's new stuff for defining locations in Features etc.
 (use 'pseudoace.locatable-schema)
  ;; TD's odds and ends that can't be handled in the schema generator.
  ;; says which classes can use the 'locatable' attributes and explicitly
  ;; sorts out the few occurances of inbound xrefs in hash models
 (use 'pseudoace.wormbase-schema-fixups)
 
 ;; Built-in schemas include explicit 1970-01-01 timestamps.
 ;; the 'metaschema' and 'locatable-schema' lines simply execute
 ;; what was read in on the previous two lines for metadata and locatables
 (tx-quiet con metaschema)      ; pace namespace, used by importer
 ;;  this is also from the metadata package
 (tx-quiet con basetypes)       ; Datomic equivalents for some ACeDB builtin types
 (tx-quiet con locatable-schema)

 ;; Need to explicitly timestamp the auto-generated schema.
 ;; add an extra attribute to the 'schema' list of schema attributes, 
 ;; saying this transaction occurs on 1st Jan 1970 to fake a first 
 ;; transaction to preserve the acedb timestamps
 (tx-quiet con (conj schema
                  {:db/id          #db/id[:db.part/tx]
                   :db/txInstant   #inst "1970-01-01T00:00:01"}))

 (tx-quiet con locatable-extras) ; pace metadata for locatable schema,
                                 ; needs to be transacted after the main
                                 ; schema.

 ;; doing the stuff read in at : '(use 'pseudoace.wormbase-schema-fixups)'
 (tx-quiet con schema-fixups)

Making a datomic-schema view of the schema

Not required, but potentially useful to see what's going on.

 ;; TD's package to convert real schemas back onto 'datomic-schema'
 ;; style nicely formatted and styled schemas like the official 'datomic-schema' syntax
 (use 'pseudoace.schema-datomic)
 ;; just some util finctions
 (use 'pseudoace.utils)
 ;; standard clojure pretty-print function 
 (use 'clojure.pprint)
 
 ;; this file is created purely for you to look at if you so wish,
 ;; it contains a nicely formatted schema for the database.
 ;; Name it anything you like.
 (with-outfile "schema250.edn"
    (doseq [s (schema-from-db (d/db con))]
      (pprint s)
      (println)))

Converting ace dumps to Datomic-log format

This assumes that the acefiles are gzipped, which saves lots of space and certainly won't harm performance. The log files are also written in gzip format (specifically, sequences of short gzip streams -- which are legal).

 ;; up until now we have been reading in the stuff to deal with models, 
 ;; rather than importing.

 ;; some TD code for doing the import - most of the functionality of
 ;; this is not used as it was TD's first attempt at an importer.
 (use 'pseudoace.import)    ;; still needed for some support code
 ;; this is the time-stamp aware importer that is actually run
 (use 'pseudoace.ts-import)
 ;; this is the ACE file parser
 (use 'acetyl.parser)

 ;; general useful java functions
 (import java.io.FileInputStream)
 ;; gzip function
 (import java.util.zip.GZIPInputStream)

 ;; define an alias for the importer
 (def imp (importer con))                     ;; Helper object, holds a cache of schema data.
 ;; define the directory that holds the EDN files
 (def log-dir (file "/datastore/datomic/tmp/datomic/import-logs-WS250/"))   ;; Must be an empty directory
 ;; specify the number of the ACE files +1 for the import loop.
 ;; 'doseq' is an imperative loop that does what you tell it
 (doseq [fid (range 1 2500)    ;; highest dump-file number + 1.
         ;; specify the name and extension of the ACE file
         :let [f (str "/datastore/datomic/tmp/acedata/WS250/acedbdump_2015-02-19_A." fid ".ace.gz")]]
    ;; print the status
    (println "Doing" f)
    ;; pipeline to read in ACE file, unzip, parse it,
    ;; then pull out objects from the pipeline in chunks of 20 objects
    (doseq [blk (->> (FileInputStream. f)
                     (GZIPInputStream.)
                     (ace-reader)
                     (ace-seq)
                     (partition-all 20))]   ;; Larger block size may be faster if
                                            ;; you have plenty of memory.
      ;; this is the importer for the objects as they come through the pipeline
      (split-logs-to-dir imp blk log-dir)))

Sorting log segments (10h 40 min)

Sort in timestamp order because datomic needs this. This bit is currently done from the shell. Some segments are large and take a while to sort -- it may be possible to improve this by throwing RAM at the problem...

 mkdir -p sort-temp;
 for i in `ls *.edn.gz`; do 
     bn=`basename $i .edn.gz`; 
     echo $bn; 
     gzip -dc $bn.edn.gz | sort -T sort-temp -k1,1 -s | gzip -c >$bn.edn.sort.gz; 
 done

Playing logs back into Datomic (approximately 36 hours and requires at least 120GB of disk space)

 ;; make fles with this ending and sort them by name.
 (def log-files (->> (.listFiles log-dir)
                     (filter #(.endsWith (.getName %) ".edn.sort.gz"))
                     (sort-by #(.getName %))))

 ;; read in EDN file, uncompress and play into database
 (doseq [f log-files]
    (println "Playing" (.getName f))
    (play-logfile con (GZIPInputStream. (FileInputStream. f))))

Test it works

Do a quick query to test if the database has read in OK.

 (d/q '[:find ?c :in $ :where [?c :gene/id "WBGene00018635"]] (d/db (d/connect uri)))
 gives:
 #{[923589767780191]}

Garbage collection

If you are very low on space, like doing a full database import with only 50Gb disk space free, you might have to do garbage collection during playing the files into Datomic to save disk space.

To do garbase collection, you give the following command using a second repl session connected to the first one. You can attach multiple repls to the same repl session with the command 'lein repl :connect' which connects to the port of the existing repl.

Any garbage older than the specified time will be collected, so give a recent time:

    (datomic.api/gc-storage con #inst "2015-08-27T16:00")

Running gc-storage occasionally during the log-replay phase helps keep storage requirements down somewhat, but they'll still climb to be substantially higher than reimporting the finished database into clean storage. Note that gc-storage may take several hours -- watch the logs for :kv-cluster/delete events to see how things are going - seeing these lines in the log file simply indicates that garbage collection is proceeding as expected.

It's definitely worth running gc-storage before excising the scaffolding IDs, and (if you're not doing a full dump and restore) afterwards as well.

When import has finished

Once all the log segments are replayed, your DB is ready to test.

Test it works

Once all the log segments are replayed, your DB is ready to test. Do a quick query to test if the database has read in OK.

 (d/q '[:find ?c :in $ :where [?c :gene/id "WBGene00018635"]] (d/db (d/connect uri)))
 gives:
 #{[923589767780191]}

When import has finished

Once you're happy with it, you'll probably want to do:

 (d/transact con [{:db/id #db/id[:db.part/user] 
                   :db/excise :importer/temp}])

To clear-out all the import scaffolding IDs. Note that while the transaction will complete very quickly, the actual excision job runs asynchronously and will take quite a while (1 hour?). You'll still see the :importer/temp attributes until the whole excision has completed.

Backup and Restore

Finally, the DB storage will be quite big at this point. You can save much space by doing a datomic backup-db, then a datomic restore-db into clean storage.

Basically: backup-db, kill the transactor, delete the "data" directory, restart the transactor, then run restore-db.

To backup a database (approximately 1.5 hours)

The transactor must be running.

Shut down all unused repls and groovey shells using up memory otherwise you will run out of memory.

Make the dump directory and set it to be group writeable, then dump into it:

    % sudo mkdir -p /datastore/datomic/dumps/WS250_dump
    % sudo chmod g+w /datastore/datomic/dumps/WS250_dump
    % bin/datomic -Xmx4g -Xms4g backup-db "datomic:free://localhost:4334/WS250" "file:/datastore/datomic/dumps/WS250_dump"

To restore the database

If you wish to restore to a fresh data file, then remove the old 'data' directory (as pointed to by the transactor config file) and restart the transactor:

Kill the transactor (use 'ps -eF | grep datomic.launcher' to find the PID of the transactor)

Delete or Move away the directory holding the storage file: '/mnt/data/datomic-free-0.9.5130/data'

Start the transactor again:

    ;; start a screen or tmux session for the transactor to run in
    % screen -S transactor -h 10000	
    % cd /mnt/data/ datomic-free-0.9.5130
    % export XMX=-Xmx4G
    % sudo  bin/transactor -Xmx4G -Xms4G config/transactor.properties &

The transactor must now be running.

    % bin/datomic -Xmx4g -Xms4g restore-db "file:datastore/datomic/backups/database_name_dump" "datomic:free://localhost:4334/name_of_database"

Restore can be used to rename a database.

You cannot restore a single database to two different URIs (two names) within the same storage. i.e. you cannot make a copy of a database with a different name in the same storage file.

You must kill and restart peers (repls) and transactors after a restore.

Time ~ 5 mins reading geneace into an existing storage file

⚠️ **GitHub.com Fallback** ⚠️