Installing Datomic - WormBase/db-prototypes GitHub Wiki
System Requirements
TLDR: plenty of memory.
Datomic likes memory. 4GiB of Java heap is the recommended minimum for a production transactor. For high-intensity writes (e.g. full-Wormbase imports), I've found 6GiB to be better (probably because the storage engine is running in the same JVM for free storage -- 4GiB is probably enough if you're running with external storage).
Memory is potentially even more important for Datomic peer processes. 2GiB of Java heap as an absolute minimum, but more is better.
When using the Free or Dev storage engines, the actual data storage uses H2 (running in-process in the transactor's JVM). This is pretty fast, but seems to be quite sensitive to the access latency of underlying storage. Best performance is on non-virtualized machines with directly-connected (ideally PCIe) SSDs. Because the storage is treated as (nearly) append-only, there shouldn't be a lot of rewriting of data once it has been stored, and therefore consumer-grade SSDs will probably work okay.
Actual storage requirements can be surprisingly small. The core WormBase data set (as of WS249) takes <10GiB in Datomic. If you add the full set of Feature_data and Homol_data, this rises to ~20GiB. Still substantially smaller than ACeDB. (NB. Datomic is believed to compress all data at rest).
One special case to consider is when you're writing a lot of data to Datomic. Datomic doesn't reclaim any storage until you do an explicit gc-storage
operation, and running gc-storage
while under heavy transaction load hurts performance (it's fine to run a few transactions, e.g. normal curation activity while GCing -- just don't try to submit thousands of back-to-back transactions). So for full Ace->Datomic conversions or very large updates (complete new species? re-do all the variation consequences) you'll need extra space. The worst case (full import) needs ~100GiB.
AWS
When using Free or Dev storage, we want SSD storage, not basic EBS.
I suspect we'd do better using directly attached ephemeral storage as much as possible and only relying on EBS as backing store. There are some (not-Datomic-specific) suggestions for how to set this up here, but needs someone with full admin rights to the AWS instance to test this stuff.
Given that Datomic tends to be more memory intensive than CPU intensive, it's probably worth investigating the EC2 "r3" family of instance types.
Java
Java 7 is the minimum requirement. Java 8 will give better performance. Historically, I've preferred to use Oracle's JVM builds, but OpenJDK seems to be fine now. When working on Linux, I used the Zulu builds of OpenJDK (these are likely to be better-tested than most Linux-distro builds...). Zing would probably perform nicely...
Transactor config.
There's a sample config file in config/samples/free-transactor-template.properties
. This is generally pretty sensible except that it's set up for small dev instances rather than production (please don't try importing WormBase into a 1GiB "dev laptop" transactor). Copy the template to somewhere more sensible (I use config/transactor.properties
) and edit the memory-index-threshold
, memory-index-max
, and object-cache-max
values (there are suggestions for a 4GiB production instance in the config file template).
You can set explicit values for log-dir
and data-dir
if they're helpful, or just leave them.
Note that, in addition to the config file, the transactor startup script also consults the environment variable XMX
for a Java heap size. To configure a 4GiB heap,
export XMX=-Xmx4G
Ports
By default, Datomic Free binds ports 4334 (main Datomic message-bus port), 4335 and 4336 (both relating to the storage). Datomic expects access control to be implemented elsewhere in the stack, so we don't want these ports open to the public.
Running it
It's straightforward to start the transactor from a shell. I prefer this for experimentation and testing.
bin/transactor config/transactor.properties
For production purposes, you probably want the transactor to run at system startup. There's a daemontools-style init script here which might be helpful (check the environment variables at the top before starting).