3. Deploying KnetMiner with Docker - Rothamsted/knetminer GitHub Wiki

Index

Table of contents generated with markdown-toc.

Introduction

The web server for the KnetMiner application can be deployed quickly and simply via the Docker container platform. Deployment, at its simplest, will only need installation of the Docker environment, then our [KnetMiner image from GitHub Packages][DK.10]. For more advanced customised deployments, our KnetMiner docker image can be used with host-specific configurations and data.

Additionally, one can create developer changes in KnetMiner, starting from our image definitions.

Running KnetMiner via Docker

Quick start guide using the sample dataset

It is quick to run an example instance of KnetMiner with our example sample dataset. The dataset is based on a small sample data set (a subset of the Poaceae dataset, which is named Aratiny (details below). After installing Docker, just type the following command:

docker run -p 8080:8080 -it ghcr.io/rothamsted/knetminer

This will pull our KnetMiner image direct from GitHub Packages via your Internet connection. You will observe server initialisation output lines in your shell window confirming that everything is starting up. Once the KnetMiner server is up and ready, you will be able to connect to the sample application via http://localhost:8080/client . You can change the host port used via the -p flag, in our case the format will always be -p <host-port>:8080. You will know the KnetMiner instance is ready when you see the final output line beginning Server startup in...

Running without any other additional parameters ensures that the Docker container only uses the configuration and data present within its own isolated file system, which contains the Arabidopsis sample dataset.

Note: the -it flag option is needed due to the method by which the startup command, associated with our container, works. An explanation is documented here.

Note: we have migrated from the DockerHub registry to GitHub Packages in 2023, so the image coordinates are no longer knetminer/knetminer.

Currently, we rebuild the Docker image once per day (upon changes); auto-builds don't occur more than this, due to it being a resource-intensive task.

Doing it simpler: the docker-run.sh script

The docker-run.sh script is provided to make the invocation of the KnetMiner container as simple as possible, essentially by building KnetMiner-specific docker run commands. For instance (assuming Docker is already running), then the equivalent of the command we just described above, would be:

./docker-run.sh

A variant of this, which maps the container to the port 8090, is:

./docker-run.sh --host-port 8090

More complex initialisation options are explained further below.

Note: while you don't need to download (i.e., git clone) the whole KnetMiner codebase just to get this and the other scripts described below, such cloning is the recommended method.

Instantiating a specific dataset

Whilst the above example will run KnetMiner against a default provided dataset, a different application instance can be run against a different dataset. Let's consider some definitions and explanations:

A dataset is a set of files that serves a KnetMiner instance. Most of the datasets that we provide are focused on a single species, or a set of related species. Thus, we have datasets named Arabidopsis, Wheat, Rice, Poaceae, etc. This is also the reason why sometimes we use the word species as a synonym of dataset.

A dataset directory contains all the dataset-specific information that a KnetMiner instance needs to run a dataset. Except when using the default dataset mentioned above, you must provide a dataset directory to run KnetMiner against a new instance.

The typical way to do so is to create such a directory on your Docker host computer (i.e., the one that runs the Docker server its containers). An initial configuration for this directory can be built by adapting one of our pre-defined configurations. This can be done using the dataset-init.sh helper script. Once you have built and configured a dataset, you must provide the dataset core data in the Ondex/Knetbuilder format. By default, this is provided in the file <dataset-dir>/data/knowledge-network.oxl.

These dataset *.oxl files are generated by integrating and processing various public and private life science data sources. The KnetMiner team has dataset OXLs for many different species and topics, either free or paid. You can also build your own OXL file, using the Ondex/Knetbuilder framework.

More details about creating new datasets and the structure of the dataset directory are given below in this document.

For example, to run the wheat dataset, using our pre-defined settings, you can set up things this way:

  1. Clone the KnetMiner repository (or download the relevant scripts and files)
  2. Create the dataset directory on the Docker host, eg, /home/knetminer/wheat-dataset, using the dataset initialisation script and the ID of a predefined dataset, either the one corresponding to your data, or a dataset close to what you have in your data file.
  1. Possibly, customise the obtained configuration files (see sections below)
  2. Place your OXL data into <dataset-dir>/data/knowledge-network.oxl (needs that you change your configuration if you want to change this default path/name, not recommended).
  3. Run a new KnetMiner Docker instance, by means of docker-run.sh and use --dataset-dir to point to the dataset directory you've just created on the host.

This is an example of what you could run as the last step:

./docker-run.sh \
  --dataset-dir /home/knetminer/wheat-dataset \
  --host-port 8090 \
  --container-name 'docker-wheat' \
  --container-memory 20G
  --detach

The options given to the above command will do the following:

  • --dataset-dir maps KnetMiner Wheat directory
  • --host-port instance to the host port 8090
  • --container-name names the corresponding container docker-wheat (a useful reference for all Docker commands)
  • --container-memory grants 20GB of memory to the container
  • --detach detaches the container from the script (ie, the script immediately ends with the container remaining in background).

This command will create some support data (eg, Lucene index files) on /home/knetminer/wheat-dataset/data. Hence, this directory must be writeable by the (host) user running Docker.

Note: in some instances, it may be beneficial to not include the --detach argument, should you wish to track the progress of each build stage, or for debugging purposes.

See docker-run.sh --help for a documented list of all available options.

The bare Docker equivalent of the command above is:

docker run -it --detach -p 8090:8080 --name docker-wheat --memory 20G \
       --volume /home/knetminer/wheat-dataset:/root/knetminer-dataset \
       ghcr.io/rothamsted/knetminer:latest

This is shown by the docker-run.sh script.

Note: the :/root/knetminer-dataset directory is static and refers to the KnetMiner's dataset directory within the docker container. Thus, your dataset directory on the Docker host is bound to the container by means of a Docker volume, which maps the directory on a known container's path. This top, root location needs to be taken into account when configuring paths (see sections below).

Following the execution of the above command, the corresponding Docker instance will become available at http://localhost:8090/client. The console where you launched the command will show some output about the KnetMiner initialisation and the answers returned to the client(s). The running container can be stopped via Ctrl-C (when not using --detach, otherwise execute docker stop docker-wheat).

Beware that the startup phase of large datasets may take a long time, so you might need some time (10 mins to few hours) before the KnetMiner application starts returning significant results. During this stage, search and alike operations return errors to the user interface.

Cleaning the working files

KnetMiner creates a number of working files in the configured dataset directory. Sometimes, you need to clean these files, eg., when you update the OXL and new Lucene files need to be re-created. The cleanup-volume.sh script is provided for this and can be executed as follows:

./cleanup-volume.sh /home/knetminer/wheat-dataset

The command will leave files like data/knowledge-network.oxl untouched, since this is what you normally provide externally for a dataset.

TODO: dataset-init.sh overrides and add files. At the moment, we don't have any helper to reset a configuration directory to default and dataset-specific contents only.

Additional Docker options

The docker-run.sh scripts wrap the execution of the docker run command and it allows for setting a limited number of options that this command accepts. For passing further Docker options that are not available from the script, use the environment variable DOCKER_OPTS, e.g.,

export DOCKER_OPTS="-it --volume /opt/local-dir:/opt/container-dir"
./docker-run.sh ...

Note: the -i and -t flags (shows the container's output) are useful in most cases. Our script adds these as defaults whenDOCKER_OPTS is empty, else you need to specify them explicitly.

Automatic container restart

A KnetMiner container can be restarted automatically using the corresponding Docker option:

export DOCKER_OPTS="-it --restart unless-stopped"
./docker-run.sh ...

When using this with a KnetMiner container, this is restarted whenever the Docker daemon is restarted (in particular, when the host is rebooted).

Note to RRes admins: always use this option for production instances and permanent test instances (test instances managed by the CI already have it).

Alternative configurations

The docker-run script supports the --config-file option, which can be used to switch to an alternative configuration file. For instance, you might set up the Neo4j mode (see the next section) in a separated file, so that you might play with default mode, ie., no option used and the default config.yml is picked, or Neo4 mode, eg, via docker-run.sh --config-file config/config-neo4j.yml.

The path to this option is always relative to the dataset file (in the unlikely case that you need a different behaviour, you'll need to pass the corresponding Java property to the container JVM, see the script code for details).

As explained in the next section, a good way to manage this scenario is by having a base/default file and including it in the variants (example)

Enabling Neo4j

KnetMiner has graph traversal components that [relate genes to other bio-molecular entities][90] of interest. One of such components is based on exporting KnetMiner data into a Neo4j database (see the link provided).

As explained above, you can configure KnetMiner in Neo4j mode in an alternative YML configuration (example). Configuration Details about Neo4j are managed via a Spring beans file, see here for details.

You might want the Docker container to point to a Neo4j database running on the container's host (not recommended for large datasets, since it might require much CPU and memory). In that case, you might need to use the Docker's --add-host parameter, as explained here.

Configuring KnetMiner

In this section, we describe the details of creating a KnetMiner instance including configuration and set up.

When the --dataset-dir <dataset-dir> parameter is passed to the docker-run.sh script, by default KnetMiner opens<dataset-dir>/config/config.yml and then also all of the other configuration files are searched in that directory. The data set directory refers to the host, and not the container, the relative paths in the config files refer to the dataset directory (not to config/).

Composing an initial dataset config via initialisation script

As mentioned earlier, the quickest way to create a new dataset directory for a new dataset is to choose an existing reference configuration in datasets/ and pass it to the dataset-init.sh script.

This creates a new configuration directory, by putting together (i.e. copying) dataset-specific files stored in the dataset directory (eg. datasets/poaceae-test) and files in the reference/test instance configuration (i.e., aratiny).

After that, and after having started the KnetMiner instance, the result is something like this:

.
├── config
│   ├── SemanticMotifs.txt
│   ├── config.yml
│   ├── defaults-cfg.yml
│   ├── neo4j
│   │   ├── config-test.xml
│   │   ├── config.xml
│   │   ├── defaults-cfg.xml
        ...
│   │   └── semantic-motif-queries.txt
│   ├── sample-queries.xml
│   ├── seed-genes-example.tsv
│   ├── species
│   │   ├── base-map-3702.xml
│   │   ├── base-map-4565.xml
│   │   └── base-map-4577.xml
│   ├── test-cfg-neo4j.yml
│   └── test-cfg.yml
└── data
    ├── knowledge-network.oxl
    ...
    ├── index
    |   ...
    │   ├── segments_p
    │   └── write.lock
    ...

The main files listed above are:

  • config/config.yml: by default, this is the entry point to read a configuration. Other YML files can be included from there (see below)
  • config/defaults-cfg.yml: defaults from aratiny, which are usually included by config.yml
  • config/sample-queries.xml: example queries for this dataset (shown on the top/right side of the user interface)
  • config/species/base-map-NNNN.xml: these are configuration files to visualise genes on the the map/chromosome view. They are per-species files, with the number being the NCBI Taxonomy ID for the respective specie (these are default names). See genomaps.js for details.
  • config/SemanticMotifs.txt: semantic motif configuration, see here for details
  • config/neo4j: semantic motif configuration for the Neo4j mode, see here for details
  • config/seed-genes-example.tsv: optional list of genes to be used as starting point for the semantic motif traversal, see here for details.
  • config/config-neo4.yml (or similar name): to be use in Neo4j mode, in place of the default `config.yml (as explained above).
  • data/knowledge-network.oxl: the data, in the form of an Ondex knowledge graph. As explained above, you need to build or acquire this file and place it here (as default path).
  • Other files in data: they are created by the application to support its functionality (eg, index/ for Lucene index files)

The Configuration files

The .yml files are based on the YAML format. The expected KnetMiner-specific schema for them can be inferred from the code.

Typical examples of that are the defaults, the file used for our test instance.

Rules for the Configuration files

We have added special markers and rules to the YAML format that support advanced features. Most of the details can be inferred from these tests and test files. A summary here:

defaults
A typical KnetMiner instance configuration consists of an entry point file (usually config.yml) which overrides or extends a number of default values. These are either defined in a default config file, which the main file includes (see below), or by the application code itself.

Examples of defaults are:

  • default path locations (eg, <datasetDir>/data/knowledge-network.oxl) OXL location
  • general dataset metadata (eg, Rothamsted as organisation name)
  • default graph traverser (see here)

inclusions
As mentioned, when parsing these YAML files, we support a special syntax for inclusion. Example:

# This is usually in config.yml. As you see, the expected value is an array, so multiple files
# can be included from a parent level 
"@includes":
- defaults-cfg.yml

extensions, overrides and merges
The inclusion mechanism is enhanced by composition rules. A simple one is extension at the top level. For instance, if you have this:

# This is defaults-cfg.yml
oxl: data/knowledge-network.oxl
dataset:
  title: Default Dataset
  organization: Rothamsted Research


# This is config.yml
"@includes":
- config.yml

oxl: data/my-data.oxl
cypherDebuggerEnabled: true
dataset:
  title: My Cool Dataset

the cypherDebuggerEnabled is added to the top-level configuration object (when not specified, the application assumes it's false), while the oxl field in config.yml overrides the already-defined attribute at the same JSON level. Similarly, the dataset field is a top-level object and the 2-field default is completely overridden by another object, which contains a new title only. Because YAML eventually yields JSON, the result here is going to be:

{
  "oxl": "data/my-data.oxl",
  "cypherDebuggerEnabled": true,
  "dataset": {
    "title": "My Cool Dataset"
  }
}

Note that, with this simple syntax, the dataset object in the defaults is completely replaced, there is no merge. That said, we support a syntactic trick to specify we want merge two objects:

# This is defaults-cfg.yml, as above
...

# This is config.yml
"@includes":
- config.yml

oxl: data/my-data.oxl
cypherDebuggerEnabled: true
"dataset @merge":
  title: My Cool Dataset

With the "@merge" postfix, the result built by the KnetMiner code is:

{
  "oxl": "data/my-data.oxl",
  "cypherDebuggerEnabled": true,
  "dataset": {
    "title": "My Cool Dataset",
    "organization": "Rothamsted Research"
  }
}

As you can see, the final dataset field is an object where top-level new fields were added to the included default and fields with the same name overridden the respective defaults.

Inclusion works in top-down direction only
the idea is that base/default definitions are put in a lower-level included file and then the higher-level file makes more specific overrides or extensions. In this example, defaults-cfg.yml cannot change oxl or cypherDebuggerEnabled in the final result, since the including/top file always wins over the included/down one.

Config properties and interpolation
As shown in the test files (1, 2), one can define @properties in a file and use them into your configuration values, in the same file or in the included files. The scope of these variables spans in the downstream direction and definitions can be nested, ie, downstream re-definition overrides values from their scope downwards, just like a local variable in a programming language function has a local scope only and overrides its parents with the same name. For instance, in the linked example, appName is visible both in the parent and in the included file, with the top-level value (since it isn't redefined in the included file). In contrast, appVersion has the 2.0 value on the parent file, while it becomes 1.0 in the included one, due to the scope locality rule.

system properties
Property names are interpolated by taking values both from @properties and/or from two common external places: the Java system properties (ie, the -D option passed to the java command, or defined via some equivalent mechanism), or the operating system environment variables. See an example here. A property/environment value passed from the outside will be overriden by @property definitions, due to the scope locality principles explained above. To obtain a different behaviour, see the next section.

system properties defined in the configuration
KnetMiner offers the additional option to pre-define Java system properties in the configuration files, using the systemProperties field for the top-level object. See the defaults for an example.

Contrary to @properties values, values defined here are overridden by Java system properties (ie, via java -D). The idea is to offer this further injection mechanism (eg, to be used by docker-run.sh or JAVA_TOOL_OPTIONS), in combination with default definitions.

In particular, this is useful to configure subsystems like Neo4j, which are not based on YAML (but on Spring) and can receive such Java properties for certain values (see #{systemProperties[...] in [this default][55]).

default properties
KnetMiner defines a number of interpolated variables:

  • ${me} resolves to the path of the current .yml file (the one where you use this name)
  • ${mydir} resolves to the directory path of this file

This can be useful to deal with relative file paths. Remember that usually, i.e., under Docker, their values are based on the constant path /root/knetminer-dataset (eg, mydir is /root/knetminer-dataset/config for the top-level config.yml), since this is where the dataset is located, from the point of view of the Docker container.

Google Analytics

TODO. For the time being, see our configurations.

The KnetMiner Docker architecture

We have moved this section here.

Developing KnetMiner with Docker

We moved this section here.

Deploying with AWS analytics enabled

We have moved this section here.

Troubleshooting

We have moved this section here.

⚠️ **GitHub.com Fallback** ⚠️