Semantic Motif Searching in Knetminer - Rothamsted/knetminer Wiki

Index

Introduction

Knetminer uses a combination of graph patterns and traditional search ranking techniques to estimate how genes are relevant to search words, which, of course, is used to rank and select the genes to show as a search result.

Details are available in our Knetminer paper. We define a semantic motif a graph path (or a pattern matching a path) from a gene to another entity in a Knetminer knowledge graph. An example (in an informal syntax):

  Gene -  encodes -> Protein - interacts-with (1-2 links) -> Protein <- mentions <- Publication

which links protein-mentioning publications to other interacting proteins and genes that encode the latter.

Knetminer can link genes to other entities by means of multiple motifs like the above. Every dataset/specie that makes up an instance can be configured with a set of motifs, which are used against genes in the datasets to find relevant gene-related entities.

That matching is performed by what we call graph traverser. Currently, there are two ways to perform semantic motif searches in Knetminer, each having two different languages to define the motifs, and different sets of configuration options. Each of such ways has its own graph traverser, which means you can choose which type of semantic motif search you want to use, and thus the corresponding graph pattern language, by defining the right traverser in a configuration file. Details are given in this document.

The Data Model for the Knetminer Knowledge Graphs

Both the graph traversers used in Knetminer (or any other traverser, for what matters) allows for the definition of graph patterns by referring to the node type names and node link names used in the underlining Knetminer dataset. This is essentially a knowledge graph, namely a property graph, and those names are based on a predefined schema. The reference for such schema is a metadata file included in Ondex. Examples of of it are given in our paper about the Knetminer backend. The same metadata are automatically translated into our BioKNO ontology, and sample queries in SPARQL are presented in our SPARQL endpoint.

All the examples in the hereby document are based on the same metadata.

The State Machine Traverser

Historically, the so-called state machine traverser (SM) has been the first developed within the Ondex project. This allows to define semantic motifs according to a graph of transitions between node types (concept classes in Ondex terms) and relation types which you want to hold between nodes.

For instance, this is what we use for the arabidopsis dataset

Where we're saying, for example, that we want to match a gene with any trait that co-occurs (cooc_wi) with the gene (in the sense of text mining occurrence), and both relations Gene - cooc_wi -> Trait and Trait - cooc_wi -> Gene will be matched (non-directional link). As another exmaple, look again the figure and find the chain Gene - enc - Protein - genetic|physical -> Protein, which includes self-loops on the first protein, mixed directed and undirected links, multiple releation types that are valid to link from a protein to the next.

State machine can be defined by a simple flat file format. The file defining the SM in figure is here. Let's look at an example:

#Finite States *=start state ^=end state
1*	Gene
2^	Publication
3^	MolFunc
...
7^	Protein
8^	Gene
9	Gene
10	Protein
...
10-10	ortho	4
10-10	xref	4
10-10	genetic	6	d
...
1-10	enc
10-7	physical	6	d
10-7	genetic	6	d
...

The format is very simple:

How to configure the SM traverser in Knetminer.

The SM configuration is part of the configuration required to setup a Knetminer instance, which is
described in our wiki. Our pre-configured datasets are examples of it.

Details are:

Performance tuning and trouble shooting

The SM traverser is usually rather efficient, without having much to configure/tune. However, there are a few factors that affects its performance:

The SM renderer

Flat files can be visualised like the figure above, see the section about the state machine converter below.

The Cypher/Neo4j Traverser

The Cypher/Neo4j traverser (shortened as Cypher traverser) is part of our efforts to publish Knetminer data as machine-readable, standardised data, which can be accessed by third parties applications, including, for instance, your scripts. Details about this general perspective are on our above-mentioned backend paper.

With this traverser, you can define semantic motif paths by means of graph queries based on the Cypher query language. The idea of this is that a Knetminer dataset (available as OXL file produced by Ondex) is converted to a Neo4j database and such Neo4j database is made available for applications like the Cypher traverser.

Cypher as a language to define semantic motifs is more expressive and offer more advanced constructs.

The queries can initially be tried straight in the Neo4j browser, either from your own Neo4j instance (see below), or using the endpoints we provide for some datasets. KnetMiner cypher queries can be found here.

Note that the Neo4j database used by the Cypher traverser doesn't replace the OXL file that Knetminer uses for most of its operations. The two have to be aligned, the Neo4j database has to be generated from the OXL conversion, as explained below.

The Cypher trverser is implemented in the backend project.

Query format

Before looking at the details about the Cypher traverser configuration, let's talk about the format required for its queries. The traverser supports the Neo4j flavour of the Cypher language (but we could add support to other Open Cypher databases in future), however, there are some restrictions that are required for a query to make sense in the context of semantic motifs. This is an example, about the pattern discussed in the State Machine section above:

MATCH path = (gene_1:Gene)
  - [enc:enc] - (protein_10:Protein)
  - [rel_10_10:h_s_s|ortho|xref*0..1] - (protein_10b:Protein)
  - [rel_10_7:genetic|physical*1..2] -> (protein_7:Protein)
WHERE gene_1.iri IN $startGeneIris
RETURN path

We can take this example to talk about several rules:

How to configure the Cypher Traverser

As for the state machine traverser, the configuration of the Cypher traverser is part of a Knetminer dataset configuration. Basics work like this:

The State Machine Converter and visual renderer

If you want to migrate from semantic motifs defined through a state machine flat file (described above) to the corresponding Cypher queries, we have a state machine to Cypher converter utility. As you can see, for the time being this is a prototype, only available through the Maven Exec plug-in (ie, requires Maven and that you download the backend codebase).

As part of its output, the converter produces files in .dot and .svg formats, which encode a representation of the SM. The sample figure presented above, in the section about the SM traverser, was achieved by means of that.

For the developers, the tool uses this class, which can be invoked programmatically.

How to generate a Cypher database

If you use one of our data dumps, we will provide you with both an OXL file and a Neo4j datadump that was generated from the OXL (and the RDF dump too, which, in addition to being useful in itself, is an intermediate). If you want to work with your own dataset, you'll have to convert your OXL.

This is how it works:

Performance tuning and troubleshooting

Several of the considerations made for the state machine traverser apply to the Cypher traverser as well. Namely, the more paths you match, the more time you need, similarly for the number of seed genes or the number of semantic motifs (ie, Cypher queries).

Similarly, you should keep the length of paths under control, in particular when defining self-loops.

In addition to that, you've to be careful with using Cypher. As most of expressive query languages, it's easy to write badly performing queries. Have a look at literature like this or this. Regarding this point, the OXL converter indexes the iri attributes and those Ondex attributes that have the flag 'index' set.

There are a bunch of other Cypher traverser parameters that can be fine-tuned and which affect its performance by changing aspects like the degree of parallelism, the gene batch and page size used for Cypher queries, a timeout that trades a percentage of failed queries (missed semantic motifs) with speed. Such parameters are described in the configuration file used with Cypher tests. Beware that the defaults we already provide are based on extensive tests and we expect they will be good in most cases where you use the community edition of Neo4j. If you've the enterprise edition instead, which exploits all its available cores without commercial restrictions, you might want to try bigger values for the queryThreadPoolSize.

Knetminer logs and performance reporter

By default, the cypher traverser query writes a performance report when it finishes it job. This is a CSV table, listing things like how much time each query took to run, the number of paths it retrieved, the average path length, how many times a query timed out. This can be very useful to check if all queries complete correctly and to optimise performance.

timeoutReportPathTemplate is a similar option, which can be used to obtain detailed reports for each query that times out.

The Cypher debugger

In addition to working with configuration and log files, you can debug/troubleshoot the queries passed to Cypher traverser by means of a debugger tool. This can be enabled by setting the knetminer.backend.cypherDebugger.enabled Maven property in your dataset Maven settings (which, as usually, is injected into data-source.xml). When this is active, point your browser to http://<server-prefix>/client/cydebug/ and you'll see an interface where you can define a new set of queries and re-initialise Knetminer with them, to obtain a performance report at the end.

The Knetminer running on the same server will now base its searches on the semantic motifs computed from te queries you passed it via the Cypher Debugger.

WARNING: if not already clear, this tool destroys the server-configured queries and re-generates semantic motifs based on the new ones. If the latter are just a bunch of tests, likely, the resulting Knetminer will not yield the results you expect. This also means that enabling the Cypher debugger on a production server is a security threat, since the corresponding web interface isn't protected at all (it's supposed to be used in a trusted intranet) and hence anyone could try to use it to empty the semantic motif queries and their results that your Knetminer relies on, which of course will disrupt the application badly.