Using Elastic Search - andrew-nguyen/titan GitHub Wiki
Elasticsearch is a flexible and powerful open source, distributed real-time search and analytics engine for the cloud. Elasticsearch allows you to start small, but will grow with your business. It is built to scale horizontally out of the box. As you need more capacity, just add more nodes, and let the cluster reorganize itself to take advantage of the extra hardware. Elasticsearch clusters are resilient – they will detect and remove failed nodes, and reorganize themselves to ensure that your data is safe and accessible. — Elasticsearch Homepage
Titan supports Elasticsearch as an embedded or remote index backend. In embedded mode, Elasticsearch runs in the same JVM as Titan and stores data on the local machine. In remote mode, Titan connects to a running Elasticsearch cluster as a client. If not in embedded mode, be sure to have the Elasticsearch running and accessible.
For single machine deployments, Elasticsearch can run embedded with Titan. In other words, Titan will start Elasticsearch internally and connect to it within the jvm.
To run Elasticsearch embedded, add the following configuration options to the graph configuration file where /tmp/searchindex/
specifies the directory where Elasticsearch should store the index data:
storage.index.search.backend=elasticsearch
storage.index.search.directory=/tmp/searchindex
storage.index.search.client-only=false
storage.index.search.local-mode=true
Note, that Elasticsearch will not be accessible from outside of this particular Titan instance, i.e., remote connections will not be possible. Also, it might be advisable to run Elasticsearch in a separate jvm even for single machine deployments to achieve more predictable GC behavior.
In the above configuration, the index backend is named search
. Replace search
by a different name to change the name of the index.
Titan can connect to an external Elasticsearch cluster running remote on a separate cluster of machines or locally on the same machine.
To connect Titan to an external Elasticsearch cluster, add the following configuration options to the graph configuration file where hostname
lists the IP addresses of the instances in the Elasticsearch cluster:
storage.index.search.backend=elasticsearch
storage.index.search.hostname=100.100.101.1,100.100.101.2
storage.index.search.client-only=true
Make sure that the Elasticsearch cluster is running prior to starting a Titan instance attempting to connect to it. Also ensure that the machine running Titan can connect to the Elasticsearch instances over the network if the machines are physically separated. This might require setting additional configuration options which are summarized below.
In the above configuration, the index backend is named search
. Replace search
by a different name to change the name of the index.
-
Full-Text: Supports all
Text
predicates to search for text properties that matches a given word, prefix or regular expression.
-
Geo: Supports the
Geo.WITHIN
condition to search for points that fall within a given circle. Only supports points for indexing and circles for querying.
-
Numeric Range: Supports all numeric comparisons in
Compare
.
This is the full list of configuration options for Elasticsearch. Note, that each of these options needs to be prefixed with storage.index.[INDEX-NAME].
where [INDEX-NAME]
stands for the name of the index backend. For instance, if the index backend is named search then these configuration options need to be prefixed with storage.index.search.
Option | Description | Value | Default | Modifiable |
---|---|---|---|---|
backend | Index backend implementation name | elasticsearch | – | – |
hostname | Comma-separated list of IP addresses or hostnames of the instances in the Elasticsearch cluster | IPs | – | yes |
index-name | Name of the index | string | titan | no |
cluster-name | Name of the Elasticsearch cluster. If none is defined, the name will be ignored. | string | elasticsearch | yes |
local-mode | Whether Titan should run Elasticsearch embedded | boolean | false | no |
directory | Directory to store Elasticsearch data in. Only applicable when running Elasticsearch embedded | string | – | yes |
config-file | Filename of the Elasticsearch yaml file used to configure this instance. Only applicable when running Elasticsearch embedded | boolean | false | no |
client-only | Whether this node is a client node with no data | boolean | true | no |
max-result-set-size | The default maximum result set size for any query if the query does not explicitly specify a limit | integer | 100000 | yes |
sniff | Whether client transport sniffing is enabled. When encountering connection problems, in particular on AWS, disable this option | boolean | true | yes |
For bulk loading or other write-intense applications, consider increasing Elasticsearch’s refresh interval. Refer to this discussion on how to increase the refresh interval and its impact on write performance. Note, that a higher refresh interval means that it takes a longer time for graph mutations to be available in the index.
For additional suggestions on how to increase write performance in Elasticsearch with detailed instructions, please read this blog post .
- Please refer to the Elasticsearch homepage and available documentation for more information on Elasticsearch and how to setup an Elasticsearch cluster.
- Check out example graph configurations for complete configurations including the storage backend.