Using Elastic Search - andrew-nguyen/titan GitHub Wiki

Elasticsearch is a flexible and powerful open source, distributed real-time search and analytics engine for the cloud. Elasticsearch allows you to start small, but will grow with your business. It is built to scale horizontally out of the box. As you need more capacity, just add more nodes, and let the cluster reorganize itself to take advantage of the extra hardware. Elasticsearch clusters are resilient – they will detect and remove failed nodes, and reorganize themselves to ensure that your data is safe and accessible. — Elasticsearch Homepage

Titan supports Elasticsearch as an embedded or remote index backend. In embedded mode, Elasticsearch runs in the same JVM as Titan and stores data on the local machine. In remote mode, Titan connects to a running Elasticsearch cluster as a client. If not in embedded mode, be sure to have the Elasticsearch running and accessible.

Elasticsearch Embedded Configuration

For single machine deployments, Elasticsearch can run embedded with Titan. In other words, Titan will start Elasticsearch internally and connect to it within the jvm.

To run Elasticsearch embedded, add the following configuration options to the graph configuration file where /tmp/searchindex/ specifies the directory where Elasticsearch should store the index data:

storage.index.search.backend=elasticsearch
storage.index.search.directory=/tmp/searchindex
storage.index.search.client-only=false
storage.index.search.local-mode=true

Note, that Elasticsearch will not be accessible from outside of this particular Titan instance, i.e., remote connections will not be possible. Also, it might be advisable to run Elasticsearch in a separate jvm even for single machine deployments to achieve more predictable GC behavior.

In the above configuration, the index backend is named search. Replace search by a different name to change the name of the index.

Elasticsearch Remote Configuration

Titan can connect to an external Elasticsearch cluster running remote on a separate cluster of machines or locally on the same machine.

To connect Titan to an external Elasticsearch cluster, add the following configuration options to the graph configuration file where hostname lists the IP addresses of the instances in the Elasticsearch cluster:

storage.index.search.backend=elasticsearch
storage.index.search.hostname=100.100.101.1,100.100.101.2
storage.index.search.client-only=true

Make sure that the Elasticsearch cluster is running prior to starting a Titan instance attempting to connect to it. Also ensure that the machine running Titan can connect to the Elasticsearch instances over the network if the machines are physically separated. This might require setting additional configuration options which are summarized below.

In the above configuration, the index backend is named search. Replace search by a different name to change the name of the index.

Feature Support

Full-Text: Supports all Text predicates to search for text properties that matches a given word, prefix or regular expression.

Geo: Supports the Geo.WITHIN condition to search for points that fall within a given circle. Only supports points for indexing and circles for querying.

Numeric Range: Supports all numeric comparisons in Compare.

Configuration Options

This is the full list of configuration options for Elasticsearch. Note, that each of these options needs to be prefixed with storage.index.[INDEX-NAME]. where [INDEX-NAME] stands for the name of the index backend. For instance, if the index backend is named search then these configuration options need to be prefixed with storage.index.search.

Option	Description	Value	Default	Modifiable
backend	Index backend implementation name	elasticsearch	–	–
hostname	Comma-separated list of IP addresses or hostnames of the instances in the Elasticsearch cluster	IPs	–	yes
index-name	Name of the index	string	titan	no
cluster-name	Name of the Elasticsearch cluster. If none is defined, the name will be ignored.	string	elasticsearch	yes
local-mode	Whether Titan should run Elasticsearch embedded	boolean	false	no
directory	Directory to store Elasticsearch data in. Only applicable when running Elasticsearch embedded	string	–	yes
config-file	Filename of the Elasticsearch yaml file used to configure this instance. Only applicable when running Elasticsearch embedded	boolean	false	no
client-only	Whether this node is a client node with no data	boolean	true	no
max-result-set-size	The default maximum result set size for any query if the query does not explicitly specify a limit	integer	100000	yes
sniff	Whether client transport sniffing is enabled. When encountering connection problems, in particular on AWS, disable this option	boolean	true	yes

Optimizing Elasticsearch

Write Optimization

For bulk loading or other write-intense applications, consider increasing Elasticsearch’s refresh interval. Refer to this discussion on how to increase the refresh interval and its impact on write performance. Note, that a higher refresh interval means that it takes a longer time for graph mutations to be available in the index.

For additional suggestions on how to increase write performance in Elasticsearch with detailed instructions, please read this blog post .

Next Steps

Please refer to the Elasticsearch homepage and available documentation for more information on Elasticsearch and how to setup an Elasticsearch cluster.
Check out example graph configurations for complete configurations including the storage backend.