Using Elastic Search - andrew-nguyen/titan GitHub Wiki

Elasticsearch is a flexible and powerful open source, distributed real-time search and analytics engine for the cloud. Elasticsearch allows you to start small, but will grow with your business. It is built to scale horizontally out of the box. As you need more capacity, just add more nodes, and let the cluster reorganize itself to take advantage of the extra hardware. Elasticsearch clusters are resilient – they will detect and remove failed nodes, and reorganize themselves to ensure that your data is safe and accessible. — Elasticsearch Homepage

Titan supports Elasticsearch as an embedded or remote index backend. In embedded mode, Elasticsearch runs in the same JVM as Titan and stores data on the local machine. In remote mode, Titan connects to a running Elasticsearch cluster as a client. If not in embedded mode, be sure to have the Elasticsearch running and accessible.

Elasticsearch Embedded Configuration

For single machine deployments, Elasticsearch can run embedded with Titan. In other words, Titan will start Elasticsearch internally and connect to it within the jvm.

To run Elasticsearch embedded, add the following configuration options to the graph configuration file where /tmp/searchindex/ specifies the directory where Elasticsearch should store the index data:

storage.index.search.backend=elasticsearch
storage.index.search.directory=/tmp/searchindex
storage.index.search.client-only=false
storage.index.search.local-mode=true

Note, that Elasticsearch will not be accessible from outside of this particular Titan instance, i.e., remote connections will not be possible. Also, it might be advisable to run Elasticsearch in a separate jvm even for single machine deployments to achieve more predictable GC behavior.

In the above configuration, the index backend is named search. Replace search by a different name to change the name of the index.

Elasticsearch Remote Configuration

Titan can connect to an external Elasticsearch cluster running remote on a separate cluster of machines or locally on the same machine.

To connect Titan to an external Elasticsearch cluster, add the following configuration options to the graph configuration file where hostname lists the IP addresses of the instances in the Elasticsearch cluster:

storage.index.search.backend=elasticsearch
storage.index.search.hostname=100.100.101.1,100.100.101.2
storage.index.search.client-only=true

Make sure that the Elasticsearch cluster is running prior to starting a Titan instance attempting to connect to it. Also ensure that the machine running Titan can connect to the Elasticsearch instances over the network if the machines are physically separated. This might require setting additional configuration options which are summarized below.

In the above configuration, the index backend is named search. Replace search by a different name to change the name of the index.

Feature Support

  • Full-Text: Supports all Text predicates to search for text properties that matches a given word, prefix or regular expression.
  • Geo: Supports the Geo.WITHIN condition to search for points that fall within a given circle. Only supports points for indexing and circles for querying.
  • Numeric Range: Supports all numeric comparisons in Compare.

Configuration Options

This is the full list of configuration options for Elasticsearch. Note, that each of these options needs to be prefixed with storage.index.[INDEX-NAME]. where [INDEX-NAME] stands for the name of the index backend. For instance, if the index backend is named search then these configuration options need to be prefixed with storage.index.search.

Option Description Value Default Modifiable
backend Index backend implementation name elasticsearch
hostname Comma-separated list of IP addresses or hostnames of the instances in the Elasticsearch cluster IPs yes
index-name Name of the index string titan no
cluster-name Name of the Elasticsearch cluster. If none is defined, the name will be ignored. string elasticsearch yes
local-mode Whether Titan should run Elasticsearch embedded boolean false no
directory Directory to store Elasticsearch data in. Only applicable when running Elasticsearch embedded string yes
config-file Filename of the Elasticsearch yaml file used to configure this instance. Only applicable when running Elasticsearch embedded boolean false no
client-only Whether this node is a client node with no data boolean true no
max-result-set-size The default maximum result set size for any query if the query does not explicitly specify a limit integer 100000 yes
sniff Whether client transport sniffing is enabled. When encountering connection problems, in particular on AWS, disable this option boolean true yes

Optimizing Elasticsearch

Write Optimization

For bulk loading or other write-intense applications, consider increasing Elasticsearch’s refresh interval. Refer to this discussion on how to increase the refresh interval and its impact on write performance. Note, that a higher refresh interval means that it takes a longer time for graph mutations to be available in the index.

For additional suggestions on how to increase write performance in Elasticsearch with detailed instructions, please read this blog post .

Next Steps

  • Please refer to the Elasticsearch homepage and available documentation for more information on Elasticsearch and how to setup an Elasticsearch cluster.
  • Check out example graph configurations for complete configurations including the storage backend.
⚠️ **GitHub.com Fallback** ⚠️