Titan Limitations - andrew-nguyen/titan GitHub Wiki

There are various limitations and “gotchas” that one should be aware of when using Titan. Some of these limitations are necessary design choices and others are issues that will be rectified as Titan development continues. Finally, the last section provides solutions to common issues.

Design Limitations

Size Limitation

Titan can store up to a quintillion edges (2^60) and half as many vertices. That limitation is imposed by Titan’s id scheme.

DataType Definitions

When declaring the data type of a property key using dataType(Class) Titan will enforce that all properties for that key have the declared type, unless that type is Object.class. This is an equality type check, meaning that sub-classes will not be allowed. For instance, one cannot declare the data type to be Number.class and use Integer or Long. For efficiency reasons, the type needs to match exactly. Hence, use Object.class as the data type for type flexibility. In all other cases, declare the actual data type to benefit from increased performance and type safety.

Edge Retrievals are not O(1)

Retrieving an edge by id, e.g tx.getEdge(edge.getId()), is not a constant time operation. Titan will retrieve an adjacent vertex of the edge to be retrieved and then execute a vertex query to identify the edge. The former is constant time but the latter is potentially linear in the number of edges incident on the vertex with the same edge label.

Temporary Limitations

Key Index Must Be Created Prior to Key Being Used

To index vertices or edges by key, the respective key index must be created before the key is first used in a vertex or edge property. Read more about creating vertex indexes.

Unable to Drop Key Indices

Once an index has been created for a key, it can never be removed.

Types Can Not Be Changed Once Created

This pitfall constrains the graph schema. While the graph schema can be extended, previous declarations cannot be changed.

Batch Loading Speed

Titan provides a batch loading mode that can be enabled through the configuration. However, this batch mode only facilitates faster loading into the storage backend, it does not use storage backend specific batch loading techniques that prepare the data in memory for disk storage. As such, batch loading in Titan is currently slower than batch loading modes provided by single machine databases. The Bulk Loading documentation lists ways to speed up batch loading in Titan.

Another limitation related to batch loading is the failure to load millions of edges into a single vertex at once or in a short time of period. Such supernode loading can fail for some storage backends. This limitation also applies to dense index entries. For more information, please refer to the ticket .

Beware

Multiple Titan instances on one machine

Running multiple Titan instances on one machine backed by the same storage backend (distributed or local) requires that each of these instances has a unique configuration for storage.machine-id-appendix. Otherwise, these instances might overwrite each other leading to data corruption. See Graph Configuration for more information.

Accidental type creation

By default, Titan will automatically create property keys and edge labels when a new type is encountered. It is strongly encouraged that users explicitly define types and disable automatic type creation by setting the graph configuration option autotype = none.

Custom Class Datatype

Titan supports arbitrary objects as attribute values on properties. To use a custom class as data type in Titan, either register a custom serializer or ensure that the class has a no-argument constructor and implements the equals method because Titan will verify that it can successfully de-/serialize objects of that class. Please read Datatype and Attribute Serializer Configuration for more information.

Transactional Scope for Edges

Edges should not be accessed outside the scope in which they were originally created or retrieved.

Locking Exceptions

When defining unique Titan types with locking enabled (i.e. requesting that Titan ensures uniqueness) it is likely to encounter locking exceptions of the type PermanentLockingException under concurrent modifications to the graph.

Such exceptions are to be expected, since Titan cannot know how to recover from a transactional state where an earlier read value has been modified by another transaction since this may invalidate the state of the transaction. It most cases it is sufficient to simply re-run the transaction. If locking exceptions are very frequent, try to analyze and remove the source of congestion.

Double and Float Data Types

Titan internally represents Double and Float data types as fixed decimal numbers. Doubles are stored with up to 6 decimal digits and floats with up to 3. This representation enables range retrievals in vertex centric queries. However, it significantly limits the precision and range of doubles and floats.
Use FullDouble and FullFloat as data type to get the full precision of floating point numbers. However, note that these data types cannot be used in range-constrained vertex centric queries.

Ghost Vertices

When the same vertex is concurrently removed in one transaction and modified in another, both transactions will successfully commit on eventually consistent storage backends and the vertex will still exist with only the modified properties or edges. This is referred to as a ghost vertex. It is possible to guard against ghost vertices on eventually consistent backends using key out-uniqueness but this is prohibitively expensive in most cases. A more scalable approach is to allow ghost vertices temporarily and clearing them out in regular time intervals, for instance using Titan tools.

Another option is to detect them at read-time using the transaction configuration option @ checkInternalVertexExistence()@

Snappy 1.4 does not work with Java 1.7

Cassandra 1.2.x makes use of Snappy 1.4. Titan will not be able to connect to Cassandra if the server is running Java 1.7 and Cassandra 1.2.x (with Snappy 1.4). Be sure to remove the Snappy 1.4 jar in the cassandra/lib directory and replace with a Snappy 1.5 jar version (available here).

Debug-level Logging

When the log level is set to debug Titan produces a lot of logging output which is useful to understand how particular queries get compiled, optimized, and executed. However, the output is so large that it will impact the query performance noticeably. Hence, you info or above for production systems or benchmarking.

Useful Tips

Titan OutOfMemoryException or excessive Garbage Collection

If you experience memory issues or excessive garbage collection while running Titan it is likely that the caches are configured incorrectly. If the caches are too large, the heap may fill up with cache entries. Try reducing the size of the transaction level cache before tuning the database level cache, in particular if you have many concurrent transactions. Read more about Titan's caching layers.

Removing JAMM Warning Messages

When launching Titan with embedded Cassandra, the following warnings may be displayed:

958 [MutationStage:25] WARN  org.apache.cassandra.db.Memtable  - MemoryMeter uninitialized (jamm not specified as java agent); assuming liveRatio of 10.0.  Usually this means cassandra-env.sh disabled jamm because you are using a buggy JRE; upgrade to the Sun JRE instead

Cassandra uses a Java agent called MemoryMeter which allows it to measure the actual memory use of an object, including JVM overhead. To use JAMM (Java Agent for Memory Measurements), the path to the JAMM jar must be specific in the Java javaagent parameter when launching the JVM (e.g. -javaagent:path/to/jamm.jar). Rather than modifying titan.sh and adding the javaagent parameter, I prefer to set the JAVA_OPTIONS environment variable with the proper javaagent setting:

export JAVA_OPTIONS=-javaagent:$TITAN_HOME/lib/jamm-0.2.5.jar

Cassandra Connection Problem

By default, Titan uses the Astyanax library to connect to Cassandra clusters. On EC2 and Rackspace, it has been reported that Astyanax was unable to establish a connection to the cluster. In those cases, changing the backend to storage.backend=cassandrathrift solved the problem.

ElasticSearch OutOfMemoryException

When numerous clients are connecting to ElasticSearch, it is likely that an OutOfMemoryException occurs. This is not due to a memory issue, but to the OS not allowing more threads to be spawned by the user (the user running ElasticSearch). To circumvent this issue, increase the number of allowed processes to the user running ElasticSearch. For example, increase the ulimit -u from the default 1024 to 10024.