Full Text and String Search - andrew-nguyen/titan GitHub Wiki
When indexing string values, that is property keys with String.class
data type, one has the choice to either index those as text or character strings.
When the value is indexed as text, the string is tokenized into a bag of words which allows the user to efficiently query for all matches that contain one or multiple words. This is commonly referred to as full-text search.
When the value is indexed as a character string, the string is index “as-is” without any further analysis or tokenization. This facilitates queries looking for an exact character sequence match. This is commonly referred to as string search.
By default, strings are indexed as text. To make this indexing option explicit, one can define a mapping when indexing a property key as text.
graph.makeKey("booksummary").dataType(String.class).indexed("search",Vertex.class,Parameter.of(Mapping.MAPPING_PREFIX,Mapping.TEXT)).make()
This is identical to a standard property key index definition with the only addition of an extra parameter that specifies the mapping in the index – in this case Mapping.TEXT
.
When a string property is indexed as text, the string value is tokenized into a bag of tokens. The exact tokenization depends on the indexing backend and its configuration. Titan’s default tokenization splits the string on non-alphanumeric characters and removes any tokens with less than 2 characters. The tokenization used by an indexing backend may differ (e.g. stop words are removed) which can lead to minor differences in how full-text search queries are handled for modifications inside a transaction and committed data in the indexing backend.
When a string property is indexed as text, only full-text search predicates are supported in graph queries by the indexing backend. Full-text search is case-insensitive.
- Text.CONTAINS: is true if (at least) one word inside the text string matches the query string
- Text.CONTAINS_PREFIX: is true if (at least) one word inside the text string begins with the query string
- Text.CONTAINS_REGEX: is true if (at least) one word inside the text string matches the given regular expression
String search predicates (see below) may be used in queries, but those require filtering in memory which can be very costly.
To index string properties as character sequences without any analysis or tokenization, specify the mapping as Mapping.STRING
:
graph.makeKey("bookname").dataType(String.class).indexed("search",Vertex.class,Parameter.of(Mapping.MAPPING_PREFIX,Mapping.STRING)).make()
When a string mapping is configured, the string value is indexed and can be queried “as-is” – including stop words and non-letter characters. However, in this case the query must match the entire string value. Hence, the string mapping is useful when indexing short character sequences that are considered to be one token.
When a string property is indexed as string, only the following predicates are supported in graph queries by the indexing backend. String search is case-sensitive.
- Cmp.EQUAL: if the string is identical to the query string
- Cmp.NOT_EQUAL: if the string is different than the query string
- Text.PREFIX: if the string value starts with the given query string
- Text.REGEX: if the string value matches the given regular expression in its entirety
Full-text search predicates may be used in queries, but those require filtering in memory which can be very costly.