3. Mapping and Analysis - madhusudhankonda/elasticsearch-next-steps GitHub Wiki

Overview

There are over two dozens of data types in Elasticsearch. Some of them are straightforward and pretty intuitive to work with. For example, setting a field with primitive data types such as text, boolean, long, keyword etc. But there are few other data types like an object, nested, join, search_as_you_type that require special attention.

Please refer to Mapping basics in our First Steps traning course here: Mapping Basics.

Let's work with Object data type in Elasticsearch.

Object data type

Often we find data in a hierarchical manner - a simple email consisting of a few files like the subject and to fields as well as attachments field, which in turn has few more properties such as attachment file name, its size and so on. If we need to model these attributes in a JSON file, it would look like this, which is self-explanatory:

// This is the main email object { "subject":"Object type", "to":"[email protected]", "attachments":{ // This is the inner object "filename":"Object type explanations.txt", "filesize_kb":200 } }

JSON allows us to create such hierarchical objects - an object wrapped up in am another object, which is encapsulated in another object and so on.

Elasticsearch has a special data type to represent a hierarchy of objects - aptly named object type.

We already know the data types for the top-level subject and to fields (text and keyword respectively, if you are wondering). As the attachments is an object itself (consists of few more data fields insider it), its data type is an object type. The two properties filename and filesize_kb properties in this attachments object are modelled as text and long field respectively.

We create an index by issuing the PUT emails command as shown below:

PUT emails
{
  "mappings": {
    "properties": {
      "subject":{
        "type": "text"
      },
      "to":{
        "type": "keyword"
      },
      "attachments":{
        "type": "object", // redundant declaration - see explanation below
        "properties": {
          "filename":{
            "type":"text"
          },
          "filesize_kb":{
            "type":"long"
          }
        }
      }
    }
  }
}

The high-level properties (subject and to fields) need no explanation. The third property (attachments) is something we should draw our attention to. The type is declared as object (it represents a JSON object) as it encapsulates the two other fields in this object.

We can check the schema by invoking GET emails/_mapping command once the request is completed:

//Fetch the mapping for emails index
GET emails/_mapping
{
  "emails" : {
    "mappings" : {
      "properties" : {
        "attachments" : { // Notice the type wasn’t mentioned here - it is object by default
          "properties" : {
            "filename" : {
              "type" : "text"
            },
            "filesize_kb" : {
              "type" : "long"
            }
          }
        },
        "subject" : {
          "type" : "text"
        },
        "to" : {
          "type" : "keyword"
        }
      }
    }
  }
}

Did you notice the type of the attachments field is missing? Setting the type to object for the inner object is not mandatory, it’s a redundant declaration. You can omit the declaration altogether if you think you are constructing an inner object (as long as the inner object conforms to JSON formatting rules). In fact, from the mapping retrieved, we can see that the type on the inner object attachments was not set when the mapping was created. You can ignore the “type”: “object” for attachments object.

Object Datatypes will ignore Relationships

In the above email, we declared to have just one attachment. In reality, emails can have multiple attachments - let's create a document with multiple attachments:

PUT myemails/_doc/1
{
  "subject":"Multi attachments",
  "to":"[email protected]",
  "attachments":[
    { 
      "filename":"attachment1.txt",
      "filesize_kb":400
    },
    { 
      "filename":"attachment2.txt",
      "filesize_kb":200
    }
  ]
}

That's good, we got an email document indexed which has two attachments (attachment1.txt with 400kb and attachment2.txt with 200kb).

Notice that the size of attachmen1.txt is 200kb and attachment2.txt is 400kb. Keep this filename-size mapping in mind as we will come back to this in a minute.

Let's run a query to fetch results based on a criteria: given a filename and size (attachment2.txt and 400), find the documents. This should return no results - the attachment2's size is 200kb not 400 - so your search query will fail - however, you'd see one document returned!

GET myemails/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
          "attachments.filename": "attachment2.txt"
        }},
        {
          "match": {
          "attachments.filesize_kb": 400
        }}
      ]
    }
  }
}

This will unfortunately return the existing document. The result should be empty because the combination of attachments2.txt with 200kb file size doesn’t exist.

This is where the object data type breaks down - it can’t honour the relationships between these inner objects. The inner objects are NOT modeled and stored as individual objects.

The reason for this is the way the objects are treated internally when stored - they are flattened as shown below:

{
    // Objects are flattened internally
    // ...
  "attachments.filename" :["attachment1.txt","attachment2.txt"], // a list of filenames
  "attachments.filesize_kb" :[200, 400] // list of file sizes
}

To solve this type of problem, we have another data type called the nested data. But before we go to start working with the nested data types, let's pick up an advanced query for searching multiple leaf queries in one go:

Advanced bool query

GET myemails/_search
{
  "query": {
    "bool": {
      "must": [ 
        {"term": { "attachments.filename.keyword": "attachment1.txt"}},
        {"term": { "attachments.filesize_kb.keyword": "400" }}
      ]
    }
  }
}

Now it's time to find out about nested data type

Nested Data Type

The nested type is nothing more than an array of objects. Going with the same example of our emails and attachments, this time let’s define the attachments fields as nested data type rather than letting Elasticsearch derive it as an object type.

First, let’s create the mapping with attachments as the nested data type:

PUT myemails2 
{
  "mappings": {
    "properties": {
      "attachments":{
        "type": "nested"
      }
    }
  }
}

Then index the same document from the earlier section with two attachments into myemails2 index (PUT myemails2/_doc/1 { .. }).

We now run a nested query for the same criteria (attachment2 and 400kb) and check if the results are returned - you'd expect no results

GET myemails2/_search
{
  "query": {
    "nested": {
      "path": "attachments", // This path param is important
      "query": {
        "bool": {
          "must": [
            {"match": { "attachments.filename": "attachments2.txt"}},
            {"match": {"attachments.filesize_kb": 400 }}
          ]
        }
      }
    }
  }
}

The nested query has an important parameter - the path - pointing to the “attachments” field. The bool query wraps up our criteria in a must clause. The results (hits), as expected are empty:

{
  //...
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

The nested data types are pretty good at honouring the associations and relationships, so if we ever have a requirement of creating an array of objects but each of the objects must be treated as an individual object, perhaps nested is your friend.

Flattened data type

A flattened data type holds information in the form of one or more subfields, each subfield’s value indexed as a keyword. None of the values is treated as text fields, and thus do not undergo the text analysis process.

Let's create an index called consultations, representing Doctor's consultation notes:

PUT consultations
{
  "mappings": {
    "properties": {
      "patient_name":{
        "type": "text"
      },
      "doctor_notes":{
        "type": "flattened"
      }
    }
  }
}

Any field (and its subfields) that’s declared as flattened will not get analyzed. Let's index a sample document:

PUT consultations/_doc/1
{
  "patient_name":"Joe Smith",
  "doctor_notes":{
    "temperature":39.9,
    "symptoms":["heacches","fever","bodyaches"],
    "history":"none",
    "medication":["Ibuprofen","Antibiotics","Paracetamol"]
  }
}

Search through the doctor's keynotes:

GET consultations/_search
{
  "query": {
    "match": {
      "doctor_notes": "Paracetamol"
    }
  }
}

geo_point Data Type

The geo_point datatypes is a specialized data type for capturing the location of a place, which is represented as longitude and latitude. We can use this to pinpoint an address such as a restaurant, a school, a golf course etc.

In the following snippet, let's create an index for restaurants with two properties, one being the name of the restaurant and other the address of that restaurant - the address field is declared as a geo_point data type:

PUT restaurants
{
  "mappings": {
    "properties": {
      "name":{
        "type": "text"
      },
      "address":{
        "type": "geo_point"
      }
    }
  }
}

Now, index a restaurant Elasticky Fingers in London:

PUT restaurants/_doc/1
{
  "name":"Elasticky Fingers",
  "address":{
    "lat":51.5,
    "lon":-0.12
  }
}

The latitude and longitude coordinates of the place, represented as lat and lon, were provided as an object in the above request. Now that we have a document with the address information, we can search the restaurant using a geo_bounding_box filter query:

GET restaurants/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "geo_bounding_box": {
          "address": {
            "top_left": {
              "lat": 55.73,
              "lon": -0.15
            },
            "bottom_right": {
              "lat": 40.717,
              "lon": -0.09
            }
          }
        }}
      ]
    }
  }
}

We can provide the location information in various formats, not just what you’ve seen above (providing as an object of lat and lon). You can provide the same location data in the form of an array or a string too, for example: "address":[51.5,-0.12] or "address":"51.5,-0.12".

Multi Types

Each field in a document is associated with a data type. Elasticsearch is flexible enough to let us define the fields with multiple data types too. For example, a subject of our emails index can be a text or a keyword or completion type if you chose to.

See the example query that creates a single field with multiple data types:

PUT emails
{
  "mappings": {
    "properties": {
      "subject":{
        "type": "text",
        "fields": {
          "my_keyword":{
            "type":"keyword"
          },
          "my_completion":{
            "type":"completion"
          }
        }
      }
    }
  }
}

The title field now has three types associated with it. If you wish to access them, you have to use the format of subject.my_keyword for the keyword type field or subject.my_completion for the completion type.

Aliases

Creating an alias:

PUT cars_for_aliases
{
  "aliases": {
    "my_new_cars_alias": {}
  }
}

Or

PUT cars_for_aliases/_alias/my_cars_alias

Creating an alias for migrating data:

  • Create an alias called vintage_cars_alias to refer to the current index vintage_cars.
  • Because the new properties are incompatible with the existing index, create a new index, say vintage_cars_new with the new settings.
  • Copy (i.e., reindex) the data from the old index (vintage_cars) to the new index (vintage_cars_new).
  • Recreate your existing alias (vintage_cars_alias), which pointed to the old index, to refer to the new index. Thus, vintage_cars_alias will now be pointed to vintage_cars_new.
  • Now all the queries that were executed against the vintage_cars_alias are carried out on the new index.
  • Get rid of the old index (vintage_cars) when the reindexing and releasing works and is proven.

Performing multiple aliasing operations

POST _aliases
{
  "actions": [
    {
      "remove": {
        "index": "vintage_cars",
        "alias": "vintage_cars_alias"
      }
    },
    {
      "add": { 
        "index": "vintage_cars_new",
        "alias": "vintage_cars_alias"
      }
    }
  ]  
}

Index Templates

The index templates can be classified into two categories (since version 7.8):

  • composable index templates (or simply index templates) and
  • component templates.

Composable index templates are composed of zero or more component templates. An index template can exist on its own too, without being associated with any component template.

Index template

Let's create a solo index template (not composed of any component templates):

POST _index_template/cars_template
{
  "index_patterns": ["*cars*"],
  "priority": 1, 
  "template": {
    "mappings": {
      "properties":{
        "created_at":{
          "type":"date"
        },
        "created_by":{
          "type":"text"
        }
      }
    }
  }
}

Component template

POST _component_template/dev_settings_component_template
{
  "template":{
    "settings":{
      "number_of_shards":3,
      "number_of_replicas":3
    }
  }
}

and perhaps a component template for mappings:

POST _component_template/dev_mapping_component_template
{
  "template": {
    "mappings": {
      "properties": {
        "created_by": {
          "type": "text"
        }
      }
    }
  }
}

Creating a composable index template

POST _index_template/composed_cars_template
{
  "index_patterns": ["*cars*"], 
  "priority": 200, 
  "composed_of": ["dev_settings_component_template",
                  "dev_mapping_component_template"]
}

Index rollover

Create an index with appropriate suffix: PUT cars_2021-000001

Now, create an alias:

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "cars_2021-000001",
        "alias": "latest_cars_a",
        "is_write_index": true 
      }
    }
  ]
}

Now, rollover the index manually: POST latest_cars_a/_rollover

Index lifecycle management

Refer to my book "Elasticsearch in Action 2nd edition" for this advanced concept.

Text Analysis

Elasticsearch does a lot of hard work behind the scenes on the incoming textual data, thus enabling the for efficient search and retrieval. The ground work is carried out in the name of text analysis, by employing so called analysers. There are a handful of analysers out of the box, such as Standard (the default analyser), Simple, English analysers and few others including Fingerprint and Pattern analyzers. We can develop our own custom ones too. There are a few language-specific analyzers - like English, German, Spanish, French, Hindi, and so on.

All unstructured (text fields) data will undergo analysis during the indexing process. Each text field can be associated with a specific analyser, else a standard analyser is set by default if we didn’t associate one.

Similar to indexing operations being analyzed, the search queries against text fields will also be analyzed too in the same manner.

Analyzers are composed of few components, such as tokenizers and filters.

Dichotomy of an Analyzer Module

The text fields will run through an Analyser module during the indexing operation. The analyzer is a module consisting of three low-level building blocks: character filers, tokenizers, and token filters. During the analysis phase, sentences are split into words (called tokens) using a process called tokenization. They then undergo normalization during which the words are reduced to their root words, for eg, fighter, fought may both lead to “fight” as the root word.

Character Filters

Character filers' sole purpose is to remove unwanted characters from the field. For example, say we have a field called message which consists of emojis, smileys, and some exclamation marks, as shown below:

message: “Welcome to cruel world, ha! ha! ha! -> :)”

The character filter’s job is to strip off all such non-textual content, creating simple English sentences - no smileys, no emojis, and no punctuation. The above sentence becomes the following once a Character Filter has been applied:

message: “Welcome to cruel world, ha ha ha”

Did you notice the disappearance of the emojis and smileys and all? That is the work of the character filter.

There are a handful of character filters provided by Elasticsearch out of the box such as mapping html filter, pattern_replace filter, and html_strip filter.

Character filter is not a mandatory component for the Analyser module.

Tokenizers

The job of a tokenizer is to create “tokens”. Tokens are the individual words of a sentence.

Take the same sentence from the above message: Welcome to cruel world, ha! ha! ha! -> :)

This sentence will be broken down into tokens likes: [welcome, to, the, cruel, world, ha, ha, ha]. In this instance, the sentence was split using white space, but hyphens and other delimiters can be used too.

The tokenizer is a mandatory component of the Analyzer module.

Tokenizer Examples

  • Lowercase Tokenizer
  • Whitespace Tokenizer
  • Standard Tokenizer,
  • Email Tokenizer, and others.

There are also partial-word tokenizers such as N-Gram and Edge N-Gram tokenizers where they break the word into partial words.

Token filters

Token filters modify the tokens as per requirements - for example:

  • Lowercase token filter: Converting the tokens to lowercase,
  • Trim token filter: Removing white spaces before and after the worlds,
  • stop token filter: Stopping the common words like the, a, it, etc () and so on.

There are at least a couple of dozens of these token filters provided by Elasticsearch off the box.

The token filters are not mandatory to compose an analyser module.

Types of Analyzers

Elasticsearch provides a set of pre-defined Analysers out of the box:

  • Standard Analyzer - thiis is a default
  • Simple Analyzer
  • Stop Analyser
  • Whitespace Analyser
  • Keyword Analyser

The Standard Analyzer is the default analyzer that is used for text fields if none is specified. It consists of the following components (remember analyzers consist of zero or more character filters, at least one tokenizers, and zero or more token filters):

  • No char filter
  • A standard tokenizer
  • Lowercase token filter and Stop filter (however, the stop token filter is disabled by default)

Setting Analyzers

We set the required analyzer for each of the attributes during the index creation. In the code snippet, we override the default (standard) analyzer with a simple analyzer:

PUT cars
{
  "mappings": {
    "properties": {
      "make":{
        "type": "text",
        "analyzer": "simple"
      }
    }
  }
}

The make field will go through the simple analyzer before being indexed.

Testing the Analyzers

We can use _analyse API to test how the analyzers are applied on input text. This will help us choose appropriate analyzers for our business requirements.

For example, we wish to find out how the text Searching with Elaticsearch is simple !!!:) is analysed using a simple analyser, he following code comes into action:

POST _analyze
{
  "text": ["Searching with Elaticsearch is simple !!!:)"],
  "analyzer": "simple"
}

This will produce a list of tokens as shown here (the response was massaged for brevity):

{
  "tokens" : [
    {
      "token" : "searching",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {"token" : "with","..."},
    {"token" : "elaticsearch", "..."},
    {"token" : "is", "..."},
    {"token" : "simple","..."}
  ]
}

The words were converted to lowercase and the punctuation was removed too.

Custom Analyzers

If off-the-shelf analyzers wouldn't cut it for you, you can create your own custom analyzers. These custom analyzers can be a mix-and-match of existing components or components from your own house.

PUT my_index_with_custom_analzer
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom-analyser":{
          "type":"custom", 
          "analyzer":"simple",  
          "tokenizer":"standard",
          "filters":"stop"
        }
      }
    }
  }
}

Note the type was set as custom

In the above snippet, we are creating our own custom analyzer my_custom_analyser with the type set to custom. We then can provide existing analyzers, token filters, character filters, etc as per our requirement.

Custom analyser

Using a `path_hierarchy` analyzer:

GET _analyze
{
  "tokenizer": "path_hierarchy",
  "filter": ["uppercase"],
  "text": "/Volumes/FILES/Dev"
}

Standards analyser

The standard analyzer consists of a standard tokenizer and two token filters: lowercase and stop filters. The stop filter is disabled by default. There is no character filter defined on the standard analyzer. GET _analyze { "analyzer": "standard", "text": "Hot cup of ☕ and a 🍿is a Weird Combo :(!!" }

Standard analyser with stop words

PUT my_index_with_stopwords
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_with_stopwords":{ 
          "type":"standard", 
          "stopwords":"_english_"
        }
      }
    }
  }
}

Test the above analyser:

POST my_index_with_stopwords/_analyze
{
  "text": ["Hot cup of ☕ and a 🍿is a Weird Combo :(!!"],
  "analyzer": "standard_with_stopwords"
}

Custom analyzer

PUT index_with_custom_analyzer
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer":{
          "type":"custom", 
          "char_filter":["html_strip"],
          "tokenizer":"standard",
          "filter":["uppercase"]
        }
      }
    }
  }
}

Test it:

POST index_with_custom_analyzer/_analyze
{
  "text": "<H1>HELLO, WoRLD</H1>",
  "analyzer": "my_custom_analyzer"  
}

Parsing greek letters using custom analyzer

PUT index_with_parse_greek_letters_custom_analyzer
{
  "settings": {
    "analysis": {
      "analyzer": {
        "greek_letter_custom_analyzer":{ 
          "type":"custom",
          "char_filter":["greek_symbol_mapper"], 
          "tokenizer":"standard", 
          "filter":["lowercase", "greek_keep_words"] 
        }
      },
      "char_filter": { 
        "greek_symbol_mapper":{ 
          "type":"mapping",
          "mappings":[ 
            "α => alpha",
            "β => Beta",
            "γ => Gamma"
          ]
        }
      },
      "filter": {
        "greek_keep_words":{ 
          "type":"keep",
          "keep_words":["alpha", "beta", "gamma"]
        }
      }
    }
  }
}

Test it:

POST index_with_parse_greek_letters_custom_analyzer/_analyze
{
  "text": "α and β are roots of a quadratic equation. γ isn't",
  "analyzer": "greek_letter_custom_analyzer"
}
⚠️ **GitHub.com Fallback** ⚠️