Ch 2: Getting Started - madhusudhankonda/elasticsearch-in-action Wiki

Getting Started

Book JSON Document

Book data represented as JSON document

{
  "title":"Effective Java",
  "author":"Joshua Bloch",
  "release_date":"2001-06-01",
  "amazon_rating":4.7,
  "best_seller":true,
  "prices": {
    "usd":9.95,
    "gbp":7.95,
    "eur":8.95
  }
}

Indexing a document using cURL command:

curl -XPUT "http://localhost:9200/books/_doc/1" -H 'Content-Type: application/json' -d'
{
  "title":"Effective Java",  
  "author":"Joshua Bloch",  
  "release_date":"2001-06-01",  
  "amazon_rating":4.7,  
  "best_seller":true,  
  "prices": {    
    "usd":9.95,    
    "gbp":7.95,    
    "eur":8.95  
  }
}'

Indexing the book document using Kibana tool

PUT books/_doc/1
{
  "title":"Effective Java",
  "author":"Joshua Bloch",
  "release_date":"2001-06-01",
  "amazon_rating":4.7,
  "best_seller":true,
  "prices": {
    "usd":9.95,
    "gbp":7.95,
    "eur":8.95
  }
}

The response to the above request would be:

{
  "_index" : "books",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

Index More Documents

Index a document with ID 2:

PUT books/_doc/2
{
  "title":"Core Java Volume I - Fundamentals",
  "author":"Cay S. Horstmann",
  "release_date":"2018-08-27",
  "amazon_rating":4.8,
  "best_seller":true,
  "prices": {
    "usd":19.95,
    "gbp":17.95,
    "eur":18.95
  }
}

Index another (third) document with ID 3:

PUT books/_doc/3
{
  "title":"Java: A Beginner’s Guide",
  "author":"Herbert Schildt",
  "release_date":"2018-11-20",
  "amazon_rating":4.2,
  "best_seller":true,
  "prices": {
    "usd":19.99,
    "gbp":19.99,
    "eur":19.99
  }
}

Counting all documents

Using the _count API to retrieve the number of documents available in the books index:

GET books/_count

This will return the number of books in the books index:

{
  "count" : 3,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

Fetching the document by ID

Given an ID, we can fetch the document by issuing a GET command:

GET books/_doc/1

This should return the document that we've had indexed earlier:

{
  "_index" : "books",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 2,
  "_seq_no" : 3,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "title" : "Effective Java",
    "author" : "Joshua Bloch",
    "release_date" : "2001-06-01",
    "amazon_rating" : 4.7,
    "best_seller" : true,
    "prices" : {
      "usd" : 9.95,
      "gbp" : 7.95,
      "eur" : 8.95
    }
  }
}

To fetch only the source and ignoring the metadata, issue the command: GET books/_source/1

Fetching multiple documents

To fetch multiple documents using a set a given IDs, we use a ids query on a _search endpoint:

GET books/_search
{
  "query": {
    "ids": {
      "values": [1,2,3]
    }
  }
}

This will return all three documents if available.

Retrieving all documents

We can fetch all documents in one go from the books index using a generic _search:

GET books/_search

This will return all the documents available in the books index. This is equivalent to a match_all search query.

Search a Book Written By a Specific Author

Develop a match query to fetch book(s) written by Joshua:

GET books/_search
{
  "query": {
    "match": {
      "author": "Joshua"
    }
  }
}

It would return one book written by Joshua:

...
"hits" : [
      {
        "_index" : "books",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0417082,
        "_source" : {
          "title" : "Effective Java",
          "author" : "Joshua Bloch",
          ...
        }
      }
    ]

Search with an Exact Title

GET books/_search
{
  "query": {
    "match": {
      "title": {
        "query": "Effective java",
        "operator": "and"
      }
    }
  }
}

You'd expect one book returned to you, with an exact title "Effective Java"

Indexing Multiple Documents using _bulk API

Execute the following script in Kiabana window (the data is also present in the code/datasets/books-kibana-dataset.txt

POST _bulk
{"index":{"_index":"books","_id":"1"}}
{"title": "Core Java Volume I – Fundamentals","author": "Cay S. Horstmann","edition": 11, "synopsis": "Java reference book that offers a detailed explanation of various features of Core Java, including exception handling, interfaces, and lambda expressions. Significant highlights of the book include simple language, conciseness, and detailed examples.","amazon_rating": 4.6,"release_date": "2018-08-27","tags": ["Programming Languages, Java Programming"]}
{"index":{"_index":"books","_id":"2"}}
{"title": "Effective Java","author": "Joshua Bloch", "edition": 3,"synopsis": "A must-have book for every Java programmer and Java aspirant, Effective Java makes up for an excellent complementary read with other Java books or learning material. The book offers 78 best practices to follow for making the code better.", "amazon_rating": 4.7, "release_date": "2017-12-27", "tags": ["Object Oriented Software Design"]}
{"index":{"_index":"books","_id":"3"}}
{"title": "Java: A Beginner’s Guide", "author": "Herbert Schildt","edition": 8,"synopsis": "One of the most comprehensive books for learning Java. The book offers several hands-on exercises as well as a quiz section at the end of every chapter to let the readers self-evaluate their learning.","amazon_rating": 4.2,"release_date": "2018-11-20","tags": ["Software Design & Engineering", "Internet & Web"]}
{"index":{"_index":"books","_id":"4"}}
{"title": "Java - The Complete Reference","author": "Herbert Schildt","edition": 11,"synopsis": "Convenient Java reference book examining essential portions of the Java API library, Java. The book is full of discussions and apt examples to better Java learning.","amazon_rating": 4.4,"release_date": "2019-03-19","tags": ["Software Design & Engineering", "Internet & Web", "Computer Programming Language & Tool"]}
{"index":{"_index":"books","_id":"5"}}
{"title": "Head First Java","author": "Kathy Sierra and Bert Bates","edition":2, "synopsis": "The most important selling points of Head First Java is its simplicity and super-effective real-life analogies that pertain to the Java programming concepts.","amazon_rating": 4.3,"release_date": "2005-02-18","tags": ["IT Certification Exams", "Object-Oriented Software Design","Design Pattern Programming"]}
{"index":{"_index":"books","_id":"6"}}
{"title": "Java Concurrency in Practice","author": "Brian Goetz with Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, and Doug Lea","edition": 1,"synopsis": "Java Concurrency in Practice is one of the best Java programming books to develop a rich understanding of concurrency and multithreading.","amazon_rating": 4.3,"release_date": "2006-05-09","tags": ["Computer Science Books", "Programming Languages", "Java Programming"]}
{"index":{"_index":"books","_id":"7"}}
{"title": "Test-Driven: TDD and Acceptance TDD for Java Developers","author": "Lasse Koskela","edition": 1,"synopsis": "Test-Driven is an excellent book for learning how to write unique automation testing programs. It is a must-have book for those Java developers that prioritize code quality as well as have a knack for writing unit, integration, and automation tests.","amazon_rating": 4.1,"release_date": "2007-10-22","tags": ["Software Architecture", "Software Design & Engineering", "Java Programming"]}
{"index":{"_index":"books","_id":"8"}}
{"title": "Head First Object-Oriented Analysis Design","author": "Brett D. McLaughlin, Gary Pollice & David West","edition": 1,"synopsis": "Head First is one of the most beautiful finest book series ever written on Java programming language. Another gem in the series is the Head First Object-Oriented Analysis Design.","amazon_rating": 3.9,"release_date": "2014-04-29","tags": ["Introductory & Beginning Programming", "Object-Oriented Software Design", "Java Programming"]}
{"index":{"_index":"books","_id":"9"}}
{"title": "Java Performance: The Definite Guide","author": "Scott Oaks","edition": 1,"synopsis": "Garbage collection, JVM, and performance tuning are some of the most favorable aspects of the Java programming language. It educates readers about maximizing Java threading and synchronization performance features, improve Java-driven database application performance, tackle performance issues","amazon_rating": 4.1,"release_date": "2014-03-04","tags": ["Design Pattern Programming", "Object-Oriented Software Design", "Computer Programming Language & Tool"]}
{"index":{"_index":"books","_id":"10"}}
{"title": "Head First Design Patterns", "author": "Eric Freeman & Elisabeth Robson with Kathy Sierra & Bert Bates","edition": 10,"synopsis": "Head First Design Patterns is one of the leading books to build that particular understanding of the Java programming language." ,"amazon_rating": 4.5,"release_date": "2014-03-04","tags": ["Design Pattern Programming", "Object-Oriented Software Design eTextbooks", "Web Development & Design eTextbooks"]}

This will index 10 books into Elasticsearch.

Matching a Word Across Multiple Fields

Execute the query to match "Java" across two fields - "title" and "synopsis"

GET books/_search
{
  "_source": {
    "includes": "title"
  },
  "query": {
    "multi_match": {
      "query": "Java",
      "fields": ["title","synopsis"]
    }
  }
}

The results will be something like this:

{
  ...
  "hits" : [{
    ...
    "_score" : 0.33537668,
    "_source" : { 
      "title" : "Effective Java”, 
      "synopsis":"A must-have book for every Java…”,
     ...
    },{
    ...
    "_score" : 0.30060259,
    "_source" : { 
      "title" : "Head First Java”,
      “synopsis":"The most important selling points of Head First Java”
      ...
      },
      
      ...
  }]

}

Boosting Queries

GET books/_search
{
  "_source": {
    "includes": ["title","synopsis"]
  },
  "query": {
    "multi_match": {
      "query": "Java",
      "fields": ["title^3","synopsis"]
    }
  }
}

The results would be like the following (compare the _score attribute before and after)

{
  ...
  "hits" : [{
    ...
    "_score" : 1.0061301,
    "_source" : { 
      "title" : "Effective Java”, 
      "synopsis":"A must-have book for every Java…”,
     ...
    },{
    ...
    "_score" : 0.90180784,
    "_source" : { 
      "title" : "Head First Java”,
      “synopsis":"The most important selling points of Head First Java”
      ...
      },
      
      ...
  }]

}

When you compare the score, the result for Effective Java is 0.33537668 before boosting but the score rose to 1.0061301 after boosting the title field.

Searching for a phrase

Searching for books with an exact phrase

GET books/_search
{
  "query": {
    "match_phrase": {
      "synopsis": "must-have book for every Java programmer"#B Our phrase
    }
  }
}

This query will result in:

"hits" : [{
  "_score" : 7.300332,
  "_source" : {
  "title" : "Effective Java",
  "synopsis" : "A must-have book for every Java programmer and Java ...",
}]}

Match phrase query with highlights

We can enable highlights in the return results by coding a highlight object at the root level:

GET books/_search
{
  "query": {
    "match_phrase": {
      "synopsis": "must-have book for every Java programmer"
    }
  },
  "highlight": {#A The highlight object at the same level as query object 
    "fields": {# B mention which fields we wish to have highlights 
      "synopsis": {}
    }
  }
}

This query will return:

"hits" : [      
  "_source" : {
    ...
    "title" : "Effective Java",
    "synopsis" : "A must-have book for every Java 
  },
  "highlight" : {
    "synopsis" : [
    "A <em>must</em>-<em>have</em> <em>book</em> <em>for</em> <em>every</em> <em>Java</em> <em>programmer</em> and Java aspirant.."]}}
]

The matches highlighted with a html markup tag (em) indicating the words are emphasised

Phrases with missing words

At times, we may have a word or two missing in a phrase. We can use a match_phrase query with slop parameter to fix this. The following query was missing a "for" word in the phrase:

GET books/_search
{
  "query": {
    "match_phrase": {
      "synopsis": {
        "query": "must-have book every Java programmer",
        "slop": 1
      }
    }
  }
}

Index adhoc documents

PUT books/_doc/99
{
  "title":"Java Collections Deep Dive"
}
PUT books/_doc/100
{
  "title":"Java Computing World"
}

Matching phrases with a prefix

Query to fetch all books with a title having “Java co” prefix:

GET books/_search
{
  "query": {
    "match_phrase_prefix": {
      "title": "Java co"
    }
  }
}

This query will search for all books that have a title like Java concurrency, Java collections, Java computing and so on.

Fuzzy query

The fuzzy query forgives users' spelling mistakes. The following query returns Java related books in spite of the user incorrectly specifying the search word as 'kava':

GET books/_search
{
  "query": {
    "fuzzy": {
      "title": {
        "value": "kava",
        "fuzziness": 1 
      }
    }
  }
}

You should get hits:

{
  ...
  "hits" : {
    "total" : {
      "value" : 8,
      "relation" : "eq"
    }
    ...
  }
}

Term level queries

term queries

A term query is used to fetch exact matches for a value provided in the search criteria.

Fetching third edition books

GET books/_search
{
  "_source": ["title","edition"], 
  "query": {
    "term": { 
      "edition": { 
        "value": 3
      }
    }
  }
}

This query returns all third edition books (we only have one book - Effective Java):

"hits" : [{
  ...
  "_score" : 1.0,
  "_source" : {
    "title" : "Effective Java",
    "edition" : 3,
    ...
  }
}]

Range queries

A range query to fetch books that rate between 4.5 and 5 stars

GET books/_search
{
  "query": {
    "range": {
      "amazon_rating": {
        "gte": 4.5,
        "lte": 5
      }
    }
  }
}

The above range query should fetch three books

Compound queries

A bool query

The must clause of a bool query returning all books authored by Joshua

GET books/_search
{
  "query": {
    "bool": { 
      "must": [{
          "match": {
            "author": "Joshua Bloch"
          }
        }]
      }
   }
}

A must clause with multiple leaf queries

The must clause can have multiple leaf queries, for example, the following query finds all books written by Joshua matching with an exact phrase:

GET books/_search
{
  "query": {
    "bool": {
      "must": [{ 
          "match": {
            "author": "Joshua Bloch"
          }
        },
        {
          "match_phrase": {
            "synopsis": "best Java programming books"
          }
        }]
      }
  }
}

The must_not clause

A bool query with must and must not clauses in action:

GET books/_search
{
  "query": {
    "bool": {
      "must": [{ "match": { "author": "Joshua" } }],
      "must_not": [{ "range": { "amazon_rating": { "lt": 4.7}}}] 
    }
  }
}

The should clause

A should query increases the relevancy score when a match is found:

GET books/_search
{
  "query": {
    "bool": {
      "must": [{"match": {"author": "Joshua"}}],
      "must_not":[{"range":{"amazon_rating":{"lt":4.7}}}],
      "should": [{"match": {"tags": "Software"}}]
    }
}

The filter clause

A filter clause wouldn't affect the relevancy score when a match is found:

GET books/_search
{
  "query": {
    "bool": {
      "must": [{"match": {"author": "Joshua"}}],
      "must_not":[{"range":{"amazon_rating":{"lt":4.7}}}],
      "should": [{"match": {"tags": "Software"}}],
      "filter":[{"range":{"release_date":{"gte": "2015-01-01"}}}]}
   }
}

Filter clause with multiple leaf queries

The bool query with the additional filter on edition field

GET books/_search
{
  "query": {
    "bool": {
      "must": [{"match": {"author": "Joshua"}}],
      "must_not":[{"range":{"amazon_rating":{"lt":4.7}}}],
      "should": [{"match": {"tags": "Software"}}],
      "filter":[
        {"range":{"release_date":{"gte": "2015-01-01"}}},
        {"term": {"edition": 3}}
      ]}
   }
}

Aggregations

Copy the contents of covid-26march2021.txt from datasets (https://github.com/madhusudhankonda/elasticsearch-in-action/blob/main/datasets/covid-26march2021.txt) to Kibana's DevTools. Once copied, execute them using _bulk API

Metric aggregations

Sum metric

Fetching the total number of critical patients

GET covid/_search
{
  "size": 0, 
  "aggs": {
    "critical_patients": {
      "sum": {
        "field": "critical"
      }
    }
  }
}

This should return:

"aggregations" : {
  "critical_patients" : {
    "value" : 88090.0
  }
}

Max metric

The query to fetch the highest number of deaths among the 10 countries we have in our data set:

GET covid/_search
{
  "size": 0, 
  "aggs": {
    "total_deaths": {
      "max": {
        "field": "deaths"
      }
    }
  }
}

The result would be:

"aggregations" : {
  "max_deaths" : {
    "value" : 561142.0
  }
}

Stats metric

We can find the minimum (min), average (avg), and others too. But there’s one statistical function that returns all these basic metrics in one go: the stats metric:

GET covid/_search
{
  "size": 0, 
  "aggs": {
    "all_stats": {
      "stats": {
        "field": "deaths"
      }
    }
  }
}

Here’s the snippet of the response:

"aggregations" : {
  "all_stats" : {
    "count" : 20,
    "min" : 30772.0,
    "max" : 561142.0,
    "avg" : 163689.1,
    "sum" : 3273782.0
  }
}

Extended stats

The extended_stats will return further more stats like variance, standard deviation etc:

GET covid/_search
{
  "aggs": {
    "all_extended_stats": {
      "extended_stats": {
        "field": "deaths"
      }
    }
  }
}

Bucketing aggregations

Histogram buckets

Fetching the countries by number of critical patients in buckets of 2500:

GET covid/_search
{
  "size": 0,
  "aggs": {
    "critical_patients_as_histogram": {
      "histogram": {
        "field": "critical",
        "interval": 2500
      }
    }
  }
}

The response should be:

"aggregations" : {
  "critical_patients_as_histogram" : {
    "buckets" : [{ 
       "key" : 0.0,
       "doc_count" : 8
     },
     {
       "key" : 2500.0,
       "doc_count" : 6
     },
     {
       "key" : 5000.0,
       "doc_count" : 0
     },
     {
       "key" : 7500.0,
       "doc_count" : 6
      }]
  }
}

Range buckets

Casualties by custom ranges using range bucketing:

GET covid/_search
{
  "size": 0, 
  "aggs": {
    "range_countries": {
      "range": { #A The range bucketing aggregation
        "field": "deaths", #B Field on which we apply the agg
        "ranges": [#C Define the custom ranges
          {"to": 60000},
          {"from": 60000,"to": 70000},
          {"from": 70000,"to": 80000},
          {"from": 80000,"to": 120000}
        ]
      }
    }
  }
}

This will return

"aggregations" : {
  "range_countries" : {
    "buckets" : [{
      "key" : "*-60000.0",
      "to" : 60000.0,
      "doc_count" : 2
    },{
      "key" : "60000.0-70000.0",
      "from" : 60000.0,
      "to" : 70000.0,
      "doc_count" : 0
    },{
      "key" : "70000.0-80000.0",
      "from" : 70000.0,
      "to" : 80000.0,
      "doc_count" : 4
    },{
      "key" : "80000.0-120000.0",
      "from" : 80000.0,
      "to" : 120000.0,
      "doc_count" : 6
    }]
  }
}