4. Search - madhusudhankonda/elasticsearch-next-steps GitHub Wiki
We use Books dataset for these queries and exercises
There are two variants of search - full-text and structured search.
Structured search queries return results in exact matches.
On the other hand, full-text (unstructured) queries will try to find relevant results. Elasticsearch employs a similarity algorithm to generate a relevance score for full-text queries. The score is a positive floating-point number attached to the results, with the highest scored document indicating more relevant to the query criteria.
The structured or unstructured search is executed in an execution context by Elasticsearch - filter or query context respectively.
Although we have no say in asking Elasticsearch to apply a certain type of context, it is our query that lets Elasticsearch decide on applying the appropriate context.
A structured search will result in a binary yes/no answer, hence Elasticsearch uses a filter context for this. Remember there are no relevance scores expected for these results, so filter context is the appropriate one.
Of course the queries on full-text search fields will be run in a query context as they must have a scoring associated with each of the matched documents.
We will look at some examples to demonstrate these contexts in action, but in the meantime let’s find out how we can access the Elasticsearch search endpoints.
Elasticsearch exposes the Search API via its _search
endpoint. There are two ways of accessing the search endpoint:
- URI Search Request: In this method, we pass in the search query parameters alongside the endpoint as params to the query.
- Query DSL: Elasticsearch has implemented a domain specific language (DSL) for search. The criteria is passed in as a JSON object as the payload when using Query DSL.
Elasticsearch developed a specific purpose language and syntax called Query DSL (domain-specific language) for querying the data.
The Query DSL is a sophisticated, powerful, and expressive language to create a multitude of queries ranging from simple and basic to complex, nested, and complicated ones. It is a JSON based query language which can be constructed with deep queries both for search and analytics. The format goes this:
GET books/_search
{
"query": {
"match": {
}
}
}
Now we know the two methods of querying for search results, there is an important concept that you should know - full-text and term-level queries. But before picking up these concepts, we need to understand what is Relevancy
Let's go over an important concept - Relevancy. Modern search engines not just return results based on your query’s criteria but also analyze and return the most relevant results. If you are searching for “Java” in a title of a book, a document containing more than one occurrence of a “Java” word in the title is highly relevant than the other documents where the title has one or no occurrence.
Elasticsearch uses the Okapi BM25 relevancy algorithm for scoring the return results so the client can expect relevant results. On a high level, the relevancy algorithm uses TF / IDF (Term Frequency / Inverse Document Frequency)
Term Frequency (TF) is a measure of how frequent the word is in the field of the document. It is the number of times the search word appears in the search field. The higher the frequency the higher the score.
The Inverse Document Frequency (IDF) is the number of times the word appears across the whole set of documents (whole index). The higher the frequency the lower the relevance (hence inverse document frequency).
The Field-length norm is another factor used in calculating the relevancy. The occurrence of search word in a field of short length (say 20 characters in a field1 ) is highly relevant than the same in a long field (say 200 characters of field2)
Relevancy is a positive floating-point number that determines the ranking of the search results. In Elasticsearch, the relevancy is attached as _score to the results.
POST books/_search?
{
"_source": ["amazon_rating","author"],
"query": {
"term": {
"author": "Joshua"
}
}
}
// Doesn't return results. Why? (clue - term query doesn't get analyzed!). Change to `joshua` and try.
GET books/_search
{
"query": {
"match": {
"author": "Joshua"
}
}
}
#IDs
GET books/_search
{
"query": {
"ids": {
"values": [1,2]
}
}
}
#terms
GET books/_validate/query
{
"query": {
"terms": {
"author": ["joshua","joseph"]
}
}
}
#Range query
GET books/_search
{
"_source": "amazon_rating",
"query": {
"range": {
"amazon_rating": {
"gte": 4.5,
"lte": 5
}
}
}
}
#Prefix - should return the Concurrency book!
GET books/_search
{
"_source": "title",
"query": {
"prefix": {
"title": {
"value": "con"
}
}
}
}
# Wildcard with highlighting
GET books/_search
{
"_source": false,
"query": {
"wildcard": {
"title": {
"value": "*st"
}
}
},"highlight": {
"fields": {
"title": {}
}
}
}
#Fuzzy
GET books/_search
{
"_source": false,
"query": {
"fuzzy": {
"title": {
"value": "kaava",
"fuzziness": 2
}
}
},"highlight": {
"fields": {
"title": {}
}
}
}
There are handful of full-text queries that Elasticsearch search API exposes:
- Match all
- Match query
- Match Phrase
- Match Phrase Prefix
- Multi match and others. Let's see a few of them in action.
Full-text queries work on fields that are unstructured. The match_all
query, as the name suggests fetches ALL the documents - as the examples are shown below indicate:
#Matchall books index
GET books/_search
{
"query": {
"match_all": {}
}
}
#Match-all wildcard indices
GET bo*/_search
{
"query": {
"match_all": {}
}
}
#Match-all all indices - remember this brings ALL the documents across ALL the indices
GET _search
{
"query": {
"match_all": {}
}
}
#Match-all multi index query
GET covid,books/_search
{
"query": {
"match_all": {}
}
}
#Match
GET books/_search
{
"explain": true,
"query": {
"match": {
"author": "Joshua"
}
}, "highlight": {
"fields": {
"author": {}
}
}
}
#Match-all wildcard indices
GET bo*/_search
{
"query": {
"match_all": {"boost":"2.0"}
}
}
Try matching with the author as "Josh" - there wouldn't be any result. Why?:)
Match queries are the ones that would find documents that satisfies the given search criteria - usually body of the text.
#Match query - matches all books with given tags
GET books/_search
{
"query": {
"match": {
"tags": "Java programming"
}
},
"highlight": {"fields": {"tags": {}}}
}
#Match - run the same for explanation
GET books/_search
{
"explain": true,
"query": {
"match": {
"tags": "Java programming"
}
},
"highlight": {"fields": {"tags": {}}}
}
The above query will be translated to Java OR programming. The OR is the default operator (rerun the same with highlight to find the operator in action)
We can change the operator by adding an operator to the query clause. Do note there's a slight variation in defining the query clause, as demonstrated below:
GET books/_search
{
"query": {
"match": {
"tags": {
"query": "Computer Elasticsearch",
"operator": "AND"
}
}
}, "highlight": {"fields": {"tags": {}}}
}
// Try NOT as an operator. Does it work? (refer to the Operator spec: https://www.javadoc.io/doc/org.elasticsearch/elasticsearch/5.0.0/org/elasticsearch/index/query/Operator.html)
We can match documents even with spelling mistakes - fuzziness. Elasticsearch implements Levenshtein Edit Distance to apply the fuzziness. If the fuzziness to be defined as 1, one spelling mistake can be forgiven: like when we search for Compuuter. Follow the example below:
// With Fuzziness (see the spelling mistake in the query) - Levenshtein Edit Distance
GET books/_search
{
"query": {
"match": {
"tags": {
"query": "Compuuter Elasticsearch",
"operator": "OR",
"fuzziness": 1
}
}
}, "highlight": {"fields": {"tags": {}}}
}
// Exercise: try with two spelling mistakes: Compuutter for eg
We may need to match a text against multiple fields of a document. This is where multi-match query comes handy
# Multimatch
GET books/_search
{
"_source": false,
"query": {
"multi_match": {
"query": "Java",
"fields": ["tags","synopsis"]
}
}, "highlight": {"fields": {"tags": {},"synopsis": {}}}
}
// Response
{
"synopsis" : [
"Core <em>Java</em> Volume I – Fundamentals is a <em>Java</em> reference book t..."
],
"tags" : [
"Programming Languages, <em>Java</em> Programming"
]
}
Should we wish to search for a fixed phrase (in the same order), Match-Phrase query is the one to rescue:
# Match Phrase - checks out exact phrase "and lambda expressions" in the synopsis field
GET books/_search
{
"_source": false,
"query": {
"match_phrase": {
"synopsis": "and lambda Expressions"
}
},"highlight": {"fields": {"synopsis": {}}}
}
// Try removing the lambda from the phrase and see what happens?
Slop setting allows the match phrase to be a bit more lenient when the search for a phrase. That is, instead of search the exact phrase, slop tells the Elasticsearch to ignore n number of words based on the slop setting. Let's rerun the same example as above this time, remove lambda from the query but add slop as 1:
#Match phrase with slop
GET books/_search
{
"_source": false,
"query": {
"match_phrase": {
"synopsis": {
"query": "and expressions",
"slop": 1
}
}
},"highlight": {"fields": {"synopsis": {}}}
}
// Match Phrase with slop setting 2
#Match phrase with slop
GET books/_search
{
"_source": false,
"query": {
"match_phrase": {
"synopsis": {
"query": "including interfaces",
"slop": 2
}
}
},"highlight": {"fields": {"synopsis": {}}}
}
// now try swapping the words - interfaces including. Does this work? Order is important when searching using phrase search.
This query works matching the prefix of the last word in the query.
##Match phrase prefix
GET books/_search
{
"query": {
"match_phrase_prefix": {
"tags": "boo"
}
},"highlight": {"fields": {"tags": {}}}
}
// This will return
"highlight" : {
"tags" : [
"Computer Science <em>Books</em>"
]
}
The "boo" matched with "Books"
// Try setting the tags field to "Computer Sci"
// Try setting the tags field to "Compu Sci"
Compound queries are the combination of one or more leaf queries as well as compound queries themselves. This is the most advanced query DSL that one should for querying the data with complex criteria.
Elasticsearch provides five types of compound queries:
- Boolean (by far the most useful)
- Constant score
- Function score
- Disjunction max
- Boosting
We will cover Boolean and Constant score here
The Boolean query is the most popular and flexible compound query one can use to create set of complex criteria for searching data. As the name indicates it is a combination of boolean clauses with each clause having individual queries like term level or full-text queries we've seen so far. Each of these clauses will have a typed occurrence of must
, must_not
, should
or filer
clauses.
- The
must
clause is an AND query where all the documents must match to the query criteria - The
must_not
clause is a NOT query where none of the documents must match to the query criteria - The
should
clause is an OR query where one of the documents must match the query criteria - The
filter
clause is a filter query where the documents must match the query criteria (similar to must clause) except that filter clause will not boost the matches
Note: the must
and should
clauses will contribute to the relevance scoring while must_not
and filter
will not.
Let's see the Boolean Query in action.
The bool
query with empty clauses will look like this:
GET books/_search
{
"query": {
"bool": {
"must": [
{}
],
"must_not": [
{}
],
"should": [
{}
],
"filter": [
{}
]
}
}
}
Each of the clause can accept an array of the queries, for example, you can provide multiple term-level and full text queries inside any of these clauses as shown below:
GET boooks/_search
{
"query": {
"bool": {
"must_not": [
{
"match": {
"FIELD": "TEXT"
}
},{
"term": {
"FIELD": {
"value": "VALUE"
}
}
}
],
"should": [
{
"range": {
"FIELD": {
"gte": 10,
"lte": 20
}
}
},{
"terms": {
"FIELD": [
"VALUE1",
"VALUE2"
]
}
}
]
}
}
}
The must
clause can be put to work by matching tags for 'computer':
GET books/_search
{
"_source": false,
"query": {
"bool": {
"must": [
{
"match": {
"tags": "computer"
}
}
]
}
},"highlight": {
"fields": {"tags": {}}
}
}
If your query is throwing errors, you can use _validate API to find out the issues with the query:
GET books/_validate/query?explain
Let's add another query clause - this time our must
match computer in tags
as well as word java in the title
(term query)
# Must query
GET books/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"tags": "computer"
}
},
{
"term": {
"title": "java"
}
}
]
}
},"highlight": {
"fields": {"tags": {},"title": {}}
}
}
Try changing the java to Java and see if the results are returned? (clue: Term vs Match query)
The
must
clause will add to the relevance score of the results
As the name suggests, the query shouldn't match the criteria specified in these clauses. For example, all documents authored by Joshua but rating no less than 4.5
GET books/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"author": "Joshua"
}
}
],
"must_not": [
{
"range": {
"amazon_rating": {
"lt": 4.5
}
}
}
]
}
}
}
The should
clause is an OR clause
will add to the scoring as any document that matches to any of the should clause will increase the relevance score
#Shoud
GET books/_search
{
"_source": false,
"query": {
"bool": {
"should": [
{
"match": {
"title": "Elasticsearch"
}
},{
"term": {
"author": {
"value": "joshua"
}
}
}
]
}
},"highlight": {"fields": {"title": {},"author": {}}}
}
the above will try to match all the titles with "Elasticsearch" word in it, and obviously the query fails. But of course, there's another clause attached to the should
query the term
query. This will try to fetch the documents matching with the author as Joshua. As you can expect, the above query will return the results as one of the queries match the criteria (as opposed to must
where all conditions must be satisfied).
As an exercise, change the match query to include 'Java' instead of Elasticsearch - do you see any difference in the relevance score? (The relevance will be scored higher when all the should
clauses are matched!
Similar to the must
clause, the filter
clause fetches all the documents which match the criteria. The only difference is the filter
clause runs in a filter context and hence as you'd expect no relevance scores are added to the document results.
Let's match all the documents with the first edition - this time we use filter
clause
// Search for all 1st edition books
GET books/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"edition": 1
}
}
]
}
}
}
Scoring will be ignored for
filter
queries You can create multiple filters, as demonstrated in the snippet below:
// Fetch the 3rd edition books written by Joshua
GET books/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"edition": 3
}
},{
"match":{
"author":"Joshua"
}
}
]
}
}
}
We can combine must
and filter as shown below:
GET books/_search
{
"_source":false,
"query": {
"bool": {
"must": [
{
"match": {
"author": "Joshua"
}
}
],
"filter": [
{
"term": {
"edition": 1
}
}
]
}
},"highlight": {
"fields": {"author": {}, "edition": {}}
}
}
// Response - do keep a note of the score
"hits" : [
{
"_index" : "books",
"_type" : "_doc",
"_id" : "6",
"_score" : 0.81000566,
"highlight" : {
"author" : [
"Brian Goetz with Tim Peierls, <em>Joshua</em> Bloch, Joseph Bowbeer, David Holmes, and Doug Lea"
]
}
}
]
The response indicates the score for the result is merely 0.8. As we discussed, filter
is similar to must
query except that it's not run in a query context. So, why don't we add the filter clause to must clause as a second must
clause:
// We've moved the filter criteria to the must clause
GET books/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"author": "Joshua"
}
},{
"term": {
"edition": {
"value": 1
}
}
}
]
}
}
}
//Response
"hits" : [
{
"_index" : "books",
"_type" : "_doc",
"_id" : "6",
"_score" : 1.8100057,
"highlight" : {
"author" : [
"Brian Goetz with Tim Peierls, <em>Joshua</em> Bloch, Joseph Bowbeer, David Holmes, and Doug Lea"
]
}
}
]
// The score attribute has increased, did you notice?!
Now let's combine, must, must_not and should together. We will find all books written by Joshua (must
), but no less than 4.5 ratings (must_not
) and tagged with Java (should
) and only books after2015-01-01 (filter
).
// Match books written by Joshua, must not have any rating less than 4.5, should have java in tags and filter by release_date
GET books/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"author": "Joshua"
}
}
],
"must_not": [
{
"range": {
"amazon_rating": {
"lte": 4.5
}
}
}
],
"should": [
{
"match": {
"tags": "Java"
}
}
],
"filter": [
{
"range": {
"release_date": {
"gt": "2015-01-01"
}
}
}
]
}
}
}
This query helps to boost the scoring for certain queries while lowering the scores for other matches intentionally. It is comprised of two components: positive and negative blocks. We provide appropriate query criteria in these blocks - positive block is where you write your positive query that you'd want the higher scoring while the documents resulting from the negative block query will have an intentionally lowered score by using negative_boost
parameter. The negative_boost
parameter is a positive floating number between 0 to 1.
Let's check out an example: We need to improve the scoring for all the documents which have 'Java' word in the title but the books written by Herbert Schildt should have a lower scoring (for no particular reason other than for demonstration purposed, Herbert!)
## Boosting
// Fetch all the Java titles but score lower for Herbet's titles by 0.5
GET books/_search
{
"_source": ["title", "author"],
"query": {
"boosting": {
"positive": {
"match": {
"title": "Java"
}
},
"negative": {
"match": {
"author": "Herbert"
}
},
"negative_boost": 0.5
}
},"highlight": {"fields": {"author": {}, "tags": {}}}
}
You can see the scoring for Herbert has been reduced from 0.3027879 (I've run the query with Cay as the author which will provide me the original score) to 0.15139395 -> which is derived by original_score * negative_boost
The
negative_boost
must be supplied between 0 and 1. Setting 0 asnegative_boost
will set the score of 0 for all matched documents while setting 1 will not alter the original scoring.