ElasticSearch - Segolene-Albouy/Memoire-TNAH2019 GitHub Wiki
ElasticSearch query language
ElasticSearch has its own query language, more documentation can be found here
Anatomy of a query
In most cases, an ElasticSearch query is composed of :
- a header defining :
- which method is to be used :
GET
,POST
, etc. - which index (i.e. which entity of the database written in snake_case) is going to be queried. Not necessary if all indexes will be queried.
- what type of search is to be made (
_search
most of the time)
- which method is to be used :
- a body defining (among other things) :
- the fields that are going to appear in the results (
_source
) - the filters that are going to narrow the number of results matching those filters (
query
) - different properties of the results (size of the results, index from which to begin, etc.)
- the fields that are going to appear in the results (
Tips & tricks
The body of a query needs to be a correctly formatted JSON string, even using single quotes instead of double is considered to be an error. The dev tools
tab in Kibana interface offers help for automatic indentation and autocompletion features that can be very handy.
Get all records
To get all the records of the entire database :
GET _search
To retrieve all records from an index (primary source in this case) :
GET primary_source/_search
Which is equivalent to :
GET primary_source/_search
{
"query": {
"match_all": {}
}
}
matching query
SimpleExact term match
All the works that contain the string "tabule" in their title.
Note that the case of the letters doesn't matter, it will match "Tabule" as well. In addition to that, in this configuration, a sub-string will not match the same as the entire string ("tabu" will not match "tabule") ; the string is treated as a complete word.
GET work/_search
{
"query":{
"match": {
"title": "tabule"
}
}
}
All the original items that are associated with a primary source that is kept in the library that have
2
as id.
GET original_text/_search
{
"query": {
"match": {
"primary_source.library.id": "2"
}
}
}
Multiple terms match
All original items that have in their title either the word "solis", either the word "lune"
In this configuration, each string separated by a space is treated individually and the operator used to connect them is OR
. In other words, the more you put terms, the more you will match original items.
GET original_text/_search
{
"query":{
"match": {
"original_text_title": "solis lune"
}
}
}
To get a response where each terms given are independently going to filter the result (same kind of behavior as a Google query), you need to specifies the operator to be AND
.
GET original_text/_search
{
"query": {
"match": {
"original_text_title": {
"query": "lune solis",
"operator": "AND"
}
}
}
}
Adding some margin of error
Fuzziness
The fuzziness allows a certain amount of inaccuracy to be accepted.
All library that approximately have the string "natonale" in the name
GET library/_search
{
"query": {
"match": {
"library_name": {
"query": "natonale",
"fuzziness": "auto"
}
}
}
}
You can set the fuzziness to 1
or more but the auto
settings allows a number of letters that do not match, proportional to the length of the term to be searched.
Full text search on an index
To allow search on every field of an entity, the query has to be set to multi_match
.
All edited texts that have approximately the string "lune" in one of them fields.
GET edited_text/_search
{
"query": {
"multi_match": {
"query": "lune",
"fuzziness": "auto"
}
}
}
As is, those kind of requests are deprecated because no fields are specified : ElasticSearch encourages to list the fields you want the query to be executed on. The fields
allows to reduce noise in the results and to take less time.
All primary sources that match approximately the strings "vatican" and "latin" in the list of fields specified.
GET primary_source/_search
{
"query": {
"multi_match": {
"query": "vatican latin",
"fuzziness": "auto",
"operator": "and",
"fields": [
"shelfmark",
"digital_identifier",
"kibana_name",
"tpq.keywork",
"taq.keywork",
"prim_type",
"library.kibana_name",
"original_texts.kibana_name",
"original_texts.table_type.kibana_name",
"original_texts.place.kibana_name",
"original_texts.historical_actor.kibana_name",
"original_texts.script.script_name",
"original_texts.language.language_name"
]
}
}
}
Notice that on the fields tpq
and taq
that are typed as integer, a string query cannot be performed. In order to query those fields as well, you must add .keyword
after the field name : it corresponds to the field but typed as a string.
Wildcards
To find some more documentation for wildcard queries, click here.
Defining the source
If you are not interested in all the metadata (i.e. the content of the fields) associated with the entity you want to query, it is possible to set a list of fields that are going to appear in the response.
Only the shelfmark, and the library name of all primary sources that are a manuscript
GET primary_source/_search
{
"_source": [
"shelfmark",
"library.library_name"
],
"query": {
"match": {
"prim_type": "ms"
}
}
}
Note that if a record in the result do not have some information you asked for (let's say, the primary source isn't associated with a library, thus doesn't have a library.libray_name
), the result object will not have the key for this precise field. Instead of looking like that :
"_source" : {
"library" : {
"library_name" : "Vatican Library"
},
"shelfmark" : "Vat. Pal. Lat. 1376"
}
It will look like :
"_source" : {
"shelfmark" : "Vat. Pal. Lat. 1376"
}
Special queries
Range queries
Range queries can be made on fields that are numbers (even if the field is typed as a string but contains integer) in the Kibana interface, but does seem to only work on integer/float/date typed field when using ajax.
All the primary source that have an edition date between 1400 and 1500
GET primary_source/_search
{
"query": {
"range": {
"date": {
"gte": 1400, // greater than
"lte": 1500 // less than
}
}
}
}
If you want to make a range query on a date typed field (fields that end with _date
, you can use some ElasticSearch tools for date math (those as weel, seems to cause problem when used with ajax) :
All original items that have been created between 1000 years before today and 25 years after 1500
GET original_text/_search
{
"query": {
"bool": {
"must": [
{
"range": {
"tpq_date": {
"gte":"now-1000y"
}
}
},
{
"range": {
"taq_date": {
"lte": "1500-01-01||+25y"
}
}
}
]
}
}
}
Geo-distance queries
The fields named location
holds information that is treated as geo point by ElasticSearch : geo_distance
queries can be executed on them.
All works that have been conceived around 100km from 48 of latitude and 2 of longitude. NB : in this syntax,
longitude
comes beforelatitude
.
GET work/_search
{
"query": {
"geo_distance": {
"distance": "100km",
"place.location": [2,48]
}
}
}
The same query can be formulated more explicitly with this syntax :
GET work/_search
{
"query": {
"geo_distance": {
"distance": "100km",
"place.location": {
"lat" : 48,
"lon" : 2
}
}
}
}
Combining multiple clauses
Every filter you want to combine to build a query can be add with this kind of structure:
Filter 1 and filter 2 must be true at the same time (
AND
)
{
"query": {
"bool": {
"must": [
{
// filter 1
},
{
// filter 2
}
]
}
}
}
One of the two filters must be true (
OR
)
{
"query": {
"bool": {
"should": [
{
// filter 1
},
{
// filter 2
}
]
}
}
}
Putting all together
The shelfmarks of all early printed primary sources that contains an original item that were created near Paris (lat : 48, long : 2).
GET original_text/_search
{
"_source": [
"primary_source.shelfmark"
],
"query": {
"bool": {
"must": [
{
"geo_distance": {
"distance": "100km",
"place.location": [
2,
48
]
}
},
{
"bool": {
"should": [
{
"match": {
"historian.id": "6"
}
},
{
"match": {
"secondary_source.historians.id": "6"
}
}
]
}
}
]
}
}
}
Ajax requests to Elasticsearch
Cross origin requests
If you are performing ajax calls on your localhost
, you must allow elasticsearch to manage cross origin requests : in order to do that, the config file (to access when you are in the etc
directory, run the command sudo vim elasticsearch/elasticsearch.yml
) must contain the settings below (more on settings here) :
http.cors.enabled: true
# to specify origins that are allowed ("/" being treated as regular expression)
http.cors.allow-origin: /https?:\/\/localhost(:[0-9]+)?/
http.cors.allow-credentials: true
Simple GET query
The ajax request must look like (you can use the parameters listed here) :
var query = {
q: 'id:5'
};
var entity = "work";
$.ajax({
url: `http://localhost:9200/${entity}/_search`,
xhrFields: {
withCredentials: true
},
data: query,
success: function(data) {
// your code there
}
});
Query as JSON string
If you want to use a query formatted in JSON (as the ones used in the kibana interface), you should use this kind of syntax :
var query = '{"query":{"match":{"id":"5"}}}';
var entity = "work";
$.ajax({
url: `http://localhost:9200/${entity}/_search?
source_content_type=application/json&source=${query}`,
xhrFields: {
withCredentials: true
},
success: function(data) {
// your code there
}
});