Elasticsearch - illyfrancis/scribble GitHub Wiki
Notes
use cases
- searching pieces of pure text
- searching text + structured data (products, user profiles, application logs)
- pure aggregated data (stats, metrics, etc)
- geo search
- distributed JSON document DB (anything) - not really a use case
at high level
- is a database, like other
- document oriented
- clusters
- built on Lucene
- build on an IR foundation (information retrieval)
- can perform fancy tricks with inverted indexes and automata (???)
basics of ES api
getting data in
storing a document
verb: PUT
index: literature
type: quote
docId: one
document: {json}
curl -XPUT http://localhost:9200/literature/quote/one -d'
{
"person": "Jack Handy",
"said": "The face of a child can say it all, especially the mouth part of the face"
}'
where does the document go
indexes live in the cluster, documents live in indexes
Key points
Documents
- a single arbitrary JSON object
- stored as a text blob + indexes on fields
- all fields get an inverted index(es)
Types
- defines the schema for documents
- defines indexing rules as well
{
"human" : {
"properties" : {
"person" : { "type": "string" },
"age" : { "type" : "integer" } }}}
Indexes
- largest building block in ES
- container for documents / types
- composable
Document storage
11:36
Document, routing, Shards
Querying
A simple query
verb: POST
index: literature
type: quote
action: _search
search body: {json}
curl -XPOST http://localhost:9200/literature/quote/_search -d'
{
"query": {
"match": {
"person": "jack" }}}'
Natural language search
Everything should run in sub linear time, usually O(log n)
Think of your indexes as Trees
14:40
SQL search as BTree - works well for "The%" Slow when %dog% - full table scan, as Btree is useless
whereas create index on every word hence resulting in an inverted index, which builds a btree and works well
Also case insensitive search would yield poor performance. fix it by creating an index on a lower case column value, then sql like lower(col) = lower('search term').
These kinds of action is called an analysis in ES.
Text in, terms out
"Some kind of Text" => ANALYZER => ["text", "of", "kind", "some"]
ANALYZER is a function. Term is token.
Analysis
"The quick brown fox jumps over the lazy dog" => Snowball Analyzer =>
["quick"2, "brown"3, "fox"4, "jump"5, "over"6, "lazi"7, "dog"8]
Stemming and stopwords
"I jump while she jumps and laughs" =>
["i"1, "jump"2, "while"3, "she"4, "jump"5, "laugh"7]
NGrams
"news" => NGram Analyzer => ["n","e","w","s","ne","ew","ws"]
Where is it useful? user name searches, non-english, partial matches
Path hierarchy analyzer
Inverted Index Highlights
- M terms map to N documents
- still uses trees, but by breaking up text, performance is gained
- string broken up into linguistic terms (usually words)
- postgres users can do this (in a simple form)
List of ES Analysis Tools
24:43
- analyzers - whole bunch
- tokenizers - also whole bunch
Scoring = Relevance
Search methodology
- Find all the docs using a boolean query
- Score all the docs using a similarity algorithm (TF/IDF)
TF/IDF Boosts when ...
- the matched term is "rare" in the corpus
- the term appears frequently in the document
Query types
- phrase query
- numeric range queries
- more like this queries
- geo
- fast autocompletion
- tones...
Compose queries with boolean / DisMax queries
Efficient Aggregate Queries:
like logstash...
An RDBMS vs Elasticsearch
ES is an Information Retrieval (IR) system.
Resons to consider ES
- speed - traditional databases often are slower for full text search
- relevance
- agregate stats
- search goodies - fase type-ahead search, did you mean, more like this...
- generic document store - as a second copy
Logstash
uses multi index query
Things ES is bad at
- extremely high write environments - not write optimized
- large amounts of document churn - deleting and remerging segments can get expensive
- not transactional operations - no!
- primary store - still too new