Elasticsearch - illyfrancis/scribble GitHub Wiki

Notes

http://techbus.safaribooksonline.com/video/programming/9781491910795/-databases-and-datastores/141-2014-10-23?query=((elasticsearch))#snippet

use cases

searching pieces of pure text
searching text + structured data (products, user profiles, application logs)
pure aggregated data (stats, metrics, etc)
geo search
distributed JSON document DB (anything) - not really a use case

at high level

is a database, like other
document oriented
clusters
built on Lucene
build on an IR foundation (information retrieval)
can perform fancy tricks with inverted indexes and automata (???)

basics of ES api

getting data in

storing a document

verb: PUT
index: literature
type: quote
docId: one
document: {json}

curl -XPUT http://localhost:9200/literature/quote/one -d'
{
  "person": "Jack Handy",
  "said": "The face of a child can say it all, especially the mouth part of the face"
}'

where does the document go

indexes live in the cluster, documents live in indexes

Key points

Documents

a single arbitrary JSON object
stored as a text blob + indexes on fields
all fields get an inverted index(es)

Types

defines the schema for documents
defines indexing rules as well

{
  "human" : {
    "properties" : {
      "person" : { "type": "string" },
      "age" : { "type" : "integer" } }}}

Indexes

largest building block in ES
container for documents / types
composable

Document storage

11:36

Document, routing, Shards

Querying

A simple query

verb: POST
index: literature
type: quote
action: _search
search body: {json}

curl -XPOST http://localhost:9200/literature/quote/_search -d'
{ 
  "query": {
    "match": {
      "person": "jack" }}}'

Natural language search

Everything should run in sub linear time, usually O(log n)

Think of your indexes as Trees

14:40

SQL search as BTree - works well for "The%" Slow when %dog% - full table scan, as Btree is useless

whereas create index on every word hence resulting in an inverted index, which builds a btree and works well

Also case insensitive search would yield poor performance. fix it by creating an index on a lower case column value, then sql like lower(col) = lower('search term').

These kinds of action is called an analysis in ES.

Text in, terms out

  "Some kind of Text" => ANALYZER => ["text", "of", "kind", "some"]

ANALYZER is a function. Term is token.

Analysis

  "The quick brown fox jumps over the lazy dog" => Snowball Analyzer => 
    ["quick"2, "brown"3, "fox"4, "jump"5, "over"6, "lazi"7, "dog"8]

Stemming and stopwords

  "I jump while she jumps and laughs" => 
    ["i"1, "jump"2, "while"3, "she"4, "jump"5, "laugh"7]

NGrams

  "news" => NGram Analyzer => ["n","e","w","s","ne","ew","ws"]

Where is it useful? user name searches, non-english, partial matches

Path hierarchy analyzer

Inverted Index Highlights

M terms map to N documents
still uses trees, but by breaking up text, performance is gained
string broken up into linguistic terms (usually words)
postgres users can do this (in a simple form)

List of ES Analysis Tools

24:43

analyzers - whole bunch
tokenizers - also whole bunch

Scoring = Relevance

Search methodology

Find all the docs using a boolean query
Score all the docs using a similarity algorithm (TF/IDF)

TF/IDF Boosts when ...

the matched term is "rare" in the corpus
the term appears frequently in the document

Query types

phrase query
numeric range queries
more like this queries
geo
fast autocompletion
tones...

Compose queries with boolean / DisMax queries

Efficient Aggregate Queries:

like logstash...

An RDBMS vs Elasticsearch

ES is an Information Retrieval (IR) system.

Resons to consider ES

speed - traditional databases often are slower for full text search
relevance
agregate stats
search goodies - fase type-ahead search, did you mean, more like this...
generic document store - as a second copy

Logstash

uses multi index query

Things ES is bad at

extremely high write environments - not write optimized
large amounts of document churn - deleting and remerging segments can get expensive
not transactional operations - no!
primary store - still too new

http://found.no