Elasticsearch - illyfrancis/scribble GitHub Wiki

Notes

http://techbus.safaribooksonline.com/video/programming/9781491910795/-databases-and-datastores/141-2014-10-23?query=((elasticsearch))#snippet

use cases

  • searching pieces of pure text
  • searching text + structured data (products, user profiles, application logs)
  • pure aggregated data (stats, metrics, etc)
  • geo search
  • distributed JSON document DB (anything) - not really a use case

at high level

  • is a database, like other
  • document oriented
  • clusters
  • built on Lucene
  • build on an IR foundation (information retrieval)
  • can perform fancy tricks with inverted indexes and automata (???)

basics of ES api

getting data in

storing a document

verb: PUT
index: literature
type: quote
docId: one
document: {json}

curl -XPUT http://localhost:9200/literature/quote/one -d'
{
  "person": "Jack Handy",
  "said": "The face of a child can say it all, especially the mouth part of the face"
}'

where does the document go

indexes live in the cluster, documents live in indexes

Key points

Documents

  • a single arbitrary JSON object
  • stored as a text blob + indexes on fields
  • all fields get an inverted index(es)

Types

  • defines the schema for documents
  • defines indexing rules as well
{
  "human" : {
    "properties" : {
      "person" : { "type": "string" },
      "age" : { "type" : "integer" } }}}

Indexes

  • largest building block in ES
  • container for documents / types
  • composable

Document storage

11:36

Document, routing, Shards

Querying

A simple query

verb: POST
index: literature
type: quote
action: _search
search body: {json}

curl -XPOST http://localhost:9200/literature/quote/_search -d'
{ 
  "query": {
    "match": {
      "person": "jack" }}}'

Natural language search

Everything should run in sub linear time, usually O(log n)

Think of your indexes as Trees

14:40

SQL search as BTree - works well for "The%" Slow when %dog% - full table scan, as Btree is useless

whereas create index on every word hence resulting in an inverted index, which builds a btree and works well

Also case insensitive search would yield poor performance. fix it by creating an index on a lower case column value, then sql like lower(col) = lower('search term').

These kinds of action is called an analysis in ES.

Text in, terms out

  "Some kind of Text" => ANALYZER => ["text", "of", "kind", "some"]

ANALYZER is a function. Term is token.

Analysis

  "The quick brown fox jumps over the lazy dog" => Snowball Analyzer => 
    ["quick"2, "brown"3, "fox"4, "jump"5, "over"6, "lazi"7, "dog"8]

Stemming and stopwords

  "I jump while she jumps and laughs" => 
    ["i"1, "jump"2, "while"3, "she"4, "jump"5, "laugh"7]

NGrams

  "news" => NGram Analyzer => ["n","e","w","s","ne","ew","ws"]

Where is it useful? user name searches, non-english, partial matches

Path hierarchy analyzer

Inverted Index Highlights

  • M terms map to N documents
  • still uses trees, but by breaking up text, performance is gained
  • string broken up into linguistic terms (usually words)
  • postgres users can do this (in a simple form)

List of ES Analysis Tools

24:43

  • analyzers - whole bunch
  • tokenizers - also whole bunch

Scoring = Relevance

Search methodology

  • Find all the docs using a boolean query
  • Score all the docs using a similarity algorithm (TF/IDF)

TF/IDF Boosts when ...

  • the matched term is "rare" in the corpus
  • the term appears frequently in the document

Query types

  • phrase query
  • numeric range queries
  • more like this queries
  • geo
  • fast autocompletion
  • tones...

Compose queries with boolean / DisMax queries

Efficient Aggregate Queries:

like logstash...

An RDBMS vs Elasticsearch

ES is an Information Retrieval (IR) system.

Resons to consider ES

  1. speed - traditional databases often are slower for full text search
  2. relevance
  3. agregate stats
  4. search goodies - fase type-ahead search, did you mean, more like this...
  5. generic document store - as a second copy

Logstash

uses multi index query

Things ES is bad at

  • extremely high write environments - not write optimized
  • large amounts of document churn - deleting and remerging segments can get expensive
  • not transactional operations - no!
  • primary store - still too new

http://found.no