ElasticSearch - raghusumanth/elk-repo GitHub Wiki

https://www.elastic.co/guide/en/elasticsearch/reference/current/elasticsearch-intro.html

What is ElasticSearch? Elasticsearch is the distributed search and analytics engine at the heart of the Elastic Stack. Logstash and Beats facilitate collecting, aggregating, and enriching your data and storing it in Elasticsearch. Kibana enables you to interactively explore, visualize, and share insights into your data and manage and monitor the stack. Elasticsearch is where the indexing, search, and analysis magic happens.

Elasticsearch provides near real-time search and analytics for all types of data. Whether you have structured or unstructured text, numerical data, or geospatial data, Elasticsearch can efficiently store and index it in a way that supports fast searches. You can go far beyond simple data retrieval and aggregate information to discover trends and patterns in your data. And as your data and query volume grows, the distributed nature of Elasticsearch enables your deployment to grow seamlessly right along with it.

While not every problem is a search problem, Elasticsearch offers speed and flexibility to handle data in a wide variety of use cases:

Add a search box to an app or website Store and analyze logs, metrics, and security event data Use machine learning to automatically model the behavior of your data in real time Automate business workflows using Elasticsearch as a storage engine Manage, integrate, and analyze spatial information using Elasticsearch as a geographic information system (GIS) Store and process genetic data using Elasticsearch as a bioinformatics research tool We’re continually amazed by the novel ways people use search. But whether your use case is similar to one of these, or you’re using Elasticsearch to tackle a new problem, the way you work with your data, documents, and indices in Elasticsearch is the same.

Data in: Documents and Indices:

Elasticsearch is a distributed document store. Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents. When you have multiple Elasticsearch nodes in a cluster, stored documents are distributed across the cluster and can be accessed immediately from any node.

When a document is stored, it is indexed and fully searchable in near real-time--within 1 second. Elasticsearch uses a data structure called an inverted index that supports very fast full-text searches. An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.

An index can be thought of as an optimized collection of documents and each document is a collection of fields, which are the key-value pairs that contain your data. By default, Elasticsearch indexes all data in every field and each indexed field has a dedicated, optimized data structure. For example, text fields are stored in inverted indices, and numeric and geo fields are stored in BKD trees. The ability to use the per-field data structures to assemble and return search results is what makes Elasticsearch so fast.

Elasticsearch also has the ability to be schema-less, which means that documents can be indexed without explicitly specifying how to handle each of the different fields that might occur in a document. When dynamic mapping is enabled, Elasticsearch automatically detects and adds new fields to the index. This default behavior makes it easy to index and explore your data—just start indexing documents and Elasticsearch will detect and map booleans, floating point and integer values, dates, and strings to the appropriate Elasticsearch data types.

Ultimately, however, you know more about your data and how you want to use it than Elasticsearch can. You can define rules to control dynamic mapping and explicitly define mappings to take full control of how fields are stored and indexed.

Defining your own mappings enables you to:

Distinguish between full-text string fields and exact value string fields Perform language-specific text analysis Optimize fields for partial matching Use custom date formats Use data types such as geo_point and geo_shape that cannot be automatically detected It’s often useful to index the same field in different ways for different purposes. For example, you might want to index a string field as both a text field for full-text search and as a keyword field for sorting or aggregating your data. Or, you might choose to use more than one language analyzer to process the contents of a string field that contains user input.

The analysis chain that is applied to a full-text field during indexing is also used at search time. When you query a full-text field, the query text undergoes the same analysis before the terms are looked up in the index.

Information out: Search and Analyze: While you can use Elasticsearch as a document store and retrieve documents and their metadata, the real power comes from being able to easily access the full suite of search capabilities built on the Apache Lucene search engine library.

Elasticsearch provides a simple, coherent REST API for managing your cluster and indexing and searching your data. For testing purposes, you can easily submit requests directly from the command line or through the Developer Console in Kibana. From your applications, you can use the Elasticsearch client for your language of choice: Java, JavaScript, Go, .NET, PHP, Perl, Python or Ruby.

Searching your dataedit The Elasticsearch REST APIs support structured queries, full text queries, and complex queries that combine the two. Structured queries are similar to the types of queries you can construct in SQL. For example, you could search the gender and age fields in your employee index and sort the matches by the hire_date field. Full-text queries find all documents that match the query string and return them sorted by relevance—how good a match they are for your search terms.

In addition to searching for individual terms, you can perform phrase searches, similarity searches, and prefix searches, and get autocomplete suggestions.

Have geospatial or other numerical data that you want to search? Elasticsearch indexes non-textual data in optimized data structures that support high-performance geo and numerical queries.

You can access all of these search capabilities using Elasticsearch’s comprehensive JSON-style query language (Query DSL). You can also construct SQL-style queries to search and aggregate data natively inside Elasticsearch, and JDBC and ODBC drivers enable a broad range of third-party applications to interact with Elasticsearch via SQL.

Analyzing your dataedit Elasticsearch aggregations enable you to build complex summaries of your data and gain insight into key metrics, patterns, and trends. Instead of just finding the proverbial “needle in a haystack”, aggregations enable you to answer questions like:

How many needles are in the haystack? What is the average length of the needles? What is the median length of the needles, broken down by manufacturer? How many needles were added to the haystack in each of the last six months? You can also use aggregations to answer more subtle questions, such as:

What are your most popular needle manufacturers? Are there any unusual or anomalous clumps of needles? Because aggregations leverage the same data-structures used for search, they are also very fast. This enables you to analyze and visualize your data in real time. Your reports and dashboards update as your data changes so you can take action based on the latest information.

What’s more, aggregations operate alongside search requests. You can search documents, filter results, and perform analytics at the same time, on the same data, in a single request. And because aggregations are calculated in the context of a particular search, you’re not just displaying a count of all size 70 needles, you’re displaying a count of the size 70 needles that match your users' search criteria—for example, all size 70 non-stick embroidery needles.

But wait, there’s moreedit Want to automate the analysis of your time series data? You can use machine learning features to create accurate baselines of normal behavior in your data and identify anomalous patterns. With machine learning, you can detect:

Anomalies related to temporal deviations in values, counts, or frequencies Statistical rarity Unusual behaviors for a member of a population And the best part? You can do this without having to specify algorithms, models, or other data science-related configurations.

Scalability And resilience: Clusters, Nodes and Shards: Elasticsearch is built to be always available and to scale with your needs. It does this by being distributed by nature. You can add servers (nodes) to a cluster to increase capacity and Elasticsearch automatically distributes your data and query load across all of the available nodes. No need to overhaul your application, Elasticsearch knows how to balance multi-node clusters to provide scale and high availability. The more nodes, the merrier.

How does this work? Under the covers, an Elasticsearch index is really just a logical grouping of one or more physical shards, where each shard is actually a self-contained index. By distributing the documents in an index across multiple shards, and distributing those shards across multiple nodes, Elasticsearch can ensure redundancy, which both protects against hardware failures and increases query capacity as nodes are added to a cluster. As the cluster grows (or shrinks), Elasticsearch automatically migrates shards to rebalance the cluster.

There are two types of shards: primaries and replicas. Each document in an index belongs to one primary shard. A replica shard is a copy of a primary shard. Replicas provide redundant copies of your data to protect against hardware failure and increase capacity to serve read requests like searching or retrieving a document.

The number of primary shards in an index is fixed at the time that an index is created, but the number of replica shards can be changed at any time, without interrupting indexing or query operations.

It depends…edit There are a number of performance considerations and trade offs with respect to shard size and the number of primary shards configured for an index. The more shards, the more overhead there is simply in maintaining those indices. The larger the shard size, the longer it takes to move shards around when Elasticsearch needs to rebalance a cluster.

Querying lots of small shards makes the processing per shard faster, but more queries means more overhead, so querying a smaller number of larger shards might be faster. In short…it depends.

As a starting point:

Aim to keep the average shard size between a few GB and a few tens of GB. For use cases with time-based data, it is common to see shards in the 20GB to 40GB range. Avoid the gazillion shards problem. The number of shards a node can hold is proportional to the available heap space. As a general rule, the number of shards per GB of heap space should be less than 20. The best way to determine the optimal configuration for your use case is through testing with your own data and queries.

In case of disasteredit For performance reasons, the nodes within a cluster need to be on the same network. Balancing shards in a cluster across nodes in different data centers simply takes too long. But high-availability architectures demand that you avoid putting all of your eggs in one basket. In the event of a major outage in one location, servers in another location need to be able to take over. Seamlessly. The answer? Cross-cluster replication (CCR).

CCR provides a way to automatically synchronize indices from your primary cluster to a secondary remote cluster that can serve as a hot backup. If the primary cluster fails, the secondary cluster can take over. You can also use CCR to create secondary clusters to serve read requests in geo-proximity to your users.

Cross-cluster replication is active-passive. The index on the primary cluster is the active leader index and handles all write requests. Indices replicated to secondary clusters are read-only followers.

Care and feedingedit As with any enterprise system, you need tools to secure, manage, and monitor your Elasticsearch clusters. Security, monitoring, and administrative features that are integrated into Elasticsearch enable you to use Kibana as a control center for managing a cluster. Features like data rollups and index lifecycle management help you intelligently manage your data over time.

EQL:EventQueryLanguage EQL (Event Query Language) is a declarative language dedicated for identifying patterns and relationships between events.

Consider using EQL if you:

Use Elasticsearch for threat hunting or other security use cases Search time series data or logs, such as network or system logs Want an easy way to explore relationships between events A good intro on EQL and its purpose is available in this blog post. See the EQL in Elasticsearch documentaton for an in-depth explanation, and also the language reference.

This release includes the following features:

Event queries Sequences Pipes

DataStreams: A data stream is a convenient, scalable way to ingest, search, and manage continuously generated time series data. They provide a simpler way to split data across multiple indices and still query it via a single named resource.

Improve Speed and memory usage of multi bucket aggregations: many of our more complex aggregations made a simplifying assumption that required that they duplicate many data structures once per bucket that contained them. The most expensive of these weighed in at a couple of kilobytes each.

Getting Started with ElasticSearch: Start ElasticSearch:

.\elasticsearch.bat -E path.data=data2 -E path.logs=log2 Use the cat health API to verify that your three-node cluster is up running. GET /_cat/health?v Note: The cluster status will remain yellow if you are only running a single instance of Elasticsearch. A single node cluster is fully functional, but data cannot be replicated to another node to provide resiliency. Replica shards must be available for the cluster status to be green. If the cluster status is red, some data is unavailable.

Talking to ES with CurlCommand: curl -X '://:/?<QUERY_STRING>' -d ''

The API endpoint, which can contain multiple components, such as _cluster/stats or _nodes/stats/jvm.

<QUERY_STRING> Any optional query-string parameters. For example, ?pretty will pretty-print the JSON response to make it easier to read.

A JSON-encoded request body (if necessary).

Ingest some documents: Once you have a cluster up and running, you’re ready to index some data. There are a variety of ingest options for Elasticsearch, but in the end they all do the same thing: put JSON documents into an Elasticsearch index.

You can do this directly with a simple PUT request that specifies the index you want to add the document, a unique document ID, and one or more "field": "value" pairs in the request body:

PUT /customer/_doc/1 { "name": "John Doe" }

This request automatically creates the customer index if it doesn’t already exist, adds a new document that has an ID of 1, and stores and indexes the name field.

Since this is a new document, the response shows that the result of the operation was that version 1 of the document was created

The new document is available immediately from any node in the cluster. You can retrieve it with a GET request that specifies its document ID:

GET /customer/_doc/1

If you have a lot of documents to index, you can submit them in batches with the bulk API. Using bulk to batch document operations is significantly faster than submitting requests individually as it minimizes network roundtrips.

The optimal batch size depends on a number of factors: the document size and complexity, the indexing and search load, and the resources available to your cluster. A good place to start is with batches of 1,000 to 5,000 documents and a total payload between 5MB and 15MB. From there, you can experiment to find the sweet spot.

curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk?pretty&refresh" --data-binary "@accounts.json" curl "localhost:9200/_cat/indices?v"

Start searching: Once you have ingested some data into an Elasticsearch index, you can search it by sending requests to the _search endpoint. To access the full suite of search capabilities, you use the Elasticsearch Query DSL to specify the search criteria in the request body. You specify the name of the index you want to search in the request URI. GET /bank/_search { "query": { "match_all": {} }, "sort": [ { "account_number": "asc" } ] }

By default, the hits section of the response includes the first 10 documents that match the search criteria:

The response also provides the following information about the search request:

took – how long it took Elasticsearch to run the query, in milliseconds timed_out – whether or not the search request timed out _shards – how many shards were searched and a breakdown of how many shards succeeded, failed, or were skipped. max_score – the score of the most relevant document found hits.total.value - how many matching documents were found hits.sort - the document’s sort position (when not sorting by relevance score) hits._score - the document’s relevance score (not applicable when using match_all) Each search request is self-contained: Elasticsearch does not maintain any state information across requests. To page through the search hits, specify the from and size parameters in your request.

For example, the following request gets hits 10 through 19:

GET /bank/_search { "query": { "match_all": {} }, "sort": [ { "account_number": "asc" } ], "from": 10, "size": 10 }

Now that you’ve seen how to submit a basic search request, you can start to construct queries that are a bit more interesting than match_all.

To search for specific terms within a field, you can use a match query. For example, the following request searches the address field to find customers whose addresses contain mill or lane:

GET /bank/_search { "query": { "match": { "address": "mill lane" } } }

To perform a phrase search rather than matching individual terms, you use match_phrase instead of match. For example, the following request only matches addresses that contain the phrase mill lane:

GET /bank/_search { "query": { "match_phrase": { "address": "mill lane" } } }

To construct more complex queries, you can use a bool query to combine multiple query criteria. You can designate criteria as required (must match), desirable (should match), or undesirable (must not match).

For example, the following request searches the bank index for accounts that belong to customers who are 40 years old, but excludes anyone who lives in Idaho (ID):

GET /bank/_search { "query": { "bool": { "must": [ { "match": { "age": "40" } } ], "must_not": [ { "match": { "state": "ID" } } ] } } }

Each must, should, and must_not element in a Boolean query is referred to as a query clause. How well a document meets the criteria in each must or should clause contributes to the document’s relevance score. The higher the score, the better the document matches your search criteria. By default, Elasticsearch returns documents ranked by these relevance scores.

The criteria in a must_not clause is treated as a filter. It affects whether or not the document is included in the results, but does not contribute to how documents are scored. You can also explicitly specify arbitrary filters to include or exclude documents based on structured data.

For example, the following request uses a range filter to limit the results to accounts with a balance between $20,000 and $30,000 (inclusive).

GET /bank/_search { "query": { "bool": { "must": { "match_all": {} }, "filter": { "range": { "balance": { "gte": 20000, "lte": 30000 } } } } } }

Analyze results with aggregations: Elasticsearch aggregations enable you to get meta-information about your search results and answer questions like, "How many account holders are in Texas?" or "What’s the average balance of accounts in Tennessee?" You can search documents, filter hits, and use aggregations to analyze the results all in one request.

For example, the following request uses a terms aggregation to group all of the accounts in the bank index by state, and returns the ten states with the most accounts in descending order:

GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword" } } } }

The buckets in the response are the values of the state field. The doc_count shows the number of accounts in each state. For example, you can see that there are 27 accounts in ID (Idaho). Because the request set size=0, the response only contains the aggregation results.

You can combine aggregations to build more complex summaries of your data. For example, the following request nests an avg aggregation within the previous group_by_state aggregation to calculate the average account balances for each state.

GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword" }, "aggs": { "average_balance": { "avg": { "field": "balance" } } } } } }

Instead of sorting the results by count, you could sort using the result of the nested aggregation by specifying the order within the terms aggregation:

GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "order": { "average_balance": "desc" } }, "aggs": { "average_balance": { "avg": { "field": "balance" } } } } } }

In addition to basic bucketing and metrics aggregations like these, Elasticsearch provides specialized aggregations for operating on multiple fields and analyzing particular types of data such as dates, IP addresses, and geo data. You can also feed the results of individual aggregations into pipeline aggregations for further analysis.

The core analysis capabilities provided by aggregations enable advanced features such as using machine learning to detect anomalies.

.\bin\elasticsearch.exe -E cluster.name=my_cluster -E node.name=node_1 -E path.logs="C:\My Logs\logs" GET /

Install ElasticSearch with Docker: docker pull docker.elastic.co/elasticsearch/elasticsearch:7.9.2

docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.9.2

To get a three-node Elasticsearch cluster up and running in Docker, you can use Docker Compose

ElasticSearch - raghusumanth/elk-repo GitHub Wiki

Data in: Documents and Indices:

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️