Standard Analyzer - ignacio-alorre/ElasticSearch GitHub Wiki
The standard analyzer is the default analyzer which is used if none is specified. It provides grammar based tokenization (based on the Unicode Text Segmentation algorithm) and works well for most languages.
Example Output
POST _analyze
{
"analyzer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above sentence would produce the following terms:
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]
Configuration
The standard analyzer accepts the following parameters:
-
max_token_length: The maximum token length. If a token is seen that exceeds this length then it is split atmax_token_lengthintervals. Defaults to255. -
stopwords: A pre-defined stop words list like_english_or an array containing a list of stop words. Defaults to\_none_. -
stopwords_path: The path to a file containing stop words.
Example configuration
In this example, we configure the standard analyzer to have a max_token_length of 5 (for demonstration purposes), and to use the pre-defined list of English stop words:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english_analyzer": {
"type": "standard",
"max_token_length": 5,
"stopwords": "_english_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_english_analyzer",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
The above example produces the following terms:
[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]
Note: We can see The is not in the final list, since it is consider a stopword. Besides that, jumped has got length 6, since max_token_length is set to 5, the word is split into jumpe and d
Definition
The standard analyzer consist of:
- Standard Tokenizer
- Standard Token Filter
- Lower Case Token Filter
- Stop Token Filter
If you need to customize the standard analyzer beyond the configuration parameters then you need to recreate it as a custom analyzer and modify it, usually by adding token filters. This would recreate the built-in standard analyzer and you can use it as a starting point:
PUT /standard_example
{
"settings": {
"analysis": {
"analyzer": {
"rebuilt_standard": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase"
]
}
}
}
}
}