Analyzers - NatLibFi/Annif GitHub Wiki
Analyzers are used to pre-process, tokenize and normalize text. If you happen to be familiar with analyzers in Lucene, Solr and/or Elasticsearch, the concept is exactly the same although the details may differ a little bit. Analyzers are typically language-specific.
By default the tokenization discards all words that are shorter than three characters, but this can be configured by setting token_min_length
in the analyzer parameters. For example, to discard only words of one character (when using the snowball
analyzer for English), use snowball(english,token_min_length=2)
.
Annif supports many analyzers: simple
, snowball
, simplemma
, voikko
, spacy
and estnltk
.
simple
analyzer
The simple
analyzer only splits text into words and turns them all into lowercase.
snowball
analyzer
The snowball
analyzer additionally performs stemming. It takes a language name as parameter, e.g. snowball(english)
or snowball(finnish)
. You can use any language supported by the NLTK Snowball stemmer; see the NLTK stemmer documentation for details on supported languages.
The supported languages as of NLTK 3.4.5 are:
arabic danish dutch english finnish french german hungarian italian norwegian porter portuguese romanian russian spanish swedish
simplemma
analyzer
The simplemma
analyzer performs simple rule-based lemmatization for many languages. It takes a language code as parameter, e.g. simplemma(en)
. Lemmatization gives better results than stemming in many cases, but this depends on the language and classification task.
voikko
analyzer
The voikko
analyzer performs lemmatization for Finnish. It takes a language code as parameter, e.g. voikko(fi)
. This analyzer needs to be installed separately. See Optional features and dependencies
spacy
analyzer
The spacy
analyzer performs lemmatization for many languages using the spaCy NLP toolkit. See Models & Languages for the current list of supported languages.
The analyzer takes a language model name as parameter, e.g. spacy(en_core_web_sm)
. Optionally, lemmas can be forced to lowercase using the lowercase
option, like this: spacy(en_core_web_sm,lowercase=1)
This analyzer and the language-specific models need to be installed separately. See Optional features and dependencies
estnltk
analyzer
The estnltk
analyzer (new in Annif 1.3) performs lemmatization for the Estonian language using the EstNLTK toolkit. It takes no parameters.
This analyzer needs to be installed separately. See Optional features and dependencies