Backends - NatLibFi/Annif GitHub Wiki
Annif includes multiple backends that implement or use (via external libraries) algorithms that can be used for automated subject indexing. Each backend has its own strengths, requirements and ideal use cases. The table below provides a quick comparison—see each backend's page for details and configuration options.
[!TIP] For a quick setup, start with TF-IDF; for best results, try fastText, Omikuji and MLLM combined in ensembles.
Name | Type | Requires extra dependencies | Description | # train documents | Train documents length | Supports hyperopt |
---|---|---|---|---|---|---|
TF-IDF | Associative | No | A baseline algorithm, only for setup testing. | ? | short/long | No |
fastText | Associative | Yes | Allows to use word and character level n-grams (i.e. words that appear together and subwords). | 10,000+ | short/long | No |
Omikuji | Associative | Yes | A tree-based algorithm for extreme multilabel classification. | 10,000+ | short/long | No |
SVC | Associative | No | Linear Support Vector Classification for multiclass (not multilabel) classification. | ? | ? | No |
MLLM | Lexical | No | Maui-like lexical matching. | 100-10,000 | long | Yes |
Ensemble | Ensemble | No | Combines results from multiple backends with averaging. | NA | NA | Yes |
NN-ensemble | Ensemble | Yes | Combines results from multiple backends using a neural network. | 1,000-10,000 | long | No |
PAV | Ensemble | No | A trainable dynamic ensemble that intelligently combines results from multiple projects. | ? | ? | No |
YAKE | Lexical | Yes | An unsupervised keyword extraction method applied to find vocabulary terms. Requires no training. | NA | long | No |
STWFSA | Lexical | Yes | Statistical translation weighted finite state automaton. | ? | short/long | No |
HTTP | Special | No | Communicates with a REST API that provides a suggest method, e.g. with another instance of Annif. | NA | NA | No |
Dummy | Special | No | Returns fixed results (used for internal testing). | NA | NA | NA |
Column descriptions
- Name: The identifier of the backend.
- Type: The general approach used by the backend:
- Associative: Learns associations between input text and subject terms from training data.
- Lexical: Uses lexical matching techniques.
- Ensemble: Combines results from multiple backends.
- Special: Provides utility or integration features.
- Requires extra dependencies: Indicates whether additional dependencies must be installed in addition to the base Annif installation. See Optional-features-and-dependencies for details.
- Description: A short explanation of the backend’s functionality and purpose.
- # Train documents: Recommended number of training documents for good performance.
- Train documents length: Indicates whether the backend works best with
- short texts (e.g., titles),
- long texts (e.g., abstracts or full documents) or
- both.
- Supports hyperopt: Whether the backend supports hyperparameter optimization using
annif hyperopt
command. The optimized parameters are- the hyperparameters of the tree algorithm with the MLLM backend
- the source project weights with the ensemble backend