Backends - NatLibFi/Annif GitHub Wiki

Annif includes multiple backends that implement or use (via external libraries) algorithms that can be used for automated subject indexing. Each backend has its own strengths, requirements and ideal use cases. The table below provides a quick comparison—see each backend's page for details and configuration options.

[!TIP] For a quick setup, start with TF-IDF; for best results, try fastText, Omikuji and MLLM combined in ensembles.

Name	Type	Requires extra dependencies	Description	# train documents	Train documents length	Supports hyperopt
TF-IDF	Associative	No	A baseline algorithm, only for setup testing.	?	short/long	No
fastText	Associative	Yes	Allows to use word and character level n-grams (i.e. words that appear together and subwords).	10,000+	short/long	No
Omikuji	Associative	Yes	A tree-based algorithm for extreme multilabel classification.	10,000+	short/long	No
SVC	Associative	No	Linear Support Vector Classification for multiclass (not multilabel) classification.	?	?	No
MLLM	Lexical	No	Maui-like lexical matching.	100-10,000	long	Yes
Ensemble	Ensemble	No	Combines results from multiple backends with averaging.	NA	NA	Yes
NN-ensemble	Ensemble	Yes	Combines results from multiple backends using a neural network.	1,000-10,000	long	No
PAV	Ensemble	No	A trainable dynamic ensemble that intelligently combines results from multiple projects.	?	?	No
YAKE	Lexical	Yes	An unsupervised keyword extraction method applied to find vocabulary terms. Requires no training.	NA	long	No
STWFSA	Lexical	Yes	Statistical translation weighted finite state automaton.	?	short/long	No
HTTP	Special	No	Communicates with a REST API that provides a suggest method, e.g. with another instance of Annif.	NA	NA	No
Dummy	Special	No	Returns fixed results (used for internal testing).	NA	NA	NA

Column descriptions

Name: The identifier of the backend.
Type: The general approach used by the backend:
- Associative: Learns associations between input text and subject terms from training data.
- Lexical: Uses lexical matching techniques.
- Ensemble: Combines results from multiple backends.
- Special: Provides utility or integration features.
Requires extra dependencies: Indicates whether additional dependencies must be installed in addition to the base Annif installation. See Optional-features-and-dependencies for details.
Description: A short explanation of the backend’s functionality and purpose.
# Train documents: Recommended number of training documents for good performance.
Train documents length: Indicates whether the backend works best with
- short texts (e.g., titles),
- long texts (e.g., abstracts or full documents) or
- both.
Supports hyperopt: Whether the backend supports hyperparameter optimization using annif hyperopt command. The optimized parameters are
- the hyperparameters of the tree algorithm with the MLLM backend
- the source project weights with the ensemble backend

← Generating synthetic training data | TF-IDF →