DRAFT ‐ Backend: EBM - NatLibFi/Annif GitHub Wiki

Note

The ebm backend is not yet available in released Annif versions; its integration is WIP: https://github.com/NatLibFi/Annif/pull/914

Starting from the idea of lexical matching, where subject suggestions are generated by matching a controlled vocabulary against an input text based on their string representation, embedding based matching (EBM) generalizes this idea by generating matches based on vector representations. That means, a sentence transformer model is used to generate embeddings for both input text and subject terms. These embeddings are numerical vector representations of text in a high dimensional space and allow the usage of vector similarity metrics to find matches between texts and keywords.

In more detail, the idea of EBM is an inverted retrieval logic: Your target vocabulary is vectorized with a sentence transformer model, the embeddings are stored in a vector storage. The vector storage serves as an instrument to index the vocabulary and its embeddings, enabling fast search across the vocabulary, even for extremely large vocabularies with many synonyms.

An input text to be indexed with terms from this vocabulary is embedded with the same sentence transformer model, and sent as a query to the vector storage, resulting in subject candidates with embeddings that are close to the query. Longer input texts can be chunked, resulting in multiple queries.

Finally, a ranker model is trained, that reranks the subject candidates, using some numerical features collected during the matching process. For instance, subject terms that are matched across a text frequently may get higher relevance from the ranker model.

This design borrows a lot of ideas from lexical matching like Maui [1], Kea [2] and particularly from the MLLM-Backend (Maui Like Lexical Matching).

[1] Medelyan, O., Frank, E., & Witten, I. H. (2009). Human-competitive tagging using automatic keyphrase extraction. ACL and AFNLP, 6–7. https://doi.org/10.5555/3454287.3454810

[2] Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C., & Nevill-Manning, C. G. (1999). Domain-Specific Keyphrase Extraction. Proceedings of the 16 Th International Joint Conference on Artifical Intelligence (IJCAI99), 668–673.

Embedding Based Matching Sketch # TODO: Add image ebm-sketch.svg

When to use EBM?

EBM is a backend that is suited to indexing, not classification. It relies on verbal descriptions of concepts and does not work e.g. with alpha-numeric class labels like DDC. EBM is a backend with a moderate overall performance, however, it features a decent performance for zero- and few-shot labels. This means that EBM is a good choice whenever you have little training material. As such, EBM is a good counterpart for supervised learning methods like Omikuji in an ensemble. As EBM performs the subject term matching based on sentence transformer embeddings, EBM's suggestions will likely be very different from a lexical approach like MLLM. Thus, EBM and MLLM may also work well in an ensemble.

Running mode

Generating the embeddings for all the vocabulary labels and the text in train and suggest operations can be computationally expensive and require using a GPU to achieve feasible performance.

EBM backend can be run in two possible modes, which governs where the embeddings are generated:

In the in-process mode the embeddings are generated locally in the same process as Annif is running. This typically requires using an environment having a GPU. Embedding is performed using sentence-transformers library.
In the API mode the embeddings are generated in another service via an API, which can be either an OpenAI style API or a Huggingface TEI API. The API service could run on cloud or an own server platform.

Note that the in-process mode possibly offers a better performance than is available via an API service owing to batch processing. The in-process mode is thus better suited for training and indexing documents in local runs, while the API mode can be the preferred option for continously running Annif as a service.

(TODO: I guess it is prossible to train an EBM project in-process and then use it in API mode?)

Installation

For using the in-process mode:

pip install annif[ebm-in-process]

Some particular embedding models (e.g. jina-ai embeddings) may need further dependencies, which you would find on a model's Hugging Face model card. TODO Also add CUDA modules can be required to be installed for using GPU?

For the API mode:

pip install annif[ebm-api]

Example configuration

A minimal configuration for the in-process mode that relies on default values for the EBM backend could be:

[ebm-bge-m3]
name=ebm-bge-m3
language=en
backend=ebm
analyzer=snowball(english)
vocab=yso
# TODO Parameters necessary for a minimal config for in-process GPU usage?

This is equivalent to setting embedding_model_name=BAAI/bge-m3.

The following sections guide you through further parameters in the individual stages that EBM undergoes during training and inference.

Backend-specific parameters

Embedding settings

An essential function in the EBM backend is the generation of the embeddings for the vocabulary and text chunks. The chunks and vocabulary are processed by the same embedding model.

The selection between in-process and API mode is controlled with the embedding_model_deployment parameter, allowing values:

in-process
HuggingFaceTEI
OpenAI

Either of the HuggingFaceTEI or OpenAI enable the API mode, see below for the related parameters.

in-process

The most important parameter in the in-process mode is the sentence transformer model. The additional parameter embedding_model_args can be used to pass further parameters to the SentenceTransformer class. See below examples.

Parameter	Default	Description
`embedding_dimensions`	1024	This is the embedding vector size. Some models, e.g. Jina-AI, support truncation of embedding dimensions. Make sure this size is supported by the embedding model.
`embedding_model_name`	`BAAI/bge-m3`	Hugging Face identifier for a model of class Sentence Transformer
`embedding_model_deployment`	`in-process`	see above # TODO Is this line needed here?
`embedding_model_args`	`{"device": "cpu", "trust_remote_code": False}`	A dictionary of parameters passed to the method `SentenceTransformer`, when loading the embedding model
`encode_args_vocab`	`{"batch_size": 32, "show_progress_bar": True}`	A dictionary of parameters passed to the method `SentenceTransformer.encode`, when generating embeddings for vocabulary
`encode_args_documents`	`{"batch_size": 32, "show_progress_bar": True}`	A dictionary of parameters passed to the method `SentenceTransformer.encode`, when generating embeddings for document chunks

See below examples on using embedding_model_args, encode_args_vocab and encode_args_documents to allow advanced handling of the embedding generation process. Usually, for vocab embeddings (which are short texts only), one can afford a much higher batch_size than for text-chunks.

Tip

Increasing batch_size in the encoding parameters can have a huge impact on processing time and can be set as high as memory constraints allow.

Here are a few sentence transformer models that may be worth trying:

These are all multilingual models that support the sentence transformer class. The landscape of embedding models continues to evolve, and most likely there is a better model for your language and application.

You can also fine-tune custom models and pass these to the EBM backend. For example KatjaK/gnd_retriever_full is a model fine-tuned for matching German title to keyword pairs.

TODO Emphasize how to enable GPU, but could its enablement be made simpler or default?

API-mode: HuggingFaceTEI

HuggingFace Text Embedding Inference (TEI) is a toolkit for deployment and serving text embedding models.

For using HuggingFaceTEI the embedding_model_args needs to include the configuration for the HuggingFaceTEI API like this:

embedding_model_deployment=HuggingFaceTEI
embedding_model_args={"api_address": "BASE_URL/embed", "headers": {"Content-Type": "application/json"}}

TODO: The setting the "headers": {"Content-Type": "application/json"} does not seem necessary for OpenAI API use, it is necessary for HFTEI?

Note that in this case embedding_model_name would not change the actual content that is served by the API. The configuration what model can be served lies with the external API service. BASE_URL would be something like http://localhost:8090/, if you deploy the HuggingFace TEI server locally.

TODO: Instruct how to start the HFTEI service locally?

API-mode: OpenAI

To use an OpenAI compatible API, you can use a configuration like this:

embedding_model_deployment=OpenAI
embedding_model_args={"api_address": "<API_URL>", "headers": {"Content-Type": "application/json"}}
encode_args_documents={"truncate_prompt_tokens": 8192}
encode_args_vocab={"truncate_prompt_tokens": 8192}

Note that you need to ensure that the truncation is handled correctly. When an API key is required, it needs to be set in an environment variable named OPENAI_API_KEY.

Note

When using Azure OpenAI API the API_URL needs to be of the form https://<YOUR_RESOURCE_NAME>.openai.azure.com/openai/v1/; for more information about using the Azure OpenAI endpoint see this page.

Vocab collection

Before any text is processed the EBM backend needs to create embeddings of the vocabulary and store them in a vector storage.

Parameter	Default	Description
`use_altLabels`	True	Consider SKOS altLabels in addition to prefLabels, when vectorizing subjects.
`hnsw_index_params`	`{"M": 32, "ef_construction": 256, "ef_search": 256}`	These are parameters used in the construction of the HNSW-Index for vector search.

See the duckDB documentation for more explanation on the settings for hnsw_index_params. These parameters control speed and accuracy of the vector search.

Chunking

A text is chunked into several pieces, sentence by sentence, using the analyzer provided in the projects.cfg (e.g. snowball). Parameters for chunking are:

Parameter	Default	Description
`max_chunk_count`	100	Upper bound on the number of chunks per document to be processed
`max_sentence_count`	100	Similar to `max_chunk_count`, an upper bound on the number of sentences to be processed
`max_chunk_length`	50	Sentences shorter than `max_chunk_length` are concatenated with the next sentence until a chunk reaches `max_chunk_length`
`chunking_jobs`	1	Number of parallel processes used for chunking

max_chunk_count and max_sentence_count effectively shorten the input text if its sentence or chunk count exceeds the specified limits. If max_chunk_length is set to a higher length, then the total number of chunks will differ from the total number of sentences, as multiple sentences are concatenated to form one chunk.

Vector search

For each text chunk EBM conducts a vector search, which may return candidates_per_chunk of subject suggestions per chunk. These are aggregated on document level, counting the number of occurences for each subject suggestion per document (and other statistics used by the ranker). On document level candidates_per_doc controls the maximum amount of subject suggestions produced per document.

Parameter	Default	Description
`candidates_per_chunk`	20	How many subject suggestions per chunk should be returned? Increase for higher recall. Decrease for better precision.
`candidates_per_doc`	100	How many subject suggestions per document should be used, when training the EBM ranker
`query_jobs`:	1	Vector search is conducted in batches. Increasing `query_jobs` to allow processing multiple batches in parallel can speed up performance.
`duckdb_threads`	1	Config parameter passed to the underlying duckdb databases. Increase performance by allowing a maximum number of threads for the internal processes of the database.

Ranker

Subject suggestions generated by the vector search are ranked using a ranker model. Heuristics gathered during the aggregation of vector search results are used to rank the subject suggestions. These heuristics include

number of occurrences in a text,
the position of the first and last match in the text
cosine similarity of the best match

and other heuristics.

The ranker model is an ensemble of gradient boosted decision trees, as implemented by the xgboost library. Fitting of the ranker can be controlled with:

Parameter	Default	Description
`xgb_shrinkage`	0.03	see XGB docs
`xgb_interaction_depth`	5	see XGB docs
`xgb_subsample`	0.7	see XGB docs
`xgb_rounds`	300	see XGB docs
`xgb_jobs`	1	number of parallel processes used by xgboost

Details on these parameters are explained in the XGB Documentation, parameters for tree booster.

Full example configurations

The most important parameter for the EBM backend is the sentence transformer model. It is worth trying out a few different models to get the most out of the EBM backend.

Using Jina-AI Embeddings

This is a project configuration using Jina-AI embeddings. Jina-AI offers multilingual embedding models. To save memory and gain speed one can choose to lower the parameter embedding_dimension to values in (32, 64, 128, 256, 512, 768, 1024), featuring Jina-AI's technique of Matryoshka Embeddings.

Note, as described in the model card you need to install flash-attention for this particular model.

pip install flash-attn --no-build-isolation

Also, you need to pass "trust_remote_code": True to the SentenceTransformer kwargs, as showcased in the below example.

Jina-AI also supports asymmetric embeddings, which is useful for matching sentence-keyword pairs. You can customize the arguments for encoding the vocab and documents.

[ebm-jina-embeddings-v3]
name=EBM with Jina-AI/jina-embeddings-v3
language=de
backend=ebm
vocab=gnd
limit=100
embedding_model_name=jinaai/jina-embeddings-v3
model_args={
  "device": "cuda",
  "trust_remote_code": True}
encode_args_vocab={
  "device": "cuda",
  "batch_size": 300,
  "task": "retrieval.query",
  "show_progress_bar": True}
encode_args_documents={
  "device": "cuda",
  "batch_size": 100,
  "task": "retrieval.passage",
  "show_progress_bar": True}

Note: Jina-AI jina-embeddings-v3 has no Multi-GPU support.

Parallel Computing with EBM

Corresponding to the multiple stages that are involved in EBM's processing pipeline, there exist multiple options to increase processing speed through parallelisation. Critical for performance are in particular the process of generating embeddings (see below) and the vector search.

Parameter	Default	Description
`xgb_jobs`	1	number of parallel processes used by xgboost
`query_jobs`:	1	Vector search is conducted in batches. Increase `query_jobs` to allow processing multiple batches in parallel and speed up performance.
`chunking_jobs`	1	number of parallel processes used for chunking
`duckdb_threads`	1	Config parameter passed to the underlying duckdb databases. Increase performance by allowing a maximum number of threads for the internal processes of the database.

If you invoke annif train with the parameter -j (jobs), each of these parameters will be set to j. For fine-grained control use the specific options for the backend documented above.

A reasonable setting might be:

duckdb_threads=40 # allow as much as you can
chunking_jobs=4 # not many processes necessary
query_jobs=16 # forking too many processes may get expensive
xgb_jobs=10 # increase if you have a large dataset

In addition, there is also the possibility of using parallel computing for embedding generation. This is best handled by the SentenceTransformer library and its underlying frameworks (e.g. torch). You can use the encode_args_vocab and encode_args_documents to pass detailed instructions. Here is an example of using two GPU units for the embedding generation of the vocab and documents:

# inside the projects.cfg
model_args={
    "device": "cuda"}
encode_args_vocab={
  "device": ["cuda:0", "cuda:1"],
  "batch_size": 300,
  "show_progress_bar": True}
encode_args_documents={
  "device": ["cuda:0", "cuda:1"],
  "batch_size": 100,
  "show_progress_bar": True}

Alternatively, you may also work on CPUs entirely. Here is an example of using four parallel processes on CPU:

# inside the projects.cfg
encode_args_vocab={
  "device": ["cpu", "cpu", "cpu", "cpu"],
  "batch_size": 300,
  "show_progress_bar": True}

Note, however, that this multiplies with the setting of the environment variable MKL_NUM_THREADS (Intel Math Kernel Library).

# set number of threads used by Intel's Math Kernel Library (MKL)
export MKL_NUM_THREADS=3

Usage

Load a vocabulary:

annif load-vocab yso /path/to/Annif-corpora/vocab/yso-skos.ttl

Train the model:

annif train ebm-bge-m3 /path/to/Annif-corpora/training/yso-finna-en.tsv.gz

Test the model with a single document:

cat document.txt | annif suggest ebm-bge-m3

Evaluate a directory full of files in fulltext document corpus format:

annif eval ebm-bge-m3 /path/to/documents/