DRAFT ‐ Backend: Xtransformer - NatLibFi/Annif GitHub Wiki

[!NOTE] The ebm backend is not yet available in released Annif versions; its integration is WIP: https://github.com/NatLibFi/Annif/pull/798

The x-transformer backend is a wrapper around pecos. X-Transformer fine-tunes a pre-trained encoder model like BERT for extreme multi-label classification. The dense embeddings of the fine-tuned model are used in conjunction with sparse tf-idf vectors in partitioned-label trees for prediction.

Results can vary drastically depending on the hyperparameter configuration. Careful optimization is often necessary to achieve good results. This depends on the size of the dataset, the number of labels, and their distribution. For more information see Section about hyperparameter optimization.

For a reasonable training time of the transformer model, access to a GPU is necessary.

Installation

See Optional features and dependencies

Example configuration

This is a configuration which uses a pre-trained Roberta model.

[x-transformer-roberta]
name="X-Transformer Roberta-XML"
language=de
backend=xtransformer
analyzer=spacy(de_core_news_lg)
batch_size=32
vocab=gnd
limit=100
min_df=2
ngram=2
max_leaf_size=400
nr_splits=256
Cn=0.52
Cp=5.33
cost_sensitive_ranker=true
threshold=0.015
max_active_matching_labels=500
post_processor=l3-hinge
negative_sampling=tfn+man
ensemble_method=concat-only
loss_function=weighted-squared-hinge
num_train_epochs=5
warmup_steps=200
logging_steps=500
save_steps=500
model_shortcut=FacebookAI/xlm-roberta-base

Backend-specific parameters

The parameters are:

Parameter	Description
limit	Maximum number of results to return
min_df	How many documents a word must appear in to be considered. Default: 1
ngram	Maximum length of word n-grams. Default: 1
batch_size	Batch size for transformer training. Default: 8
max_leaf_size	The maximum size of each leaf node of the tree. Default is `100`
nr_splits	The out-degree of each internal node of the tree. Default is `16`
Cn	Negative penalty parameter. Defaults to 1.0
Cp	Positive penalty parameter. Defaults to 1.0
cost_sensitive_ranker	If True, use clustering count aggregating for ranker's cost-sensitive learning. Default False
threshold	Sparsify the final model by eliminating all entrees with absolute value less than threshold. Default to 0.1.
max_active_matching_labels	max number of active matching labels, will sub-sample from existing negative samples if necessary. Default None to ignore
post_processor	The post_processor specified in the model. Default to "l3-hinge". Options: `noop`, `sigmoid`, `log-sigmoid`, `l3-hinge`, `log-l3-hinge`
negative_sampling	Negative sampling types. Default: `tfn`. Options: `tfn`, `tfn+man`. For more information about the sampling strategies for negative instances, see here
ensemble_method	Micro ensemble method to generate prediction. Default: `transformer-only`. Options: `transformer-only`, `concat-only`, `rank_average`, `round_robin`, `average`
loss_function	Loss function to use. Default: `squared-hinge`. Options: `hinge`, `squared-hinge`, `weighted-hinge`, `weighted-squared-hinge`, `cross-entropy`
num_train_epochs	Total number of training epochs to perform. Default 5
warmup_steps	Learning rate warmup over warmup-steps. Default: 0
logging_steps	Log training information every NUM updates steps. Default: 50
save_steps	Save checkpoint every NUM updates steps. Default: 100
model_shortcut	String of pre-trained model shortcut. Default: 'bert-base-cased'

The min_df parameter controls the features (words/tokens) used to build the model. With the default setting of 1, all the words in the training set will be used, even ones that appear in only one training document. With a higher value such as 5, only those that appear in at least that many documents are included. Increasing the min_df value will decrease the size and training time of the model.

Setting the ngram parameter to 2 the vectorizer will use 2-grams as well 1-grams. This may improve the results of the model, but the model will be much larger. When using ngram>1, it probably makes sense to set min_df to something more than 1, otherwise there may be a huge number of pretty useless features.

A pre-trained transformer model can be selected via model_shortcut from Huggingface. pecos supports the following encoder model types: BERT, Roberta, XLM-Roberta, XL-NET and Distilbert. Pre-trained models are required to implement one of these types. Suitable model can be found via filtering functions on Hugginface. It is possible for instance to filter for models pre-trained on different languages.

[!NOTE] Transformer models typically have a maximum input length. This means that any text exceeding the maximum length will be truncated.

Hyperparameter Optimization

The optimal choice of hyperparameters depends on the size of the dataset, the number of available labels and their distributions. A good starting point for experiments is to chose the most similar dataset in terms of label set size from the original paper and adapt hyperparameters from there. Hyperparameter configurations for different datasets can also be found in the pecos repository here.

Based on experience, the hyperparameters that define the structure of the partitioned label tree (max_leaf_size and nr_splits) and those that deal with training linear classifiers (cp, cn, and threshold) have a drastic impact on performance. Especially the cost for wrongly classified negative examples (cn) and the cost for wrongly classified positive examples (cp) are influential. Lower values for threshold will generally lead to better results but increase time and memory requirements.

We have also found that the choice of pre-trained encoder model matters, especially for data containing non-English text. A good starting point is google-bert/bert-base-multilingual-cased or FacebookAI/xlm-roberta-base.

Usage

Load a vocabulary:

annif load-vocab yso /path/to/Annif-corpora/vocab/yso-skos.ttl

Train the model:

annif train <train command>

Test the model with a single document:

cat document.txt | annif suggest <predict command>

Evaluate a directory full of files in fulltext document corpus format:

annif eval <eval command>

← XX | YY →