DRAFT ‐ Backend: Xtransformer - NatLibFi/Annif GitHub Wiki
[!NOTE] The
ebmbackend is not yet available in released Annif versions; its integration is WIP: https://github.com/NatLibFi/Annif/pull/798
The x-transformer backend is a wrapper around pecos. X-Transformer fine-tunes a pre-trained encoder model like BERT for extreme multi-label classification. The dense embeddings of the fine-tuned model are used in conjunction with sparse tf-idf vectors in partitioned-label trees for prediction.
Results can vary drastically depending on the hyperparameter configuration. Careful optimization is often necessary to achieve good results. This depends on the size of the dataset, the number of labels, and their distribution. For more information see Section about hyperparameter optimization.
For a reasonable training time of the transformer model, access to a GPU is necessary.
Installation
See Optional features and dependencies
Example configuration
This is a configuration which uses a pre-trained Roberta model.
[x-transformer-roberta]
name="X-Transformer Roberta-XML"
language=de
backend=xtransformer
analyzer=spacy(de_core_news_lg)
batch_size=32
vocab=gnd
limit=100
min_df=2
ngram=2
max_leaf_size=400
nr_splits=256
Cn=0.52
Cp=5.33
cost_sensitive_ranker=true
threshold=0.015
max_active_matching_labels=500
post_processor=l3-hinge
negative_sampling=tfn+man
ensemble_method=concat-only
loss_function=weighted-squared-hinge
num_train_epochs=5
warmup_steps=200
logging_steps=500
save_steps=500
model_shortcut=FacebookAI/xlm-roberta-base
Backend-specific parameters
The parameters are:
| Parameter | Description |
|---|---|
| limit | Maximum number of results to return |
| min_df | How many documents a word must appear in to be considered. Default: 1 |
| ngram | Maximum length of word n-grams. Default: 1 |
| batch_size | Batch size for transformer training. Default: 8 |
| max_leaf_size | The maximum size of each leaf node of the tree. Default is 100 |
| nr_splits | The out-degree of each internal node of the tree. Default is 16 |
| Cn | Negative penalty parameter. Defaults to 1.0 |
| Cp | Positive penalty parameter. Defaults to 1.0 |
| cost_sensitive_ranker | If True, use clustering count aggregating for ranker's cost-sensitive learning. Default False |
| threshold | Sparsify the final model by eliminating all entrees with absolute value less than threshold. Default to 0.1. |
| max_active_matching_labels | max number of active matching labels, will sub-sample from existing negative samples if necessary. Default None to ignore |
| post_processor | The post_processor specified in the model. Default to "l3-hinge". Options: noop, sigmoid, log-sigmoid, l3-hinge, log-l3-hinge |
| negative_sampling | Negative sampling types. Default: tfn. Options: tfn, tfn+man. For more information about the sampling strategies for negative instances, see here |
| ensemble_method | Micro ensemble method to generate prediction. Default: transformer-only. Options: transformer-only, concat-only, rank_average, round_robin, average |
| loss_function | Loss function to use. Default: squared-hinge. Options: hinge, squared-hinge, weighted-hinge, weighted-squared-hinge, cross-entropy |
| num_train_epochs | Total number of training epochs to perform. Default 5 |
| warmup_steps | Learning rate warmup over warmup-steps. Default: 0 |
| logging_steps | Log training information every NUM updates steps. Default: 50 |
| save_steps | Save checkpoint every NUM updates steps. Default: 100 |
| model_shortcut | String of pre-trained model shortcut. Default: 'bert-base-cased' |
The min_df parameter controls the features (words/tokens) used to build the model. With the default setting of 1, all the words in the training set will be used, even ones that appear in only one training document. With a higher value such as 5, only those that appear in at least that many documents are included. Increasing the min_df value will decrease the size and training time of the model.
Setting the ngram parameter to 2 the vectorizer will use 2-grams as well 1-grams. This may improve the results of the model, but the model will be much larger. When using ngram>1, it probably makes sense to set min_df to something more than 1, otherwise there may be a huge number of pretty useless features.
A pre-trained transformer model can be selected via model_shortcut from Huggingface. pecos supports the following encoder model types: BERT, Roberta, XLM-Roberta, XL-NET and Distilbert. Pre-trained models are required to implement one of these types.
Suitable model can be found via filtering functions on Hugginface. It is possible for instance to filter for models pre-trained on different languages.
[!NOTE] Transformer models typically have a maximum input length. This means that any text exceeding the maximum length will be truncated.
Hyperparameter Optimization
The optimal choice of hyperparameters depends on the size of the dataset, the number of available labels and their distributions. A good starting point for experiments is to chose the most similar dataset in terms of label set size from the original paper and adapt hyperparameters from there. Hyperparameter configurations for different datasets can also be found in the pecos repository here.
Based on experience, the hyperparameters that define the structure of the partitioned label tree (max_leaf_size and nr_splits) and those that deal with training linear classifiers (cp, cn, and threshold) have a drastic impact on performance. Especially the cost for wrongly classified negative examples (cn) and the cost for wrongly classified positive examples (cp) are influential. Lower values for threshold will generally lead to better results but increase time and memory requirements.
We have also found that the choice of pre-trained encoder model matters, especially for data containing non-English text. A good starting point is google-bert/bert-base-multilingual-cased or FacebookAI/xlm-roberta-base.
Usage
Load a vocabulary:
annif load-vocab yso /path/to/Annif-corpora/vocab/yso-skos.ttl
Train the model:
annif train <train command>
Test the model with a single document:
cat document.txt | annif suggest <predict command>
Evaluate a directory full of files in fulltext document corpus format:
annif eval <eval command>