Setting up ML backend for Label Studio - lmmx/devnotes GitHub Wiki

Table of contents

Background links

Requirements

A prerequirement for setting up a ML backend for Label Studio is Docker Compose, which in turn requires Docker Engine.

See Installing Docker Compose and Installing Docker Engine

(Thankfully the standard way to install Engine also installs Compose)

Deploying an example backend

To set up the example used in the repo's docs, run the following

git clone https://github.com/heartexlabs/label-studio-ml-backend
cd label-studio-ml-backend/label_studio_ml/examples/simple_text_classifier
docker compose up

If you installed via a different route you may have docker-compose as the command instead

Here's the project structure:

label-studio-ml-backend/label_studio_ml/examples/simple_text_classifier $ tree .

.
├── data
│   ├── redis
│   └── server
│       └── models
├── docker-compose.yml
├── Dockerfile
├── logs
├── README.md
├── requirements.txt
├── simple_text_classifier.py
└── _wsgi.py

5 directories, 6 files

There's a data directory with a subdirectory for each of the services, redis and server (with a subdirectory models), and a logs directory.

  • In fact these are all created when the service starts: notice they're not checked into the repo

Compose specification

The docker-compose.yml Compose specification file specifies the service names and where to mount these directories, and the port number for the server service:

  • The redis service:
    • mounts ./data/redis as /data
    • also names its container and hostname redis
  • The server service:
    • mounts ./data/server as /data
    • mounts ./logs as /tmp
    • sets environment variables including MODEL_DIR as /data/models and an API key
    • defines a network link to the container in the redis service, thereby determining the order of service startup.
    • specifies that it depends_on the redis service, again determining order of service startup and shutdown.
version: "3.8"

services:
  redis:
    image: redis:alpine
    container_name: redis
    hostname: redis
    volumes:
      - "./data/redis:/data"
    expose:
      - 6379
  server:
    container_name: server
    build: .
    environment:
      - MODEL_DIR=/data/models
      - RQ_QUEUE_NAME=default
      - REDIS_HOST=redis
      - REDIS_PORT=6379
      - LABEL_STUDIO_ML_BACKEND_V2=true
      - LABEL_STUDIO_HOSTNAME=http://localhost:8000
      - LABEL_STUDIO_API_KEY=d6f8a2622d39e9d89ff0dfef1a80ad877f4ee9e3
    ports:
      - "9090:9090"
    depends_on:
      - redis
    links:
      - redis
    volumes:
      - "./data/server:/data"
      - "./logs:/tmp"
  • The LABEL_STUDIO_API_KEY is specific to this backend (see grep.app)
  • The YAML specifies 2 services:
    • one named redis (which is spun up from the redis:alpine image and binds the ./data/redis)
    • one named server (which depends on the one named redis)

WSGI and model modules

The obvious question here is how ./data/server/models got made (see the directory tree above). Since /data/models is passed in as an environment variable MODEL_DIR, it seems obvious that either _wsgi.py or simple_text_classifier.py boots up when the service starts and touches it.

_wsgi.py

A 126 line module that looks in its directory for a config.json (note: not used), exposes a CLI parser that reads a kwarg config of parameters as well as port (from the env. var.), host, debug flag, log level, model dir (defaulting to the file's dir) and runs the app that it initialises by label_studio_ml.api.init_app; or if not being run from the command line, just goes straight to initialising the app with the args from the environment but does not run the app.

Note: the app is initialised by a module-level singleton LabelStudioMLManager instance _manager, passing the model_class as the SimpleTextClassifier class imported from the simple_text_classifier.py module, and the _server object returned here as app is a Flask app, again a module-level singleton instance. Importantly, the SimpleTextClassifier class gets bound to the model_class of the LabelStudioMLManager and ends up getting bound inside the _current_model attribute (dict)

simple_text_classifier.py

A 161 line module that ensures an API key is set as an env. var, then defines the SimpleTextClassifier class which subclasses LabelStudioMLBase and uses some imported sklearn helpers (LogisticRegression, TfidfVectorizer, make_pipeline). The class has only a few methods:

  • __init__:

    • checks self.parsed_label_config is [a dict] of length 1 (this comes from the base class), and sets self.name and self.info from its key/value.
      • The base class sets this attribute from parse_config(self.label_config) if self.label_config else {}.
      • The base class's __init__ signature is self, label_config=None, train_output=None, **kwargs. My understanding here is that the Flask app sends a POST request that includes the config of the project, among which is the labelling config set up in the UI, and that if this isn't text classification then initialising this backend will fail.
    • checks that the config's type (now in self.info) value is "Choices" (i.e. it's for classification).
    • checks that the config's to_name and inputs are length 1 (i.e. the model has just 1 input), and that the type of the input is "Text"
    • sets self.to_name from the config's to_name
    • sets self.value from the config's first and only inputs value
  • reset_model creates the following simple 2-step model:

    self.model = make_pipeline(
        TfidfVectorizer(ngram_range=(1, 3), token_pattern=r"(?u)\b\w\w+\b|\w"),
        LogisticRegression(C=10, verbose=True)
    )
  • predict gets the input text from the data of each of the tasks (passed in as an argument), runs self.predict_proba() on them, gets argmax indices of the predicted labels and scores, then zips the labels against the scores in a dict that get listed for all the tasks.

    for idx, score in zip(predicted_label_indices, predicted_scores):
        predicted_label = self.labels[idx]
        # prediction result for the single task
        result = [{
            'from_name': self.from_name,
            'to_name': self.to_name,
            'type': 'choices',
            'value': {'choices': [predicted_label]}
        }]
    
        # expand predictions with their scores for all tasks
        predictions.append({'result': result, 'score': score})
  • _get_annotated_dataset takes a project_id and is used to support webhook-based training workflows (in fit). It is marked "just for demo purposes", and uses the API key to authenticate a request to localhost/api/projects/{project_id}/export to retrieve data annotations for the project.

    • N.B. this is why the API key is needed.
  • fit that takes annotations (which gets renamed to tasks) and a workdir (not used if MODEL_DIR env. var. is set), builds a list of input_texts from accessing ["data"].get(self.value) of each of the tasks, calls reset_model then self.model.fit with the input_texts and output_labels_idx, and returns a dict of labels (the sorted set of output labels coerced to list) and model_file (the pickled model).

    • The optional kwarg data can be used to override annotations as tasks = _get_annotated_dataset(data["project"]["id"].

Testing the service

The README for this backend says to run curl http://localhost:9090/health to check the service is running OK, and indeed it returns some JSON:

{"model_dir":"/data/models","status":"UP","v2":"true"}

If we search through the repo for the word health we find that this is a Flask route defined in label_studio_ml/api.py

@_server.route('/health', methods=['GET'])
@_server.route('/', methods=['GET'])
@exception_handler
def health():
    return jsonify({
        'status': 'UP',
        'model_dir': _manager.model_dir,
        'v2': os.getenv('LABEL_STUDIO_ML_BACKEND_V2', default=LABEL_STUDIO_ML_BACKEND_V2_DEFAULT)
    })

Reading this also tells us that we can curl http://localhost:9090/ (the / route) and get the same output. These are bound to our deployed app because when we initialised it we got the _server module-level singleton object defined in this api.py module.

Another thing to notice here about this route funcdef is that it takes no arguments (pretty standard for a GET request), and only uses environment variables and the _manager (module-level global variable).

Compare this to the POST request route for _predict:

@_server.route('/predict', methods=['POST'])
@exception_handler
def _predict():
    data = request.json
    tasks = data.get('tasks')
    project = data.get('project')
    label_config = data.get('label_config')
    force_reload = data.get('force_reload', False)
    try_fetch = data.get('try_fetch', True)
    params = data.get('params') or {}
    predictions, model = _manager.predict(
        tasks, project, label_config, force_reload, try_fetch, **params
    )
    response = {
        'results': predictions,
        'model_version': model.model_version
    }
    return jsonify(response)

Here there is the implicit parameter request which is provided by Flask to a route when called.

It's known as a 'context' in Flask's docs, implemented in werkzeug as a context local. I don't really know much on the implementation details here other than you access the .json attribute and then you're just working with a regular dict (similar to locals()).

But how does this all work together? How can we test the /predict route? We can't just send a plain POST request:

curl --header "Content-Type: application/json" --request POST --data '{}' http://localhost:9090/predict

We hit an exception in the LabelStudioMLManager.predict class method which receives the empty data and get told the model is not loaded:

    @classmethod
    def predict(
        cls, tasks, project=None, label_config=None, force_reload=False, try_fetch=True, **kwargs
    ):
        if not os.getenv('LABEL_STUDIO_ML_BACKEND_V2', default=LABEL_STUDIO_ML_BACKEND_V2_DEFAULT):
            if try_fetch:
                m = cls.fetch(project, label_config, force_reload)
            else:
                m = cls.get(project)
                if not m:
                    raise FileNotFoundError('No model loaded. Specify "try_fetch=True" option.')
            predictions = m.model.predict(tasks, **kwargs)
            return predictions, m

        if not cls._current_model:
            raise ValueError(f'Model is not loaded for {cls.__class__.__name__}: run setup() before
using predict()')

        predictions = cls._current_model.model.predict(tasks, **kwargs)
        return predictions, cls._current_model

In other words, you should [let the UI] set this backend up before trying to decipher the inner workings any deeper.

Backend specification

I don't want a text classifier like this though, I want a bounding box predictor (object detection model). This one doesn't tick all the boxes for my needs, which are:

  • A HuggingFace model (the text classifier indeed uses HuggingFace transformers.AutoTokenizer and transformers.AutoModelForCausalLM). These will be useful for figuring out what to put in the predict method of the model class in my ML backend.
  • An image model (not too complicated: similar to a language model, it'll use a tokenizer/processor and a model). This will be useful for figuring out how to pass the image data into the model (which has some requirements that get validated in the model's __init__ method).

It's clear that of these two, the first priority should be to find another object detection labelling studio backend, so I'd be able to look at the assertions made in its equivalent of the simple_text_classifier's SimpleTextClassifier.__init__() method.

Detectron2 example

I searched for the name of the base class LabelStudioMLBase on the code search site grep.app (here) and indeed I landed on an image model, Detectron2, a well-known semantic segmentation model (which is closer to object detection with bboxes, but I expect will be outputting pixel-level masks).

Edit: in fact it is giving bboxes: in the code excerpt below, result type is "rectanglelabels".

Edit 2: it turned out I overlooked one right under my nose: mmdetection contains an object detection example in this repo!

After some further digging, this turned out to be one of the LayoutParser developers' personal copy of code that would go on to become part of the official LayoutParser annotation service.

This is much closer to what I am aiming for, in fact it's the exact same task even, however I want to use the LayoutLMv3 model on HuggingFace whereas this example (obviously) is using a layoutparser model (specifically lp.Detectron2LayoutModel)

From this example however, it's clear what function signatures we should aim for in an object detection API:

class ObjectDetectionAPI(LabelStudioMLBase):
    def __init__(self, freeze_extractor=False, **kwargs):
        ...

    def predict(self, tasks, **kwargs):
        image_urls = [task["data"][self.value] for task in tasks]
        images = [load_image_from_url(url) for url in image_urls]
        layouts = [self.model.detect(image) for image in images]
        predictions = []
        for image, layout in zip(images, layouts):
            height, width = image.shape[:2]
            result = [
                {
                    "from_name": self.from_name,
                    "to_name": self.to_name,
                    "original_height": height,
                    "original_width": width,
                    "source": "$image",
                    "type": "rectanglelabels",
                    "value": convert_block_to_value(block, height, width),
                }
                for block in layout
            ]
            predictions.append({"result": result})
        return predictions

    def fit(self, completions, workdir=None, batch_size=32, num_epochs=10, **kwargs):
        image_urls, image_classes = [], []
        print("Collecting completions...")
        # for completion in completions:
        #     if is_skipped(completion):
        #         continue
        #     image_urls.append(completion['data'][self.value])
        #     image_classes.append(get_choice(completion))
        print("Creating dataset...")
        # dataset = ImageClassifierDataset(image_urls, image_classes)
        # dataloader = DataLoader(dataset, shuffle=True, batch_size=batch_size)
        print("Train model...")
        # self.reset_model()
        # self.model.train(dataloader, num_epochs=num_epochs)
        print("Save model...")
        # model_path = os.path.join(workdir, 'model.pt')
        # self.model.save(model_path)
        return {"model_path": None, "classes": None}

It's pretty clear here that this is a work in progress (i.e. all the commented out code). After getting to grips with how Label Studio backends work, I'm fairly certain that the training API isn't operational, the prediction service looks like it could be though.

The commented out line dataset = ImageClassifierDataset(image_urls, image_classes) caught my attention, as it suggests that this was building on prior work. Indeed, searching on grep.app shows that this name comes from the label-studio-ml-backend repo:

One thing I like about the code at this link is that it has nicely contained methods. (To be covered below: the HuggingFace example has quite a messy fit method).

It's a pretty simple PyTorch dataset, but I'm not personally going to use URLs for my data, so it's not quite aligned to my needs.

I expect I'm going to use something more like this example by Niels Rogge (MLE at HuggingFace):

from torch.utils.data import Dataset
from PIL import Image

class CustomDataset(Dataset):
     def __init__(self, root, df, processor):
          self.root = root
          self.df = df
          self.processor = processor

     def __getitem__(self, idx):
          # get document image + corresponding words and boxes
          item = self.df.iloc[idx]
          image = Image.open(self.root + ...).convert('RGB')
          words = item.words
          boxes = item.boxes

          # use processor to prepare everything for the model
          encoding = self.processor(image, words, boxes=boxes)

          return encoding

This is just a draft, assuming you have a root folder with all your document images, and a Pandas dataframe that contains the words + boxes for each document image.

You can then instantiate the dataset as follows:

from transformers import LayoutLMv3Processor

processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")

dataset = CustomDataset(root="path_to_your_root", df="your_dataframe", processor=processor)

Note that in the tutorial notebook it's clarified what the processor is:

Next, we prepare the dataset for the model. This can be done very easily using LayoutLMv3Processor, which internally wraps a LayoutLMv3FeatureExtractor (for the image modality) and a LayoutLMv3Tokenizer (for the text modality) into one.

Back to the code at hand though! (There's not much to say)

The result here would be a good use case for a typing.TypedDict as the keys will always be the same.

Note here that convert_block_to_value(block, image_height, image_width) returns:

{
    "height": block.height / image_height * 100,
    "rectanglelabels": [str(block.type)],
    "rotation": 0,
    "width": block.width / image_width * 100,
    "x": block.coordinates[0] / image_width * 100,
    "y": block.coordinates[1] / image_height * 100,
    "score": block.score,
}

...and block is a single object from layouts, which is a list returned from self.model.detect(image) where as already stated, the model is Detectron2LayoutModel from layoutparser (source here).

If we keep digging, we see the detect method returns what it gets from running the gather_output method on the output of calling self.model. To disregard the model here, the "gathering" involves creating Layout objects (from the lp.elements.layout module) and putting TextBlock objects in them, each populated with a block argument made of a Rectangle, both from the lp.elements.layout_elements module.

The Rectangle is a "manual dataclass" made of x_1, y_1, x_2, y_2 (or 'Lord of the Rings Bilbo' notation as I remember it: LT,RB).

So that's what a block is being iterated over in the predict method of the ObjectDetectionAPI class which is subclassing LabelStudioMLBase, and therefore we can interpret the values... The "x", and "y" are the bbox left and top coordinate's percentages of the image width and height, while the "height" and "width" of the bbox are again relative to the image's height and width (again as a percentage).

The block "score" comes from the model, it's difficult to look at the model directly in the code due to the weird metaprogramming approach used (it comes from another package fvcore, fvcore.common.registry, via detectron2's registry), which instantiates a module-wide 'registry' of architectures which get recorded through the @META_ARCH_REGISTRY.register() decorator (see search results here).

The block "type" is passed as the label of predicted classes list (pred_classes.tolist()).

MMDetection example

I missed the OpenMMLab MMDetection toolbox example backend at first, perhaps because it has such a simple class structure: it only has __init__ and predict methods (as well as a _get_image_url helper method). It's not trainable through the Label Studio interface, you just load trained checkpoints from file.

This one's a bit unusual: it's the only one I've seen here that asks to specify device in the class __init__ signature (defaulting to "cpu").

It also loads labels from a file (the other backends populate their labels attribute from the labels value in the info attribute that comes from the parsed_label_config dict's value).

Instead, the parsed_label_config dict's first value is assigned to schema which looks like it's also a dict, with yet more dicts nested inside... (Type annotations would be valuable here!)

(
    self.from_name,
    self.to_name,
    self.value,
    self.labels_in_config,
) = get_single_tag_keys(self.parsed_label_config, "RectangleLabels", "Image")
schema = list(self.parsed_label_config.values())[0]
self.labels_in_config = set(self.labels_in_config)

# Collect label maps from `predicted_values="airplane,car"` attribute
# in <Label> tag
self.labels_attrs = schema.get("labels_attrs")
if self.labels_attrs:
    for label_name, label_attrs in self.labels_attrs.items():
        for predicted_value in label_attrs.get("predicted_values", "").split(
            ","
        ):
            self.label_map[predicted_value] = label_name

That get_single_tag_keys function is from the label_studio_ml.utils module:

def get_single_tag_keys(parsed_label_config, control_type, object_type):
    """
    Gets parsed label config, and returns data keys related to the single
    control tag and the single object tag schema
    (e.g. one "Choices" with one "Text")

    :param parsed_label_config: parsed label config returned by
                                "label_studio.misc.parse_config" function
    :param control_type: control tag str as it written in label config
                         (e.g. 'Choices')
    :param object_type: object tag str as it written in label config
                        (e.g. 'Text')
    :return: 3 string keys and 1 array of string labels:
             (from_name, to_name, value, labels)
    """
    assert len(parsed_label_config) == 1
    from_name, info = list(parsed_label_config.items())[0]
    assert info["type"] == control_type, (
        'Label config has control tag "<'
        + info["type"]
        + '>" but "<'
        + control_type
        + '>" is expected for this model.'
    )  # noqa

    assert len(info["to_name"]) == 1
    assert len(info["inputs"]) == 1
    assert info["inputs"][0]["type"] == object_type
    to_name = info["to_name"][0]
    value = info["inputs"][0]["value"]
    return from_name, to_name, value, info["labels"]

As well as a 'getter', this is a helper validating the parsed_label_config!

The control_type appears to be more like the "type of task label" (so classification task has example "Choices" type of task label) which is "RectangleLabels" here because the labels are bboxes, and the object_type is more like the "modality type of the task" (so a text classifier has example "Text" type of modality) which is "Image" here because the objects are detected in images.

HuggingFace Transformers backends

I don't just want an image inputs/rectangular labels ML backend though, I specifically want to use HuggingFace's Transformers library to load my model and then make predictions with it.

If we hop into the examples directory and search for the transformers import statement we can see what's been demo'd:

grep -r --include \*.py transformers

huggingface/gpt.py:from transformers import AutoTokenizer, AutoModelForCausalLM
bert/bert_classifier.py:from transformers import BertTokenizer, BertForSequenceClassification
bert/bert_classifier.py:from transformers import AdamW, get_linear_schedule_with_warmup
ner/ner.py:from transformers import (
ner/ner.py:from transformers import AdamW, get_linear_schedule_with_warmup
electra/electra.py:from transformers import ElectraTokenizerFast, ElectraForSequenceClassification
electra/electra.py:from transformers import Trainer
electra/electra.py:from transformers import TrainingArguments

So that's GPT, BERT, and Electra, all (text) language models. The ner directory likewise obviously contains language models (for named entity recognition: BERT, Roberta, DistilBert, CamemBert).

Note from the imported names that bert is the first in this list that does classification like the simple_text_classifier backend we saw above (BertForSequenceClassification). This seems like a good example to compare to (it should otherwise be similar to simple_text_classifier).

BERT backend

The only difference between the two Docker Compose specs (bert and simple_text_classifier) is that the BERT example does not specify any of the Label Studio-related env. vars:

diff simple_text_classifier/docker-compose.yml bert/docker-compose.yml

20,22d19
<       - LABEL_STUDIO_ML_BACKEND_V2=true
<       - LABEL_STUDIO_HOSTNAME=http://localhost:8000
<       - LABEL_STUDIO_API_KEY=d6f8a2622d39e9d89ff0dfef1a80ad877f4ee9e3

The hostname and API key are used to GET data annotations from the Label Studio API [locally], and it turned out these aren't set up for this backend, hence the env. var's not being set.

Let's now look at the bert_classifier.py module. The class BertClassifier defines nearly all of the same methods as SimpleTextClassifier. It doesn't define _get_annotated_dataset, which I interpret as meaning this model does not support webhook-based training workflows (unconfirmed).

The rest:

  • __init__ identical but now with a few extra lines after the base class gets initialised (assigning attributes that were added to the previously blank self, **kwargs signature, all of which have defaults):
            self.pretrained_model = pretrained_model
            self.maxlen = maxlen
            self.batch_size = batch_size
            self.num_epochs = num_epochs
            self.logging_steps = logging_steps
            self.train_logs = train_logs
    • Not quite identically at the end too, where the pickle loading is replaced with loading from_pretrained the model saved with save_pretrained in HuggingFace.
  • reset_model is used to set up an initial model, but rather than the sklearn pipeline in SimpleTextClassifier, it's the model loaded from_pretrained again.
  • predict does a few things differently:
    • First off, it won't return anything if the tokenizer attribute wasn't set by running the load() method in the __init__ method [by passing the truthiness check on self.train_output, which gets set in the base class when the train_output kwarg is passed)
    • Rather than just iterating over the tasks and sticking task["data"].get(self.value) into a list of input_texts, a proper dataloader is used (it's cooked up in the utils.prepare_texts function), and iterating over it gives input IDs and attention masks which are moved to the appropriate device upon being dataloaded.
    • After dataloading, model inference runs in a torch.no_grad block (which disables the gradient calculation), and then after this block the resulting logits are detached from the graph, and put back on the CPU.
    • The scores and labels are assigned more neatly than in the sklearn model. The predicted label is listed directly rather than waiting to zip the argmax index against the score and look up the label just before building the result dict.
  • fit has no argument annotations but instead completions, which has the annotations nested inside it. Compare the simple_text_classifier vs. bert_classifier, they're clearly the same:
    output_label = annotation['result'][0]['value']['choices'][0]
    output_label = completion['annotations'][0]['result'][0]['value']['choices'][0]
    After that, there's a ton more that goes on (whereas the sklearn backend's fit method abstracted it all away into the sklearn model.fit call). This is not particularly worth going through: neural net training loop, with logs, tqdm'd dataloader, model.train() call, loss backprop., early stopping.

...it also defines

  • a load method, which loads the pretrained model, and overwrites some of the attributes from the restored model over attributes defined at __init__ (namely: batch_size, labels, maxlen).
  • a not_trained property, which relies on checking if the self.tokenizer attribute has been set (it gets set in the load method).

Electra backend

The first thing notable in the electra backend is that its _wsgi.py is near identical to the bert directory's: the BertClassifier [subclass of LabelStudioMLBase] is just replaced with ElectraTextClassifier.

The electra.py module is simpler however: totalling 145 lines to the BERT module's 221. Right from the start you can see it has fewer imports.

On closer inspection this is because the Electra model is trained with the HuggingFace Trainer class, so other than transformers, the only libraries loaded in the module are requests and json!

  • See its source code for details on what is abstracted away into this class or click through to particular sections from the docs
diff <(grep import examples/bert/bert_classifier.py) <(grep import examples/electra/electra.py)

2c2,3
< import numpy as np
---
> import requests
> import json
4,10c5,7
< from torch.utils.data import SequentialSampler
< from tqdm import tqdm, trange
< from collections import deque
< from tensorboardX import SummaryWriter
< from transformers import BertTokenizer, BertForSequenceClassification
< from transformers import AdamW, get_linear_schedule_with_warmup
< from torch.utils.data import TensorDataset, DataLoader, RandomSampler
---
> from transformers import ElectraTokenizerFast, ElectraForSequenceClassification
> from transformers import Trainer
> from transformers import TrainingArguments
12c9
< from utils import prepare_texts, calc_slope
---
> from label_studio_tools.core.label_config import parse_config
  • The __init__ method is conspicuously lacking any assert statements (the other 2 examples had checks for the config's inputs value being length 1, i.e. for single labels in annotations). It seems to just rely on it implicitly however, and behaves the same.

    self.value = self.info["inputs"][0]["value"]
  • The fit method is shrunk back down closer to the simple_text_classifier backend, after being crammed full of training loop logic in the BERT backend.

  • There's no load method (which in the BERT model was checking if self.tokenizer was set. Here self.tokenizer gets set in the __init__ method.

  • There is a load_config method, but this is used to initialise the parsed_label_config if fit is called before that's set. It's set when the base class initialises but can be {} if no config is passed (i.e. if the ElectraTextClassifier class isn't passed a label_config kwarg on init.

  • The predict method has the nice neat HuggingFace style of predictions (seen in the BERT example) but keeps the label index from argmax as seen in the simple_text_classifier's sklearn code. This is the best of both worlds.

  • The _get_annotated_dataset method is back, and handles the 'webhook' events with the API key (though the API key is hardcoded here, rather than set as an env. var. in the Docker Compose spec. as done in the simple_text_classifier.

  • There is also a new _get_text_from_s3 method which I don't need.

It also includes a CustomDataset class similar to the draft above.

Model loading in all backends

To step aside and review how each of the models are loaded and what it entails for the resulting backend capabilities (we can put a checkbox beside them to indicate if they are compatible with local training):

grep -r --include \*.py "self.model ="

label_studio_ml/examples/flair/ner_ml_backend.py:            self.model = self.load(self.train_output["base_path"])
label_studio_ml/examples/huggingface/gpt.py:        self.model = AutoModelForCausalLM.from_pretrained(self.model_name)
label_studio_ml/examples/mmdetection/mmdetection.py:        self.model = init_detector(config_file, checkpoint_file, device=device)
label_studio_ml/examples/bert/bert_classifier.py:        self.model = BertForSequenceClassification.from_pretrained(pretrained_model)
label_studio_ml/examples/nemo/asr.py:        self.model = nemo_asr.models.EncDecCTCModel.from_pretrained(
label_studio_ml/examples/tensorflow/mobilenet_finetune.py:        self.model = tf.keras.Sequential(
label_studio_ml/examples/simple_text_classifier/simple_text_classifier.py:                self.model = pickle.load(f)
label_studio_ml/examples/simple_text_classifier/simple_text_classifier.py:        self.model = make_pipeline(
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py:        self.model = models.resnet18(pretrained=True)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py:        self.model = self.model.to(device)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py:            self.model = ImageClassifier(len(self.classes), freeze_extractor)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py:            self.model = ImageClassifier(len(self.classes), freeze_extractor)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py:        self.model = ImageClassifier(len(self.classes), self.freeze_extractor)
label_studio_ml/examples/electra/electra.py:            self.model = ElectraForSequenceClassification.from_pretrained("my_model")
label_studio_ml/examples/electra/electra.py:            self.model = ElectraForSequenceClassification.from_pretrained(
  • The flair backend assigns self.model in the __init__ method in a conditional block checking self.train_output (which gets set in the base class __init__) and if the check fails it just doesn't load a model (doesn't even assign None to the attribute! Dicey).

    • The model is always loaded from a local path with filename best_model.pt
  • The gpt backend doesn't do any check, and the model gets assigned as self.model_name, so it can't be trained (so it isn't usable in active learning retraining workflows).

  • The mmdetection backend sets it once, as for gpt.

  • The bert backend sets it from pretrained_model in a load method. Unless I'm mistaken, there's a bug where reset_model is a no-op. The model it returns isn't assigned to self.model (as the simple_text_classifier sklearn backend did). In fact the only other example with a reset_model method is the pytorch_transfer_learning backend, and indeed it binds self.model too.

    • That said, if it were fixed (so the call to reset_model in the __init__ method assigned to self.model) it would be training-compatible.
  • The mmdetection backend sets it once, as for gpt.

  • There is also the ner backend which sets self._model from the self.train_output attribute's model_path value (which with HuggingFace can of course be a HuggingFace Hub-hosted model path rather than a local file path).

  • The nemo ASR backend sets it once, as for gpt.

  • The tensorflow backend sets it once but loads weights afterwards if self.train_output is set (truthily).

  • The simple_text_classifier backend either sets it from reset_model and immediately fits to initialise if train_output is falsey, or sets it from the pickle if a local model_file is passed in train_output.

  • The pytorch_transfer_learning backend loads its classes if train_output is passed, and loads weights into it after assigning too, otherwise it just initialises it. This is done quite neatly (probably helped by defining the model class in the module itself, not relying on importing an external one).

  • The electra backend checks if a hardcoded path exists then doesn't use the hardcoded path, but it clearly is supposed to. Again, yes you can train the model and use it here.

So to sumarise the training-friendly backend examples and whether they're good templates to build from:

  • bert pulls in the labels from the label config info before resetting the model if training output is not available; if available it loads that and gets the labels and [should get] the model from there.
    • It uses the fact that only load (not reset_model) sets the tokenizer to distinguish whether it's not_trained (so as to refuse to predict until being trained)
  • ner uses the model_path from the training output if provided, otherwise just sets labels
    • I.e. it does the same as the BERT backend, and can't predict until trained.
  • tensorflow is the odd one out, starting with the same model regardless of self.train_output but then loading weights into it if available. This uses Keras not HuggingFace, so not applicable.
  • simple_text_classifier calls fit directly on self.model after resetting the model if not trained, otherwise unpickles the model.
  • pytorch_transfer_learning calls load if training output is available, otherwise instantiates the model directly (not put in a reset_model method but same idea). Really it should call reset_model in both blocks of that condition.
  • electra instantiates the model directly, just changes the path based on whether the model file exists. I'm not a fan of the hardcoded value, but I do like that the attributes are consistent regardless of whether the model was trained already (self.tokenizer gets set either way too).

I'd like:

  • The attribute assignment simplicity of electra (and subsequent ability to predict regardless of whether trained or not)
  • The model path handling of ner
  • The proper reset_model/load method handling of bert (when fixed as above)
  • The proper assertion checks on __init__ of bert

Custom LayoutLMv3 object detection backend

So having found honed in on these 5 (SimpleTextClassifier, BERT, Electra, the NER tagger, and MMDetection), as well as the partial Detectron2 example, it's clear that we actually want to mix and match aspects of code from various sources.

  • API key handling is only done properly (in Docker Compose spec) by the SimpleTextClassifier. Electra uses it too in _get_annotated_dataset but it's a hardcoded module string literal.

  • Training is done most neatly (i.e. more simply, abstracting the details away) in Electra, and matches the use of the Trainer API in the LayoutLMv3 tutorial by Niels Rogge (via the Transformer-Tutorials repo).

  • Prediction is done most neatly in Electra, but I'd still prefer a TypedDict for the results to make it even cleaner.

  • Bounding box handling is done in Detectron2 and MMDetection. The result type will be changed from choices to rectanglelabels

  • Config assertions are done in BERT's __init__ method, and these may be useful to write (they're not in Electra).

  • GPU handling is only done explicitly in BERT, but I expect the Trainer class handles that in Electra. This is handled through the place_model_on_device property of the TrainingArguments class,

    • ...which is True if transformers.utils.import_utils.py's is_sagemaker_mp_enabled() evaluates to False (i.e. if not using model parallelism, which is set via SM_HP_MP_PARAMETERS env. var. else defaults to False).
  • The processor is going to go where the tokenizer goes in Electra (in the __init__ method) for use in predict and fit. Even though it's said to be 'pretrained', it doesn't get retrained so we don't need to load it, so it doesn't need to be conditional on there being train_output (see discussion).

  • The model is going to go where it goes in Electra (in the __init__ method) but rather than instantiating it here from a hardcoded MODEL_FILE module-level global variable, it's going to be loaded via the path given by the load method like in BERT if train_output is available, otherwise from reset_model. This condition will look more like BERT but without moving devices (unsure?). Like the SimpleTextClassifier, the labels attribute is set from train_output if loading else from info if using reset_model.

    • reset_model should not be passing hardcoded defaults through (as in BERT), they should be method defaults (as in SimpleTextClassifier). The method should take no arguments.

At the risk of overemphasising, let's turn that inside out so it's in terms of what 'features' we want from each source:

  • BERT: reset_model/load pattern; config assertions in __init__
  • Electra: simple attribute assignment [in particular of the tokenizer, i.e. processor in my case] in __init__ (permitting use of predict even if no train_output); prediction with Trainer API (with automatic GPU device handling); _get_annotated_dataset
  • Simple Text Classifier: API key handling in Docker Compose spec and _get_annotated_dataset; method-level defaults in reset_model (not hardcoded in __init__'s call to that method)
  • Detectron2 and MMDetection: bbox handling
  • Niels Rogge's Transformer-Tutorials LayoutLMv3 notebook: Trainer API usage; custom Dataset (from issue #123)
  • NER tagger: model_path handling

Backend rewriting recipe

Since this is quite an ambitious rewrite (with at least 4 different sources in the examples here, plus likely reusing some of Niels Rogge's code for datasets and training with the Trainer API), I'll want to take a principled approach, and record what I do, with version control so I can roll back (or at least review) any mistakes.

  • The first step is to begin adapting the most relevant template (Electra), which will achieve GPU handling (which we get 'for free' with the Trainer API) and immediately check one of our features off the to-do list.
  • Then we should start modifying the model itself. The first 'easy win' is API key usage.
  • Next, we should move onto the model class __init__ method, and tackle some real wins:
    • Adapting the signature to be relevant to the LayoutLMv3 model's args/kwargs.
    • The config assertions (just guess if unsure, we can fix if they fail)
    • The processor instantiation (another easy win)
    • The model instantiation within a train_output conditional block.
  • That just leaves:
    • Prediction (of which bbox handling is a component), which will give us preannotation
    • Training, which will give us a retrainable (fine-tuneable) model which will learn from the annotation labels we provide in Label Studio

Our recipe is therefore:

  1. GPU handling
  2. API key handling
  3. Config assertions
  4. Processor
  5. Model
  6. Prediction
  7. Bbox handling
  8. Training

Choosing a template

The first question is obviously: where to start? I.e. which template to begin adapting from?

Well, out of the sources above, Electra has the longest list of 'features' I want.

Looked at another way, the most complexity-reducing thing we have here is the Trainer API, and that's only in Electra (Niels Rogge's Trainer API example is not a Label Studio backend example).

At the risk of bikeshedding I'm going to just go with that impulse...

Adapting the WSGI server module

  1. Copy the directory and rename to layoutlmv3
  2. Rename the model module to layoutlmv3.py
  3. Overwrite the import line for the model module with the new model module (layoutlmv3) and class names (LayoutLMv3Classifier)
  4. Overwrite the model class name with the new one
cp -r electra layoutlmv3
cd layoutlmv3
mv electra.py layoutlmv3.py
sed -i 's/from electra import ElectraTextClassifier/from layoutlmv3 import LayoutLMv3Classifier/' _wsgi.py
sed -i 's/ElectraTextClassifier/LayoutLMv3Classifier/g' _wsgi.py

Before we start adapting the model class, we should really ensure that class is renamed too (so far it's just renamed in the server module).

sed -i 's/ElectraTextClassifier/LayoutLMv3Classifier/g' layoutlmv3.py

Modifying the Compose spec and API key usage

A major feature that is missing from Electra is that it doesn't handle the API key from an environment variable set in the Docker Compose spec, it handles it from a hard-coded string. We can get this easily enough by copying the Docker Compose spec over from simple_text_classifier and then using it in layoutlmv3.py the same way the simple_text_classifier.py module uses it.

Just copy the variables in the environment section of the YAML (I just did this in a text editor)

--- a/label_studio_ml/examples/layoutlmv3/docker-compose.yml
+++ b/label_studio_ml/examples/layoutlmv3/docker-compose.yml
@@ -18,6 +18,9 @@ services:
       - REDIS_HOST=redis
       - REDIS_PORT=6379
       - USE_REDIS=true
+      - LABEL_STUDIO_ML_BACKEND_V2=true
+      - LABEL_STUDIO_HOSTNAME=http://localhost:8000
+      - LABEL_STUDIO_API_KEY=d6f8a2622d39e9d89ff0dfef1a80ad877f4ee9e3

Since it's all local, I imagine you could change that key to be whatever you wanted instead? (TBC)

The model module layoutlmv3.py now needs to use those environment variables like simple_text_classifier.py does:

--- a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
+++ b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
@@ -12,9 +12,15 @@
+from label_studio_ml.utils import DATA_UNDEFINED_NAME, get_env
+
+HOSTNAME = get_env("HOSTNAME", "http://localhost:8080")
+API_KEY = get_env("API_KEY")
+
+print("=> LABEL STUDIO HOSTNAME = ", HOSTNAME)
+if not API_KEY:
+    print("=> WARNING! API_KEY is not set")
 
-HOSTNAME = "https://app.heartex.com/"
-API_KEY = ""

Finally we also need to modify the _get_annotated_dataset method (which Electra had) to use the same 'best practice' method of simple_text_classifier (missing exception handling):

--- a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
+++ b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
@@ -142,6 +142,11 @@ class LayoutLMv3Classifier(LabelStudioMLBase):
         response = requests.get(
             download_url, headers={"Authorization": f"Token {API_KEY}"}
         )
+        if response.status_code != 200:
+            raise Exception(
+                f"Can't load task data using {download_url}, "
+                f"response status_code = {response.status_code}"
+            )
         return json.loads(response.content)

and with that we should have enabled webhook-triggered training with the Docker Compose-specified API key.

The only reason you might not want to do this is if the error would crash your annotation session, but I'd expect it to fail early, before you'd done any annotation, so not losing any work.

Config checks

BERT had some confident assertions that demonstrate data validation on the input config, so that we can't accidentally use this backend with the wrong task type (or something like that).

The obvious question here is: what are we going to check? What are we expecting? Well, we can't just reuse the BERT code as we are not expecting to classify choices but rather to have labelled bounding boxes or rectanglelabels as they're known.

Here are the checks the BERT classifier does:

        # then collect all keys from config which will be used to extract data from task and to form prediction
        # Parsed label config contains only one output of <Choices> type
        assert len(self.parsed_label_config) == 1
        self.from_name, self.info = list(self.parsed_label_config.items())[0]
        assert self.info["type"] == "Choices"

        # the model has only one textual input
        assert len(self.info["to_name"]) == 1
        assert len(self.info["inputs"]) == 1
        assert self.info["inputs"][0]["type"] == "Text"
        self.to_name = self.info["to_name"][0]
        self.value = self.info["inputs"][0]["value"]

We aren't using outputs of Choices type, but RectangleLabels (recall this is known as the control_type).

If you review the code above from the get_single_tag_keys helper function in label_studio_ml.utils, it is in fact the exact same check. So we can just call that and have a much more concise (thus more maintainable) model class __init__.

So in fact, we really want to copy the mmdetection backend's routine here.

diff --git a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
index 0e7e5bb..2ddbfdb 100644
--- a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
+++ b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
@@ -25,13 +25,17 @@ MODEL_FILE = "my_model"
 
 
 class LayoutLMv3Classifier(LabelStudioMLBase):
+    control_type: str = "RectangleLabels"
+    object_type: str = "Image"
+
     def __init__(self, **kwargs):
         super(LayoutLMv3Classifier, self).__init__(**kwargs)
         try:
-            self.from_name, self.info = list(self.parsed_label_config.items())[0]
-            self.to_name = self.info["to_name"][0]
-            self.value = self.info["inputs"][0]["value"]
-            self.labels = sorted(self.info["labels"])
+            self.from_name, self.to_name, self.value, self.labels = get_single_tag_keys(
+                self.parsed_label_config,
+                control_type=self.control_type,
+                object_type=self.object_type,
+            )
         except BaseException:
             print("Couldn't load label config")

While we're at it, we may as well set some class attributes and type annotate to make it clearer.

These print statements are annoyingly amateur though: I then swapped them all for logger.error calls.

Next I removed some code repetition and made load_config only take the self argument.

With that, the config step was all done, and tucked away neatly into a load_confi method.

Processor instantiation

See the LayoutLMv3 processor source code and its tests

We create the processor just once, as Electra did for its tokenizer (so we just need to adapt this tokenizer to be a processor).

To make it neater, I made the processor name a class attribute, and the processor class as another.

I swapped the Electra tokenizer import for LayoutLMv3Processor (while at it also swapping the ElectraForSequenceClassification with LayoutLMv3ForTokenClassification) and was now halfway done migrating it from Electra to LayoutLMv3:

class LayoutLMv3Classifier(LabelStudioMLBase):
    control_type: str = "RectangleLabels"
    object_type: str = "Image"
    hf_hub_name: str = "microsoft/layoutlmv3-base"
    hf_model_cls: Type = LayoutLMv3ForTokenClassification
    hf_processor_cls: Type = LayoutLMv3Processor
    
    def __init__(self, **kwargs):
        super(LayoutLMv3Classifier, self).__init__(**kwargs)
        self.load_config()
        self.processor = self.hf_processor_cls.from_pretrained(self.hf_hub_name)

Model initialisation

See the LayoutLMv3 model source code and its tests

There are two options for the model class: LayoutLMv3ForSequenceClassification and LayoutLMv3ForTokenClassification. The 'sequence' is a document (e.g. if you wanted to distinguish different types of document), and the 'token' is a part of a document (I want to annotate and classify parts of documents so I chose this).

We instantiate our model in two different ways: either with reset_model or with load (if we have train_output).

What I did initially was to simplify the Electra model initialisation into two lines:

        model_to_load = MODEL_FILE if Path(MODEL_FILE).exists() else self.hf_hub_name
        self.model = self.hf_model_cls.from_pretrained(model_to_load)

At this point I committed my changes in case I messed the next step up

However as already established, this conditional block should actually be as in simple_text_classifier and bert.

This part of the code had hardcoded device="cpu", so I replaced that with a module-level global

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

This left me with the outline of the new model, but still with the BERT kwargs. I added more type annotations, made the reset_model take no arguments and return nothing, and annotated load as returning nothing too.

        if not self.train_output:
            self.labels = self.info["labels"]
            self.reset_model()
            load_repr = "Initialised with"
        else:
            self.load(self.train_output)
            load_repr = f"Loaded from train output with"
        logger.info(f"{load_repr} {self.from_name=}, {self.to_name=}, {self.labels=!s}")
        
    def reset_model(self) -> None:
        # THESE KWARGS HAVE NOT BEEN CHANGED FROM BERT ! TODO
        model_kwargs = dict(
            num_labels=len(self.labels),
            output_attentions=False,
            output_hidden_states=False,
            cache_dir=None,
        )   
        model = self.hf_model_cls.from_pretrained(
            self.hf_hub_name,
            **model_kwargs
        )   
        model.to(DEVICE)
        self.model = model
        return
        
    def load(self, train_output) -> None:
        pretrained_model = train_output["model_path"]
        self.model = self.hf_model_cls.from_pretrained(pretrained_model)
        self.model.to(DEVICE)
        self.model.eval()
        self.batch_size = train_output["batch_size"]
        self.labels = train_output["labels"]
        self.maxlen = train_output["maxlen"]

Now getting the arguments to the model class looks tricky: if we review the BERT signature which we are adapting, BertForSequenceClassification:

input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
    Indices of input sequence tokens in the vocabulary.

    Indices can be obtained using [`BertTokenizer`]. See [`PreTrainedTokenizer.encode`] and
    [`PreTrainedTokenizer.__call__`] for details.

    [What are input IDs?](../glossary#input-ids)
attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
    Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

    - 1 for tokens that are **not masked**,
    - 0 for tokens that are **masked**.

    [What are attention masks?](../glossary#attention-mask)
token_type_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
    Segment token indices to indicate first and second portions of the inputs. Indices are selected in
    `[0, 1]`:

    - 0 corresponds to a *sentence A* token,
    - 1 corresponds to a *sentence B* token.

    [What are token type IDs?](../glossary#token-type-ids)
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
    Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
    `[0, config.max_position_embeddings - 1]`.

    [What are position IDs?](../glossary#position-ids)
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
    Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:

    - 1 indicates the head is **not masked**,
    - 0 indicates the head is **masked**.

inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
    Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
    This is useful if you want more control over how to convert `input_ids` indices into associated vectors
    than the model's internal embedding lookup matrix.
output_attentions (`bool`, *optional*):
    Whether or not to return the attentions tensors of all attention layers. See `attentions` under
    returned tensors for more detail.
output_hidden_states (`bool`, *optional*):
    Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
    for more detail.
return_dict (`bool`, *optional*):
    Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.

labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
    Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
    config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss),
    If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).

So num_labels seems to pass through as a kwarg and bind into the config, whereas output_attentions and output_hidden_states are kwargs to the model itself.

The other kwarg, cache_dir, is actually in the signature of from_pretrained. Since it's only being passed as None it's a bit futile to pass at all. Likewise output_attentions and output_hidden_states default to False, so including them is just for demonstration.

num_labels is listed as one of the arguments used for fine-tuning (which is what we're trying to do), so that'd be all that is worth keeping (and I expect even that could be set later).

Importantly: this is why you need to set self.labels before calling the reset_model method!

Now that the first part is done (setting up an initial model), we're left with the job of adapting the load method to suit the LayoutLMv3 signature too.

It seems there was a mistake in the BERT example, as the num_labels was not passed when loading a trained model, and couldn't be because self.labels was set after instantiating the model.

maxlen and batch_size are used in the predict method, so I left a note to review them later.

With that, the model initialisation handling is done too:

--- a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
+++ b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
@@ -1,13 +1,14 @@
 import json
 import logging
 from pathlib import Path
+from typing import Type
 
 import requests
 import torch
 from label_studio_tools.core.label_config import parse_config
 from transformers import (
     LayoutLMv3Processor,
-    ElectraForSequenceClassification,
+    LayoutLMv3ForTokenClassification,
     Trainer,
     TrainingArguments,
 )
@@ -17,6 +18,7 @@ from label_studio_ml.utils import DATA_UNDEFINED_NAME, get_env
 
 HOSTNAME = get_env("HOSTNAME", "http://localhost:8080")
 API_KEY = get_env("API_KEY")
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
 
 logger.info("=> LABEL STUDIO HOSTNAME = ", HOSTNAME)
 if not API_KEY:
@@ -27,20 +29,45 @@ MODEL_FILE = "my_model"
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.INFO)
 
+
 class LayoutLMv3Classifier(LabelStudioMLBase):
     control_type: str = "RectangleLabels"
     object_type: str = "Image"
-    hf_model_name: str = "google/electra-small-discriminator"
-    hf_processor_name: str = "microsoft/layoutlmv3-base"
-    hf_model_class: type = ElectraForSequenceClassification
-    hf_processor_class: type = LayoutLMv3Processor
+    hf_hub_name: str = "microsoft/layoutlmv3-base"
+    hf_model_cls: Type = LayoutLMv3ForTokenClassification
+    hf_processor_cls: Type = LayoutLMv3Processor
 
     def __init__(self, **kwargs):
         super(LayoutLMv3Classifier, self).__init__(**kwargs)
         self.load_config()
-        self.processor = self.tokenizer_class.from_pretrained(self.hf_processor_name)
-        model_name = MODEL_FILE if Path(MODEL_FILE).exists() else self.hf_model_name
-        self.model = self.hf_model_class.from_pretrained(model_name)
+        self.processor = self.tokenizer_cls.from_pretrained(self.hf_hub_name)
+        if not self.train_output:
+            self.labels = self.info["labels"]
+            self.reset_model()
+            load_repr = "Initialised with"
+        else:
+            self.load(self.train_output)
+            load_repr = f"Loaded from train output with"
+        logger.info(f"{load_repr} {self.from_name=}, {self.to_name=}, {self.labels=!s}")
+
+    def _load_model(self, name_or_path: str) -> None:
+        assert hasattr(self, "labels"), "Loading model requires labels to be set first"
+        self.model = self.hf_model_cls.from_pretrained(
+            name_or_path,
+            num_labels=len(self.labels),
+        )
+        self.model.to(DEVICE)
+        return
+
+    def reset_model(self) -> None:
+        return self._load_model(name_or_path=self.hf_hub_name)
+
+    def load(self, train_output) -> None:
+        self.labels = train_output["labels"]
+        self._load_model(name_or_path=train_output["model_path"])
+        self.model.eval()
+        self.batch_size = train_output["batch_size"]  # TODO: review use in `predict`
+        self.maxlen = train_output["maxlen"]  # TODO: ditto (source: BERT backend)

Preannotation prediction and bbox handling

This one's really simple: we just call the model with the inputs. The inputs must be in a specific format though, and we need to handle the bounding box rectangles properly.

If we review the predict method of the Detectron2 example above:

    def predict(self, tasks, **kwargs):
        image_urls = [task["data"][self.value] for task in tasks]
        images = [load_image_from_url(url) for url in image_urls]
        layouts = [self.model.detect(image) for image in images]
        predictions = []
        for image, layout in zip(images, layouts):
            height, width = image.shape[:2]
            result = [
                {
                    "from_name": self.from_name,
                    "to_name": self.to_name,
                    "original_height": height,
                    "original_width": width,
                    "source": "$image",
                    "type": "rectanglelabels",
                    "value": convert_block_to_value(block, height, width),
                }
                for block in layout
            ]
            predictions.append({"result": result})
        return predictions

It's clear that where the text classifiers access their input_texts from each task on task["data"][self.value], we now get image URLs (recall images are the 'objects' referenced in object_type).

The Detectron2 method uses listcomps for images and layouts (i.e. including inference) and only iterates over the results of that model inference, and creates the result object more concisely.

The Detectron2 code style is neater, but we don't call our model with model.detect(), so we want elements from the Electra example too for its HuggingFace style:

    def predict(self, tasks, **kwargs):
        # get data for prediction from tasks
        final_results = []
        for task in tasks:
            input_texts = ""
            input_text = task["data"].get(self.value)
            if input_text.startswith("http://"):
                input_text = self._get_text_from_s3(input_text)
            input_texts += input_text

            labels = torch.tensor([1], dtype=torch.long)
            # tokenize data
            input_ids = torch.tensor(
                self.tokenizer.encode(input_texts, add_special_tokens=True)
            ).unsqueeze(0)
            # predict label
            predictions = self.model(input_ids, labels=labels).logits
            predictions = torch.softmax(predictions.flatten(), 0)
            label_count = torch.argmax(predictions).item()
            final_results.append(
                {
                    "result": [
                        {
                            "from_name": self.from_name,
                            "to_name": self.to_name,
                            "type": "choices",
                            "value": {"choices": [self.labels[label_count]]},
                        }
                    ],
                    "task": task["id"],
                    "score": predictions.flatten().tolist()[label_count],
                }
            )
        return final_results

Note how the final_results.append call spans 14 lines.

Also note how the returned results have a result value which is only one item, whereas in the Detectron2 results the result value is a list of many blocks in a layout.

I want to make it even neater, with a TypedDict

class DetectionResult(TypedDict):
    from_name: str
    to_name: str
    original_height: int
    original_width: int
    source: str
    type: str
    value: BlockValue

This is nested in a singleton dict I'll again formalise in a TypedDict:

class PredictionResult(TypedDict):
    result: list[DetectionResult]
    task: int
    score: float

and I then can annotate the value key's value as a BlockValue corresponding to the returned type from convert_block_to_value (already covered above).

class BlockValue(TypedDict):
    height: int
    rectanglelabels: list[str]
    rotation: int
    width: int
    x: float
    y: float
    score: float

Note that I'm not going to rotate my bboxes, so it's going to stay as initialised, at 0.

We can then annotate the return type of our predict method specifically and concisely.

This gets us most of the way there, with some ambiguous parts left with TODOs and old code commented out where replaced with reasonable guesses, e.g.:

  • encoding = self.processor(images)
  • self.model(**encoding, labels=labels).logits
class LayoutLMv3Classifier(LabelStudioMLBase):
    control_type: str = "RectangleLabels"
    object_type: str = "Image"
    hf_hub_name: str = "microsoft/layoutlmv3-base"
    hf_model_cls: Type = LayoutLMv3ForTokenClassification
    hf_processor_cls: Type = LayoutLMv3Processor
    detection_source: str = f"${object_type}".lower()
    detection_type: str = control_type.lower()

    ...

    def detect_images(self, images: list):
        # TODO: change this to HuggingFace style
        return [self.model.detect(image) for image in images]

    def load_images_from_urls(image_urls: list):
        # TODO: change this to paths / check how load_image_from_url works in Detectron2
        return [load_image_from_url(url) for url in image_urls]

    def predict(self, tasks, **kwargs) -> list[PredictionResult]:
        # get data for prediction from tasks
        image_urls = [task["data"][self.value] for task in tasks]
        images = self.load_images_from_urls(image_urls)
        layouts = self.detect_images(images)
        predictions = []
        for image, layout in zip(images, layouts):
            height, width = image.shape[:2]
            labels = torch.tensor([1], dtype=torch.long)
            encoding = self.processor(images)
            # input_ids = torch.tensor(
            #     self.tokenizer.encode(input_texts, add_special_tokens=True)
            # ).unsqueeze(0)
            # predict label
            predictions = self.model(**encoding, labels=labels).logits
            predictions = torch.softmax(predictions.flatten(), 0)
            label_idx = torch.argmax(predictions).item()
            pred_score = predictions.flatten().tolist()[label_idx]
            # TODO: see `convert_block_to_value` to get bbox h,w,x,y
            detection_results = [
                DetectionResult(
                    {
                        "from_name": self.from_name,
                        "to_name": self.to_name,
                        "original_height": height,
                        "original_width": width,
                        "source": self.detection_source,
                        "type": self.detection_type,
                        "value": BlockValue(
	        	        {
	        	            "height": ...,
	        	            "rectanglelabels": [self.labels[label_idx]],
	        	            "rotation": 0,
	        	            "width": ...,
	        	            "x": ...,
	        	            "y": ...,
	        	            "score": pred_score,
	        	        }
                        )
                    }
                )
                for block in layout
            ]
            pred = PredictionResult(
                {
                    "result": detection_results,
                    "task": task["id"],
                    "score": predictions.flatten().tolist()[label_count],
                }
            )
            predictions.append(pred)
        return predictions

The remaining problems to solve are:

  • How to handle the bboxes (which Detectron2's code handled via convert_block_to_value)
  • How to detect the layouts in images (the placeholder method detect_images)
  • How to handle the images from local files (the placeholder method load_images_from_urls)

We've crossed a line here from adapting examples that shared the behaviour we want, to creating something entirely new (the object detection examples, as helpful as they are, don't quite match and their usefulness as templates is almost up).

Predicting bboxes from only the image

At this point, midway through the rewrite, I got stuck, and it dawned on me that the code I was referring to for prediction (or "model inference") in Niels Rogge's Transformers Tutorials notebook on fine-tuning LayoutLMv3 with the FUNSD dataset was actually not the way I would be doing it in my backend.

Specifically this was addressed at the end of that tutorial:

Note: inference when you don't have labels

The code above used the labels to determine which tokens were at the start of a particular word or not. Of course, at inference time, you don't have access to any labels. In that case, you can leverage the offset_mapping returned by the tokenizer. I do have a notebook for that (for LayoutLMv2, but it's equivalent for LayoutLMv3) here.

I then looked at how the LayoutLMv2 code did it, and made a HuggingFace Space to confirm this worked in a similar way. The relevant part of the source is:

def process_image(image):
    width, height = image.size
    encoding = processor(
        image, truncation=True, return_offsets_mapping=True, return_tensors="pt"
    )
    offset_mapping = encoding.pop("offset_mapping")
    outputs = model(**encoding)
    predictions = outputs.logits.argmax(-1).squeeze().tolist()
    token_boxes = encoding.bbox.squeeze().tolist()
    # only keep non-subword predictions
    is_subword = np.array(offset_mapping.squeeze().tolist())[:, 0] != 0
    true_predictions = [
        id2label[pred] for idx, pred in enumerate(predictions) if not is_subword[idx]
    ]
    true_boxes = [
        unnormalize_box(box, width, height)
        for idx, box in enumerate(token_boxes)
        if not is_subword[idx]
    ]
    for prediction, box in zip(true_predictions, true_boxes):
        predicted_label = iob_to_label(prediction).lower()
        ...
  • Only lightly trimmed here (result drawing code omitted)
  • Note that this function is not self-contained: it assumes we already have processor, model, np, id2label, unnormalize_box, and iob_to_label
  • id2label is just a dict(enumerate(labels)) and notice how it's just used to temporarily 'store' the predictions as integer indices (before restoring them in the for loop at the end)

This is how we should handle the images (PIL.Image, RGB mode) to produce label predictions with bboxes.

The code at this point was getting too nested for my liking, so rather than instantiate the DetectionResult within the predict_method, and within that instantiate the BlockValue, I whisked away this complexity into those classes themselves in classmethods from_backend that took the self (i.e. the LayoutLMv3Classifier subclassing LabelStudioMLBase) and then only passed in things that didn't come from self-reference, greatly simplifying the call signature and effectively decluttering the predict method.

Simultaneously, I became "adrift" in the code: I couldn't see it all on one screen. I prefer to see everything in one eyeful, so I split out the new classes into separate modules.

The trick to doing this is to copy everything, then delete things in one, then run autoflake8

When this wasn't enough, I broke out individual methods into separate modules (choosing those with fewest self references, as these would need to be passed as kwargs to a static method). _get_annotated_dataset turned out to be a completely static method, and got moved entirely.

After this, I had 4 new modules: url_utils.py, detection.py, components.py, ls_api.py.

This greatly eased the writing of specific parts, and before I knew it I was done. The key part is the unnormalize_box function which takes a width and height and scales up the bbox to full size. After that, it's just passing the info around.

def convert_box_to_value(box: tuple[float, float, float, float]):
    x1, y1, x2, y2 = box
    w = x2 - x1
    h = y2 - y1
    return x1, y1, w, h


class LayoutBlock(TypedDict):
    box: tuple[float, float, float, float]
    label: str
    score: float


class BlockValue(TypedDict):
    height: int
    rectanglelabels: list[str]
    rotation: float
    width: int
    x: float
    y: float
    score: float

    @classmethod
    def from_backend(
        cls,
        block: LayoutBlock,
        backend: LabelStudioMLBase,
    ) -> BlockValue:
        box, label, score = block.values()
        x, y, w, h = convert_box_to_value(box)
        return cls(
            height=h,
            rectanglelabels=[label],
            rotation=0,
            width=w,
            x=x,
            y=y,
            score=score,
        )

I realised I'd somehow ended up passing labels and scores at both the image and the block level, but obviously we want to label the block level because we're doing token classification not sequence classification. Oops. Moving on.

So a bbox with info attached is a "block" (and we skip), and multiple blocks make a "layout".

TODO: annotate return values, noting bboxes after being unnormalized have float w,h

Handling images from local files and as URLs

I wasn't 100% sure how this was going to work (web apps tend to work from URLs but I was working with local files... it could even use local URLs like file:// but to start with I just wrote the simplest possible solution, which was this method on the LayoutLMv3Classifier class (using pathlib.Path):

    def load_images_from_urls(image_paths_or_urls: list[Path | str]):
        return list(map(load_image_from_path_or_url, image_paths_or_urls))

calling this function from the url_utils module:

from __future__ import annotations

from pathlib import Path

import requests
from PIL import Image

__all__ = ["load_image_from_path_or_url"]


def load_image_from_path_or_url(path_or_url: str | Path) -> Image:
    if isinstance(path_or_url, str) and path_or_url.startswith("http"):
        im_ref = requests.get(path_or_url, stream=True).raw
    else:
        im_ref = path_or_url
    image = Image.open(im_path).convert("RGB")
    return image

Training for in-the-loop fine-tuning

All of the above gets us the ability to make predictions from a pretrained model, but not the ability to train (or fine-tune) one.

The training details are simplified to just calling the Trainer.train() method.

Other features that got copied from the other examples:

  • The tasks are called 'completions'
  • The training supports a web hook (to do with the Label Studio API)
  • The training is for text
  • The train_dataset is not completed (currently passed as placeholder class Custom_Dataset)
    def fit(self, completions, workdir=None, **kwargs):
        # check if training is from web hook
        if kwargs.get("data"):
            project_id = kwargs["data"]["project"]["id"]
            tasks = get_annotated_dataset(project_id)
            if not self.parsed_label_config:
                self.parsed_label_config = parse_config(
                    kwargs["data"]["project"]["label_config"]
                )
                self.load_config()
        # ML training without web hook
        else:
            tasks = completions
        # Create training params with batch size = 1 as text are different size
        training_args = TrainingArguments(
            "test_trainer", per_device_train_batch_size=1, per_device_eval_batch_size=1
        )
        # Prepare training data
        input_texts = []
        input_labels = []
        for task in tasks:
            if not task.get("annotations"):
                continue
            input_text = task["data"].get(self.value)
            input_texts.append(
                torch.flatten(self.tokenizer.encode(input_text, return_tensors="pt"))
            )
            annotation = task["annotations"][0]
            output_label = annotation["result"][0]["value"]["choices"][0]
            output_label_idx = self.labels.index(output_label)
            output_label_idx = torch.tensor([[output_label_idx]], dtype=torch.int)
            input_labels.append(output_label_idx)
        print(f"Train dataset length: {len(tasks)}")
        my_dataset = Custom_Dataset((input_texts, input_labels))
        trainer = Trainer(
            model=self.model,
            args=training_args,
            train_dataset=my_dataset,
            # eval_dataset=small_eval_dataset
        )
        trainer.train()
        self.model.save_pretrained(MODEL_FILE)
        train_output = {"labels": self.labels, "model_file": MODEL_FILE}
        return train_output

A good place to start is the custom class for the training data, Custom_Dataset.

This was taken from the code for Electra, which is defined as:

class Custom_Dataset(torch.utils.data.dataset.Dataset):
    def __init__(self, _dataset):
        self.dataset = _dataset

    def __getitem__(self, index):
        example, target = self.dataset[0][index], self.dataset[1][index]
        return {"input_ids": example, "label": target}

    def __len__(self):
        return len(self.dataset)

Now compare this to the example from Niels Rogge given earlier (I've adapted it slightly to actually run, rather than having a placeholder image path):

import pandas as pd

from torch.utils.data import Dataset
from PIL import Image

class CustomDataset(Dataset):
     def __init__(self, root: Path, df: pd.DataFrame, processor):
          self.root = root
          self.df = df
          self.processor = processor

     def __getitem__(self, idx):
          # get document image + corresponding words and boxes
          item = self.df.iloc[idx]
          filename = item.filename
          image_path = self.root / filename
          image = Image.open(str(image_path)).convert("RGB")
          words = item.words
          boxes = item.boxes
          # use processor to prepare everything for the model
          encoding = self.processor(image, words, boxes=boxes)
          return encoding

The image loading will be refactored into

from layoutlmv3.url_utils import load_image_from_path_or_url
...
image = load_image_from_path_or_url(image_path)

The difference is that in Niels's code, the dataset is stored in a dataframe.

This is difficult to figure out completely out of context, so at this point I began trying to deploy and attach the backend even though it wasn't ready.

Deploying and debugging the custom backend

Similar to the beginning, we now deploy the custom backend from its directory:

cd label-studio-ml-backend/label_studio_ml/examples/layoutlmv3
docker compose up

This gave me an error: the redis container couldn't launch because it clashed with the one that was previously set up

[+] Running 1/1
 ⠿ Network layoutlmv3_default  Created                                     0.3s
 ⠋ Container redis             Creating                                    0.0s
Error response from daemon: Conflict. The container name "/redis" is already in
use by container "...". You have to remove (or rename) that container to be
able to reuse that name.

To do so, run docker rmi simple_text_classifier_server:latest (the image name will tab-autocomplete). If that fails you need to docker ps -a to list and then docker rm the ID of the container that's clashing. Then rmi will work, after which you can docker rm the container that has the /redis name attached, and finally then you can run the compose command.

This time the server just started, a little too quietly: there was no message to tell me that it had succeeded and the address to load, just:

server  | 2022-07-17 19:40:58,704 INFO supervisord started with pid 1
server  | 2022-07-17 19:40:59,708 INFO spawned: 'rq_00' with pid 9
server  | 2022-07-17 19:40:59,711 INFO spawned: 'wsgi' with pid 10
server  | 2022-07-17 19:41:00,797 INFO success: rq_00 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
server  | 2022-07-17 19:41:00,797 INFO success: wsgi entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

I wanted to see the entire build happen again so I deleted the container and the images and re-ran. This gave the full build output, and the same success log from the server container, however there was no local server for me to connect to!

Running curl http://localhost:9090/ gave "Internal Server Error", with no debug output.

A cursory review of the config showed that the logs were going to the logs directory in the current working directory, and in particular uwsgi.log contained a syntax bug in my code:

Traceback (most recent call last):
  File "./_wsgi.py", line 7, in <module>
    from layoutlmv3 import LayoutLMv3Classifier
  File "<fstring>", line 1
    (self.from_name=)
                   ^
SyntaxError: invalid syntax

This was in fact from trying to use Python 3.8+ syntax in a Python 3.7 container. Apparently you can currently upgrade to 3.8 but not 3.9. (I did this by changing the Dockerfile and recreating with the compose command again).

docker ps -a
docker rm #id_of_redis_container #id_of_layoutlmv3_container
docker images
docker rmi redis:alpine layoutlmv3_server:latest

Which rebuilt the image for Python 3.8.

This log file was also littered with warnings about running uWSGI as root ("use the --uid flag")

After rebuilding, and reviewing the uwsgi.log again, I came across some further bugs in my code. This time:

    from label_studio_ml.examples.layoutlmv3.components import (
ModuleNotFoundError: No module named 'label_studio_ml.examples'

The 'package' here should be the backend, not the repo of multiple backends, d'oh!

This source code is 'baked in' to the Docker image, so once again I needed to delete containers and images. This time I automated it using a filter, and after noticing that it only requires there not to be a container, reduced it:

docker rm $(docker ps -a -q -f name=server) && docker rmi layoutlmv3_server:latest
docker compose up

I now got the logged traceback

  File "/app/./_wsgi.py", line 7, in <module>
    from layoutlmv3 import LayoutLMv3Classifier
  File "/app/./layoutlmv3.py", line 18, in <module>
    from layoutlmv3.components import (
ModuleNotFoundError: No module named 'layoutlmv3.components'; 'layoutlmv3' is not a package

I couldn't get a package structure by sprinkling __init__.py around like usually, so I dumped the __package__ and __file__ variables to the log:

PACKAGE IS  FILE IS /app/./layoutlmv3.py

So there is no package to perform relative imports with regards to, and the file is simply called. I'm not sure I know enough to debug further. Direct imports work however, so I just renamed all my modules to have a layoutlmv3_ prefix (for code readability, despite the ugly effect on the directory).

With that, I was rid of the ModuleNotFoundError tracebacks, and got some fresh ones to debug (each time calling my docker rm command after stopping the docker compose process and then calling it again to recreate the server with the newly edited source).

I needed from __future__ import annotations in all the modules (Python 3.8 has type support 'backported')

...et voila, the backend was running!

curl http://localhost:9090/

{"model_dir":"/data/models","status":"UP","v2":"true"}

Additionally, if I call sudo tail -f logs/uwsgi.log in one shell and run the curl command in another (or curl the equivalent /health endpoint) I can see the log added to in real time:

GET /health => generated 55 bytes in 0 msecs (HTTP/1.1 200) 2 headers in 71 bytes (1 switches on core 0)

Attaching the custom backend

So now we have a running custom backend (in some potentially minimally viable state), and we can run Label Studio (label-studio) and login to access a saved project. In the project, in the Settings page, click "Machine Learning", "Add Model" and set the URL as http://localhost:9090/.

This gave the following message:

Successfully connected to http://localhost:9090/ but it doesn't look like a valid ML backend. Reason: 500 Server Error: INTERNAL SERVER ERROR for url: http://localhost:9090/setup.

Check the ML backend server console logs to check the status. There might be something wrong with your model or it might be incompatible with the current labeling configuration.

Switching back to my log tail shell, I could see a Python error that had cause the internal server error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/label_studio_ml/exceptions.py", line 39, in
exception_f
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/label_studio_ml/api.py", line 50, in _setup
    model = _manager.fetch(project, schema, force_reload, hostname=hostname,
access_token=access_token)
  File "/usr/local/lib/python3.8/site-packages/label_studio_ml/model.py", line 502, in fetch
    model = cls.model_class(label_config=label_config, **kwargs)
  File "/app/./layoutlmv3.py", line 48, in __init__
    self.processor = self.tokenizer_cls.from_pretrained(self.hf_hub_name)
AttributeError: 'LayoutLMv3Classifier' object has no attribute 'tokenizer_cls'

[Mon Jul 18 17:56:17 2022] POST /setup => generated 888 bytes in 3 msecs (HTTP/1.1 500) 2 headers in 91 bytes (1 switches on core 0)

I double checked it was definitely caused by the /setup endpoint being called by clicking "Validate and Save" on the Label Studio "Add model" modal, and indeed this error message got duplicated.

At this point I:

  • Stopped the docker compose process
  • Cancelled the tail process
  • Left label-studio running
  • Edited the model module to fix the bug (and prayed that was the only one)
  • Re-ran the docker rm command and restarted docker compose as above.

This time I got a little pause before the error:

Successfully connected to http://localhost:9090/ but it doesn't look like a valid ML backend. Reason: HTTPConnectionPool(host='localhost', port=9090): Read timed out. (read timeout=3.0).

Check the ML backend server console logs to check the status. There might be something wrong with your model or it might be incompatible with the current labeling configuration.

This time it'd gotten a bit further:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/label_studio_ml/exceptions.py", line 39, in
exception_f
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/label_studio_ml/api.py", line 50, in _setup
    model = _manager.fetch(project, schema, force_reload, hostname=hostname,
access_token=access_token)
  File "/usr/local/lib/python3.8/site-packages/label_studio_ml/model.py", line 502, in fetch
    model = cls.model_class(label_config=label_config, **kwargs)
  File "/app/./layoutlmv3.py", line 50, in __init__
    self.labels = self.info["labels"]
AttributeError: 'LayoutLMv3Classifier' object has no attribute 'info'

This came from

class LayoutLMv3Classifier(LabelStudioMLBase):
    control_type: str = "RectangleLabels"
    object_type: str = "Image"
    hf_hub_name: str = "microsoft/layoutlmv3-base"
    hf_model_cls: Type = LayoutLMv3ForTokenClassification
    hf_processor_cls: Type = LayoutLMv3Processor
    detection_source: str = f"${object_type}".lower()
    detection_type: str = control_type.lower()

    def __init__(self, **kwargs):
        super(LayoutLMv3Classifier, self).__init__(**kwargs)
        self.load_config()
        self.processor = self.hf_processor_cls.from_pretrained(self.hf_hub_name)
        if not self.train_output:
            self.labels = self.info["labels"]
            self.reset_model()
            load_repr = "Initialised with"
        else:
            self.load(self.train_output)
            load_repr = "Loaded from train output with"
        logger.info(f"{load_repr} {self.from_name=}, {self.to_name=}, {self.labels=!s}")

I guess I'm supposed to hardcode the labels into the model. The attribute self.info is used nowhere else in the code: it matches code in the BERT and SimpleTextClassifier examples.

I created the labels when I did some initial labelling, and from parsing the export JSON I can get them back:

>>> {x["rectanglelabels"][0] for x in d[0]["label"]}
{'intra_ref_redirect_command', 'intra_ref_redirect_referent_name', 'partial_referent_name',
'ref_page_num', 'page_num', 'section_break', 'referent_synonym_or_specifier', 'section_header',
'referent_name', 'ref_section_header'}

I've copied the order I input them into the interface, and made an enum, and the same id2label and label2id functions that Niels did.

from enum import Enum

class Labels(Enum):
    referent_name = 1
    ref_page_num = 2
    page_num = 3
    section_header = 4
    ref_section_header = 5
    intra_ref_redirect_command = 6
    intra_ref_redirect_referent_name = 7
    partial_referent_name = 8
    referent_synonym_or_specifier = 9
    section_break = 0

def label2id() -> dict:
    return {k.name: k.value for k in Labels}

def id2label() -> dict:
    return {k.value: k.name for k in Labels}

There's a bit of a catch 22 at play here: to label effectively I need to have my interface set up. To set up my labelling interface I need to have a fine-tuned model to load (at the least, labels) from.

One workaround I can see is to fine-tune a model with just one category (e.g. page_num: the page number), and then 'bootstrap' from that, by getting the more efficient training loop set up.

In other words: don't try and create a custom model and a backend for it in one go. Instead:

  1. Create a custom dataset
  2. Somehow (however you can) get a few annotations for (at least) a single class into the dataset.
  3. Fine-tune LayoutLMv3 on this minimal dataset (don't need to have examples for all the features, model will just learn to never assign them).
  4. Hook the fine-tuned model backend into Label Studio, to thereby incrementally complete your dataset annotation.
  5. Iteratively repeat the fine-tuning (from scratch), 'bootstrapping' the annotation process.

While thinking along these lines, I also realised I had subtly given myself a particularly hard task: not just set up a ML model backend, but do so without an intermediate working model. The example of a LayoutLMv3 model fine-tuned on FUNSD would be easier to set up, and once that was done I could take it a step further by applying it to my model.

Therefore I duplicated the backend and renamed that example directory layoutlmv3_funsd.

⚠️ **GitHub.com Fallback** ⚠️