Setting up ML backend for Label Studio - lmmx/devnotes GitHub Wiki
- Background links
- Requirements
- Deploying an example backend
- Backend specification
- Detectron2 example
- MMDetection example
- HuggingFace Transformers backends
-
Custom LayoutLMv3 object detection backend
- Backend rewriting recipe
- Deploying and debugging the custom backend
- Attaching the custom backend
A prerequirement for setting up a ML backend for Label Studio is Docker Compose, which in turn requires Docker Engine.
- Specifically Quickstart with an example ML backend guides you to set up Compose.
See Installing Docker Compose and Installing Docker Engine
(Thankfully the standard way to install Engine also installs Compose)
To set up the example used in the repo's docs, run the following
git clone https://github.com/heartexlabs/label-studio-ml-backend
cd label-studio-ml-backend/label_studio_ml/examples/simple_text_classifier
docker compose up
If you installed via a different route you may have
docker-compose
as the command instead
Here's the project structure:
label-studio-ml-backend/label_studio_ml/examples/simple_text_classifier $ tree .
⇣
.
├── data
│ ├── redis
│ └── server
│ └── models
├── docker-compose.yml
├── Dockerfile
├── logs
├── README.md
├── requirements.txt
├── simple_text_classifier.py
└── _wsgi.py
5 directories, 6 files
There's a data
directory with a subdirectory for each of the services, redis
and server
(with a subdirectory models
), and a logs
directory.
- In fact these are all created when the service starts: notice they're not checked into the repo
The docker-compose.yml
Compose specification
file specifies the service names and where to mount these directories,
and the port number for the server
service:
- The
redis
service:- mounts
./data/redis
as/data
- also names its container and hostname
redis
- mounts
- The
server
service:- mounts
./data/server
as/data
- mounts
./logs
as/tmp
- sets environment variables including
MODEL_DIR
as/data/models
and an API key -
defines a network link to the container
in the
redis
service, thereby determining the order of service startup. - specifies that it
depends_on
theredis
service, again determining order of service startup and shutdown.
- mounts
version: "3.8"
services:
redis:
image: redis:alpine
container_name: redis
hostname: redis
volumes:
- "./data/redis:/data"
expose:
- 6379
server:
container_name: server
build: .
environment:
- MODEL_DIR=/data/models
- RQ_QUEUE_NAME=default
- REDIS_HOST=redis
- REDIS_PORT=6379
- LABEL_STUDIO_ML_BACKEND_V2=true
- LABEL_STUDIO_HOSTNAME=http://localhost:8000
- LABEL_STUDIO_API_KEY=d6f8a2622d39e9d89ff0dfef1a80ad877f4ee9e3
ports:
- "9090:9090"
depends_on:
- redis
links:
- redis
volumes:
- "./data/server:/data"
- "./logs:/tmp"
- The
LABEL_STUDIO_API_KEY
is specific to this backend (see grep.app) - The YAML specifies 2 services:
- one named
redis
(which is spun up from theredis:alpine
image and binds the./data/redis
) - one named
server
(which depends on the one namedredis
)
- one named
The obvious question here is how ./data/server/models
got made (see the directory tree above).
Since /data/models
is passed in as an environment variable MODEL_DIR
, it seems obvious that
either _wsgi.py
or simple_text_classifier.py
boots up when the service starts and touch
es it.
A 126 line module that looks in its directory for a config.json
(note: not used),
exposes a CLI parser that reads a kwarg config of parameters as well as port (from the env. var.),
host, debug flag, log level, model dir (defaulting to the file's dir) and runs the app
that it initialises by
label_studio_ml.api.init_app
;
or if not being run from the command line, just goes straight to initialising
the app with the args from the environment but does not run the app.
Note: the app is initialised by a module-level singleton
LabelStudioMLManager
instance_manager
, passing themodel_class
as theSimpleTextClassifier
class imported from thesimple_text_classifier.py
module, and the_server
object returned here asapp
is a Flask app, again a module-level singleton instance. Importantly, theSimpleTextClassifier
class gets bound to themodel_class
of theLabelStudioMLManager
and ends up getting bound inside the_current_model
attribute (dict)
A 161 line module that ensures an API key is set as an env. var,
then defines the SimpleTextClassifier
class which subclasses LabelStudioMLBase
and uses some imported sklearn
helpers
(LogisticRegression
, TfidfVectorizer
, make_pipeline
). The class has only a few methods:
-
__init__
:- checks
self.parsed_label_config
is [a dict] of length 1 (this comes from the base class), and setsself.name
andself.info
from its key/value.- The base class sets this attribute
from
parse_config(self.label_config) if self.label_config else {}
. - The base class's
__init__
signature isself, label_config=None, train_output=None, **kwargs
. My understanding here is that the Flask app sends a POST request that includes the config of the project, among which is the labelling config set up in the UI, and that if this isn't text classification then initialising this backend will fail.
- The base class sets this attribute
from
- checks that the config's
type
(now inself.info
) value is "Choices" (i.e. it's for classification). - checks that the config's
to_name
andinputs
are length 1 (i.e. the model has just 1 input), and that the type of the input is "Text" - sets
self.to_name
from the config'sto_name
- sets
self.value
from the config's first and onlyinputs
value
- checks
-
reset_model
creates the following simple 2-step model:self.model = make_pipeline( TfidfVectorizer(ngram_range=(1, 3), token_pattern=r"(?u)\b\w\w+\b|\w"), LogisticRegression(C=10, verbose=True) )
-
predict
gets the input text from thedata
of each of thetasks
(passed in as an argument), runsself.predict_proba()
on them, getsargmax
indices of the predicted labels and scores, then zips the labels against the scores in a dict that get listed for all the tasks.for idx, score in zip(predicted_label_indices, predicted_scores): predicted_label = self.labels[idx] # prediction result for the single task result = [{ 'from_name': self.from_name, 'to_name': self.to_name, 'type': 'choices', 'value': {'choices': [predicted_label]} }] # expand predictions with their scores for all tasks predictions.append({'result': result, 'score': score})
-
_get_annotated_dataset
takes aproject_id
and is used to support webhook-based training workflows (infit
). It is marked "just for demo purposes", and uses the API key to authenticate a request tolocalhost/api/projects/{project_id}/export
to retrieve data annotations for the project.- N.B. this is why the API key is needed.
-
fit
that takesannotations
(which gets renamed totasks
) and aworkdir
(not used ifMODEL_DIR
env. var. is set), builds a list ofinput_texts
from accessing["data"].get(self.value)
of each of thetasks
, callsreset_model
thenself.model.fit
with theinput_texts
andoutput_labels_idx
, and returns a dict oflabels
(the sorted set of output labels coerced to list) andmodel_file
(the pickled model).- The optional kwarg
data
can be used to overrideannotations
astasks = _get_annotated_dataset(data["project"]["id"]
.
- The optional kwarg
The README
for this backend says to run curl http://localhost:9090/health
to check the service is running OK,
and indeed it returns some JSON:
{"model_dir":"/data/models","status":"UP","v2":"true"}
If we search through the repo for the word health we find that this is a Flask route defined in
label_studio_ml/api.py
@_server.route('/health', methods=['GET'])
@_server.route('/', methods=['GET'])
@exception_handler
def health():
return jsonify({
'status': 'UP',
'model_dir': _manager.model_dir,
'v2': os.getenv('LABEL_STUDIO_ML_BACKEND_V2', default=LABEL_STUDIO_ML_BACKEND_V2_DEFAULT)
})
Reading this also tells us that we can curl http://localhost:9090/
(the /
route) and get the same output.
These are bound to our deployed app
because when we initialised it we got the _server
module-level singleton object defined in this api.py
module.
Another thing to notice here about this route funcdef is that it takes no arguments (pretty standard
for a GET request), and only uses environment variables and the _manager
(module-level global variable).
Compare this to the POST request route for _predict
:
@_server.route('/predict', methods=['POST'])
@exception_handler
def _predict():
data = request.json
tasks = data.get('tasks')
project = data.get('project')
label_config = data.get('label_config')
force_reload = data.get('force_reload', False)
try_fetch = data.get('try_fetch', True)
params = data.get('params') or {}
predictions, model = _manager.predict(
tasks, project, label_config, force_reload, try_fetch, **params
)
response = {
'results': predictions,
'model_version': model.model_version
}
return jsonify(response)
Here there is the implicit parameter request
which is provided by Flask
to a route when called.
It's known as a 'context' in Flask's docs, implemented in werkzeug as a context local. I don't really know much on the implementation details here other than you access the
.json
attribute and then you're just working with a regular dict (similar tolocals()
).
But how does this all work together? How can we test the /predict
route? We can't just send a
plain POST request:
curl --header "Content-Type: application/json" --request POST --data '{}' http://localhost:9090/predict
We hit an exception in the LabelStudioMLManager.predict
class method which receives the empty data
and get told the model is not loaded:
@classmethod
def predict(
cls, tasks, project=None, label_config=None, force_reload=False, try_fetch=True, **kwargs
):
if not os.getenv('LABEL_STUDIO_ML_BACKEND_V2', default=LABEL_STUDIO_ML_BACKEND_V2_DEFAULT):
if try_fetch:
m = cls.fetch(project, label_config, force_reload)
else:
m = cls.get(project)
if not m:
raise FileNotFoundError('No model loaded. Specify "try_fetch=True" option.')
predictions = m.model.predict(tasks, **kwargs)
return predictions, m
if not cls._current_model:
raise ValueError(f'Model is not loaded for {cls.__class__.__name__}: run setup() before
using predict()')
predictions = cls._current_model.model.predict(tasks, **kwargs)
return predictions, cls._current_model
In other words, you should [let the UI] set this backend up before trying to decipher the inner workings any deeper.
I don't want a text classifier like this though, I want a bounding box predictor (object detection model). This one doesn't tick all the boxes for my needs, which are:
- A HuggingFace model (the text classifier indeed uses HuggingFace
transformers.AutoTokenizer
andtransformers.AutoModelForCausalLM
). These will be useful for figuring out what to put in thepredict
method of the model class in my ML backend. - An image model (not too complicated: similar to a language model, it'll use a
tokenizer
/processor
and amodel
). This will be useful for figuring out how to pass the image data into the model (which has some requirements that get validated in the model's__init__
method).
It's clear that of these two, the first priority should be to find another object detection labelling studio backend,
so I'd be able to look at the assertions made in its equivalent of the simple_text_classifier
's
SimpleTextClassifier.__init__()
method.
I searched for the name of the base class LabelStudioMLBase
on the code search site grep.app
(here) and indeed I landed on an image model,
Detectron2, a well-known semantic segmentation model (which is closer to object detection with
bboxes, but I expect will be outputting pixel-level masks).
Edit: in fact it is giving bboxes: in the code excerpt below, result type is "rectanglelabels".
Edit 2: it turned out I overlooked one right under my nose: mmdetection
contains an object
detection example in this repo!
After some further digging, this turned out to be one of the LayoutParser developers' personal copy of code that would go on to become part of the official LayoutParser annotation service.
This is much closer to what I am aiming for, in fact it's the exact same task even,
however I want to use the LayoutLMv3 model on HuggingFace whereas this example
(obviously) is using a layoutparser
model
(specifically lp.Detectron2LayoutModel
)
From this example however, it's clear what function signatures we should aim for in an object detection API:
class ObjectDetectionAPI(LabelStudioMLBase):
def __init__(self, freeze_extractor=False, **kwargs):
...
def predict(self, tasks, **kwargs):
image_urls = [task["data"][self.value] for task in tasks]
images = [load_image_from_url(url) for url in image_urls]
layouts = [self.model.detect(image) for image in images]
predictions = []
for image, layout in zip(images, layouts):
height, width = image.shape[:2]
result = [
{
"from_name": self.from_name,
"to_name": self.to_name,
"original_height": height,
"original_width": width,
"source": "$image",
"type": "rectanglelabels",
"value": convert_block_to_value(block, height, width),
}
for block in layout
]
predictions.append({"result": result})
return predictions
def fit(self, completions, workdir=None, batch_size=32, num_epochs=10, **kwargs):
image_urls, image_classes = [], []
print("Collecting completions...")
# for completion in completions:
# if is_skipped(completion):
# continue
# image_urls.append(completion['data'][self.value])
# image_classes.append(get_choice(completion))
print("Creating dataset...")
# dataset = ImageClassifierDataset(image_urls, image_classes)
# dataloader = DataLoader(dataset, shuffle=True, batch_size=batch_size)
print("Train model...")
# self.reset_model()
# self.model.train(dataloader, num_epochs=num_epochs)
print("Save model...")
# model_path = os.path.join(workdir, 'model.pt')
# self.model.save(model_path)
return {"model_path": None, "classes": None}
It's pretty clear here that this is a work in progress (i.e. all the commented out code). After getting to grips with how Label Studio backends work, I'm fairly certain that the training API isn't operational, the prediction service looks like it could be though.
The commented out line dataset = ImageClassifierDataset(image_urls, image_classes)
caught my
attention, as it suggests that this was building on prior work. Indeed, searching on
grep.app shows that this name comes
from the label-studio-ml-backend
repo:
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py
docs/source/tutorials/pytorch-image-transfer-learning.md
One thing I like about the code at this link is that it has nicely contained methods.
(To be covered below: the HuggingFace example has quite a messy fit
method).
It's a pretty simple PyTorch dataset, but I'm not personally going to use URLs for my data, so it's not quite aligned to my needs.
I expect I'm going to use something more like this example by Niels Rogge (MLE at HuggingFace):
from torch.utils.data import Dataset
from PIL import Image
class CustomDataset(Dataset):
def __init__(self, root, df, processor):
self.root = root
self.df = df
self.processor = processor
def __getitem__(self, idx):
# get document image + corresponding words and boxes
item = self.df.iloc[idx]
image = Image.open(self.root + ...).convert('RGB')
words = item.words
boxes = item.boxes
# use processor to prepare everything for the model
encoding = self.processor(image, words, boxes=boxes)
return encoding
This is just a draft, assuming you have a root folder with all your document images, and a Pandas dataframe that contains the words + boxes for each document image.
You can then instantiate the dataset as follows:
from transformers import LayoutLMv3Processor
processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")
dataset = CustomDataset(root="path_to_your_root", df="your_dataframe", processor=processor)
Note that in the tutorial notebook it's clarified what the processor is:
Next, we prepare the dataset for the model. This can be done very easily using
LayoutLMv3Processor
, which internally wraps aLayoutLMv3FeatureExtractor
(for the image modality) and aLayoutLMv3Tokenizer
(for the text modality) into one.
Back to the code at hand though! (There's not much to say)
The result
here would be a good use case for a typing.TypedDict
as the keys will always be the same.
Note here that convert_block_to_value(block, image_height, image_width)
returns:
{
"height": block.height / image_height * 100,
"rectanglelabels": [str(block.type)],
"rotation": 0,
"width": block.width / image_width * 100,
"x": block.coordinates[0] / image_width * 100,
"y": block.coordinates[1] / image_height * 100,
"score": block.score,
}
...and block
is a single object from layouts
, which is a list returned from
self.model.detect(image)
where as already stated, the model is Detectron2LayoutModel
from layoutparser
(source here).
If we keep digging,
we see the detect
method returns what it gets from running the gather_output
method on the output of calling self.model
. To disregard the model here, the "gathering" involves
creating Layout
objects
(from the lp.elements.layout
module)
and putting TextBlock
objects
in them, each populated with a block
argument made of a
Rectangle
,
both from the lp.elements.layout_elements
module.
The Rectangle
is a "manual dataclass" made of x_1
, y_1
, x_2
, y_2
(or 'Lord of the Rings Bilbo' notation as I remember it: LT,RB).
So that's what a block
is being iterated over in the predict
method of
the ObjectDetectionAPI
class which is subclassing LabelStudioMLBase
,
and therefore we can interpret the values... The "x", and "y" are the
bbox left and top coordinate's percentages of the image width and height,
while the "height" and "width" of the bbox are again relative to the image's
height and width (again as a percentage).
The block "score" comes from the model, it's difficult to look at the model directly in the code
due to the weird metaprogramming approach used (it comes from another package fvcore
,
fvcore.common.registry
,
via detectron2
's registry
),
which instantiates a module-wide 'registry' of architectures which get recorded through the
@META_ARCH_REGISTRY.register()
decorator
(see search results here).
The block "type" is passed as the label of predicted classes list (pred_classes.tolist()
).
I missed the OpenMMLab MMDetection toolbox example
backend at first, perhaps because it has such a simple class structure: it only has __init__
and
predict
methods (as well as a _get_image_url
helper method). It's not trainable through the
Label Studio interface, you just load trained checkpoints from file.
This one's a bit unusual: it's the only one I've seen here that asks to specify device in the class
__init__
signature (defaulting to "cpu"
).
It also loads labels from a file (the other backends populate their labels
attribute
from the labels
value in the info
attribute that comes from the parsed_label_config
dict's value).
Instead, the parsed_label_config
dict's first value is assigned to schema
which looks like it's
also a dict
, with yet more dicts nested inside... (Type annotations would be valuable here!)
(
self.from_name,
self.to_name,
self.value,
self.labels_in_config,
) = get_single_tag_keys(self.parsed_label_config, "RectangleLabels", "Image")
schema = list(self.parsed_label_config.values())[0]
self.labels_in_config = set(self.labels_in_config)
# Collect label maps from `predicted_values="airplane,car"` attribute
# in <Label> tag
self.labels_attrs = schema.get("labels_attrs")
if self.labels_attrs:
for label_name, label_attrs in self.labels_attrs.items():
for predicted_value in label_attrs.get("predicted_values", "").split(
","
):
self.label_map[predicted_value] = label_name
That get_single_tag_keys
function is from the label_studio_ml.utils
module:
def get_single_tag_keys(parsed_label_config, control_type, object_type):
"""
Gets parsed label config, and returns data keys related to the single
control tag and the single object tag schema
(e.g. one "Choices" with one "Text")
:param parsed_label_config: parsed label config returned by
"label_studio.misc.parse_config" function
:param control_type: control tag str as it written in label config
(e.g. 'Choices')
:param object_type: object tag str as it written in label config
(e.g. 'Text')
:return: 3 string keys and 1 array of string labels:
(from_name, to_name, value, labels)
"""
assert len(parsed_label_config) == 1
from_name, info = list(parsed_label_config.items())[0]
assert info["type"] == control_type, (
'Label config has control tag "<'
+ info["type"]
+ '>" but "<'
+ control_type
+ '>" is expected for this model.'
) # noqa
assert len(info["to_name"]) == 1
assert len(info["inputs"]) == 1
assert info["inputs"][0]["type"] == object_type
to_name = info["to_name"][0]
value = info["inputs"][0]["value"]
return from_name, to_name, value, info["labels"]
As well as a 'getter', this is a helper validating the parsed_label_config
!
The control_type
appears to be more like the "type of task label"
(so classification task has example "Choices" type of task label)
which is "RectangleLabels" here because the labels are bboxes,
and the object_type
is more like the "modality type of the task"
(so a text classifier has example "Text" type of modality)
which is "Image" here because the objects are detected in images.
I don't just want an image inputs/rectangular labels ML backend though, I specifically want to use HuggingFace's Transformers library to load my model and then make predictions with it.
If we hop into the examples directory and search for the transformers
import statement we can see
what's been demo'd:
grep -r --include \*.py transformers
⇣
huggingface/gpt.py:from transformers import AutoTokenizer, AutoModelForCausalLM
bert/bert_classifier.py:from transformers import BertTokenizer, BertForSequenceClassification
bert/bert_classifier.py:from transformers import AdamW, get_linear_schedule_with_warmup
ner/ner.py:from transformers import (
ner/ner.py:from transformers import AdamW, get_linear_schedule_with_warmup
electra/electra.py:from transformers import ElectraTokenizerFast, ElectraForSequenceClassification
electra/electra.py:from transformers import Trainer
electra/electra.py:from transformers import TrainingArguments
So that's GPT, BERT, and Electra, all (text) language models. The ner
directory likewise obviously
contains language models (for named entity recognition: BERT, Roberta, DistilBert, CamemBert).
Note from the imported names that bert
is the first in this list that does classification
like the simple_text_classifier
backend we saw above (BertForSequenceClassification
).
This seems like a good example to compare to (it should otherwise be similar to simple_text_classifier
).
The only difference between the two Docker Compose specs (bert
and simple_text_classifier
) is
that the BERT example does not specify any of the Label Studio-related env. vars:
diff simple_text_classifier/docker-compose.yml bert/docker-compose.yml
⇣
20,22d19
< - LABEL_STUDIO_ML_BACKEND_V2=true
< - LABEL_STUDIO_HOSTNAME=http://localhost:8000
< - LABEL_STUDIO_API_KEY=d6f8a2622d39e9d89ff0dfef1a80ad877f4ee9e3
The hostname and API key are used to GET data annotations from the Label Studio API [locally], and it turned out these aren't set up for this backend, hence the env. var's not being set.
Let's now look at the bert_classifier.py
module. The class BertClassifier
defines nearly all of the
same methods as SimpleTextClassifier
. It doesn't define _get_annotated_dataset
,
which I interpret as meaning this model does not support webhook-based training workflows (unconfirmed).
The rest:
-
__init__
identical but now with a few extra lines after the base class gets initialised (assigning attributes that were added to the previously blankself, **kwargs
signature, all of which have defaults):self.pretrained_model = pretrained_model self.maxlen = maxlen self.batch_size = batch_size self.num_epochs = num_epochs self.logging_steps = logging_steps self.train_logs = train_logs
- Not quite identically at the end too, where the pickle loading is replaced with loading
from_pretrained
the model saved withsave_pretrained
in HuggingFace.
- Not quite identically at the end too, where the pickle loading is replaced with loading
-
reset_model
is used to set up an initial model, but rather than the sklearn pipeline inSimpleTextClassifier
, it's the model loadedfrom_pretrained
again. -
predict
does a few things differently:- First off, it won't return anything if the
tokenizer
attribute wasn't set by running theload()
method in the__init__
method [by passing the truthiness check onself.train_output
, which gets set in the base class when thetrain_output
kwarg is passed) - Rather than just iterating over the tasks and sticking
task["data"].get(self.value)
into a list ofinput_texts
, a proper dataloader is used (it's cooked up in theutils.prepare_texts
function), and iterating over it gives input IDs and attention masks which are moved to the appropriate device upon being dataloaded. - After dataloading, model inference runs in
a
torch.no_grad
block (which disables the gradient calculation), and then after this block the resulting logits aredetach
ed from the graph, and put back on the CPU. - The scores and labels are assigned more neatly than in the sklearn model. The predicted label is listed directly rather than waiting to zip the argmax index against the score and look up the label just before building the result dict.
- First off, it won't return anything if the
-
fit
has no argumentannotations
but insteadcompletions
, which has the annotations nested inside it. Compare thesimple_text_classifier
vs.bert_classifier
, they're clearly the same:output_label = annotation['result'][0]['value']['choices'][0]
After that, there's a ton more that goes on (whereas the sklearn backend'soutput_label = completion['annotations'][0]['result'][0]['value']['choices'][0]
fit
method abstracted it all away into the sklearnmodel.fit
call). This is not particularly worth going through: neural net training loop, with logs, tqdm'd dataloader,model.train()
call, loss backprop., early stopping.
...it also defines
- a
load
method, which loads the pretrained model, and overwrites some of the attributes from the restored model over attributes defined at__init__
(namely:batch_size
,labels
,maxlen
). - a
not_trained
property, which relies on checking if theself.tokenizer
attribute has been set (it gets set in theload
method).
The first thing notable in the electra
backend is that its _wsgi.py
is near identical to the
bert
directory's: the BertClassifier
[subclass of LabelStudioMLBase]
is just replaced with ElectraTextClassifier
.
The electra.py
module is simpler however: totalling 145 lines to the BERT module's 221.
Right from the start you can see it has fewer imports.
On closer inspection this is because the Electra model is trained with
the HuggingFace Trainer
class,
so other than transformers
, the only libraries loaded in the module are requests
and json
!
- See its source code for details on what is abstracted away into this class or click through to particular sections from the docs
diff <(grep import examples/bert/bert_classifier.py) <(grep import examples/electra/electra.py)
⇣
2c2,3
< import numpy as np
---
> import requests
> import json
4,10c5,7
< from torch.utils.data import SequentialSampler
< from tqdm import tqdm, trange
< from collections import deque
< from tensorboardX import SummaryWriter
< from transformers import BertTokenizer, BertForSequenceClassification
< from transformers import AdamW, get_linear_schedule_with_warmup
< from torch.utils.data import TensorDataset, DataLoader, RandomSampler
---
> from transformers import ElectraTokenizerFast, ElectraForSequenceClassification
> from transformers import Trainer
> from transformers import TrainingArguments
12c9
< from utils import prepare_texts, calc_slope
---
> from label_studio_tools.core.label_config import parse_config
-
The
__init__
method is conspicuously lacking anyassert
statements (the other 2 examples had checks for the config's inputs value being length 1, i.e. for single labels in annotations). It seems to just rely on it implicitly however, and behaves the same.self.value = self.info["inputs"][0]["value"]
-
The
fit
method is shrunk back down closer to thesimple_text_classifier
backend, after being crammed full of training loop logic in the BERT backend. -
There's no
load
method (which in the BERT model was checking ifself.tokenizer
was set. Hereself.tokenizer
gets set in the__init__
method. -
There is a
load_config
method, but this is used to initialise theparsed_label_config
iffit
is called before that's set. It's set when the base class initialises but can be{}
if no config is passed (i.e. if theElectraTextClassifier
class isn't passed alabel_config
kwarg on init. -
The
predict
method has the nice neat HuggingFace style of predictions (seen in the BERT example) but keeps the label index from argmax as seen in thesimple_text_classifier
's sklearn code. This is the best of both worlds. -
The
_get_annotated_dataset
method is back, and handles the 'webhook' events with the API key (though the API key is hardcoded here, rather than set as an env. var. in the Docker Compose spec. as done in thesimple_text_classifier
. -
There is also a new
_get_text_from_s3
method which I don't need.
It also includes a CustomDataset
class similar to the draft above.
To step aside and review how each of the models are loaded and what it entails for the resulting backend capabilities (we can put a checkbox beside them to indicate if they are compatible with local training):
grep -r --include \*.py "self.model ="
⇣
label_studio_ml/examples/flair/ner_ml_backend.py: self.model = self.load(self.train_output["base_path"])
label_studio_ml/examples/huggingface/gpt.py: self.model = AutoModelForCausalLM.from_pretrained(self.model_name)
label_studio_ml/examples/mmdetection/mmdetection.py: self.model = init_detector(config_file, checkpoint_file, device=device)
label_studio_ml/examples/bert/bert_classifier.py: self.model = BertForSequenceClassification.from_pretrained(pretrained_model)
label_studio_ml/examples/nemo/asr.py: self.model = nemo_asr.models.EncDecCTCModel.from_pretrained(
label_studio_ml/examples/tensorflow/mobilenet_finetune.py: self.model = tf.keras.Sequential(
label_studio_ml/examples/simple_text_classifier/simple_text_classifier.py: self.model = pickle.load(f)
label_studio_ml/examples/simple_text_classifier/simple_text_classifier.py: self.model = make_pipeline(
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py: self.model = models.resnet18(pretrained=True)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py: self.model = self.model.to(device)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py: self.model = ImageClassifier(len(self.classes), freeze_extractor)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py: self.model = ImageClassifier(len(self.classes), freeze_extractor)
label_studio_ml/examples/pytorch_transfer_learning/pytorch_transfer_learning.py: self.model = ImageClassifier(len(self.classes), self.freeze_extractor)
label_studio_ml/examples/electra/electra.py: self.model = ElectraForSequenceClassification.from_pretrained("my_model")
label_studio_ml/examples/electra/electra.py: self.model = ElectraForSequenceClassification.from_pretrained(
-
The
flair
backend assignsself.model
in the__init__
method in a conditional block checkingself.train_output
(which gets set in the base class__init__
) and if the check fails it just doesn't load a model (doesn't even assignNone
to the attribute! Dicey).- The model is always loaded from a local path with filename
best_model.pt
- The model is always loaded from a local path with filename
-
The
gpt
backend doesn't do any check, and the model gets assigned asself.model_name
, so it can't be trained (so it isn't usable in active learning retraining workflows). -
The
mmdetection
backend sets it once, as forgpt
. -
The
bert
backend sets it frompretrained_model
in aload
method. Unless I'm mistaken, there's a bug wherereset_model
is a no-op. The model it returns isn't assigned toself.model
(as thesimple_text_classifier
sklearn backend did). In fact the only other example with areset_model
method is thepytorch_transfer_learning
backend, and indeed it bindsself.model
too.- That said, if it were fixed (so the call to
reset_model
in the__init__
method assigned toself.model
) it would be training-compatible.
- That said, if it were fixed (so the call to
-
The
mmdetection
backend sets it once, as forgpt
. -
There is also the
ner
backend which setsself._model
from theself.train_output
attribute'smodel_path
value (which with HuggingFace can of course be a HuggingFace Hub-hosted model path rather than a local file path). -
The
nemo
ASR backend sets it once, as forgpt
. -
The
tensorflow
backend sets it once but loads weights afterwards ifself.train_output
is set (truthily). -
The
simple_text_classifier
backend either sets it fromreset_model
and immediatelyfit
s to initialise iftrain_output
is falsey, or sets it from the pickle if a localmodel_file
is passed intrain_output
. -
The
pytorch_transfer_learning
backend loads its classes iftrain_output
is passed, and loads weights into it after assigning too, otherwise it just initialises it. This is done quite neatly (probably helped by defining the model class in the module itself, not relying on importing an external one). -
The
electra
backend checks if a hardcoded path exists then doesn't use the hardcoded path, but it clearly is supposed to. Again, yes you can train the model and use it here.
So to sumarise the training-friendly backend examples and whether they're good templates to build from:
-
bert
pulls in the labels from the label configinfo
before resetting the model if training output is not available; if available itload
s that and gets the labels and [should get] the model from there.- It uses the fact that only
load
(notreset_model
) sets thetokenizer
to distinguish whether it'snot_trained
(so as to refuse topredict
until being trained)
- It uses the fact that only
-
ner
uses themodel_path
from the training output if provided, otherwise just sets labels- I.e. it does the same as the BERT backend, and can't
predict
until trained.
- I.e. it does the same as the BERT backend, and can't
-
tensorflow
is the odd one out, starting with the same model regardless ofself.train_output
but then loading weights into it if available. This uses Keras not HuggingFace, so not applicable. -
simple_text_classifier
callsfit
directly onself.model
after resetting the model if not trained, otherwise unpickles the model. -
pytorch_transfer_learning
callsload
if training output is available, otherwise instantiates the model directly (not put in areset_model
method but same idea). Really it should callreset_model
in both blocks of that condition. -
electra
instantiates the model directly, just changes the path based on whether the model file exists. I'm not a fan of the hardcoded value, but I do like that the attributes are consistent regardless of whether the model was trained already (self.tokenizer
gets set either way too).
I'd like:
- The attribute assignment simplicity of
electra
(and subsequent ability topredict
regardless of whether trained or not) - The model path handling of
ner
- The proper
reset_model
/load
method handling ofbert
(when fixed as above) - The proper assertion checks on
__init__
ofbert
So having found honed in on these 5 (SimpleTextClassifier, BERT, Electra, the NER tagger, and MMDetection), as well as the partial Detectron2 example, it's clear that we actually want to mix and match aspects of code from various sources.
-
API key handling is only done properly (in Docker Compose spec) by the SimpleTextClassifier. Electra uses it too in
_get_annotated_dataset
but it's a hardcoded module string literal. -
Training is done most neatly (i.e. more simply, abstracting the details away) in Electra, and matches the use of the
Trainer
API in the LayoutLMv3 tutorial by Niels Rogge (via the Transformer-Tutorials repo). -
Prediction is done most neatly in Electra, but I'd still prefer a
TypedDict
for the results to make it even cleaner. -
Bounding box handling is done in Detectron2 and MMDetection. The result type will be changed from
choices
torectanglelabels
-
Config assertions are done in BERT's
__init__
method, and these may be useful to write (they're not in Electra). -
GPU handling is only done explicitly in BERT, but I expect the Trainer class handles that in Electra. This is handled through the
place_model_on_device
property of theTrainingArguments
class,- ...which is True if
transformers.utils.import_utils.py
'sis_sagemaker_mp_enabled()
evaluates to False (i.e. if not using model parallelism, which is set viaSM_HP_MP_PARAMETERS
env. var. else defaults to False).
- ...which is True if
-
The processor is going to go where the tokenizer goes in Electra (in the
__init__
method) for use inpredict
andfit
. Even though it's said to be 'pretrained', it doesn't get retrained so we don't need to load it, so it doesn't need to be conditional on there beingtrain_output
(see discussion). -
The model is going to go where it goes in Electra (in the
__init__
method) but rather than instantiating it here from a hardcodedMODEL_FILE
module-level global variable, it's going to be loaded via the path given by theload
method like in BERT iftrain_output
is available, otherwise fromreset_model
. This condition will look more like BERT but without moving devices (unsure?). Like the SimpleTextClassifier, thelabels
attribute is set fromtrain_output
ifload
ing else frominfo
if usingreset_model
.-
reset_model
should not be passing hardcoded defaults through (as in BERT), they should be method defaults (as in SimpleTextClassifier). The method should take no arguments.
-
At the risk of overemphasising, let's turn that inside out so it's in terms of what 'features' we want from each source:
- BERT:
reset_model
/load
pattern; config assertions in__init__
- Electra: simple attribute assignment [in particular of the
tokenizer
, i.e.processor
in my case] in__init__
(permitting use ofpredict
even if notrain_output
); prediction withTrainer
API (with automatic GPU device handling);_get_annotated_dataset
- Simple Text Classifier: API key handling in Docker Compose spec and
_get_annotated_dataset
; method-level defaults inreset_model
(not hardcoded in__init__
's call to that method) - Detectron2 and MMDetection: bbox handling
- Niels Rogge's Transformer-Tutorials LayoutLMv3 notebook:
Trainer
API usage; customDataset
(from issue #123) - NER tagger:
model_path
handling
Since this is quite an ambitious rewrite (with at least 4 different sources in the
examples here, plus likely reusing some of Niels Rogge's code for
datasets
and training with the Trainer
API),
I'll want to take a principled approach, and record what I do, with version control so I can roll
back (or at least review) any mistakes.
- The first step is to begin adapting the most relevant template (Electra), which will achieve
GPU handling (which we get 'for free' with the
Trainer
API) and immediately check one of our features off the to-do list. - Then we should start modifying the model itself. The first 'easy win' is API key usage.
- Next, we should move onto the model class
__init__
method, and tackle some real wins:- Adapting the signature to be relevant to the
LayoutLMv3
model's args/kwargs. - The config assertions (just guess if unsure, we can fix if they fail)
- The processor instantiation (another easy win)
- The model instantiation within a
train_output
conditional block.
- Adapting the signature to be relevant to the
- That just leaves:
- Prediction (of which bbox handling is a component), which will give us preannotation
- Training, which will give us a retrainable (fine-tuneable) model which will learn from the annotation labels we provide in Label Studio
Our recipe is therefore:
- GPU handling
- API key handling
- Config assertions
- Processor
- Model
- Prediction
- Bbox handling
- Training
The first question is obviously: where to start? I.e. which template to begin adapting from?
Well, out of the sources above, Electra has the longest list of 'features' I want.
Looked at another way, the most complexity-reducing thing we have here is the Trainer
API,
and that's only in Electra (Niels Rogge's Trainer
API example is not a Label Studio backend example).
At the risk of bikeshedding I'm going to just go with that impulse...
- Copy the directory and rename to
layoutlmv3
- Rename the model module to
layoutlmv3.py
- Overwrite the import line for the model module with the new model module (
layoutlmv3
) and class names (LayoutLMv3Classifier
) - Overwrite the model class name with the new one
cp -r electra layoutlmv3
cd layoutlmv3
mv electra.py layoutlmv3.py
sed -i 's/from electra import ElectraTextClassifier/from layoutlmv3 import LayoutLMv3Classifier/' _wsgi.py
sed -i 's/ElectraTextClassifier/LayoutLMv3Classifier/g' _wsgi.py
Before we start adapting the model class, we should really ensure that class is renamed too (so far it's just renamed in the server module).
sed -i 's/ElectraTextClassifier/LayoutLMv3Classifier/g' layoutlmv3.py
A major feature that is missing from Electra is that it doesn't handle the API key from an
environment variable set in the Docker Compose spec, it handles it from a hard-coded string.
We can get this easily enough by copying the Docker Compose spec over from simple_text_classifier
and then using it in layoutlmv3.py
the same way the simple_text_classifier.py
module uses it.
Just copy the variables in the environment
section of the YAML (I just did this in a text editor)
--- a/label_studio_ml/examples/layoutlmv3/docker-compose.yml
+++ b/label_studio_ml/examples/layoutlmv3/docker-compose.yml
@@ -18,6 +18,9 @@ services:
- REDIS_HOST=redis
- REDIS_PORT=6379
- USE_REDIS=true
+ - LABEL_STUDIO_ML_BACKEND_V2=true
+ - LABEL_STUDIO_HOSTNAME=http://localhost:8000
+ - LABEL_STUDIO_API_KEY=d6f8a2622d39e9d89ff0dfef1a80ad877f4ee9e3
Since it's all local, I imagine you could change that key to be whatever you wanted instead? (TBC)
The model module layoutlmv3.py
now needs to use those environment variables like
simple_text_classifier.py
does:
--- a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
+++ b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
@@ -12,9 +12,15 @@
+from label_studio_ml.utils import DATA_UNDEFINED_NAME, get_env
+
+HOSTNAME = get_env("HOSTNAME", "http://localhost:8080")
+API_KEY = get_env("API_KEY")
+
+print("=> LABEL STUDIO HOSTNAME = ", HOSTNAME)
+if not API_KEY:
+ print("=> WARNING! API_KEY is not set")
-HOSTNAME = "https://app.heartex.com/"
-API_KEY = ""
Finally we also need to modify the _get_annotated_dataset
method (which Electra had)
to use the same 'best practice' method of simple_text_classifier
(missing exception handling):
--- a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
+++ b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
@@ -142,6 +142,11 @@ class LayoutLMv3Classifier(LabelStudioMLBase):
response = requests.get(
download_url, headers={"Authorization": f"Token {API_KEY}"}
)
+ if response.status_code != 200:
+ raise Exception(
+ f"Can't load task data using {download_url}, "
+ f"response status_code = {response.status_code}"
+ )
return json.loads(response.content)
and with that we should have enabled webhook-triggered training with the Docker Compose-specified API key.
The only reason you might not want to do this is if the error would crash your annotation session, but I'd expect it to fail early, before you'd done any annotation, so not losing any work.
BERT had some confident assertions that demonstrate data validation on the input config, so that we can't accidentally use this backend with the wrong task type (or something like that).
The obvious question here is: what are we going to check? What are we expecting?
Well, we can't just reuse the BERT code as we are not expecting to classify choices
but rather
to have labelled bounding boxes or rectanglelabels
as they're known.
Here are the checks the BERT classifier does:
# then collect all keys from config which will be used to extract data from task and to form prediction
# Parsed label config contains only one output of <Choices> type
assert len(self.parsed_label_config) == 1
self.from_name, self.info = list(self.parsed_label_config.items())[0]
assert self.info["type"] == "Choices"
# the model has only one textual input
assert len(self.info["to_name"]) == 1
assert len(self.info["inputs"]) == 1
assert self.info["inputs"][0]["type"] == "Text"
self.to_name = self.info["to_name"][0]
self.value = self.info["inputs"][0]["value"]
We aren't using outputs of Choices
type, but RectangleLabels
(recall this is known as the control_type
).
If you review the code above from the get_single_tag_keys
helper function in
label_studio_ml.utils
, it is in fact the exact same check. So we can just call that and have a
much more concise (thus more maintainable) model class __init__
.
So in fact, we really want to copy the mmdetection
backend's routine here.
diff --git a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
index 0e7e5bb..2ddbfdb 100644
--- a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
+++ b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
@@ -25,13 +25,17 @@ MODEL_FILE = "my_model"
class LayoutLMv3Classifier(LabelStudioMLBase):
+ control_type: str = "RectangleLabels"
+ object_type: str = "Image"
+
def __init__(self, **kwargs):
super(LayoutLMv3Classifier, self).__init__(**kwargs)
try:
- self.from_name, self.info = list(self.parsed_label_config.items())[0]
- self.to_name = self.info["to_name"][0]
- self.value = self.info["inputs"][0]["value"]
- self.labels = sorted(self.info["labels"])
+ self.from_name, self.to_name, self.value, self.labels = get_single_tag_keys(
+ self.parsed_label_config,
+ control_type=self.control_type,
+ object_type=self.object_type,
+ )
except BaseException:
print("Couldn't load label config")
While we're at it, we may as well set some class attributes and type annotate to make it clearer.
These print statements are annoyingly amateur though: I then swapped them all for logger.error
calls.
Next I removed some code repetition and made load_config
only take the self
argument.
With that, the config step was all done, and tucked away neatly into a load_confi
method.
We create the processor
just once, as Electra did for its tokenizer
(so we just need to adapt
this tokenizer to be a processor).
To make it neater, I made the processor name a class attribute, and the processor class as another.
I swapped the Electra tokenizer import for LayoutLMv3Processor
(while at it also swapping the ElectraForSequenceClassification
with LayoutLMv3ForTokenClassification
)
and was now halfway done migrating it from Electra to LayoutLMv3:
class LayoutLMv3Classifier(LabelStudioMLBase):
control_type: str = "RectangleLabels"
object_type: str = "Image"
hf_hub_name: str = "microsoft/layoutlmv3-base"
hf_model_cls: Type = LayoutLMv3ForTokenClassification
hf_processor_cls: Type = LayoutLMv3Processor
def __init__(self, **kwargs):
super(LayoutLMv3Classifier, self).__init__(**kwargs)
self.load_config()
self.processor = self.hf_processor_cls.from_pretrained(self.hf_hub_name)
There are two options for the model class: LayoutLMv3ForSequenceClassification
and LayoutLMv3ForTokenClassification
.
The 'sequence' is a document (e.g. if you wanted to distinguish different types of document),
and the 'token' is a part of a document (I want to annotate and classify parts of documents so I chose this).
We instantiate our model in two different ways: either with reset_model
or with load
(if we have
train_output
).
What I did initially was to simplify the Electra model initialisation into two lines:
model_to_load = MODEL_FILE if Path(MODEL_FILE).exists() else self.hf_hub_name
self.model = self.hf_model_cls.from_pretrained(model_to_load)
At this point I committed my changes in case I messed the next step up
However as already established, this conditional block should actually be as in
simple_text_classifier
and bert
.
This part of the code had hardcoded device="cpu"
, so I replaced that with a module-level global
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
This left me with the outline of the new model, but still with the BERT kwargs.
I added more type annotations, made the reset_model
take no arguments and return nothing,
and annotated load
as returning nothing too.
if not self.train_output:
self.labels = self.info["labels"]
self.reset_model()
load_repr = "Initialised with"
else:
self.load(self.train_output)
load_repr = f"Loaded from train output with"
logger.info(f"{load_repr} {self.from_name=}, {self.to_name=}, {self.labels=!s}")
def reset_model(self) -> None:
# THESE KWARGS HAVE NOT BEEN CHANGED FROM BERT ! TODO
model_kwargs = dict(
num_labels=len(self.labels),
output_attentions=False,
output_hidden_states=False,
cache_dir=None,
)
model = self.hf_model_cls.from_pretrained(
self.hf_hub_name,
**model_kwargs
)
model.to(DEVICE)
self.model = model
return
def load(self, train_output) -> None:
pretrained_model = train_output["model_path"]
self.model = self.hf_model_cls.from_pretrained(pretrained_model)
self.model.to(DEVICE)
self.model.eval()
self.batch_size = train_output["batch_size"]
self.labels = train_output["labels"]
self.maxlen = train_output["maxlen"]
Now getting the arguments to the model class looks tricky: if we review the BERT signature which we
are adapting, BertForSequenceClassification
:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary.
Indices can be obtained using [`BertTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
token_type_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Segment token indices to indicate first and second portions of the inputs. Indices are selected in
`[0, 1]`:
- 0 corresponds to a *sentence A* token,
- 1 corresponds to a *sentence B* token.
[What are token type IDs?](../glossary#token-type-ids)
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
`[0, config.max_position_embeddings - 1]`.
[What are position IDs?](../glossary#position-ids)
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
than the model's internal embedding lookup matrix.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
for more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss),
If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
So num_labels
seems to pass through as a kwarg and bind into the config
, whereas
output_attentions
and output_hidden_states
are kwargs to the model itself.
The other kwarg, cache_dir
, is actually in
the signature
of from_pretrained
. Since it's only being passed as None
it's a bit futile to pass at all.
Likewise output_attentions
and output_hidden_states
default to False,
so including them is just for demonstration.
num_labels
is listed as one of the arguments used for fine-tuning (which is what we're trying to
do), so that'd be all that is worth keeping (and I expect even that could be set later).
Importantly: this is why you need to set self.labels
before calling the reset_model
method!
Now that the first part is done (setting up an initial model), we're left with the job of adapting
the load
method to suit the LayoutLMv3 signature too.
It seems there was a mistake in the BERT example, as the num_labels
was not passed when loading a
trained model, and couldn't be because self.labels
was set after instantiating the model.
maxlen
and batch_size
are used in the predict
method, so I left a note to review them later.
With that, the model initialisation handling is done too:
--- a/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
+++ b/label_studio_ml/examples/layoutlmv3/layoutlmv3.py
@@ -1,13 +1,14 @@
import json
import logging
from pathlib import Path
+from typing import Type
import requests
import torch
from label_studio_tools.core.label_config import parse_config
from transformers import (
LayoutLMv3Processor,
- ElectraForSequenceClassification,
+ LayoutLMv3ForTokenClassification,
Trainer,
TrainingArguments,
)
@@ -17,6 +18,7 @@ from label_studio_ml.utils import DATA_UNDEFINED_NAME, get_env
HOSTNAME = get_env("HOSTNAME", "http://localhost:8080")
API_KEY = get_env("API_KEY")
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
logger.info("=> LABEL STUDIO HOSTNAME = ", HOSTNAME)
if not API_KEY:
@@ -27,20 +29,45 @@ MODEL_FILE = "my_model"
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
+
class LayoutLMv3Classifier(LabelStudioMLBase):
control_type: str = "RectangleLabels"
object_type: str = "Image"
- hf_model_name: str = "google/electra-small-discriminator"
- hf_processor_name: str = "microsoft/layoutlmv3-base"
- hf_model_class: type = ElectraForSequenceClassification
- hf_processor_class: type = LayoutLMv3Processor
+ hf_hub_name: str = "microsoft/layoutlmv3-base"
+ hf_model_cls: Type = LayoutLMv3ForTokenClassification
+ hf_processor_cls: Type = LayoutLMv3Processor
def __init__(self, **kwargs):
super(LayoutLMv3Classifier, self).__init__(**kwargs)
self.load_config()
- self.processor = self.tokenizer_class.from_pretrained(self.hf_processor_name)
- model_name = MODEL_FILE if Path(MODEL_FILE).exists() else self.hf_model_name
- self.model = self.hf_model_class.from_pretrained(model_name)
+ self.processor = self.tokenizer_cls.from_pretrained(self.hf_hub_name)
+ if not self.train_output:
+ self.labels = self.info["labels"]
+ self.reset_model()
+ load_repr = "Initialised with"
+ else:
+ self.load(self.train_output)
+ load_repr = f"Loaded from train output with"
+ logger.info(f"{load_repr} {self.from_name=}, {self.to_name=}, {self.labels=!s}")
+
+ def _load_model(self, name_or_path: str) -> None:
+ assert hasattr(self, "labels"), "Loading model requires labels to be set first"
+ self.model = self.hf_model_cls.from_pretrained(
+ name_or_path,
+ num_labels=len(self.labels),
+ )
+ self.model.to(DEVICE)
+ return
+
+ def reset_model(self) -> None:
+ return self._load_model(name_or_path=self.hf_hub_name)
+
+ def load(self, train_output) -> None:
+ self.labels = train_output["labels"]
+ self._load_model(name_or_path=train_output["model_path"])
+ self.model.eval()
+ self.batch_size = train_output["batch_size"] # TODO: review use in `predict`
+ self.maxlen = train_output["maxlen"] # TODO: ditto (source: BERT backend)
This one's really simple: we just call the model with the inputs. The inputs must be in a specific format though, and we need to handle the bounding box rectangles properly.
If we review the predict
method of the Detectron2 example above:
def predict(self, tasks, **kwargs):
image_urls = [task["data"][self.value] for task in tasks]
images = [load_image_from_url(url) for url in image_urls]
layouts = [self.model.detect(image) for image in images]
predictions = []
for image, layout in zip(images, layouts):
height, width = image.shape[:2]
result = [
{
"from_name": self.from_name,
"to_name": self.to_name,
"original_height": height,
"original_width": width,
"source": "$image",
"type": "rectanglelabels",
"value": convert_block_to_value(block, height, width),
}
for block in layout
]
predictions.append({"result": result})
return predictions
It's clear that where the text classifiers access their input_texts
from each task
on
task["data"][self.value]
, we now get image URLs
(recall images are the 'objects' referenced in object_type
).
The Detectron2 method uses listcomps for images and layouts (i.e. including inference) and only
iterates over the results of that model inference, and creates the result
object more concisely.
The Detectron2 code style is neater, but we don't call our model with model.detect()
, so we want
elements from the Electra example too for its HuggingFace style:
def predict(self, tasks, **kwargs):
# get data for prediction from tasks
final_results = []
for task in tasks:
input_texts = ""
input_text = task["data"].get(self.value)
if input_text.startswith("http://"):
input_text = self._get_text_from_s3(input_text)
input_texts += input_text
labels = torch.tensor([1], dtype=torch.long)
# tokenize data
input_ids = torch.tensor(
self.tokenizer.encode(input_texts, add_special_tokens=True)
).unsqueeze(0)
# predict label
predictions = self.model(input_ids, labels=labels).logits
predictions = torch.softmax(predictions.flatten(), 0)
label_count = torch.argmax(predictions).item()
final_results.append(
{
"result": [
{
"from_name": self.from_name,
"to_name": self.to_name,
"type": "choices",
"value": {"choices": [self.labels[label_count]]},
}
],
"task": task["id"],
"score": predictions.flatten().tolist()[label_count],
}
)
return final_results
Note how the
final_results.append
call spans 14 lines.Also note how the returned results have a
result
value which is only one item, whereas in the Detectron2 results theresult
value is a list of many blocks in a layout.
I want to make it even neater, with a TypedDict
class DetectionResult(TypedDict):
from_name: str
to_name: str
original_height: int
original_width: int
source: str
type: str
value: BlockValue
This is nested in a singleton dict I'll again formalise in a TypedDict
:
class PredictionResult(TypedDict):
result: list[DetectionResult]
task: int
score: float
and I then can annotate the value
key's value as a BlockValue
corresponding to the returned type
from convert_block_to_value
(already covered above).
class BlockValue(TypedDict):
height: int
rectanglelabels: list[str]
rotation: int
width: int
x: float
y: float
score: float
Note that I'm not going to rotate my bboxes, so it's going to stay as initialised, at
0
.
We can then annotate the return type of our predict
method specifically and concisely.
This gets us most of the way there, with some ambiguous parts left with TODO
s and old code
commented out where replaced with reasonable guesses, e.g.:
encoding = self.processor(images)
self.model(**encoding, labels=labels).logits
class LayoutLMv3Classifier(LabelStudioMLBase):
control_type: str = "RectangleLabels"
object_type: str = "Image"
hf_hub_name: str = "microsoft/layoutlmv3-base"
hf_model_cls: Type = LayoutLMv3ForTokenClassification
hf_processor_cls: Type = LayoutLMv3Processor
detection_source: str = f"${object_type}".lower()
detection_type: str = control_type.lower()
...
def detect_images(self, images: list):
# TODO: change this to HuggingFace style
return [self.model.detect(image) for image in images]
def load_images_from_urls(image_urls: list):
# TODO: change this to paths / check how load_image_from_url works in Detectron2
return [load_image_from_url(url) for url in image_urls]
def predict(self, tasks, **kwargs) -> list[PredictionResult]:
# get data for prediction from tasks
image_urls = [task["data"][self.value] for task in tasks]
images = self.load_images_from_urls(image_urls)
layouts = self.detect_images(images)
predictions = []
for image, layout in zip(images, layouts):
height, width = image.shape[:2]
labels = torch.tensor([1], dtype=torch.long)
encoding = self.processor(images)
# input_ids = torch.tensor(
# self.tokenizer.encode(input_texts, add_special_tokens=True)
# ).unsqueeze(0)
# predict label
predictions = self.model(**encoding, labels=labels).logits
predictions = torch.softmax(predictions.flatten(), 0)
label_idx = torch.argmax(predictions).item()
pred_score = predictions.flatten().tolist()[label_idx]
# TODO: see `convert_block_to_value` to get bbox h,w,x,y
detection_results = [
DetectionResult(
{
"from_name": self.from_name,
"to_name": self.to_name,
"original_height": height,
"original_width": width,
"source": self.detection_source,
"type": self.detection_type,
"value": BlockValue(
{
"height": ...,
"rectanglelabels": [self.labels[label_idx]],
"rotation": 0,
"width": ...,
"x": ...,
"y": ...,
"score": pred_score,
}
)
}
)
for block in layout
]
pred = PredictionResult(
{
"result": detection_results,
"task": task["id"],
"score": predictions.flatten().tolist()[label_count],
}
)
predictions.append(pred)
return predictions
The remaining problems to solve are:
- How to handle the bboxes (which Detectron2's code handled via
convert_block_to_value
) - How to detect the layouts in images (the placeholder method
detect_images
) - How to handle the images from local files (the placeholder method
load_images_from_urls
)
We've crossed a line here from adapting examples that shared the behaviour we want, to creating something entirely new (the object detection examples, as helpful as they are, don't quite match and their usefulness as templates is almost up).
At this point, midway through the rewrite, I got stuck, and it dawned on me that the code I was referring to for prediction (or "model inference") in Niels Rogge's Transformers Tutorials notebook on fine-tuning LayoutLMv3 with the FUNSD dataset was actually not the way I would be doing it in my backend.
Specifically this was addressed at the end of that tutorial:
The code above used the
labels
to determine which tokens were at the start of a particular word or not. Of course, at inference time, you don't have access to any labels. In that case, you can leverage theoffset_mapping
returned by the tokenizer. I do have a notebook for that (for LayoutLMv2, but it's equivalent for LayoutLMv3) here.
I then looked at how the LayoutLMv2 code did it, and made a HuggingFace Space to confirm this worked in a similar way. The relevant part of the source is:
def process_image(image):
width, height = image.size
encoding = processor(
image, truncation=True, return_offsets_mapping=True, return_tensors="pt"
)
offset_mapping = encoding.pop("offset_mapping")
outputs = model(**encoding)
predictions = outputs.logits.argmax(-1).squeeze().tolist()
token_boxes = encoding.bbox.squeeze().tolist()
# only keep non-subword predictions
is_subword = np.array(offset_mapping.squeeze().tolist())[:, 0] != 0
true_predictions = [
id2label[pred] for idx, pred in enumerate(predictions) if not is_subword[idx]
]
true_boxes = [
unnormalize_box(box, width, height)
for idx, box in enumerate(token_boxes)
if not is_subword[idx]
]
for prediction, box in zip(true_predictions, true_boxes):
predicted_label = iob_to_label(prediction).lower()
...
- Only lightly trimmed here (result drawing code omitted)
- Note that this function is not self-contained: it assumes we already have
processor
,model
,np
,id2label
,unnormalize_box
, andiob_to_label
id2label
is just adict(enumerate(labels))
and notice how it's just used to temporarily 'store' the predictions as integer indices (before restoring them in the for loop at the end)
This is how we should handle the images (PIL.Image
, RGB mode) to produce label predictions with bboxes.
The code at this point was getting too nested for my liking, so rather than instantiate the
DetectionResult
within the predict_method
, and within that instantiate the BlockValue
, I
whisked away this complexity into those classes themselves in classmethods from_backend
that took
the self
(i.e. the LayoutLMv3Classifier
subclassing LabelStudioMLBase
) and then only passed in
things that didn't come from self-reference, greatly simplifying the call signature and effectively
decluttering the predict
method.
Simultaneously, I became "adrift" in the code: I couldn't see it all on one screen. I prefer to see everything in one eyeful, so I split out the new classes into separate modules.
The trick to doing this is to copy everything, then delete things in one, then run
autoflake8
When this wasn't enough, I broke out individual methods into separate modules (choosing those
with fewest self
references, as these would need to be passed as kwargs to a static method).
_get_annotated_dataset
turned out to be a completely static method, and got moved entirely.
After this, I had 4 new modules: url_utils.py
, detection.py
, components.py
, ls_api.py
.
This greatly eased the writing of specific parts, and before I knew it I was done. The key part is
the unnormalize_box
function which takes a width and height and scales up the bbox to full size.
After that, it's just passing the info around.
def convert_box_to_value(box: tuple[float, float, float, float]):
x1, y1, x2, y2 = box
w = x2 - x1
h = y2 - y1
return x1, y1, w, h
class LayoutBlock(TypedDict):
box: tuple[float, float, float, float]
label: str
score: float
class BlockValue(TypedDict):
height: int
rectanglelabels: list[str]
rotation: float
width: int
x: float
y: float
score: float
@classmethod
def from_backend(
cls,
block: LayoutBlock,
backend: LabelStudioMLBase,
) -> BlockValue:
box, label, score = block.values()
x, y, w, h = convert_box_to_value(box)
return cls(
height=h,
rectanglelabels=[label],
rotation=0,
width=w,
x=x,
y=y,
score=score,
)
I realised I'd somehow ended up passing labels and scores at both the image and the block level, but obviously we want to label the block level because we're doing token classification not sequence classification. Oops. Moving on.
So a bbox with info attached is a "block" (and we skip), and multiple blocks make a "layout".
TODO: annotate return values, noting bboxes after being unnormalized have float w,h
I wasn't 100% sure how this was going to work (web apps tend to work from URLs but I was working
with local files... it could even use local URLs like file://
but to start with I just wrote the
simplest possible solution, which was this method on the LayoutLMv3Classifier class (using
pathlib.Path
):
def load_images_from_urls(image_paths_or_urls: list[Path | str]):
return list(map(load_image_from_path_or_url, image_paths_or_urls))
calling this function from the url_utils
module:
from __future__ import annotations
from pathlib import Path
import requests
from PIL import Image
__all__ = ["load_image_from_path_or_url"]
def load_image_from_path_or_url(path_or_url: str | Path) -> Image:
if isinstance(path_or_url, str) and path_or_url.startswith("http"):
im_ref = requests.get(path_or_url, stream=True).raw
else:
im_ref = path_or_url
image = Image.open(im_path).convert("RGB")
return image
All of the above gets us the ability to make predictions from a pretrained model, but not the ability to train (or fine-tune) one.
The training details are simplified to just calling the Trainer.train()
method.
Other features that got copied from the other examples:
- The tasks are called 'completions'
- The training supports a web hook (to do with the Label Studio API)
- The training is for text
- The
train_dataset
is not completed (currently passed as placeholder classCustom_Dataset
)
def fit(self, completions, workdir=None, **kwargs):
# check if training is from web hook
if kwargs.get("data"):
project_id = kwargs["data"]["project"]["id"]
tasks = get_annotated_dataset(project_id)
if not self.parsed_label_config:
self.parsed_label_config = parse_config(
kwargs["data"]["project"]["label_config"]
)
self.load_config()
# ML training without web hook
else:
tasks = completions
# Create training params with batch size = 1 as text are different size
training_args = TrainingArguments(
"test_trainer", per_device_train_batch_size=1, per_device_eval_batch_size=1
)
# Prepare training data
input_texts = []
input_labels = []
for task in tasks:
if not task.get("annotations"):
continue
input_text = task["data"].get(self.value)
input_texts.append(
torch.flatten(self.tokenizer.encode(input_text, return_tensors="pt"))
)
annotation = task["annotations"][0]
output_label = annotation["result"][0]["value"]["choices"][0]
output_label_idx = self.labels.index(output_label)
output_label_idx = torch.tensor([[output_label_idx]], dtype=torch.int)
input_labels.append(output_label_idx)
print(f"Train dataset length: {len(tasks)}")
my_dataset = Custom_Dataset((input_texts, input_labels))
trainer = Trainer(
model=self.model,
args=training_args,
train_dataset=my_dataset,
# eval_dataset=small_eval_dataset
)
trainer.train()
self.model.save_pretrained(MODEL_FILE)
train_output = {"labels": self.labels, "model_file": MODEL_FILE}
return train_output
A good place to start is the custom class for the training data, Custom_Dataset
.
This was taken from the code for Electra, which is defined as:
class Custom_Dataset(torch.utils.data.dataset.Dataset):
def __init__(self, _dataset):
self.dataset = _dataset
def __getitem__(self, index):
example, target = self.dataset[0][index], self.dataset[1][index]
return {"input_ids": example, "label": target}
def __len__(self):
return len(self.dataset)
Now compare this to the example from Niels Rogge given earlier (I've adapted it slightly to actually run, rather than having a placeholder image path):
import pandas as pd
from torch.utils.data import Dataset
from PIL import Image
class CustomDataset(Dataset):
def __init__(self, root: Path, df: pd.DataFrame, processor):
self.root = root
self.df = df
self.processor = processor
def __getitem__(self, idx):
# get document image + corresponding words and boxes
item = self.df.iloc[idx]
filename = item.filename
image_path = self.root / filename
image = Image.open(str(image_path)).convert("RGB")
words = item.words
boxes = item.boxes
# use processor to prepare everything for the model
encoding = self.processor(image, words, boxes=boxes)
return encoding
The image loading will be refactored into
from layoutlmv3.url_utils import load_image_from_path_or_url
...
image = load_image_from_path_or_url(image_path)
The difference is that in Niels's code, the dataset is stored in a dataframe.
This is difficult to figure out completely out of context, so at this point I began trying to deploy and attach the backend even though it wasn't ready.
Similar to the beginning, we now deploy the custom backend from its directory:
cd label-studio-ml-backend/label_studio_ml/examples/layoutlmv3
docker compose up
This gave me an error: the redis
container couldn't launch because it clashed with the one that
was previously set up
[+] Running 1/1
⠿ Network layoutlmv3_default Created 0.3s
⠋ Container redis Creating 0.0s
Error response from daemon: Conflict. The container name "/redis" is already in
use by container "...". You have to remove (or rename) that container to be
able to reuse that name.
To do so, run docker rmi simple_text_classifier_server:latest
(the image name will
tab-autocomplete). If that fails you need to docker ps -a
to list and then docker rm
the ID
of the container that's clashing. Then rmi
will work, after which you can docker rm
the
container that has the /redis
name attached, and finally then you can run the compose
command.
This time the server just started, a little too quietly: there was no message to tell me that it had succeeded and the address to load, just:
server | 2022-07-17 19:40:58,704 INFO supervisord started with pid 1
server | 2022-07-17 19:40:59,708 INFO spawned: 'rq_00' with pid 9
server | 2022-07-17 19:40:59,711 INFO spawned: 'wsgi' with pid 10
server | 2022-07-17 19:41:00,797 INFO success: rq_00 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
server | 2022-07-17 19:41:00,797 INFO success: wsgi entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
I wanted to see the entire build happen again so I deleted the container and the images and re-ran.
This gave the full build output, and the same success log from the server
container, however there
was no local server for me to connect to!
Running curl http://localhost:9090/
gave "Internal Server Error", with no debug output.
A cursory review of the config showed that the logs were going to the logs
directory in the
current working directory, and in particular uwsgi.log
contained a syntax bug in my code:
Traceback (most recent call last):
File "./_wsgi.py", line 7, in <module>
from layoutlmv3 import LayoutLMv3Classifier
File "<fstring>", line 1
(self.from_name=)
^
SyntaxError: invalid syntax
This was in fact from trying to use Python 3.8+ syntax in a Python 3.7 container. Apparently you can currently upgrade to 3.8 but not 3.9. (I did this by changing the Dockerfile and recreating with the compose command again).
docker ps -a
docker rm #id_of_redis_container #id_of_layoutlmv3_container
docker images
docker rmi redis:alpine layoutlmv3_server:latest
Which rebuilt the image for Python 3.8.
This log file was also littered with warnings about running uWSGI as root ("use the --uid
flag")
After rebuilding, and reviewing the uwsgi.log
again, I came across some further bugs in my code.
This time:
from label_studio_ml.examples.layoutlmv3.components import (
ModuleNotFoundError: No module named 'label_studio_ml.examples'
The 'package' here should be the backend, not the repo of multiple backends, d'oh!
This source code is 'baked in' to the Docker image, so once again I needed to delete containers and images. This time I automated it using a filter, and after noticing that it only requires there not to be a container, reduced it:
docker rm $(docker ps -a -q -f name=server) && docker rmi layoutlmv3_server:latest
docker compose up
I now got the logged traceback
File "/app/./_wsgi.py", line 7, in <module>
from layoutlmv3 import LayoutLMv3Classifier
File "/app/./layoutlmv3.py", line 18, in <module>
from layoutlmv3.components import (
ModuleNotFoundError: No module named 'layoutlmv3.components'; 'layoutlmv3' is not a package
I couldn't get a package structure by sprinkling __init__.py
around like usually,
so I dumped the __package__
and __file__
variables to the log:
PACKAGE IS FILE IS /app/./layoutlmv3.py
So there is no package to perform relative imports with regards to, and the file is simply called.
I'm not sure I know enough to debug further. Direct imports work however, so I just renamed all my
modules to have a layoutlmv3_
prefix (for code readability, despite the ugly effect on the directory).
With that, I was rid of the ModuleNotFoundError
tracebacks, and got some fresh ones to debug
(each time calling my docker rm
command after stopping the docker compose
process and then
calling it again to recreate the server with the newly edited source).
I needed
from __future__ import annotations
in all the modules (Python 3.8 has type support 'backported')
...et voila, the backend was running!
curl http://localhost:9090/
⇣
{"model_dir":"/data/models","status":"UP","v2":"true"}
Additionally, if I call sudo tail -f logs/uwsgi.log
in one shell and run the curl command in
another (or curl the equivalent /health
endpoint) I can see the log added to in real time:
GET /health => generated 55 bytes in 0 msecs (HTTP/1.1 200) 2 headers in 71 bytes (1 switches on core 0)
So now we have a running custom backend (in some potentially minimally viable state), and we can run
Label Studio (label-studio
) and login to access a saved project. In the project, in the Settings
page, click "Machine Learning", "Add Model" and set the URL as http://localhost:9090/
.
This gave the following message:
Successfully connected to http://localhost:9090/ but it doesn't look like a valid ML backend. Reason: 500 Server Error: INTERNAL SERVER ERROR for url:
http://localhost:9090/setup
.Check the ML backend server console logs to check the status. There might be something wrong with your model or it might be incompatible with the current labeling configuration.
Switching back to my log tail
shell, I could see a Python error that had cause the internal server
error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/label_studio_ml/exceptions.py", line 39, in
exception_f
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/label_studio_ml/api.py", line 50, in _setup
model = _manager.fetch(project, schema, force_reload, hostname=hostname,
access_token=access_token)
File "/usr/local/lib/python3.8/site-packages/label_studio_ml/model.py", line 502, in fetch
model = cls.model_class(label_config=label_config, **kwargs)
File "/app/./layoutlmv3.py", line 48, in __init__
self.processor = self.tokenizer_cls.from_pretrained(self.hf_hub_name)
AttributeError: 'LayoutLMv3Classifier' object has no attribute 'tokenizer_cls'
[Mon Jul 18 17:56:17 2022] POST /setup => generated 888 bytes in 3 msecs (HTTP/1.1 500) 2 headers in 91 bytes (1 switches on core 0)
I double checked it was definitely caused by the /setup
endpoint being called by clicking
"Validate and Save" on the Label Studio "Add model" modal, and indeed this error message got
duplicated.
At this point I:
- Stopped the
docker compose
process - Cancelled the
tail
process - Left
label-studio
running - Edited the model module to fix the bug (and prayed that was the only one)
- Re-ran the
docker rm
command and restarteddocker compose
as above.
This time I got a little pause before the error:
Successfully connected to http://localhost:9090/ but it doesn't look like a valid ML backend. Reason: HTTPConnectionPool(host='localhost', port=9090): Read timed out. (read timeout=3.0).
Check the ML backend server console logs to check the status. There might be something wrong with your model or it might be incompatible with the current labeling configuration.
This time it'd gotten a bit further:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/label_studio_ml/exceptions.py", line 39, in
exception_f
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/label_studio_ml/api.py", line 50, in _setup
model = _manager.fetch(project, schema, force_reload, hostname=hostname,
access_token=access_token)
File "/usr/local/lib/python3.8/site-packages/label_studio_ml/model.py", line 502, in fetch
model = cls.model_class(label_config=label_config, **kwargs)
File "/app/./layoutlmv3.py", line 50, in __init__
self.labels = self.info["labels"]
AttributeError: 'LayoutLMv3Classifier' object has no attribute 'info'
This came from
class LayoutLMv3Classifier(LabelStudioMLBase):
control_type: str = "RectangleLabels"
object_type: str = "Image"
hf_hub_name: str = "microsoft/layoutlmv3-base"
hf_model_cls: Type = LayoutLMv3ForTokenClassification
hf_processor_cls: Type = LayoutLMv3Processor
detection_source: str = f"${object_type}".lower()
detection_type: str = control_type.lower()
def __init__(self, **kwargs):
super(LayoutLMv3Classifier, self).__init__(**kwargs)
self.load_config()
self.processor = self.hf_processor_cls.from_pretrained(self.hf_hub_name)
if not self.train_output:
self.labels = self.info["labels"]
self.reset_model()
load_repr = "Initialised with"
else:
self.load(self.train_output)
load_repr = "Loaded from train output with"
logger.info(f"{load_repr} {self.from_name=}, {self.to_name=}, {self.labels=!s}")
I guess I'm supposed to hardcode the labels into the model. The attribute self.info
is used
nowhere else in the code: it matches code in the BERT and SimpleTextClassifier examples.
I created the labels when I did some initial labelling, and from parsing the export JSON I can get them back:
>>> {x["rectanglelabels"][0] for x in d[0]["label"]}
{'intra_ref_redirect_command', 'intra_ref_redirect_referent_name', 'partial_referent_name',
'ref_page_num', 'page_num', 'section_break', 'referent_synonym_or_specifier', 'section_header',
'referent_name', 'ref_section_header'}
I've copied the order I input them into the interface, and made an enum, and the same id2label
and
label2id
functions that Niels did.
from enum import Enum
class Labels(Enum):
referent_name = 1
ref_page_num = 2
page_num = 3
section_header = 4
ref_section_header = 5
intra_ref_redirect_command = 6
intra_ref_redirect_referent_name = 7
partial_referent_name = 8
referent_synonym_or_specifier = 9
section_break = 0
def label2id() -> dict:
return {k.name: k.value for k in Labels}
def id2label() -> dict:
return {k.value: k.name for k in Labels}
There's a bit of a catch 22 at play here: to label effectively I need to have my interface set up. To set up my labelling interface I need to have a fine-tuned model to load (at the least, labels) from.
One workaround I can see is to fine-tune a model with just one category (e.g. page_num
: the
page number), and then 'bootstrap' from that, by getting the more efficient training loop set up.
In other words: don't try and create a custom model and a backend for it in one go. Instead:
- Create a custom dataset
- Somehow (however you can) get a few annotations for (at least) a single class into the dataset.
- Fine-tune LayoutLMv3 on this minimal dataset (don't need to have examples for all the features, model will just learn to never assign them).
- Hook the fine-tuned model backend into Label Studio, to thereby incrementally complete your dataset annotation.
- Iteratively repeat the fine-tuning (from scratch), 'bootstrapping' the annotation process.
While thinking along these lines, I also realised I had subtly given myself a particularly hard task: not just set up a ML model backend, but do so without an intermediate working model. The example of a LayoutLMv3 model fine-tuned on FUNSD would be easier to set up, and once that was done I could take it a step further by applying it to my model.
Therefore I duplicated the backend and renamed that example directory layoutlmv3_funsd
.