How to train better language detection with Bling Fire and FastText - microsoft/BlingFire GitHub Wiki

In this small tutorial we illustrate the importance of good tokenization for multi-lingual text classification. First we will train a baseline model based on regular tokens, then we will break the same text with multilingual tokenization model, train the same model again and compare the accuracy.

Language Detection Baseline

First thing first lets follow the steps here install FastText, download training data, create valid.txt , train.txt and train our baseline model.

Let's first train a model based only on tokens:

~/fastext$ wc -l train.txt valid.txt 
  8379309 train.txt
    10000 valid.txt
  8389309 total

~/fastext$ fasttext supervised -input train.txt -output langdetect -dim 16
Read 68M words
Number of words:  3336283
Number of labels: 365
Progress: 100.0% words/sec/thread:   41674 lr:  0.000000 avg.loss:  0.124791 ETA:   0h 0m 0s

~/fastext$ fasttext test langdetect.bin valid.txt
N	10000
P@1	0.96
R@1	0.96

The training data contains 68M words of which 3.3M are unique. We got 96% Precision @1 (top most predicted class is 96% times correct) using just word-based model.

As suggested in the tutorial let's enable character ngrams and train again.

~/fastext$ fasttext supervised -input train.txt -output langdetect24cgr -dim 16 -minn 2 -maxn 4

~/fastext$ fasttext test langdetect24cgr.bin valid.txt 
N	10000
P@1	0.985
R@1	0.985

We got 98.5% Precision @1. I also tried to add word bigrams, but results were identical, maybe fasttext does not support both character and word ngrams.

Two baseline model sizes are like this:

~/fastext$ ls -lh langdetect.bin langdetect24cgr.bin 
-rw-rw-r-- 1 sergei sergei 403M Jul  7 18:20 langdetect24cgr.bin
-rw-rw-r-- 1 sergei sergei 281M Jul  7 17:51 langdetect.bin

Language Detection with Bling Fire

Now let's add Bling Fire into the mix. Instead of training models from raw tokens, we will tokenize text with laser100k model and use only IDs from Bling Fire for FastText training. The laser100k.bin is a Unigram LM tokenization model with 100K tokens learned from balanced by language plain text corpus.

Let's use this Python script (convert_td2id.py):

import sys
import argparse
from blingfire import *

parser = argparse.ArgumentParser()
parser.add_argument("-m", "--model", default="./laser100k.bin", help="bin file with compiled tokenization model")
args = parser.parse_args()

h = load_model(args.model)

for line in sys.stdin:

    line = line.strip()
    idx = line.find(" ")

    if idx == -1:
        continue

    label = line[0 : idx]

    text = line[idx + 1 :]
    ids = text_to_ids(h, text, 1024, 0, True)
    ids_str = ' '.join([str(id) for id in ids])
    
    print(label + " " + ids_str)

and convert train.txt and valid.txt to id format:

~/fastext$ python convert_td2id.py -m laser100k.bin < valid.txt > valid.100k.txt
~/fastext$ python convert_td2id.py -m laser100k.bin < train.txt > train.100k.txt

~/fastext$ head --lines=10 valid.100k.txt
__label__mkd 36944 165 2177 91909 104 37457 3
__label__fra 5558 13 5 16656 2191 3
__label__eng 3264 17 119 344 3623 258 3357 14277 5957 6 152 67534 4 31382 1145
__label__deu 58875 5562 4225 97 1019 119 7586 6297 4 22048 9 5562 3387 6470 492 3399 3 3434 12985 8405 5562 4225 29163 114 3
__label__eng 1872 7934 5 23953 5 100 1171 67 39141 6 11018 36843 390 5728 183 111 4424 3
__label__rus 564 46327 1473 39919 63 24278 3067 1462 2068 85487 137 3
__label__hun 3518 264 3199 2232 7 2740 1145
__label__eng 9162 17 6 2218 30 67 1458 310 7162 258 3357 2599 65258 3
__label__epo 727 3987 531 8102 1668 2134 3
__label__spa 5 57900 16155 23 44 722 495 93 1145

Let's do the same and train a model just based on tokens first:

~/fastext$ wc -l train.100k.txt valid.100k.txt 
  8379309 train.100k.txt
    10000 valid.100k.txt
  8389309 total

~/fastext$ fasttext supervised -input train.100k.txt -output langdetect100k -dim 16
Read 118M words
Number of words:  88683
Number of labels: 365
Progress: 100.0% words/sec/thread:   51246 lr:  0.000000 avg.loss:  0.106825 ETA:   0h 0m 0s

~/fastext$ fasttext test langdetect100k.bin valid.100k.txt
N	10000
P@1	0.975
R@1	0.975

~/fastext$ ls -lh langdetect100k.bin
-rw-rw-r-- 1 sergei sergei 6.8M Jun  4 09:56 langdetect100k.bin

Now the same training data contains 118M words of which only 88K are unique, so almost 40x less different words with 1.7 times more tokens. With this tokenization we are getting Precision @1 of 97.5%, which is +1.5 points gain over the word-only baseline. The size of the model is significantly less and is only 6.8M (vs 281MB of baseline).

Also it is interesting to notice that we did not train the laser100k model on the language detection training data, it was trained on Wiki Matrix data, but 88K / 100K available ids are used. This tells us that the model learned only most common language properties which transfer between different tasks as long as the language is the same.

For training a better model, it does not make sense to enable character ngram features since we have ids, however it does make sense to use token bigrams. Also if we use token bigrams perhaps it make sense to add more dimensions and increase the number of epochs from default 5 to 10.

~/fastext$ fasttext supervised -dim 32 -wordNgrams 2 -epoch 10 -input train.100k.txt -output langdetect100k2g_32
Read 118M words
Number of words:  88683
Number of labels: 365
Progress: 100.0% words/sec/thread:   55107 lr:  0.000000 avg.loss:  0.042347 ETA:   0h 0m 0s

~/fastext$ fasttext test langdetect100k2g_32.bin valid.100k.txt
N	10000
P@1	0.989
R@1	0.989

~/fastext$ ls -lh langdetect100k2g_32.bin
-rw-rw-r-- 1 sergei sergei 257M Jul  8 14:50 langdetect100k2g_32.bin

As you can see we got Precision@1 of 98.9% , so we have improved by +0.4 pt the best baseline number and the model is twice as small.

Ok, let's get crazy and try the hyper parameter search features in fasttext:

~/fastext$ fasttext supervised -input train.100k.txt -output langdetect100kauto -autotune-validation valid.100k.txt -autotune-duration 4200 

~/fastext$ fasttext test langdetect100kauto.bin valid.100k.txt
N	10000
P@1	0.99
R@1	0.99

Wow! the auto-tuned model got to Precision @1 of 99.0% (+0.5 pt over best baseline), however it is way too big to be practical.

Conclusion

We have shown the benefits of using multi-lingual tokenization models like laser100k for multi-lingual text classification. Main advantage of this model is small number of tokens and ability to preserve important units of language for each language, hence better accuracy and smaller model size.