Benchmark Guide - microsoft/BlingFire GitHub Wiki

This is a wiki for benchmarking BlingFire.

Benchmark result

Performed on December 2018.

Total Running time in sec Bling Fire SpaCy NLTK
100 times repeated Min Max Avg Min Max Avg Min Max Avg
1K 0.074 0.125 0.079 1.470 1.581 1.529 1.681 1.793 1.733
10K 0.805 0.865 0.823 8.500 9.370 8.653 17.739 18.213 17.821
100K 7.941 8.161 8.018 86.577 93.095 87.700 181.032 185.407 182.079

Benchmark Setup

  • OS: Linux Ubuntu
  • Machine: Azure VM 6 VCPUs(Intel Xeon CPU E5-2690 v3 @ 2.60GHz), 56GB memory.
  • Python version: 3.5.6
  • SpaCy version: 2.0.17
  • NLTK version: 3.4
  • Corpus: English Gigawords
  • Enabled subtraction of warm-up time. First 10% of passages used as warm up, excluded from benchmarking calculation
  • Collect data based on 100 times repeat of each data.

Run benchmark

Getting data

This script currently support 3 types of corpus.

Run the script

Go to the ** /scripts ** folder, you should see benchmark.py. Run it with desired parameters will give you the benchmark result.

Args Comment Example
-d Specify the data set Python3 benchmark.py -d englishgigawords.txt
-n Number of passages Python3 benchmark.py -n 1000
-o Output result. No output if this arg is not specified Python3 benchmark.py -o
-s Sepcify the type of data set. By default is plain text. Options: - marco - plaintext - englishgigawords Python benchmark.py -s englishgigawords
-w Warm up until. Set the size of warm up set. Default is 100. Use this together with '-n'. Example '-n 1000 -w 100' then the reported result will be processing time of 900 passages Python benchmark.py -n 1100 -w 100

Summary

Comparing Bling Fire with other popular NLP libraries, Bling Fire shows 10X faster speed in tokenization task

System Avg Run Time (Second Per 10,000 Passages)
Bling Fire 0.823
SpaCy 8.653
NLTK 17.821