Benchmark Guide - microsoft/BlingFire GitHub Wiki

This is a wiki for benchmarking BlingFire.

Benchmark result

Performed on December 2018.

Total Running time in sec	Bling Fire			SpaCy			NLTK
100 times repeated	Min	Max	Avg	Min	Max	Avg	Min	Max	Avg
1K	0.074	0.125	0.079	1.470	1.581	1.529	1.681	1.793	1.733
10K	0.805	0.865	0.823	8.500	9.370	8.653	17.739	18.213	17.821
100K	7.941	8.161	8.018	86.577	93.095	87.700	181.032	185.407	182.079

OS: Linux Ubuntu
Machine: Azure VM 6 VCPUs(Intel Xeon CPU E5-2690 v3 @ 2.60GHz), 56GB memory.
Python version: 3.5.6
SpaCy version: 2.0.17
NLTK version: 3.4
Corpus: English Gigawords
Enabled subtraction of warm-up time. First 10% of passages used as warm up, excluded from benchmarking calculation
Collect data based on 100 times repeat of each data.

This script currently support 3 types of corpus.

Go to the ** /scripts ** folder, you should see benchmark.py. Run it with desired parameters will give you the benchmark result.

Args	Comment	Example
-d	Specify the data set	`Python3 benchmark.py -d englishgigawords.txt`
-n	Number of passages	`Python3 benchmark.py -n 1000`
-o	Output result. No output if this arg is not specified	`Python3 benchmark.py -o`
-s	Sepcify the type of data set. By default is plain text. Options: - marco - plaintext - englishgigawords	`Python benchmark.py -s englishgigawords`
-w	Warm up until. Set the size of warm up set. Default is 100. Use this together with '-n'. Example '-n 1000 -w 100' then the reported result will be processing time of 900 passages	`Python benchmark.py -n 1100 -w 100`

Comparing Bling Fire with other popular NLP libraries, Bling Fire shows 10X faster speed in tokenization task

System	Avg Run Time (Second Per 10,000 Passages)
Bling Fire	0.823
SpaCy	8.653
NLTK	17.821