UNLV Testing of Tesseract - kana112233/tesseract GitHub Wiki

#如何在Tesseract上运行UNLV测试

介绍

Tesseract 2.0+提供的脚本可以运行第四次OCR准确度年度测试中发布的一些UNLV测试.

AT-1995.pdf(最初可在http://www.isri.unlv.edu/获得) . 提供这些测试脚本的主要目的是使Tesseract用户能够验证其安装是否正确,并且没有特定于体系结构的问题导致错误的识别准确性. 它还可作为展示每个版本准确性改进的基准. 从事Tesseract工作的开发人员可能会发现基准测试工具对于测量实验性新模块非常有用.

请注意必然会出现一些特定于体系结构的变体. 其中大部分应该由编译器之间的浮点算法的不同处理和优化引起. 当然也有可能存在内存初始化错误,这些错误表现为架构之间的差异,但我们声称已经在unicodeization过程中发现了大部分错误.

警告

UNLV图像是G4压缩的,因此您必须构建具有TIFF支持的Tesseract,但所有最新版本都默认包含libtiff.

Windows用户还必须具有一些unix shell脚本功能,可能通过cygwin或同等功能.

图片

当前脚本仅涵盖3B测试集的测试(即300 dpi双色调). 开源Tesseract中的自适应阈值与原始版本中的自适应阈值不同,因为原始自适应阈值未包含在开源版本中,因此8位灰度图像测试无法正确比较,而其他分辨率则有趣 ,不要真正服务于有用的回归测试目的.

#运行测试

有关如何运行Tesseract 4测试的最新说明,请参见UNLV测试的README文件.

#示例结果

以下是1995年测试的一些结果,取自AT-1995.pdf并重新格式化以匹配 Tesseract测试工具的输出:

Testid  Testset Character               Word                    Non-stopword
                Errors  Acc     Change  Errors  Acc     Change  Errors  Acc     Change
1995    bus.3B  5959    98.14%  0.00%   1631    96.83%  0.00%   1293    95.73%  0.00%
1995    doe3.3B 36349   97.52%  0.00%   7826    96.34%  0.00%   7042    94.87%  0.00%
1995    mag.3B  15043   97.74%  0.00%   4566    96.01%  0.00%   3379    94.99%  0.00%
1995    news.3B 6432    98.69%  0.00%   1946    97.68%  0.00%   1502    96.94%  0.00%

(更改列用于最近的测试,并测量这些1995年结果的变化.)

使用gcc 4.0.3-1ubuntu5编译的Tesseract 2.00的结果是:

Testid  Testset Character               Word                    Non-stopword
                Errors  Acc     Change  Errors  Acc     Change  Errors  Acc     Change
gcc4.0  bus.3B  6259    98.04%  5.03%   1691    96.71%  3.68%   1313    95.66   1.55%
gcc4.0  doe3.3B 28850   98.03%  -20.63% 7863    96.32%  0.47%   6688    95.13   -5.03%
gcc4.0  mag.3B  14815   97.78%  -1.52%  4396    96.16%  -3.72%  3124    95.37   -7.55%
gcc4.0  news.3B 7533    98.47%  17.12%  1758    97.91%  -9.66%  1220    97.51   -18.77%
gcc4.0  Total   57457   -       -9.92%  15708   -       -1.63%  12345   -       -6.59%

更改列显示了1995年结果的准确性差异,doe3.3B测试集上的字符错误减少了20%,而news.3B测试集上的字符错误增加了17%. 由于引擎自1995年测试以来已经完全重新训练,现在它运行在具有不同编译器的不同处理器上,因此很难确定这种疯狂变化的原因. (这也可能部分归因于没有阿司匹林包装.)

为了说明编译器的不同之处,下面是使用gcc 4.1.1编译的相同代码的结果:

Testid  Testset Character               Word                    Non-stopword
                Errors  Acc     Change  Errors  Acc     Change  Errors  Acc     Change
gcc4.1  bus.3B  6258    98.04%  5.02%   1690    96.72%  3.62%   1312    95.67   1.47%
gcc4.1  doe3.3B 28589   98.05%  -21.35% 7864    96.32%  0.49%   6692    95.12   -4.97%
gcc4.1  mag.3B  14800   97.78%  -1.62%  4394    96.16%  -3.77%  3123    95.37   -7.58%
gcc4.1  news.3B 7524    98.47%  16.98%  1759    97.91%  -9.61%  1220    97.51   -18.77%
gcc4.1  Total   57171   -       -10.37% 15707   -       -1.64%  12347   -       -6.58%

错误率没有那么不同,但存在细微差别. 相比之下,使用VisualC ++ Express构建的相同代码给出了:

Testid  Testset Character               Word                    Non-stopword
                Errors  Acc     Change  Errors  Acc     Change  Errors  Acc     Change
vc++exp bus.3B  6270    98.04%  5.22%   1695    96.71%  3.92%   1315    95.66   1.70%
vc++exp doe3.3B 29098   98.01%  -19.95% 8246    96.14%  5.37%   7038    94.87   -0.06%
vc++exp mag.3B  14981   97.75%  -0.41%  4435    96.12%  -2.87%  3157    95.32   -6.57%
vc++exp news.3B 7548    98.47%  17.35%  1763    97.90%  -9.40%  1224    97.51   -18.51%
vc++exp Total   57897   -       -9.23%  16139   -       1.06%   12734   -       -3.65%

这显示了错误率的相当大的增加,这是在从代码中消除了一些浮点运算的使用之后. 但是,更加截然不同的是Visual C ++ 6,其测量的单词准确度略高,但字符准确性更差:

Testid  Testset Character               Word                    Non-stopword
                Errors  Acc     Change  Errors  Acc     Change  Errors  Acc     Change
vc6     bus.3B  6298    98.03%  5.69%   1696    96.70%  3.99%   1317    95.65   1.86%
vc6     doe3.3B 29745   97.97%  -18.17% 8105    96.20%  3.57%   6894    94.98   -2.10%
vc6     mag.3B  15036   97.74%  -0.05%  4448    96.11%  -2.58%  3165    95.31   -6.33%
vc6     news.3B 7531    98.47%  17.09%  1745    97.92%  -10.33% 1210    97.53   -19.44%
vc6     Total   58610   -       -8.11%  15994   -       0.16%   12586   -       -4.77%

未来的工作可能是为了使这些差异更小,如果不是完全消除它们,理由是在存在变化的情况下,还有改进的余地......

###更新版Tesseract的结果[作者Tom Morris](https://groups.google.com/forum/#!searchin/tesseract-dev/bus.3B%7Csort:date/tesseract-dev/LErriuT- SCK/B5PR0QaCGwAJ)

全部使用Apple C编译器Apple LLVM版本7.0.2(clang-700.1.81)编译,目标为x86_64-apple-darwin14.3.0.

Testid  Testset Character               Word                    Non-stopword
                Errors  Acc     Change  Errors  Acc     Change  Errors  Acc     Change
3.04.01 bus.3B  8816    97.24%  47.94%  2221    95.68%  36.17%  1629    94.62   25.99%
3.04.01 doe3.3B 48306   96.70%  32.89%  9903    95.36%  26.54%  9020    93.43   28.09%
3.04.01 mag.3B  30860   95.37%  105.15% 7034    93.85%  54.05%  5228    92.25   54.72%
3.04.01 news.3B 19073   96.12%  196.53% 3432    95.92%  76.36%  2685    94.53   78.76%
3.04.01 Total   107055  -       67.84%  22590   -       41.46%  18562   -       40.45%
Testid  Testset Character               Word                    Non-stopword
                Errors  Acc     Change  Errors  Acc     Change  Errors  Acc     Change
3.03rc1	bus.3B	8816	97.24%	47.94%	2221	95.68%	36.17%	1629	94.62	25.99%
3.03rc1	doe3.3B	48306	96.70%	32.89%	9903	95.36%	26.54%	9020	93.43	28.09%
3.03rc1	mag.3B	30860	95.37%	105.15%	7034	93.85%	54.05%	5228	92.25	54.72%
3.03rc1	news.3B	19073	96.12%	196.53%	3432	95.92%	76.36%	2685	94.53	78.76%
3.03rc1	Total	107055	-	67.84%	22590	-	41.46%	18562	-	40.45%
Testid  Testset Character               Word                    Non-stopword
                Errors  Acc     Change  Errors  Acc     Change  Errors  Acc     Change
3.02.02	bus.3B	6039	98.11%	1.34%	1541	97.01%	-5.52%	1240	95.90	-4.10%
3.02.02	doe3.3B	35988	97.54%	-0.99%	8482	96.03%	8.38%	7640	94.43	8.49%
3.02.02	mag.3B	14367	97.84%	-4.49%	3891	96.60%	-14.78%	3024	95.52	-10.51%
3.02.02	news.3B	7148	98.55%	11.13%	1484	98.23%	-23.74%	1152	97.65	-23.30%
3.02.02	Total	63542	-	-0.38%	15398	-	-3.58%	13056	-	-1.21%
Testid  Testset Character               Word                    Non-stopword
                Errors  Acc     Change  Errors  Acc     Change  Errors  Acc     Change
3.01	bus.3B	22384	93.00%	275.63%	2253	95.62%	38.14%	1863	93.85	44.08%
3.01	doe3.3B	301312	79.41%	728.94%	13924	93.48%	77.92%	11665	91.50	65.65%
3.01	mag.3B	160024	75.98%	963.78%	10698	90.65%	134.30%	7261	89.24	114.89%
3.01	news.3B	43454	91.17%	575.59%	3469	95.87%	78.26%	2380	95.15	58.46%
3.01	Total	527174	-	726.51%	30344	-	90.02%	23169	-	75.31%
Testid  Testset Character               Word                    Non-stopword
                Errors  Acc     Change  Errors  Acc     Change  Errors  Acc     Change
2.04	bus.3B	6422	97.99%	7.77%	1750	96.60%	7.30%	1361	95.51	5.26%
2.04	doe3.3B	29514	97.98%	-18.80%	7963	96.27%	1.75%	6762	95.07	-3.98%
2.04	mag.3B	14568	97.81%	-3.16%	4289	96.25%	-6.07%	3053	95.47	-9.65%
2.04	news.3B	7655	98.44%	19.01%	1730	97.94%	-11.10%	1208	97.54	-19.57%
2.04	Total	58159	-	-8.82%	15732	-	-1.48%	12384	-	-6.30%

#如何使用Nick White的OCR评估工具

介绍

尼克·怀特(Nick White)分享了ISRI OCR评估工具的存储库,以便使用UTF-8轻松工作,并包含一些帮助程序脚本.

git clone https://ancientgreekocr.org/ocr-evaluation-tools.git
Tools to test OCR accuracy.

这里特别相关的是'tessaccsummary'脚本,当给出图像目录和相应的地面实况文本和.traineddata文件时,将对每个页面进行OCR并打印精度,并在最后给出平均摘要.