UNLV Testing of Tesseract - kana112233/tesseract GitHub Wiki
#如何在Tesseract上运行UNLV测试
介绍
Tesseract 2.0+提供的脚本可以运行第四次OCR准确度年度测试中发布的一些UNLV测试.
见AT-1995.pdf(最初可在http://www.isri.unlv.edu/获得) . 提供这些测试脚本的主要目的是使Tesseract用户能够验证其安装是否正确,并且没有特定于体系结构的问题导致错误的识别准确性. 它还可作为展示每个版本准确性改进的基准. 从事Tesseract工作的开发人员可能会发现基准测试工具对于测量实验性新模块非常有用.
请注意必然会出现一些特定于体系结构的变体. 其中大部分应该由编译器之间的浮点算法的不同处理和优化引起. 当然也有可能存在内存初始化错误,这些错误表现为架构之间的差异,但我们声称已经在unicodeization过程中发现了大部分错误.
警告
UNLV图像是G4压缩的,因此您必须构建具有TIFF支持的Tesseract,但所有最新版本都默认包含libtiff.
Windows用户还必须具有一些unix shell脚本功能,可能通过cygwin或同等功能.
图片
当前脚本仅涵盖3B测试集的测试(即300 dpi双色调). 开源Tesseract中的自适应阈值与原始版本中的自适应阈值不同,因为原始自适应阈值未包含在开源版本中,因此8位灰度图像测试无法正确比较,而其他分辨率则有趣 ,不要真正服务于有用的回归测试目的.
#运行测试
有关如何运行Tesseract 4测试的最新说明,请参见UNLV测试的README文件.
#示例结果
以下是1995年测试的一些结果,取自AT-1995.pdf并重新格式化以匹配 Tesseract测试工具的输出:
Testid Testset Character Word Non-stopword
Errors Acc Change Errors Acc Change Errors Acc Change
1995 bus.3B 5959 98.14% 0.00% 1631 96.83% 0.00% 1293 95.73% 0.00%
1995 doe3.3B 36349 97.52% 0.00% 7826 96.34% 0.00% 7042 94.87% 0.00%
1995 mag.3B 15043 97.74% 0.00% 4566 96.01% 0.00% 3379 94.99% 0.00%
1995 news.3B 6432 98.69% 0.00% 1946 97.68% 0.00% 1502 96.94% 0.00%
(更改列用于最近的测试,并测量这些1995年结果的变化.)
使用gcc 4.0.3-1ubuntu5编译的Tesseract 2.00的结果是:
Testid Testset Character Word Non-stopword
Errors Acc Change Errors Acc Change Errors Acc Change
gcc4.0 bus.3B 6259 98.04% 5.03% 1691 96.71% 3.68% 1313 95.66 1.55%
gcc4.0 doe3.3B 28850 98.03% -20.63% 7863 96.32% 0.47% 6688 95.13 -5.03%
gcc4.0 mag.3B 14815 97.78% -1.52% 4396 96.16% -3.72% 3124 95.37 -7.55%
gcc4.0 news.3B 7533 98.47% 17.12% 1758 97.91% -9.66% 1220 97.51 -18.77%
gcc4.0 Total 57457 - -9.92% 15708 - -1.63% 12345 - -6.59%
更改列显示了1995年结果的准确性差异,doe3.3B测试集上的字符错误减少了20%,而news.3B测试集上的字符错误增加了17%. 由于引擎自1995年测试以来已经完全重新训练,现在它运行在具有不同编译器的不同处理器上,因此很难确定这种疯狂变化的原因. (这也可能部分归因于没有阿司匹林包装.)
为了说明编译器的不同之处,下面是使用gcc 4.1.1编译的相同代码的结果:
Testid Testset Character Word Non-stopword
Errors Acc Change Errors Acc Change Errors Acc Change
gcc4.1 bus.3B 6258 98.04% 5.02% 1690 96.72% 3.62% 1312 95.67 1.47%
gcc4.1 doe3.3B 28589 98.05% -21.35% 7864 96.32% 0.49% 6692 95.12 -4.97%
gcc4.1 mag.3B 14800 97.78% -1.62% 4394 96.16% -3.77% 3123 95.37 -7.58%
gcc4.1 news.3B 7524 98.47% 16.98% 1759 97.91% -9.61% 1220 97.51 -18.77%
gcc4.1 Total 57171 - -10.37% 15707 - -1.64% 12347 - -6.58%
错误率没有那么不同,但存在细微差别. 相比之下,使用VisualC ++ Express构建的相同代码给出了:
Testid Testset Character Word Non-stopword
Errors Acc Change Errors Acc Change Errors Acc Change
vc++exp bus.3B 6270 98.04% 5.22% 1695 96.71% 3.92% 1315 95.66 1.70%
vc++exp doe3.3B 29098 98.01% -19.95% 8246 96.14% 5.37% 7038 94.87 -0.06%
vc++exp mag.3B 14981 97.75% -0.41% 4435 96.12% -2.87% 3157 95.32 -6.57%
vc++exp news.3B 7548 98.47% 17.35% 1763 97.90% -9.40% 1224 97.51 -18.51%
vc++exp Total 57897 - -9.23% 16139 - 1.06% 12734 - -3.65%
这显示了错误率的相当大的增加,这是在从代码中消除了一些浮点运算的使用之后. 但是,更加截然不同的是Visual C ++ 6,其测量的单词准确度略高,但字符准确性更差:
Testid Testset Character Word Non-stopword
Errors Acc Change Errors Acc Change Errors Acc Change
vc6 bus.3B 6298 98.03% 5.69% 1696 96.70% 3.99% 1317 95.65 1.86%
vc6 doe3.3B 29745 97.97% -18.17% 8105 96.20% 3.57% 6894 94.98 -2.10%
vc6 mag.3B 15036 97.74% -0.05% 4448 96.11% -2.58% 3165 95.31 -6.33%
vc6 news.3B 7531 98.47% 17.09% 1745 97.92% -10.33% 1210 97.53 -19.44%
vc6 Total 58610 - -8.11% 15994 - 0.16% 12586 - -4.77%
未来的工作可能是为了使这些差异更小,如果不是完全消除它们,理由是在存在变化的情况下,还有改进的余地......
###更新版Tesseract的结果[作者Tom Morris](https://groups.google.com/forum/#!searchin/tesseract-dev/bus.3B%7Csort:date/tesseract-dev/LErriuT- SCK/B5PR0QaCGwAJ)
全部使用Apple C编译器Apple LLVM版本7.0.2(clang-700.1.81)编译,目标为x86_64-apple-darwin14.3.0.
Testid Testset Character Word Non-stopword
Errors Acc Change Errors Acc Change Errors Acc Change
3.04.01 bus.3B 8816 97.24% 47.94% 2221 95.68% 36.17% 1629 94.62 25.99%
3.04.01 doe3.3B 48306 96.70% 32.89% 9903 95.36% 26.54% 9020 93.43 28.09%
3.04.01 mag.3B 30860 95.37% 105.15% 7034 93.85% 54.05% 5228 92.25 54.72%
3.04.01 news.3B 19073 96.12% 196.53% 3432 95.92% 76.36% 2685 94.53 78.76%
3.04.01 Total 107055 - 67.84% 22590 - 41.46% 18562 - 40.45%
Testid Testset Character Word Non-stopword
Errors Acc Change Errors Acc Change Errors Acc Change
3.03rc1 bus.3B 8816 97.24% 47.94% 2221 95.68% 36.17% 1629 94.62 25.99%
3.03rc1 doe3.3B 48306 96.70% 32.89% 9903 95.36% 26.54% 9020 93.43 28.09%
3.03rc1 mag.3B 30860 95.37% 105.15% 7034 93.85% 54.05% 5228 92.25 54.72%
3.03rc1 news.3B 19073 96.12% 196.53% 3432 95.92% 76.36% 2685 94.53 78.76%
3.03rc1 Total 107055 - 67.84% 22590 - 41.46% 18562 - 40.45%
Testid Testset Character Word Non-stopword
Errors Acc Change Errors Acc Change Errors Acc Change
3.02.02 bus.3B 6039 98.11% 1.34% 1541 97.01% -5.52% 1240 95.90 -4.10%
3.02.02 doe3.3B 35988 97.54% -0.99% 8482 96.03% 8.38% 7640 94.43 8.49%
3.02.02 mag.3B 14367 97.84% -4.49% 3891 96.60% -14.78% 3024 95.52 -10.51%
3.02.02 news.3B 7148 98.55% 11.13% 1484 98.23% -23.74% 1152 97.65 -23.30%
3.02.02 Total 63542 - -0.38% 15398 - -3.58% 13056 - -1.21%
Testid Testset Character Word Non-stopword
Errors Acc Change Errors Acc Change Errors Acc Change
3.01 bus.3B 22384 93.00% 275.63% 2253 95.62% 38.14% 1863 93.85 44.08%
3.01 doe3.3B 301312 79.41% 728.94% 13924 93.48% 77.92% 11665 91.50 65.65%
3.01 mag.3B 160024 75.98% 963.78% 10698 90.65% 134.30% 7261 89.24 114.89%
3.01 news.3B 43454 91.17% 575.59% 3469 95.87% 78.26% 2380 95.15 58.46%
3.01 Total 527174 - 726.51% 30344 - 90.02% 23169 - 75.31%
Testid Testset Character Word Non-stopword
Errors Acc Change Errors Acc Change Errors Acc Change
2.04 bus.3B 6422 97.99% 7.77% 1750 96.60% 7.30% 1361 95.51 5.26%
2.04 doe3.3B 29514 97.98% -18.80% 7963 96.27% 1.75% 6762 95.07 -3.98%
2.04 mag.3B 14568 97.81% -3.16% 4289 96.25% -6.07% 3053 95.47 -9.65%
2.04 news.3B 7655 98.44% 19.01% 1730 97.94% -11.10% 1208 97.54 -19.57%
2.04 Total 58159 - -8.82% 15732 - -1.48% 12384 - -6.30%
#如何使用Nick White的OCR评估工具
介绍
尼克·怀特(Nick White)分享了ISRI OCR评估工具的存储库,以便使用UTF-8轻松工作,并包含一些帮助程序脚本.
git clone https://ancientgreekocr.org/ocr-evaluation-tools.git
Tools to test OCR accuracy.
这里特别相关的是'tessaccsummary'脚本,当给出图像目录和相应的地面实况文本和.traineddata文件时,将对每个页面进行OCR并打印精度,并在最后给出平均摘要.