中文训练库的构建 - wanghaisheng/awesome-ocr GitHub Wiki

印刷文本的训练数据

书籍 文献

餐饮类菜单

餐饮类小票

医疗票据 报告单 化验单 处方单

发票

网页界面

自然场景的训练数据

身份证件 银行卡片 护照,医保卡

验证码

http://caca.zoy.org/wiki/PWNtcha

车牌号

门牌 招牌

1.书籍文献

文本行数据的生成步骤

准备工作:

  • 1.工具 这里我们使用ocropus自带的linegen工具 使用docker启动一个镜像即可
docker run -it --name generate-trainning-data -v /home/wanghs/ocr/ocr-training-material/trainning_data/:/work edwin/ocropus /bin/bash

python pic/ocropus-linegen -t ../txt-fonts/eachline_1_ocr_input_with_special_char_total_8099.txt -f ../txt-fonts/simsun.ttc -o ../pic_simsun_eachline_1_ocr_input_with_special_char_total_8099 -m 15000

  • 2.素材
    • utf8编码的文本行(训练要识别的中英文字符 或其他语言字符 如梵文)
    • ttf格式的字体文件(期望训练好的模型能够识别的字体)

文本行从GB2312字库里、github上字库词库、常用医学字典上找即可,这里以中文为例

[中文常用汉字(简繁体)](https://github.com/howiehu/commonly-used-chinese-characters)           
[3754 个常用汉字列表 / A list of 3754 Chinese simplified common characters.](https://github.com/liaohuqiu/chinese-simplified-common-characters)          
[汉语转拼音区位码表,包含一、二级汉字共7000个左右](https://github.com/Mikumikunisiteageru/Kanji_4_Corner_Index)       
[汉语转拼音区位码表,包含一、二级汉字共7000个左右](https://github.com/lanyuechen/pinyin)    
[HTML特殊字符大全 #138](https://github.com/9958/rinblog/issues/138)                
[特殊字符大全](http://xh.5156edu.com/page/18466.html)         
[特殊字符](http://baike.baidu.com/link?url=8YYTP7eMr_EV2ADb-0ZLXw4Gt71Ea-u7rbVKNZd-RioODH_gOmdkQwnk7QBTDIhBxj2trf7Ut_zXEptCdgRuh_)      

中文字体

可以直接在windows 系统中文件中拷贝常用的字体文件
免费中文字体
https://github.com/zenozeng/Free-Chinese-Fonts
http://pan.baidu.com/s/1eQ2KnsA 经典简宋
http://pan.baidu.com/s/1hqTQP4g 华康翩跹.ttf
https://github.com/aui/font-spider
 中文 WebFont 自动化压缩工具 http://font-spider.org 

- [思源黑体: 简体中文 ttf 版本](https://github.com/aui/free-fonts/archive/KaiGenGothic-1.001-SimplifiedChinese.zip)
- [思源黑体: 繁体中文 ttf 版本](https://github.com/aui/free-fonts/archive/KaiGenGothic-1.001-TraditionalChinese.zip)
- [思源黑体: 中、日、韩 ttf 版本](https://mega.nz/#!PZxFSYQI!ICvNugaFX_y4Mh003-S3fao1zU0uNpeSyprdmvHDnwc)
- [开源图标字体: Font Awesome](http://fontawesome.io)

| song | 宋               | simsun.ttf   |
| hei  | 黑               | simhei.ttf  |
| kai  | 楷               | simkai.ttf   |
| fs   | 仿宋             | simfang.ttf  |
| li   | 隶               | simli.ttf    |

首先我们在github上找到如下已整理好的常用汉字文件
中文常用汉字(简繁体)https://github.com/howiehu/commonly-used-chinese-characters
3754 个常用汉字列表 / A list of 3754 Chinese simplified common characters. https://github.com/liaohuqiu/chinese-simplified-common-characters
日语常用汉字6355个的四角号码 https://github.com/Mikumikunisiteageru/Kanji_4_Corner_Index
汉语转拼音区位码表,包含一、二级汉字共7000个左右,PHP编写,其他语言类似 https://github.com/lanyuechen/pinyin

分别对其进行预处理、统计、去重
跳过上面这一步,直接使用7000字这个作为text—line输入,选择宋体
这里-m参数用于设置生成的训练文件的个数 请注意 如果你输入的文本行为200行 m设置为200 并不能保证原来的200行都至少出现一次 应该至少是1.5m

root@4b2648975d03:/ocropy# python ocropus-linegen  -t pic/hanyu_yiji_pinyin_duizhaobiao.txt  -f pic/simsun.ttc  -o trainning_data -m 8000
root@4b2648975d03:/ocropy#  ls -l trainning_data/pic_simsun  |grep "^-"|wc -l
8894
root@4b2648975d03:/ocropy# python pic/ocropus-linegen  -t pic/1-line.txt  -f pic/simsun.ttc -m 1 -o 1-line
fonts ['pic/simsun.ttc']
pic/1-line.txt
# reading pic/1-line.txt
got 1 lines
got 1 unique lines
base 1-line
=== 1-line/pic_simsun pic/simsun.ttc
 0.50  0.50  50 顷3974

root@4b2648975d03:/ocropy# ls -al 1-line/
total 16
drwxr-xr-x  3 root root 4096 May  9 07:41 .
drwxr-xr-x 14 root root 4096 May  9 07:41 ..
drwxr-xr-x  2 root root 4096 May  9 07:41 pic_simsun
-rw-r--r--  1 root root   15 May  9 07:41 pic_simsun.info

遇到的问题

  1. https://github.com/wanghaisheng/awesome-ocr/issues/37
  2. https://github.com/wanghaisheng/awesome-ocr/issues/38

参考资料

  1. ocropus工具对应文档
    >P46 section 4.3 The use of artificial data is getting popular in computer vision domain for object recognition purpose. A similar path is taken in this thesis to address the issue of limited GT data. Baird [Bai92] proposed several degradation models to generate arti cial data from the text (ASCII) form. There are many parameters that can be altered to make the arti cially generated text-line images resemble closely to those obtained from a scanning process. Some of the signi cant parameters are: Blur: Itisthepixel-wisespreadintheoutputimage,and is model edascircular Gaussian lter. Threshold: It is used to distort the image by randomly removing the text pixels. If a pixel value is greater than this threshold, then it is a background pixel. Size: It is the height and width of individual characters in the image. It is modeled by image scaling operations. Skew: It is the rotation angle of the output symbol. The resulting angle is skewed to right or left by specifying the ‘skew’ parameter.

In this thesis, a utility based on these degradation models from OCRopus [OCR15] (open-source OCR framework) is used to generate the arti cial data. The aforesaid OCRopus utility requires utf-8-encoded text-lines to generate the corresponding text- line images along with the ttf -type font les. The process of line image generation is shown in Figure 4.3. The user can specify the parameter values or use the default values.

synthetic text-line image generation process

You can also generate training data using ocropus-linegen:

ocropus-linegen -t tests/tomsawyer.txt -f tests/DejaVuSans.ttf

This will create a directory "linegen/..." containing training data suitable for training OCRopus with synthetic data.

  1. 字体 https://github.com/adobe-fonts
  2. 生成工具1