中文训练库的构建 - wanghaisheng/awesome-ocr GitHub Wiki
印刷文本的训练数据
书籍 文献
餐饮类菜单
餐饮类小票
医疗票据 报告单 化验单 处方单
发票
网页界面
自然场景的训练数据
身份证件 银行卡片 护照,医保卡
验证码
http://caca.zoy.org/wiki/PWNtcha
车牌号
门牌 招牌
1.书籍文献
文本行数据的生成步骤
准备工作:
- 1.工具 这里我们使用ocropus自带的linegen工具 使用docker启动一个镜像即可
docker run -it --name generate-trainning-data -v /home/wanghs/ocr/ocr-training-material/trainning_data/:/work edwin/ocropus /bin/bash
python pic/ocropus-linegen -t ../txt-fonts/eachline_1_ocr_input_with_special_char_total_8099.txt -f ../txt-fonts/simsun.ttc -o ../pic_simsun_eachline_1_ocr_input_with_special_char_total_8099 -m 15000
- 2.素材
- utf8编码的文本行(训练要识别的中英文字符 或其他语言字符 如梵文)
- ttf格式的字体文件(期望训练好的模型能够识别的字体)
文本行从GB2312字库里、github上字库词库、常用医学字典上找即可,这里以中文为例
[中文常用汉字(简繁体)](https://github.com/howiehu/commonly-used-chinese-characters)
[3754 个常用汉字列表 / A list of 3754 Chinese simplified common characters.](https://github.com/liaohuqiu/chinese-simplified-common-characters)
[汉语转拼音区位码表,包含一、二级汉字共7000个左右](https://github.com/Mikumikunisiteageru/Kanji_4_Corner_Index)
[汉语转拼音区位码表,包含一、二级汉字共7000个左右](https://github.com/lanyuechen/pinyin)
[HTML特殊字符大全 #138](https://github.com/9958/rinblog/issues/138)
[特殊字符大全](http://xh.5156edu.com/page/18466.html)
[特殊字符](http://baike.baidu.com/link?url=8YYTP7eMr_EV2ADb-0ZLXw4Gt71Ea-u7rbVKNZd-RioODH_gOmdkQwnk7QBTDIhBxj2trf7Ut_zXEptCdgRuh_)
中文字体
可以直接在windows 系统中文件中拷贝常用的字体文件
免费中文字体
https://github.com/zenozeng/Free-Chinese-Fonts
http://pan.baidu.com/s/1eQ2KnsA 经典简宋
http://pan.baidu.com/s/1hqTQP4g 华康翩跹.ttf
https://github.com/aui/font-spider
中文 WebFont 自动化压缩工具 http://font-spider.org
- [思源黑体: 简体中文 ttf 版本](https://github.com/aui/free-fonts/archive/KaiGenGothic-1.001-SimplifiedChinese.zip)
- [思源黑体: 繁体中文 ttf 版本](https://github.com/aui/free-fonts/archive/KaiGenGothic-1.001-TraditionalChinese.zip)
- [思源黑体: 中、日、韩 ttf 版本](https://mega.nz/#!PZxFSYQI!ICvNugaFX_y4Mh003-S3fao1zU0uNpeSyprdmvHDnwc)
- [开源图标字体: Font Awesome](http://fontawesome.io)
| song | 宋 | simsun.ttf |
| hei | 黑 | simhei.ttf |
| kai | 楷 | simkai.ttf |
| fs | 仿宋 | simfang.ttf |
| li | 隶 | simli.ttf |
-
- 实际操作命令
首先我们在github上找到如下已整理好的常用汉字文件
中文常用汉字(简繁体)https://github.com/howiehu/commonly-used-chinese-characters
3754 个常用汉字列表 / A list of 3754 Chinese simplified common characters. https://github.com/liaohuqiu/chinese-simplified-common-characters
日语常用汉字6355个的四角号码 https://github.com/Mikumikunisiteageru/Kanji_4_Corner_Index
汉语转拼音区位码表,包含一、二级汉字共7000个左右,PHP编写,其他语言类似
https://github.com/lanyuechen/pinyin
分别对其进行预处理、统计、去重
跳过上面这一步,直接使用7000字这个作为text—line输入,选择宋体
这里-m参数用于设置生成的训练文件的个数 请注意 如果你输入的文本行为200行 m设置为200 并不能保证原来的200行都至少出现一次 应该至少是1.5m
root@4b2648975d03:/ocropy# python ocropus-linegen -t pic/hanyu_yiji_pinyin_duizhaobiao.txt -f pic/simsun.ttc -o trainning_data -m 8000
root@4b2648975d03:/ocropy# ls -l trainning_data/pic_simsun |grep "^-"|wc -l
8894
root@4b2648975d03:/ocropy# python pic/ocropus-linegen -t pic/1-line.txt -f pic/simsun.ttc -m 1 -o 1-line
fonts ['pic/simsun.ttc']
pic/1-line.txt
# reading pic/1-line.txt
got 1 lines
got 1 unique lines
base 1-line
=== 1-line/pic_simsun pic/simsun.ttc
0.50 0.50 50 顷3974
root@4b2648975d03:/ocropy# ls -al 1-line/
total 16
drwxr-xr-x 3 root root 4096 May 9 07:41 .
drwxr-xr-x 14 root root 4096 May 9 07:41 ..
drwxr-xr-x 2 root root 4096 May 9 07:41 pic_simsun
-rw-r--r-- 1 root root 15 May 9 07:41 pic_simsun.info
遇到的问题
- https://github.com/wanghaisheng/awesome-ocr/issues/37
- https://github.com/wanghaisheng/awesome-ocr/issues/38
参考资料
- ocropus工具对应文档
>P46 section 4.3 The use of artificial data is getting popular in computer vision domain for object recognition purpose. A similar path is taken in this thesis to address the issue of limited GT data. Baird [Bai92] proposed several degradation models to generate arti cial data from the text (ASCII) form. There are many parameters that can be altered to make the arti cially generated text-line images resemble closely to those obtained from a scanning process. Some of the signi cant parameters are: Blur: Itisthepixel-wisespreadintheoutputimage,and is model edascircular Gaussian lter. Threshold: It is used to distort the image by randomly removing the text pixels. If a pixel value is greater than this threshold, then it is a background pixel. Size: It is the height and width of individual characters in the image. It is modeled by image scaling operations. Skew: It is the rotation angle of the output symbol. The resulting angle is skewed to right or left by specifying the ‘skew’ parameter.
In this thesis, a utility based on these degradation models from OCRopus [OCR15] (open-source OCR framework) is used to generate the arti cial data. The aforesaid OCRopus utility requires utf-8-encoded text-lines to generate the corresponding text- line images along with the ttf -type font les. The process of line image generation is shown in Figure 4.3. The user can specify the parameter values or use the default values.
You can also generate training data using ocropus-linegen:
ocropus-linegen -t tests/tomsawyer.txt -f tests/DejaVuSans.ttf
This will create a directory "linegen/..." containing training data suitable for training OCRopus with synthetic data.