Data Files - kana112233/tesseract GitHub Wiki
##特殊数据文件
| Lang Code | Description | 4.0/3.0x traineddata |
|---|---|---|
| osd | Orientation and script detection | osd.traineddata |
| equ | Math / equation detection | equ.traineddata |
注意:这两个数据文件与旧版本的Tesseract兼容.
osd与3.01及更高版本兼容,equ与3.02及更高版本兼容.
##版本4.00的更新数据文件(2017年9月15日)
我们在三个独立的存储库中的GitHub上有三组.traineddata文件.
大多数用户都希望**tessdata_fast **,这将是Linux发行版的一部分.
**tessdata_best **适用于愿意以更高的准确度进行交易的人们.
也是
唯一可用于高级用户的某些再培训方案的文件集.
**tessdata **中的第三个集是唯一支持遗留识别器的集合.
2016年11月的4.00文件包含旧版LSTM和旧版LSTM.
**tessdata **中的当前文件集具有传统模型和较新的LSTM模型(tessdata_best中的4.00.00 alpha模型的整数版本).
注意:在** tessdata_best **和**tessdata_fast` **存储库中使用新模型时,仅支持新的基于LSTM的OCR引擎.
这些文件不支持旧版引擎,因此Tesseract的oem模式“0”和“2”将无法使用它们.
##版本4.00的数据文件(2016年11月29日)
这组训练的数据文件支持带有--oem 0的传统识别器和带有--oem 1的LSTM模型.
注意:kur数据文件未从3.04更新.
对于Fraktur,请参阅Fraktur数据文件部分,或使用tessdata_fast或tessdata_best存储库中的较新数据文件.
| Lang Code | Language | 4.0 traineddata |
|---|---|---|
| afr | Afrikaans | afr.traineddata |
| amh | Amharic | amh.traineddata |
| ara | Arabic | ara.traineddata |
| asm | Assamese | asm.traineddata |
| aze | Azerbaijani | aze.traineddata |
| aze_cyrl | Azerbaijani - Cyrillic | aze_cyrl.traineddata |
| bel | Belarusian | bel.traineddata |
| ben | Bengali | ben.traineddata |
| bod | Tibetan | bod.traineddata |
| bos | Bosnian | bos.traineddata |
| bul | Bulgarian | bul.traineddata |
| cat | Catalan; Valencian | cat.traineddata |
| ceb | Cebuano | ceb.traineddata |
| ces | Czech | ces.traineddata |
| chi_sim | Chinese - Simplified | chi_sim.traineddata |
| chi_tra | Chinese - Traditional | chi_tra.traineddata |
| chr | Cherokee | chr.traineddata |
| cym | Welsh | cym.traineddata |
| dan | Danish | dan.traineddata |
| deu | German | deu.traineddata |
| dzo | Dzongkha | dzo.traineddata |
| ell | Greek, Modern (1453-) | ell.traineddata |
| eng | English | eng.traineddata |
| enm | English, Middle (1100-1500) | enm.traineddata |
| epo | Esperanto | epo.traineddata |
| est | Estonian | est.traineddata |
| eus | Basque | eus.traineddata |
| fas | Persian | fas.traineddata |
| fin | Finnish | fin.traineddata |
| fra | French | fra.traineddata |
| frk | Frankish | frk.traineddata |
| frm | French, Middle (ca. 1400-1600) | frm.traineddata |
| gle | Irish | gle.traineddata |
| glg | Galician | glg.traineddata |
| grc | Greek, Ancient (-1453) | grc.traineddata |
| guj | Gujarati | guj.traineddata |
| hat | Haitian; Haitian Creole | hat.traineddata |
| heb | Hebrew | heb.traineddata |
| hin | Hindi | hin.traineddata |
| hrv | Croatian | hrv.traineddata |
| hun | Hungarian | hun.traineddata |
| iku | Inuktitut | iku.traineddata |
| ind | Indonesian | ind.traineddata |
| isl | Icelandic | isl.traineddata |
| ita | Italian | ita.traineddata |
| ita_old | Italian - Old | ita_old.traineddata |
| jav | Javanese | jav.traineddata |
| jpn | Japanese | jpn.traineddata |
| kan | Kannada | kan.traineddata |
| kat | Georgian | kat.traineddata |
| kat_old | Georgian - Old | kat_old.traineddata |
| kaz | Kazakh | kaz.traineddata |
| khm | Central Khmer | khm.traineddata |
| kir | Kirghiz; Kyrgyz | kir.traineddata |
| kor | Korean | kor.traineddata |
| kur | Kurdish | kur.traineddata |
| lao | Lao | lao.traineddata |
| lat | Latin | lat.traineddata |
| lav | Latvian | lav.traineddata |
| lit | Lithuanian | lit.traineddata |
| mal | Malayalam | mal.traineddata |
| mar | Marathi | mar.traineddata |
| mkd | Macedonian | mkd.traineddata |
| mlt | Maltese | mlt.traineddata |
| msa | Malay | msa.traineddata |
| mya | Burmese | mya.traineddata |
| nep | Nepali | nep.traineddata |
| nld | Dutch; Flemish | nld.traineddata |
| nor | Norwegian | nor.traineddata |
| ori | Oriya | ori.traineddata |
| pan | Panjabi; Punjabi | pan.traineddata |
| pol | Polish | pol.traineddata |
| por | Portuguese | por.traineddata |
| pus | Pushto; Pashto | pus.traineddata |
| ron | Romanian; Moldavian; Moldovan | ron.traineddata |
| rus | Russian | rus.traineddata |
| san | Sanskrit | san.traineddata |
| sin | Sinhala; Sinhalese | sin.traineddata |
| slk | Slovak | slk.traineddata |
| slv | Slovenian | slv.traineddata |
| spa | Spanish; Castilian | spa.traineddata |
| spa_old | Spanish; Castilian - Old | spa_old.traineddata |
| sqi | Albanian | sqi.traineddata |
| srp | Serbian | srp.traineddata |
| srp_latn | Serbian - Latin | srp_latn.traineddata |
| swa | Swahili | swa.traineddata |
| swe | Swedish | swe.traineddata |
| syr | Syriac | syr.traineddata |
| tam | Tamil | tam.traineddata |
| tel | Telugu | tel.traineddata |
| tgk | Tajik | tgk.traineddata |
| tgl | Tagalog | tgl.traineddata |
| tha | Thai | tha.traineddata |
| tir | Tigrinya | tir.traineddata |
| tur | Turkish | tur.traineddata |
| uig | Uighur; Uyghur | uig.traineddata |
| ukr | Ukrainian | ukr.traineddata |
| urd | Urdu | urd.traineddata |
| uzb | Uzbek | uzb.traineddata |
| uzb_cyrl | Uzbek - Cyrillic | uzb_cyrl.traineddata |
| vie | Vietnamese | vie.traineddata |
| yid | Yiddish | yid.traineddata |
版本3.04/3.05的##数据文件
注意:对于阿拉伯语和印地语,您需要训练的数据文件和立方体数据文件.
| Lang Code | Language | 3.04 traineddata |
|---|---|---|
| afr | Afrikaans | afr.traineddata |
| amh | Amharic | amh.traineddata |
| ara | Arabic | ara.traineddata |
| asm | Assamese | asm.traineddata |
| aze | Azerbaijani | aze.traineddata |
| aze_cyrl | Azerbaijani - Cyrillic | aze_cyrl.traineddata |
| bel | Belarusian | bel.traineddata |
| ben | Bengali | ben.traineddata |
| bod | Tibetan | bod.traineddata |
| bos | Bosnian | bos.traineddata |
| bul | Bulgarian | bul.traineddata |
| cat | Catalan; Valencian | cat.traineddata |
| ceb | Cebuano | ceb.traineddata |
| ces | Czech | ces.traineddata |
| chi_sim | Chinese - Simplified | chi_sim.traineddata |
| chi_tra | Chinese - Traditional | chi_tra.traineddata |
| chr | Cherokee | chr.traineddata |
| cym | Welsh | cym.traineddata |
| dan | Danish | dan.traineddata |
| deu | German | deu.traineddata |
| dzo | Dzongkha | dzo.traineddata |
| ell | Greek, Modern (1453-) | ell.traineddata |
| eng | English | eng.traineddata |
| enm | English, Middle (1100-1500) | enm.traineddata |
| epo | Esperanto | epo.traineddata |
| est | Estonian | est.traineddata |
| eus | Basque | eus.traineddata |
| fas | Persian | fas.traineddata |
| fin | Finnish | fin.traineddata |
| fra | French | fra.traineddata |
| frk | Frankish | frk.traineddata |
| frm | French, Middle (ca. 1400-1600) | frm.traineddata |
| gle | Irish | gle.traineddata |
| glg | Galician | glg.traineddata |
| grc | Greek, Ancient (-1453) | grc.traineddata |
| guj | Gujarati | guj.traineddata |
| hat | Haitian; Haitian Creole | hat.traineddata |
| heb | Hebrew | heb.traineddata |
| hin | Hindi | hin.traineddata |
| hrv | Croatian | hrv.traineddata |
| hun | Hungarian | hun.traineddata |
| iku | Inuktitut | iku.traineddata |
| ind | Indonesian | ind.traineddata |
| isl | Icelandic | isl.traineddata |
| ita | Italian | ita.traineddata |
| ita_old | Italian - Old | ita_old.traineddata |
| jav | Javanese | jav.traineddata |
| jpn | Japanese | jpn.traineddata |
| kan | Kannada | kan.traineddata |
| kat | Georgian | kat.traineddata |
| kat_old | Georgian - Old | kat_old.traineddata |
| kaz | Kazakh | kaz.traineddata |
| khm | Central Khmer | khm.traineddata |
| kir | Kirghiz; Kyrgyz | kir.traineddata |
| kor | Korean | kor.traineddata |
| kur | Kurdish | kur.traineddata |
| lao | Lao | lao.traineddata |
| lat | Latin | lat.traineddata |
| lav | Latvian | lav.traineddata |
| lit | Lithuanian | lit.traineddata |
| mal | Malayalam | mal.traineddata |
| mar | Marathi | mar.traineddata |
| mkd | Macedonian | mkd.traineddata |
| mlt | Maltese | mlt.traineddata |
| msa | Malay | msa.traineddata |
| mya | Burmese | mya.traineddata |
| nep | Nepali | nep.traineddata |
| nld | Dutch; Flemish | nld.traineddata |
| nor | Norwegian | nor.traineddata |
| ori | Oriya | ori.traineddata |
| pan | Panjabi; Punjabi | pan.traineddata |
| pol | Polish | pol.traineddata |
| por | Portuguese | por.traineddata |
| pus | Pushto; Pashto | pus.traineddata |
| ron | Romanian; Moldavian; Moldovan | ron.traineddata |
| rus | Russian | rus.traineddata |
| san | Sanskrit | san.traineddata |
| sin | Sinhala; Sinhalese | sin.traineddata |
| slk | Slovak | slk.traineddata |
| slv | Slovenian | slv.traineddata |
| spa | Spanish; Castilian | spa.traineddata |
| spa_old | Spanish; Castilian - Old | spa_old.traineddata |
| sqi | Albanian | sqi.traineddata |
| srp | Serbian | srp.traineddata |
| srp_latn | Serbian - Latin | srp_latn.traineddata |
| swa | Swahili | swa.traineddata |
| swe | Swedish | swe.traineddata |
| syr | Syriac | syr.traineddata |
| tam | Tamil | tam.traineddata |
| tel | Telugu | tel.traineddata |
| tgk | Tajik | tgk.traineddata |
| tgl | Tagalog | tgl.traineddata |
| tha | Thai | tha.traineddata |
| tir | Tigrinya | tir.traineddata |
| tur | Turkish | tur.traineddata |
| uig | Uighur; Uyghur | uig.traineddata |
| ukr | Ukrainian | ukr.traineddata |
| urd | Urdu | urd.traineddata |
| uzb | Uzbek | uzb.traineddata |
| uzb_cyrl | Uzbek - Cyrillic | uzb_cyrl.traineddata |
| vie | Vietnamese | vie.traineddata |
| yid | Yiddish | yid.traineddata |
版本3.04/3.05的##多维数据集数据文件
在Tesseract 3.0x中,阿拉伯语和印地语使用Cube OCR引擎. 您需要下载多维数据集文件并将其移动到<ara/hin> .traineddata文件所在的同一文件夹中.
在Tesseract 4.0中,Cube OCR引擎已从代码库中删除,因此如果您使用的是4.0或更新版本,则不需要这些文件.
印地语:
阿拉伯:
这些数据文件由@paalberti为一些旧版本的Tesseract准备.
dan_frak,deu_frak和swe_frak是为版本3.00准备的,slk_frak是为3.01准备的.
有关这些文件的更新,请访问paalberti/tesseract-dan-fraktur.
| Lang Code | Language | 3.0x traineddata |
|---|---|---|
| dan_frak | Danish - Fraktur | dan_frak.traineddata |
| deu_frak | German - Fraktur | deu_frak.traineddata |
| slk_frak | Slovak - Fraktur | slk_frak.traineddata |
| swe_frak | Swedish - Fraktur | swe-frak.traineddata |
##版本3.02的数据文件
##版本2.0x的数据文件
| Lang Code | Language | 2.0x traineddata |
|---|---|---|
| deu | German | tesseract-2.00.deu.tar.gz |
| deu-f | German - Fraktur | tesseract-2.01.deu-f.tar.gz |
| eng | English | tesseract-2.00.eng.tar.gz |
| eus | Basque | tesseract-2.04-eus.tar.gz |
| fra | French | tesseract-2.00.fra.tar.gz |
| ita | Italian | tesseract-2.00.ita.tar.gz |
| nld | Dutch; Flemish | tesseract-2.00.nld.tar.gz |
| por | Portuguese | tesseract-2.01.por.tar.gz |
| spa | Spanish; Castilian | tesseract-2.00.spa.tar.gz |
| vie | Vietnamese | tesseract-2.01.vie.tar.gz |
##训练的数据文件的格式
每种语言的traineddata文件是Tesseract特定格式的存档文件.
它包含Tesseract OCR进程所需的几个未压缩组件文件.
程序combine_tessdata用于从组件文件创建tessdata文件,也可以像下面的示例一样再次提取它们:
###从2016年11月开始的4.0.0格式(包括LSTM和Legacy型号)
combine_tessdata -u eng.traineddata eng.
Extracting tessdata components from eng.traineddata
Wrote eng.unicharset
Wrote eng.unicharambigs
Wrote eng.inttemp
Wrote eng.pffmtable
Wrote eng.normproto
Wrote eng.punc-dawg
Wrote eng.word-dawg
Wrote eng.number-dawg
Wrote eng.freq-dawg
Wrote eng.cube-unicharset
Wrote eng.cube-word-dawg
Wrote eng.shapetable
Wrote eng.bigram-dawg
Wrote eng.lstm
Wrote eng.lstm-punc-dawg
Wrote eng.lstm-word-dawg
Wrote eng.lstm-number-dawg
Wrote eng.version
Version string:Pre-4.0.0
1:unicharset:size=7477, offset=192
2:unicharambigs:size=1047, offset=7669
3:inttemp:size=976552, offset=8716
4:pffmtable:size=844, offset=985268
5:normproto:size=13408, offset=986112
6:punc-dawg:size=4322, offset=999520
7:word-dawg:size=1082890, offset=1003842
8:number-dawg:size=6426, offset=2086732
9:freq-dawg:size=1410, offset=2093158
11:cube-unicharset:size=1511, offset=2094568
12:cube-word-dawg:size=1062106, offset=2096079
13:shapetable:size=63346, offset=3158185
14:bigram-dawg:size=16109842, offset=3221531
17:lstm:size=5390718, offset=19331373
18:lstm-punc-dawg:size=4322, offset=24722091
19:lstm-word-dawg:size=7143578, offset=24726413
20:lstm-number-dawg:size=3530, offset=31869991
23:version:size=9, offset=31873521
combine_tessdata -u eng.traineddata eng.
Extracting tessdata components from eng.traineddata
Wrote eng.lstm
Wrote eng.lstm-punc-dawg
Wrote eng.lstm-word-dawg
Wrote eng.lstm-number-dawg
Wrote eng.lstm-unicharset
Wrote eng.lstm-recoder
Wrote eng.version
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517
###关于压缩的训练数据文件的提议
There are some proposals to replace the Tesseract archive format by a standard archive format which could also support compression. A discussion on the tesseract-dev forum proposed the ZIP format already in 2014. In 2017 an experimental implementation was provided as a pull request.