Data Files - kana112233/tesseract GitHub Wiki
##特殊数据文件
Lang Code | Description | 4.0/3.0x traineddata |
---|---|---|
osd | Orientation and script detection | osd.traineddata |
equ | Math / equation detection | equ.traineddata |
注意:这两个数据文件与旧版本的Tesseract兼容.
osd
与3.01及更高版本兼容,equ
与3.02及更高版本兼容.
##版本4.00的更新数据文件(2017年9月15日)
我们在三个独立的存储库中的GitHub上有三组.traineddata文件.
大多数用户都希望**tessdata_fast
**,这将是Linux发行版的一部分.
**tessdata_best
**适用于愿意以更高的准确度进行交易的人们.
也是
唯一可用于高级用户的某些再培训方案的文件集.
**tessdata
**中的第三个集是唯一支持遗留识别器的集合.
2016年11月的4.00文件包含旧版LSTM和旧版LSTM.
**tessdata
**中的当前文件集具有传统模型和较新的LSTM模型(tessdata_best中的4.00.00 alpha模型的整数版本).
注意:在** tessdata_best **和**
tessdata_fast` **存储库中使用新模型时,仅支持新的基于LSTM的OCR引擎.
这些文件不支持旧版引擎,因此Tesseract的oem模式“0”和“2”将无法使用它们.
##版本4.00的数据文件(2016年11月29日)
这组训练的数据文件支持带有--oem 0的传统识别器和带有--oem 1的LSTM模型.
注意:kur
数据文件未从3.04更新.
对于Fraktur,请参阅Fraktur数据文件部分,或使用tessdata_fast或tessdata_best存储库中的较新数据文件.
Lang Code | Language | 4.0 traineddata |
---|---|---|
afr | Afrikaans | afr.traineddata |
amh | Amharic | amh.traineddata |
ara | Arabic | ara.traineddata |
asm | Assamese | asm.traineddata |
aze | Azerbaijani | aze.traineddata |
aze_cyrl | Azerbaijani - Cyrillic | aze_cyrl.traineddata |
bel | Belarusian | bel.traineddata |
ben | Bengali | ben.traineddata |
bod | Tibetan | bod.traineddata |
bos | Bosnian | bos.traineddata |
bul | Bulgarian | bul.traineddata |
cat | Catalan; Valencian | cat.traineddata |
ceb | Cebuano | ceb.traineddata |
ces | Czech | ces.traineddata |
chi_sim | Chinese - Simplified | chi_sim.traineddata |
chi_tra | Chinese - Traditional | chi_tra.traineddata |
chr | Cherokee | chr.traineddata |
cym | Welsh | cym.traineddata |
dan | Danish | dan.traineddata |
deu | German | deu.traineddata |
dzo | Dzongkha | dzo.traineddata |
ell | Greek, Modern (1453-) | ell.traineddata |
eng | English | eng.traineddata |
enm | English, Middle (1100-1500) | enm.traineddata |
epo | Esperanto | epo.traineddata |
est | Estonian | est.traineddata |
eus | Basque | eus.traineddata |
fas | Persian | fas.traineddata |
fin | Finnish | fin.traineddata |
fra | French | fra.traineddata |
frk | Frankish | frk.traineddata |
frm | French, Middle (ca. 1400-1600) | frm.traineddata |
gle | Irish | gle.traineddata |
glg | Galician | glg.traineddata |
grc | Greek, Ancient (-1453) | grc.traineddata |
guj | Gujarati | guj.traineddata |
hat | Haitian; Haitian Creole | hat.traineddata |
heb | Hebrew | heb.traineddata |
hin | Hindi | hin.traineddata |
hrv | Croatian | hrv.traineddata |
hun | Hungarian | hun.traineddata |
iku | Inuktitut | iku.traineddata |
ind | Indonesian | ind.traineddata |
isl | Icelandic | isl.traineddata |
ita | Italian | ita.traineddata |
ita_old | Italian - Old | ita_old.traineddata |
jav | Javanese | jav.traineddata |
jpn | Japanese | jpn.traineddata |
kan | Kannada | kan.traineddata |
kat | Georgian | kat.traineddata |
kat_old | Georgian - Old | kat_old.traineddata |
kaz | Kazakh | kaz.traineddata |
khm | Central Khmer | khm.traineddata |
kir | Kirghiz; Kyrgyz | kir.traineddata |
kor | Korean | kor.traineddata |
kur | Kurdish | kur.traineddata |
lao | Lao | lao.traineddata |
lat | Latin | lat.traineddata |
lav | Latvian | lav.traineddata |
lit | Lithuanian | lit.traineddata |
mal | Malayalam | mal.traineddata |
mar | Marathi | mar.traineddata |
mkd | Macedonian | mkd.traineddata |
mlt | Maltese | mlt.traineddata |
msa | Malay | msa.traineddata |
mya | Burmese | mya.traineddata |
nep | Nepali | nep.traineddata |
nld | Dutch; Flemish | nld.traineddata |
nor | Norwegian | nor.traineddata |
ori | Oriya | ori.traineddata |
pan | Panjabi; Punjabi | pan.traineddata |
pol | Polish | pol.traineddata |
por | Portuguese | por.traineddata |
pus | Pushto; Pashto | pus.traineddata |
ron | Romanian; Moldavian; Moldovan | ron.traineddata |
rus | Russian | rus.traineddata |
san | Sanskrit | san.traineddata |
sin | Sinhala; Sinhalese | sin.traineddata |
slk | Slovak | slk.traineddata |
slv | Slovenian | slv.traineddata |
spa | Spanish; Castilian | spa.traineddata |
spa_old | Spanish; Castilian - Old | spa_old.traineddata |
sqi | Albanian | sqi.traineddata |
srp | Serbian | srp.traineddata |
srp_latn | Serbian - Latin | srp_latn.traineddata |
swa | Swahili | swa.traineddata |
swe | Swedish | swe.traineddata |
syr | Syriac | syr.traineddata |
tam | Tamil | tam.traineddata |
tel | Telugu | tel.traineddata |
tgk | Tajik | tgk.traineddata |
tgl | Tagalog | tgl.traineddata |
tha | Thai | tha.traineddata |
tir | Tigrinya | tir.traineddata |
tur | Turkish | tur.traineddata |
uig | Uighur; Uyghur | uig.traineddata |
ukr | Ukrainian | ukr.traineddata |
urd | Urdu | urd.traineddata |
uzb | Uzbek | uzb.traineddata |
uzb_cyrl | Uzbek - Cyrillic | uzb_cyrl.traineddata |
vie | Vietnamese | vie.traineddata |
yid | Yiddish | yid.traineddata |
版本3.04/3.05的##数据文件
注意:对于阿拉伯语和印地语,您需要训练的数据文件和立方体数据文件.
Lang Code | Language | 3.04 traineddata |
---|---|---|
afr | Afrikaans | afr.traineddata |
amh | Amharic | amh.traineddata |
ara | Arabic | ara.traineddata |
asm | Assamese | asm.traineddata |
aze | Azerbaijani | aze.traineddata |
aze_cyrl | Azerbaijani - Cyrillic | aze_cyrl.traineddata |
bel | Belarusian | bel.traineddata |
ben | Bengali | ben.traineddata |
bod | Tibetan | bod.traineddata |
bos | Bosnian | bos.traineddata |
bul | Bulgarian | bul.traineddata |
cat | Catalan; Valencian | cat.traineddata |
ceb | Cebuano | ceb.traineddata |
ces | Czech | ces.traineddata |
chi_sim | Chinese - Simplified | chi_sim.traineddata |
chi_tra | Chinese - Traditional | chi_tra.traineddata |
chr | Cherokee | chr.traineddata |
cym | Welsh | cym.traineddata |
dan | Danish | dan.traineddata |
deu | German | deu.traineddata |
dzo | Dzongkha | dzo.traineddata |
ell | Greek, Modern (1453-) | ell.traineddata |
eng | English | eng.traineddata |
enm | English, Middle (1100-1500) | enm.traineddata |
epo | Esperanto | epo.traineddata |
est | Estonian | est.traineddata |
eus | Basque | eus.traineddata |
fas | Persian | fas.traineddata |
fin | Finnish | fin.traineddata |
fra | French | fra.traineddata |
frk | Frankish | frk.traineddata |
frm | French, Middle (ca. 1400-1600) | frm.traineddata |
gle | Irish | gle.traineddata |
glg | Galician | glg.traineddata |
grc | Greek, Ancient (-1453) | grc.traineddata |
guj | Gujarati | guj.traineddata |
hat | Haitian; Haitian Creole | hat.traineddata |
heb | Hebrew | heb.traineddata |
hin | Hindi | hin.traineddata |
hrv | Croatian | hrv.traineddata |
hun | Hungarian | hun.traineddata |
iku | Inuktitut | iku.traineddata |
ind | Indonesian | ind.traineddata |
isl | Icelandic | isl.traineddata |
ita | Italian | ita.traineddata |
ita_old | Italian - Old | ita_old.traineddata |
jav | Javanese | jav.traineddata |
jpn | Japanese | jpn.traineddata |
kan | Kannada | kan.traineddata |
kat | Georgian | kat.traineddata |
kat_old | Georgian - Old | kat_old.traineddata |
kaz | Kazakh | kaz.traineddata |
khm | Central Khmer | khm.traineddata |
kir | Kirghiz; Kyrgyz | kir.traineddata |
kor | Korean | kor.traineddata |
kur | Kurdish | kur.traineddata |
lao | Lao | lao.traineddata |
lat | Latin | lat.traineddata |
lav | Latvian | lav.traineddata |
lit | Lithuanian | lit.traineddata |
mal | Malayalam | mal.traineddata |
mar | Marathi | mar.traineddata |
mkd | Macedonian | mkd.traineddata |
mlt | Maltese | mlt.traineddata |
msa | Malay | msa.traineddata |
mya | Burmese | mya.traineddata |
nep | Nepali | nep.traineddata |
nld | Dutch; Flemish | nld.traineddata |
nor | Norwegian | nor.traineddata |
ori | Oriya | ori.traineddata |
pan | Panjabi; Punjabi | pan.traineddata |
pol | Polish | pol.traineddata |
por | Portuguese | por.traineddata |
pus | Pushto; Pashto | pus.traineddata |
ron | Romanian; Moldavian; Moldovan | ron.traineddata |
rus | Russian | rus.traineddata |
san | Sanskrit | san.traineddata |
sin | Sinhala; Sinhalese | sin.traineddata |
slk | Slovak | slk.traineddata |
slv | Slovenian | slv.traineddata |
spa | Spanish; Castilian | spa.traineddata |
spa_old | Spanish; Castilian - Old | spa_old.traineddata |
sqi | Albanian | sqi.traineddata |
srp | Serbian | srp.traineddata |
srp_latn | Serbian - Latin | srp_latn.traineddata |
swa | Swahili | swa.traineddata |
swe | Swedish | swe.traineddata |
syr | Syriac | syr.traineddata |
tam | Tamil | tam.traineddata |
tel | Telugu | tel.traineddata |
tgk | Tajik | tgk.traineddata |
tgl | Tagalog | tgl.traineddata |
tha | Thai | tha.traineddata |
tir | Tigrinya | tir.traineddata |
tur | Turkish | tur.traineddata |
uig | Uighur; Uyghur | uig.traineddata |
ukr | Ukrainian | ukr.traineddata |
urd | Urdu | urd.traineddata |
uzb | Uzbek | uzb.traineddata |
uzb_cyrl | Uzbek - Cyrillic | uzb_cyrl.traineddata |
vie | Vietnamese | vie.traineddata |
yid | Yiddish | yid.traineddata |
版本3.04/3.05的##多维数据集数据文件
在Tesseract 3.0x中,阿拉伯语和印地语使用Cube OCR引擎. 您需要下载多维数据集文件并将其移动到<ara/hin> .traineddata文件所在的同一文件夹中.
在Tesseract 4.0中,Cube OCR引擎已从代码库中删除,因此如果您使用的是4.0或更新版本,则不需要这些文件.
印地语:
阿拉伯:
这些数据文件由@paalberti为一些旧版本的Tesseract准备.
dan_frak
,deu_frak
和swe_frak
是为版本3.00准备的,slk_frak
是为3.01准备的.
有关这些文件的更新,请访问paalberti/tesseract-dan-fraktur.
Lang Code | Language | 3.0x traineddata |
---|---|---|
dan_frak | Danish - Fraktur | dan_frak.traineddata |
deu_frak | German - Fraktur | deu_frak.traineddata |
slk_frak | Slovak - Fraktur | slk_frak.traineddata |
swe_frak | Swedish - Fraktur | swe-frak.traineddata |
##版本3.02的数据文件
##版本2.0x的数据文件
Lang Code | Language | 2.0x traineddata |
---|---|---|
deu | German | tesseract-2.00.deu.tar.gz |
deu-f | German - Fraktur | tesseract-2.01.deu-f.tar.gz |
eng | English | tesseract-2.00.eng.tar.gz |
eus | Basque | tesseract-2.04-eus.tar.gz |
fra | French | tesseract-2.00.fra.tar.gz |
ita | Italian | tesseract-2.00.ita.tar.gz |
nld | Dutch; Flemish | tesseract-2.00.nld.tar.gz |
por | Portuguese | tesseract-2.01.por.tar.gz |
spa | Spanish; Castilian | tesseract-2.00.spa.tar.gz |
vie | Vietnamese | tesseract-2.01.vie.tar.gz |
##训练的数据文件的格式
每种语言的traineddata
文件是Tesseract特定格式的存档文件.
它包含Tesseract OCR进程所需的几个未压缩组件文件.
程序combine_tessdata
用于从组件文件创建tessdata
文件,也可以像下面的示例一样再次提取它们:
###从2016年11月开始的4.0.0格式(包括LSTM和Legacy型号)
combine_tessdata -u eng.traineddata eng.
Extracting tessdata components from eng.traineddata
Wrote eng.unicharset
Wrote eng.unicharambigs
Wrote eng.inttemp
Wrote eng.pffmtable
Wrote eng.normproto
Wrote eng.punc-dawg
Wrote eng.word-dawg
Wrote eng.number-dawg
Wrote eng.freq-dawg
Wrote eng.cube-unicharset
Wrote eng.cube-word-dawg
Wrote eng.shapetable
Wrote eng.bigram-dawg
Wrote eng.lstm
Wrote eng.lstm-punc-dawg
Wrote eng.lstm-word-dawg
Wrote eng.lstm-number-dawg
Wrote eng.version
Version string:Pre-4.0.0
1:unicharset:size=7477, offset=192
2:unicharambigs:size=1047, offset=7669
3:inttemp:size=976552, offset=8716
4:pffmtable:size=844, offset=985268
5:normproto:size=13408, offset=986112
6:punc-dawg:size=4322, offset=999520
7:word-dawg:size=1082890, offset=1003842
8:number-dawg:size=6426, offset=2086732
9:freq-dawg:size=1410, offset=2093158
11:cube-unicharset:size=1511, offset=2094568
12:cube-word-dawg:size=1062106, offset=2096079
13:shapetable:size=63346, offset=3158185
14:bigram-dawg:size=16109842, offset=3221531
17:lstm:size=5390718, offset=19331373
18:lstm-punc-dawg:size=4322, offset=24722091
19:lstm-word-dawg:size=7143578, offset=24726413
20:lstm-number-dawg:size=3530, offset=31869991
23:version:size=9, offset=31873521
combine_tessdata -u eng.traineddata eng.
Extracting tessdata components from eng.traineddata
Wrote eng.lstm
Wrote eng.lstm-punc-dawg
Wrote eng.lstm-word-dawg
Wrote eng.lstm-number-dawg
Wrote eng.lstm-unicharset
Wrote eng.lstm-recoder
Wrote eng.version
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517
###关于压缩的训练数据文件的提议
There are some proposals to replace the Tesseract archive format by a standard archive format which could also support compression. A discussion on the tesseract-dev forum proposed the ZIP format already in 2014. In 2017 an experimental implementation was provided as a pull request.