Data Files - kana112233/tesseract GitHub Wiki

##特殊数据文件

Lang Code Description 4.0/3.0x traineddata
osd Orientation and script detection osd.traineddata
equ Math / equation detection equ.traineddata

注意:这两个数据文件与旧版本的Tesseract兼容. osd与3.01及更高版本兼容,equ与3.02及更高版本兼容.

##版本4.00的更新数据文件(2017年9月15日)

我们在三个独立的存储库中的GitHub上有三组.traineddata文件.

大多数用户都希望**tessdata_fast **,这将是Linux发行版的一部分.

**tessdata_best **适用于愿意以更高的准确度进行交易的人们. 也是

唯一可用于高级用户的某些再培训方案的文件集.

**tessdata **中的第三个集是唯一支持遗留识别器的集合. 2016年11月的4.00文件包含旧版LSTM和旧版LSTM. **tessdata **中的当前文件集具有传统模型和较新的LSTM模型(tessdata_best中的4.00.00 alpha模型的整数版本).

注意:在** tessdata_best **和**tessdata_fast` **存储库中使用新模型时,仅支持新的基于LSTM的OCR引擎. 这些文件不支持旧版引擎,因此Tesseract的oem模式“0”和“2”将无法使用它们.

##版本4.00的数据文件(2016年11月29日)

这组训练的数据文件支持带有--oem 0的传统识别器和带有--oem 1的LSTM模型.

注意:kur数据文件未从3.04更新. 对于Fraktur,请参阅Fraktur数据文件部分,或使用tessdata_fast或tessdata_best存储库中的较新数据文件.

Lang Code Language 4.0 traineddata
afr Afrikaans afr.traineddata
amh Amharic amh.traineddata
ara Arabic ara.traineddata
asm Assamese asm.traineddata
aze Azerbaijani aze.traineddata
aze_cyrl Azerbaijani - Cyrillic aze_cyrl.traineddata
bel Belarusian bel.traineddata
ben Bengali ben.traineddata
bod Tibetan bod.traineddata
bos Bosnian bos.traineddata
bul Bulgarian bul.traineddata
cat Catalan; Valencian cat.traineddata
ceb Cebuano ceb.traineddata
ces Czech ces.traineddata
chi_sim Chinese - Simplified chi_sim.traineddata
chi_tra Chinese - Traditional chi_tra.traineddata
chr Cherokee chr.traineddata
cym Welsh cym.traineddata
dan Danish dan.traineddata
deu German deu.traineddata
dzo Dzongkha dzo.traineddata
ell Greek, Modern (1453-) ell.traineddata
eng English eng.traineddata
enm English, Middle (1100-1500) enm.traineddata
epo Esperanto epo.traineddata
est Estonian est.traineddata
eus Basque eus.traineddata
fas Persian fas.traineddata
fin Finnish fin.traineddata
fra French fra.traineddata
frk Frankish frk.traineddata
frm French, Middle (ca. 1400-1600) frm.traineddata
gle Irish gle.traineddata
glg Galician glg.traineddata
grc Greek, Ancient (-1453) grc.traineddata
guj Gujarati guj.traineddata
hat Haitian; Haitian Creole hat.traineddata
heb Hebrew heb.traineddata
hin Hindi hin.traineddata
hrv Croatian hrv.traineddata
hun Hungarian hun.traineddata
iku Inuktitut iku.traineddata
ind Indonesian ind.traineddata
isl Icelandic isl.traineddata
ita Italian ita.traineddata
ita_old Italian - Old ita_old.traineddata
jav Javanese jav.traineddata
jpn Japanese jpn.traineddata
kan Kannada kan.traineddata
kat Georgian kat.traineddata
kat_old Georgian - Old kat_old.traineddata
kaz Kazakh kaz.traineddata
khm Central Khmer khm.traineddata
kir Kirghiz; Kyrgyz kir.traineddata
kor Korean kor.traineddata
kur Kurdish kur.traineddata
lao Lao lao.traineddata
lat Latin lat.traineddata
lav Latvian lav.traineddata
lit Lithuanian lit.traineddata
mal Malayalam mal.traineddata
mar Marathi mar.traineddata
mkd Macedonian mkd.traineddata
mlt Maltese mlt.traineddata
msa Malay msa.traineddata
mya Burmese mya.traineddata
nep Nepali nep.traineddata
nld Dutch; Flemish nld.traineddata
nor Norwegian nor.traineddata
ori Oriya ori.traineddata
pan Panjabi; Punjabi pan.traineddata
pol Polish pol.traineddata
por Portuguese por.traineddata
pus Pushto; Pashto pus.traineddata
ron Romanian; Moldavian; Moldovan ron.traineddata
rus Russian rus.traineddata
san Sanskrit san.traineddata
sin Sinhala; Sinhalese sin.traineddata
slk Slovak slk.traineddata
slv Slovenian slv.traineddata
spa Spanish; Castilian spa.traineddata
spa_old Spanish; Castilian - Old spa_old.traineddata
sqi Albanian sqi.traineddata
srp Serbian srp.traineddata
srp_latn Serbian - Latin srp_latn.traineddata
swa Swahili swa.traineddata
swe Swedish swe.traineddata
syr Syriac syr.traineddata
tam Tamil tam.traineddata
tel Telugu tel.traineddata
tgk Tajik tgk.traineddata
tgl Tagalog tgl.traineddata
tha Thai tha.traineddata
tir Tigrinya tir.traineddata
tur Turkish tur.traineddata
uig Uighur; Uyghur uig.traineddata
ukr Ukrainian ukr.traineddata
urd Urdu urd.traineddata
uzb Uzbek uzb.traineddata
uzb_cyrl Uzbek - Cyrillic uzb_cyrl.traineddata
vie Vietnamese vie.traineddata
yid Yiddish yid.traineddata

版本3.04/3.05的##数据文件

注意:对于阿拉伯语和印地语,您需要训练的数据文件和立方体数据文件.

Lang Code Language 3.04 traineddata
afr Afrikaans afr.traineddata
amh Amharic amh.traineddata
ara Arabic ara.traineddata
asm Assamese asm.traineddata
aze Azerbaijani aze.traineddata
aze_cyrl Azerbaijani - Cyrillic aze_cyrl.traineddata
bel Belarusian bel.traineddata
ben Bengali ben.traineddata
bod Tibetan bod.traineddata
bos Bosnian bos.traineddata
bul Bulgarian bul.traineddata
cat Catalan; Valencian cat.traineddata
ceb Cebuano ceb.traineddata
ces Czech ces.traineddata
chi_sim Chinese - Simplified chi_sim.traineddata
chi_tra Chinese - Traditional chi_tra.traineddata
chr Cherokee chr.traineddata
cym Welsh cym.traineddata
dan Danish dan.traineddata
deu German deu.traineddata
dzo Dzongkha dzo.traineddata
ell Greek, Modern (1453-) ell.traineddata
eng English eng.traineddata
enm English, Middle (1100-1500) enm.traineddata
epo Esperanto epo.traineddata
est Estonian est.traineddata
eus Basque eus.traineddata
fas Persian fas.traineddata
fin Finnish fin.traineddata
fra French fra.traineddata
frk Frankish frk.traineddata
frm French, Middle (ca. 1400-1600) frm.traineddata
gle Irish gle.traineddata
glg Galician glg.traineddata
grc Greek, Ancient (-1453) grc.traineddata
guj Gujarati guj.traineddata
hat Haitian; Haitian Creole hat.traineddata
heb Hebrew heb.traineddata
hin Hindi hin.traineddata
hrv Croatian hrv.traineddata
hun Hungarian hun.traineddata
iku Inuktitut iku.traineddata
ind Indonesian ind.traineddata
isl Icelandic isl.traineddata
ita Italian ita.traineddata
ita_old Italian - Old ita_old.traineddata
jav Javanese jav.traineddata
jpn Japanese jpn.traineddata
kan Kannada kan.traineddata
kat Georgian kat.traineddata
kat_old Georgian - Old kat_old.traineddata
kaz Kazakh kaz.traineddata
khm Central Khmer khm.traineddata
kir Kirghiz; Kyrgyz kir.traineddata
kor Korean kor.traineddata
kur Kurdish kur.traineddata
lao Lao lao.traineddata
lat Latin lat.traineddata
lav Latvian lav.traineddata
lit Lithuanian lit.traineddata
mal Malayalam mal.traineddata
mar Marathi mar.traineddata
mkd Macedonian mkd.traineddata
mlt Maltese mlt.traineddata
msa Malay msa.traineddata
mya Burmese mya.traineddata
nep Nepali nep.traineddata
nld Dutch; Flemish nld.traineddata
nor Norwegian nor.traineddata
ori Oriya ori.traineddata
pan Panjabi; Punjabi pan.traineddata
pol Polish pol.traineddata
por Portuguese por.traineddata
pus Pushto; Pashto pus.traineddata
ron Romanian; Moldavian; Moldovan ron.traineddata
rus Russian rus.traineddata
san Sanskrit san.traineddata
sin Sinhala; Sinhalese sin.traineddata
slk Slovak slk.traineddata
slv Slovenian slv.traineddata
spa Spanish; Castilian spa.traineddata
spa_old Spanish; Castilian - Old spa_old.traineddata
sqi Albanian sqi.traineddata
srp Serbian srp.traineddata
srp_latn Serbian - Latin srp_latn.traineddata
swa Swahili swa.traineddata
swe Swedish swe.traineddata
syr Syriac syr.traineddata
tam Tamil tam.traineddata
tel Telugu tel.traineddata
tgk Tajik tgk.traineddata
tgl Tagalog tgl.traineddata
tha Thai tha.traineddata
tir Tigrinya tir.traineddata
tur Turkish tur.traineddata
uig Uighur; Uyghur uig.traineddata
ukr Ukrainian ukr.traineddata
urd Urdu urd.traineddata
uzb Uzbek uzb.traineddata
uzb_cyrl Uzbek - Cyrillic uzb_cyrl.traineddata
vie Vietnamese vie.traineddata
yid Yiddish yid.traineddata

版本3.04/3.05的##多维数据集数据文件

在Tesseract 3.0x中,阿拉伯语和印地语使用Cube OCR引擎. 您需要下载多维数据集文件并将其移动到<ara/hin> .traineddata文件所在的同一文件夹中.

在Tesseract 4.0中,Cube OCR引擎已从代码库中删除,因此如果您使用的是4.0或更新版本,则不需要这些文件.

印地语:

hin.cube.bigrams,

hin.cube.fold,

hin.cube.lm,

hin.cube.nn,

hin.cube.params,

hin.cube.word-频率,

hin.tesseract_cube.nn

阿拉伯:

ara.cube.bigrams,

ara.cube.fold,

ara.cube.lm,

ara.cube.nn,

ara.cube.params,

ara.cube.word-频率,

ara.cube.size,

ara.tesseract_cube.nn

Fraktur数据文件

这些数据文件由@paalberti为一些旧版本的Tesseract准备. dan_frak,deu_frakswe_frak是为版本3.00准备的,slk_frak是为3.01准备的. 有关这些文件的更新,请访问paalberti/tesseract-dan-fraktur.

Lang Code Language 3.0x traineddata
dan_frak Danish - Fraktur dan_frak.traineddata
deu_frak German - Fraktur deu_frak.traineddata
slk_frak Slovak - Fraktur slk_frak.traineddata
swe_frak Swedish - Fraktur swe-frak.traineddata

##版本3.02的数据文件

Lang Code Language 3.02 traineddata
afr Afrikaans tesseract-ocr-3.02.afr.tar.gz
ara Arabic tesseract-ocr-3.02.ara.tar.gz
aze Azerbaijani tesseract-ocr-3.02.aze.tar.gz
bel Belarusian tesseract-ocr-3.02.bel.tar.gz
ben Bengali tesseract-ocr-3.02.ben.tar.gz
bul Bulgarian tesseract-ocr-3.02.bul.tar.gz
cat Catalan; Valencian tesseract-ocr-3.02.cat.tar.gz
ces Czech tesseract-ocr-3.02.ces.tar.gz
chi_sim Chinese - Simplified tesseract-ocr-3.02.chi_sim.tar.gz
chi_tra Chinese - Traditional tesseract-ocr-3.02.chi_tra.tar.gz
chr Cherokee tesseract-ocr-3.02.chr.tar.gz
dan Danish tesseract-ocr-3.02.dan.tar.gz
deu German tesseract-ocr-3.02.deu.tar.gz
ell Greek, Modern (1453-) tesseract-ocr-3.02.ell.tar.gz
eng English tesseract-ocr-3.02.eng.tar.gz
enm English, Middle (1100-1500) tesseract-ocr-3.02.enm.tar.gz
epo Esperanto tesseract-ocr-3.02.epo.tar.gz
est Estonian tesseract-ocr-3.02.est.tar.gz
eus Basque tesseract-ocr-3.02.eus.tar.gz
fin Finnish tesseract-ocr-3.02.fin.tar.gz
fra French tesseract-ocr-3.02.fra.tar.gz
frk Frankish tesseract-ocr-3.02.frk.tar.gz
frm French, Middle (ca. 1400-1600) tesseract-ocr-3.02.frm.tar.gz
glg Galician tesseract-ocr-3.02.glg.tar.gz
grc Greek, Ancient (-1453) tesseract-ocr-3.02.grc.tar.gz
heb Hebrew tesseract-ocr-3.02.heb.tar.gz
hin Hindi tesseract-ocr-3.02.hin.tar.gz
hrv Croatian tesseract-ocr-3.02.hrv.tar.gz
hun Hungarian tesseract-ocr-3.02.hun.tar.gz
ind Indonesian tesseract-ocr-3.02.ind.tar.gz
isl Icelandic tesseract-ocr-3.02.isl.tar.gz
ita Italian tesseract-ocr-3.02.ita.tar.gz
ita_old Italian - Old tesseract-ocr-3.02.ita_old.tar.gz
jpn Japanese tesseract-ocr-3.02.jpn.tar.gz
kan Kannada tesseract-ocr-3.02.kan.tar.gz
kor Korean tesseract-ocr-3.02.kor.tar.gz
lav Latvian tesseract-ocr-3.02.lav.tar.gz
lit Lithuanian tesseract-ocr-3.02.lit.tar.gz
mal Malayalam tesseract-ocr-3.02.mal.tar.gz
mkd Macedonian tesseract-ocr-3.02.mkd.tar.gz
mlt Maltese tesseract-ocr-3.02.mlt.tar.gz
msa Malay tesseract-ocr-3.02.msa.tar.gz
nld Dutch; Flemish tesseract-ocr-3.02.nld.tar.gz
nor Norwegian tesseract-ocr-3.02.nor.tar.gz
pol Polish tesseract-ocr-3.02.pol.tar.gz
por Portuguese tesseract-ocr-3.02.por.tar.gz
ron Romanian; Moldavian; Moldovan tesseract-ocr-3.02.ron.tar.gz
rus Russian tesseract-ocr-3.02.rus.tar.gz
slk Slovak tesseract-ocr-3.02.slk.tar.gz
slv Slovenian tesseract-ocr-3.02.slv.tar.gz
spa Spanish; Castilian tesseract-ocr-3.02.spa.tar.gz
spa_old Spanish; Castilian - Old tesseract-ocr-3.02.spa_old.tar.gz
sqi Albanian tesseract-ocr-3.02.sqi.tar.gz
srp Serbian tesseract-ocr-3.02.srp.tar.gz
swa Swahili tesseract-ocr-3.02.swa.tar.gz
swe Swedish tesseract-ocr-3.02.swe.tar.gz
tam Tamil tesseract-ocr-3.02.tam.tar.gz
tel Telugu tesseract-ocr-3.02.tel.tar.gz
tgl Tagalog tesseract-ocr-3.02.tgl.tar.gz
tha Thai tesseract-ocr-3.02.tha.tar.gz
tur Turkish tesseract-ocr-3.02.tur.tar.gz
ukr Ukrainian tesseract-ocr-3.02.ukr.tar.gz
vie Vietnamese tesseract-ocr-3.02.vie.tar.gz

##版本2.0x的数据文件

Lang Code Language 2.0x traineddata
deu German tesseract-2.00.deu.tar.gz
deu-f German - Fraktur tesseract-2.01.deu-f.tar.gz
eng English tesseract-2.00.eng.tar.gz
eus Basque tesseract-2.04-eus.tar.gz
fra French tesseract-2.00.fra.tar.gz
ita Italian tesseract-2.00.ita.tar.gz
nld Dutch; Flemish tesseract-2.00.nld.tar.gz
por Portuguese tesseract-2.01.por.tar.gz
spa Spanish; Castilian tesseract-2.00.spa.tar.gz
vie Vietnamese tesseract-2.01.vie.tar.gz

##训练的数据文件的格式

每种语言的traineddata文件是Tesseract特定格式的存档文件. 它包含Tesseract OCR进程所需的几个未压缩组件文件. 程序combine_tessdata用于从组件文件创建tessdata文件,也可以像下面的示例一样再次提取它们:

###从2016年11月开始的4.0.0格式(包括LSTM和Legacy型号)

combine_tessdata -u eng.traineddata eng.
Extracting tessdata components from eng.traineddata
Wrote eng.unicharset
Wrote eng.unicharambigs
Wrote eng.inttemp
Wrote eng.pffmtable
Wrote eng.normproto
Wrote eng.punc-dawg
Wrote eng.word-dawg
Wrote eng.number-dawg
Wrote eng.freq-dawg
Wrote eng.cube-unicharset
Wrote eng.cube-word-dawg
Wrote eng.shapetable
Wrote eng.bigram-dawg
Wrote eng.lstm
Wrote eng.lstm-punc-dawg
Wrote eng.lstm-word-dawg
Wrote eng.lstm-number-dawg
Wrote eng.version
Version string:Pre-4.0.0
1:unicharset:size=7477, offset=192
2:unicharambigs:size=1047, offset=7669
3:inttemp:size=976552, offset=8716
4:pffmtable:size=844, offset=985268
5:normproto:size=13408, offset=986112
6:punc-dawg:size=4322, offset=999520
7:word-dawg:size=1082890, offset=1003842
8:number-dawg:size=6426, offset=2086732
9:freq-dawg:size=1410, offset=2093158
11:cube-unicharset:size=1511, offset=2094568
12:cube-word-dawg:size=1062106, offset=2096079
13:shapetable:size=63346, offset=3158185
14:bigram-dawg:size=16109842, offset=3221531
17:lstm:size=5390718, offset=19331373
18:lstm-punc-dawg:size=4322, offset=24722091
19:lstm-word-dawg:size=7143578, offset=24726413
20:lstm-number-dawg:size=3530, offset=31869991
23:version:size=9, offset=31873521

4.00.00alpha仅限LSTM格式

combine_tessdata -u eng.traineddata eng.
Extracting tessdata components from eng.traineddata
Wrote eng.lstm
Wrote eng.lstm-punc-dawg
Wrote eng.lstm-word-dawg
Wrote eng.lstm-number-dawg
Wrote eng.lstm-unicharset
Wrote eng.lstm-recoder
Wrote eng.version
Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517

###关于压缩的训练数据文件的提议

There are some proposals to replace the Tesseract archive format by a standard archive format which could also support compression. A discussion on the tesseract-dev forum proposed the ZIP format already in 2014. In 2017 an experimental implementation was provided as a pull request.

⚠️ **GitHub.com Fallback** ⚠️