Training with Tesseract - UB-Mannheim/AustrianNewspapers GitHub Wiki
Training with AustrianNewspaper 1.x
See https://github.com/tesseract-ocr/tesstrain/wiki/AustrianNewspapers.
Training with AustrianNewspaper 2.0
Create GT line pairs from images and PAGE XML files with PAGETools or
page2img.py
from format-converters.
Then use tesstrain.
# Prepare GT lines.
cd data
for set in TrainingSet_ONB_Newseye_GT_M1+ ValidationSet_ONB_Newseye_GT_M1+; do
(
cd $set
mkdir -p gt
cd GT-PAGE
page2img.py -p 2013-07-15 -o ../gt -t -v *xml
# Optionally decompose the line texts.
cd ../gt
find -name "*.txt"|xargs decompose.py
)
done
cd $HOME/src/github/tesseract-ocr/tesstrain
mkdir data/ONB
mkdir -p data/ONB-ground-truth
cd data/ONB-ground-truth
ln -s ~/src/github/UB-Mannheim/AustrianNewspapers/data/TrainingSet_ONB_Newseye_GT_M1+/gt t-gt
ln -s ~/src/github/UB-Mannheim/AustrianNewspapers/data/ValidationSet_ONB_Newseye_GT_M1+/gt v-gt
for txt in $(find */ -name "*.txt"); do mv -v $txt ${txt/.txt/.gt.txt}; done
make MODEL_NAME=ONB unicharset
make MODEL_NAME=ONB lists
for i in $(seq 1 40); do
nohup time make MODEL_NAME=ONB NET_SPEC="[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c1]" EPOCHS=$i training | tee -a data/ONB/training.log
cp -av data/ONB/checkpoints/ONB_checkpoint data/ONB/checkpoints/ONB-$(printf '%03d' $i).checkpoint
done
[...]
At iteration 306729/2192243/2192363, mean rms=0.158%, delta=0.219%, BCER train=0.770%, BWER train=2.315%, skip ratio=0.000%, sub_trainer=0.673 margin=14.413
At iteration 302860/2192284/2192404, mean rms=0.141%, delta=0.258%, BCER train=0.695%, BWER train=1.998%, skip ratio=0.000%,
wrote checkpoint.
UpdateSubtrainer:At iteration 302864/2192384/2192504, mean rms=0.142%, delta=0.254%, BCER train=0.781%, BWER train=2.120%, skip ratio=0.000%,
At iteration 306733/2192343/2192463, mean rms=0.157%, delta=0.230%, BCER train=0.774%, BWER train=2.326%, skip ratio=0.000%, sub_trainer=0.695 margin=11.367
At iteration 302864/2192384/2192504, mean rms=0.142%, delta=0.254%, BCER train=0.781%, BWER train=2.120%, skip ratio=0.000%,
wrote checkpoint.
At iteration 306742/2192443/2192563, mean rms=0.160%, delta=0.246%, BCER train=0.792%, BWER train=2.395%, skip ratio=0.000%, wrote checkpoint.
At iteration 306751/2192543/2192663, mean rms=0.161%, delta=0.245%, BCER train=0.783%, BWER train=2.304%, skip ratio=0.000%, wrote checkpoint.
At iteration 306756/2192643/2192763, mean rms=0.159%, delta=0.234%, BCER train=0.709%, BWER train=2.206%, skip ratio=0.000%, wrote checkpoint.
At iteration 306763/2192743/2192863, mean rms=0.161%, delta=0.244%, BCER train=0.792%, BWER train=2.341%, skip ratio=0.000%, wrote checkpoint.
At iteration 306770/2192843/2192963, mean rms=0.162%, delta=0.249%, BCER train=0.799%, BWER train=2.517%, skip ratio=0.000%, wrote checkpoint.
At iteration 306776/2192943/2193063, mean rms=0.153%, delta=0.212%, BCER train=0.707%, BWER train=2.452%, skip ratio=0.000%, wrote checkpoint.
At iteration 306781/2193043/2193163, mean rms=0.149%, delta=0.199%, BCER train=0.646%, BWER train=2.258%, skip ratio=0.000%, wrote checkpoint.
At iteration 306787/2193143/2193263, mean rms=0.148%, delta=0.202%, BCER train=0.661%, BWER train=2.276%, skip ratio=0.000%, wrote checkpoint.
At iteration 306792/2193243/2193363, mean rms=0.149%, delta=0.196%, BCER train=0.661%, BWER train=2.213%, skip ratio=0.000%, wrote checkpoint.
At iteration 306793/2193343/2193463, mean rms=0.145%, delta=0.172%, BCER train=0.618%, BWER train=2.030%, skip ratio=0.000%, wrote checkpoint.
At iteration 306804/2193443/2193563, mean rms=0.146%, delta=0.194%, BCER train=0.608%, BWER train=2.032%, skip ratio=0.000%, wrote checkpoint.
At iteration 306807/2193480/2193600, mean rms=0.147%, delta=0.197%, BCER train=0.615%, BWER train=2.049%, skip ratio=0.000%, wrote checkpoint.
Finished! Selected model with minimal training error rate (BCER) = 0.23
lstmtraining \
--stop_training \
--continue_from data/ONB/checkpoints/ONB_checkpoint \
--traineddata data/ONB/ONB.traineddata \
--model_output data/ONB.traineddata
Loaded file data/ONB/checkpoints/ONB_checkpoint, unpacking...
7691.44user 21.05system 2:08:32elapsed 99%CPU (0avgtext+0avgdata 1043728maxresident)k
0inputs+52274720outputs (0major+2664860minor)pagefaults 0swaps
'data/ONB/checkpoints/ONB_checkpoint' -> 'data/ONB/checkpoints/ONB-040.checkpoint'