Training with Calamari - UB-Mannheim/AustrianNewspapers GitHub Wiki
All trainings are run from the working directory of AustrianNewspapers.
Trained models are available from https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/calamari/. This URL provides the results and models for the first training with AustrianNewspapers 1.x and the latest training with AustrianNewspapers 2.0.
Training with AustrianNewspapers 1.x
Training on Linux
Training with CPU (AMD EPYC 7502 32-Core Processor, 64 GiB RAM)
The training process was interrupted manually after 100 steps.
time calamari-train --train PageXML --train.images "TrainingSet_ONB_Newseye_GT_M1+/*.tif" --val PageXML --val.images "ValidationSet_ONB_Newseye_GT_M1+/*.tif"
[...]
INFO 2022-05-23 18:34:57,423 tfaip.trainer.callbacks.logger: Start of epoch 1
100/2665 [>.............................] - ETA: 19:28 - loss: 151.4249 - ctc-loss: 151.4249 - loss/mean_epoch: 151.4249 - CER: 1.0249
[...]
real 4m3.230s
user 29m56.892s
sys 3m35.661s
Training with GPU (NVidia RTX A5000, 24 GiB RAM)
The training process was interrupted manually after 100 steps.
time calamari-train --device.gpus 0 --train PageXML --train.images "TrainingSet_ONB_Newseye_GT_M1+/*.tif" --val PageXML --val.images "ValidationSet_ONB_Newseye_GT_M1+/*.tif"
(--best_models_dir data/frak2021)
[...]
INFO 2022-05-23 18:41:38,742 tfaip.trainer.callbacks.logger: Start of epoch 1
2022-05-23 18:41:42.194654: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8204
2022-05-23 18:41:44.164328: I tensorflow/stream_executor/cuda/cuda_blas.cc:1760] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
106/2665 [>.............................] - ETA: 2:19 - loss: 154.5983 - ctc-loss: 154.5983 - loss/mean_epoch: 154.5983 - CER: 1.0163
[...]
real 3m25.564s
user 10m2.160s
sys 1m32.504s
Training with GPU (NVidia RTX A5000, 24 GiB RAM) – 2022-07-24
This training was run on a different server (ocr-02) and also generated models.
time calamari-train --device.gpus 0 --train PageXML --train.images TrainingSet_ONB_Newseye_GT_M1+/*.tif --val PageXML --val.images ValidationSet_ONB_Newseye_GT_M1+/*.tif --trainer.output_dir calamari-model --trainer.epochs 999
INFO 2022-07-24 18:55:36,997 tfaip.util.logging: Logging to 'calamari-model/train.log'
INFO 2022-07-24 18:55:37,001 calamari_ocr.scripts.train: trainer_params={
"epochs": 999,
"current_epoch": 0,
[...]
"best_model_prefix": "best",
"network": null,
"__cls__": "calamari_ocr.ocr.training.params:TrainerParams"
}
INFO 2022-07-24 18:55:37,002 tfaip.device.device_config: Setting up device config DeviceConfigParams(gpus=[0], gpu_auto_tune=False, gpu_memory=None, soft_device_placement=True, dist_strategy=<DistributionStrategy.DEFAULT: 'default'>)
INFO 2022-07-24 18:55:37,919 tfaip.data.pipeline.datapipeli: Preloading: Converting training to raw pipeline.
INFO 2022-07-24 18:57:38,813 tfaip.data.pipeline.datapipeli: Preloading: Converting evaluation to raw pipeline.
INFO 2022-07-24 18:57:51,663 tfaip.data.pipeline.datapipeli: Preloading: Converting targets to raw pipeline.
INFO 2022-07-24 18:57:51,752 tfaip.data.pipeline.datapipeli: Preloading: Converting targets to raw pipeline.
INFO 2022-07-24 18:57:51,759 calamari_ocr.ocr.training.trai: CODEC: ['', ' ', '!', '#', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '|', '~', '§', '°', '±', '²', '³', '·', '¹', '¼', '½', '¾', 'Ä', 'È', 'É', 'Ô', 'Ö', 'Ü', 'ß', 'à', 'á', 'â', 'ä', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ï', 'ñ', 'ò', 'ó', 'ô', 'ö', 'û', 'ü', 'ō', 'ő', 'ř', 'Š', 'Ž', 'ſ', 'ɔ', 'ʃ', 'ʞ', '˙', '–', '—', '‘', '’', '‚', '“', '”', '„', '†', '•', '⁰', '⁴', '⁵', '⁶', '⁷', '⁸', '⁹', '₄', '₆', '₈', '⅐', '⅓', '⅔', '⅕', '⅖', '⅙', '⅚', '⅛', '⅜', '⅝', '⅞', '≅', '▲', '△', '◯', '◻', '◼', '☚', '☛', '✕', '✤', '⬤', '⸗', '⸫', 'ꝛ', '+', '-', '=']
INFO 2022-07-24 18:57:51,759 tfaip.data.pipeline.datapipeli: Preloading: Converting training to raw pipeline.
WARNING 2022-07-24 18:57:52,196 calamari_ocr.ocr.dataset.image: Invalid line (longer outputs than inputs) (id=TrainingSet_ONB_Newseye_GT_M1+/ONB_nfp_19110701_021/line_1548751721484_210)
WARNING 2022-07-24 18:57:52,584 calamari_ocr.ocr.dataset.image: Invalid line (longer outputs than inputs) (id=TrainingSet_ONB_Newseye_GT_M1+/ONB_ibn_19110701_022/line_1545920018915_126)
WARNING 2022-07-24 18:57:53,761 calamari_ocr.ocr.dataset.image: Invalid line (longer outputs than inputs) (id=TrainingSet_ONB_Newseye_GT_M1+/ONB_nfp_19110701_022/line_1548816280516_119)
WARNING 2022-07-24 18:57:53,928 calamari_ocr.ocr.dataset.image: Invalid line (longer outputs than inputs) (id=TrainingSet_ONB_Newseye_GT_M1+/ONB_nfp_18950706_012/line_1548398603794_77)
WARNING 2022-07-24 18:57:54,644 calamari_ocr.ocr.dataset.image: Invalid line (longer outputs than inputs) (id=TrainingSet_ONB_Newseye_GT_M1+/ONB_ibn_19110701_022/line_1545915105649_97)
[...]
INFO 2022-07-24 18:58:16,630 tfaip.scenario.scenariobase: Total params: 1,566,638
INFO 2022-07-24 18:58:16,631 tfaip.scenario.scenariobase: Trainable params: 1,566,637
INFO 2022-07-24 18:58:16,631 tfaip.scenario.scenariobase: Non-trainable params: 1
INFO 2022-07-24 18:58:16,631 tfaip.scenario.scenariobase: ______________________________________________________________________________________________________________________________________________________
INFO 2022-07-24 18:58:17,128 tfaip.trainer.callbacks.logger: Start of epoch 1
INFO 2022-07-24 19:00:24,074 tfaip.model.print_evaluate_lay: Printing evaluation results of 10 instances
INFO 2022-07-24 19:00:24,090 tfaip.model.print_evaluate_lay:
CER: 0.013333333333333334
PRED: 'bei Einſenkung des Sargs geſprochen werden. Jch will ruhen in dem ſchon vor'
TRUE: 'bei Einſenkung des Sargs geſprochen werden. Ich will ruhen in dem ſchon vor'
INFO 2022-07-24 19:00:24,103 tfaip.model.print_evaluate_lay:
CER: 0.0
PRED: 'Jahren gebauten Grab neben meiner verewigten Gemahlin Katharina, wie ich es'
TRUE: 'Jahren gebauten Grab neben meiner verewigten Gemahlin Katharina, wie ich es'
INFO 2022-07-24 19:00:24,117 tfaip.model.print_evaluate_lay:
CER: 0.024691358024691357
PRED: 'ihr verſprochen hatte. b. Die Eandestrauer wünſche ich auf 3 Monate beſchränkt zu'
TRUE: 'ihr verſprochen hatte. 6. Die Landestrauer wünſche ich auf 3 Monate beſchränkt zu'
INFO 2022-07-24 19:00:24,130 tfaip.model.print_evaluate_lay:
CER: 0.0
PRED: 'ſehen, und nur 10 Tage nach meinem Begräbniß ſoll mit den Glocken geläutet'
TRUE: 'ſehen, und nur 10 Tage nach meinem Begräbniß ſoll mit den Glocken geläutet'
INFO 2022-07-24 19:00:24,144 tfaip.model.print_evaluate_lay:
CER: 0.01282051282051282
PRED: 'werden, meine Perſonalien ebenſo einfach in den Kirchen geleſen werden. 7. Jch'
TRUE: 'werden, meine Perſonalien ebenſo einfach in den Kirchen geleſen werden. 7. Ich'
INFO 2022-07-24 19:00:24,156 tfaip.model.print_evaluate_lay:
CER: 0.012345679012345678
PRED: 'fterbe als wahrer Chriſt, verzeihe allen meinen Feinden, danke meiner Familie für'
TRUE: 'ſterbe als wahrer Chriſt, verzeihe allen meinen Feinden, danke meiner Familie für'
INFO 2022-07-24 19:00:24,162 tfaip.model.print_evaluate_lay:
CER: 0.04054054054054054
PRED: 'ire innige Eiebe, meinen Oienern vom Civil wie vom Militär für ihre treue'
TRUE: 'ihre innige Liebe, meinen Dienern vom Civil wie vom Militär für ihre treue'
INFO 2022-07-24 19:00:24,168 tfaip.model.print_evaluate_lay:
CER: 0.012048192771084338
PRED: 'Anhänglichkeit und Eifer in Erfüllung ihrer Pflchten, allen meinen Unterthanen für'
TRUE: 'Anhänglichkeit und Eifer in Erfüllung ihrer Pflichten, allen meinen Unterthanen für'
INFO 2022-07-24 19:00:24,174 tfaip.model.print_evaluate_lay:
CER: 0.036585365853658534
PRED: 'ihre Dreue und Gehorſam gegen die Geſete. Jch habe für die Einigkeit, Selbſtſtän⸗'
TRUE: 'ihre Treue und Gehorſam gegen die Geſetze. Ich habe für die Einigkeit, Selbſtſtän⸗'
INFO 2022-07-24 19:00:24,180 tfaip.model.print_evaluate_lay:
CER: 0.012658227848101266
PRED: 'digkeit, Ruhe von Deutſchland gelebt, mein Württemberg über Aüles geliebt. Heil'
TRUE: 'digkeit, Ruhe von Deutſchland gelebt, mein Württemberg über Alles geliebt. Heil'
INFO 2022-07-24 19:00:28,837 tfaip.trainer.callbacks.benchm: Benchmark results:
+-----------------+----------------------+---------------------+--------------------+----------------------+---------------------+-----------------------+
| Benchmark | Train Total | Train Batch | Train Sample | Test Total | Test Batch | Test Sample |
+-----------------+----------------------+---------------------+--------------------+----------------------+---------------------+-----------------------+
| AVG Count | 1 | 2665 | 3601 | 1 | 2665 | 3601 |
| AVG Time Per | 131.73861646652222 | 0.04716916299000466 | 0.0349085863283428 | 131.7386302947998 | 0.04716916299000466 | 0.0349085863283428 |
| AVG Per Second | 0.007590788690680707 | 21.200291389777345 | 28.646247390089385 | 0.007590787893894427 | 21.200291389777345 | 28.646247390089385 |
| Last Count | 1 | 2665 | 3601 | 1 | 226 | 3601 |
| Last Time Per | 131.70751333236694 | 0.04716916299000466 | 0.0349085863283428 | 5.72977089881897 | 0.02481046714613923 | 0.0015571134615460888 |
| Last Per Second | 0.007592581278764841 | 21.200291389777345 | 28.646247390089385 | 0.17452704788006826 | 40.305569182142975 | 642.2139585172426 |
+-----------------+----------------------+---------------------+--------------------+----------------------+---------------------+-----------------------+
INFO 2022-07-24 19:00:28,899 tfaip.trainer.callbacks.earlys: Better value of val_CER found. Old = None, Best = 0.06386524438858032
INFO 2022-07-24 19:00:28,922 tfaip.trainer.callbacks.logger: Results of epoch 1 CER: 0.2832 - ctc-loss: 41.7597 - loss: 41.7597 - loss/mean_epoch: 41.7597 - val_CER: 0.0639 - val_ctc-loss: 8.5366 - val_loss: 8.5366 - val_loss/mean_epoch: 8.5366
INFO 2022-07-24 19:00:28,937 tfaip.trainer.callbacks.logger: Start of epoch 2
[...]
INFO 2022-07-24 19:34:16,150 tfaip.trainer.callbacks.benchm: Benchmark results:
+-----------------+-----------------------+----------------------+----------------------+-----------------------+----------------------+-----------------------+
| Benchmark | Train Total | Train Batch | Train Sample | Test Total | Test Batch | Test Sample |
+-----------------+-----------------------+----------------------+----------------------+-----------------------+----------------------+-----------------------+
| AVG Count | 1 | 45305 | 61217 | 1 | 45305 | 61217 |
| AVG Time Per | 2159.0522861480713 | 0.045726970289074784 | 0.033841259600217805 | 2159.0522968769073 | 0.045726970289074784 | 0.033841259600217805 |
| AVG Per Second | 0.0004631661800947318 | 21.868931916508863 | 29.549727516453434 | 0.0004631661777931507 | 21.868931916508863 | 29.549727516453434 |
| Last Count | 1 | 2665 | 3601 | 1 | 226 | 3601 |
| Last Time Per | 126.70879006385803 | 0.04568098866246207 | 0.033807229876551353 | 4.817528009414673 | 0.02080420582695345 | 0.0013056791216027436 |
| Last Per Second | 0.007892112295413959 | 21.890944773306554 | 29.579471718077635 | 0.2075753369872985 | 48.06720373360386 | 765.8849586049004 |
+-----------------+-----------------------+----------------------+----------------------+-----------------------+----------------------+-----------------------+
INFO 2022-07-24 19:34:16,214 tfaip.trainer.callbacks.earlys: Early stopping progressed. (remaining iteration without improvement: 1)
INFO 2022-07-24 19:34:16,214 tfaip.trainer.callbacks.earlys: No better value of val_CER = 0.026566628366708755 found. Keeping best = 0.026074493303894997
INFO 2022-07-24 19:34:16,214 tfaip.trainer.callbacks.earlys: Early stopping. Reached number of maximum iterations without improvement (5 = 5
INFO 2022-07-24 19:34:16,220 tfaip.trainer.callbacks.logger: Results of epoch 17 CER: 0.0269 - ctc-loss: 3.9759 - loss: 3.9759 - loss/mean_epoch: 3.9759 - val_CER: 0.0266 - val_ctc-loss: 3.6376 - val_loss: 3.6376 - val_loss/mean_epoch: 3.6376
INFO 2022-07-24 19:34:16,221 tfaip.trainer.callbacks.benchm: Benchmark results:
+-----------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| Benchmark | Train Total | Train Batch | Train Sample | Test Total | Test Batch | Test Sample |
+-----------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
| AVG Count | 1 | 45305 | 61217 | 1 | 45305 | 61217 |
| AVG Time Per | 2159.123648405075 | 1.009313989384721e-06 | 5.528082003400657e-07 | 2159.123648405075 | 1.009313989384721e-06 | 5.528082003400657e-07 |
| AVG Per Second | 0.0004631508717616478 | 990771.960477434 | 1808945.6693747297 | 0.0004631508717616478 | 990771.960477434 | 1808945.6693747297 |
| Last Count | 1 | 2665 | 3601 | 1 | 226 | 3601 |
| Last Time Per | 126.70879006385803 | 0.04568098866246207 | 0.033807229876551353 | 4.817528009414673 | 0.02080420582695345 | 0.0013056791216027436 |
| Last Per Second | 0.007892112295413959 | 21.890944773306554 | 29.579471718077635 | 0.2075753369872985 | 48.06720373360386 | 765.8849586049004 |
+-----------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
INFO 2022-07-24 19:34:16,221 calamari_ocr.ocr.training.trai: Training finished
2022-07-24 19:34:16.383483: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[{{node PyFunc}}](/UB-Mannheim/AustrianNewspapers/wiki/{{node-PyFunc}})
real 38m41.824s
user 110m48.328s
sys 7m26.589s
Training on MacOS
Installation
Create a Python user configuration file ~/.config/pip/pip.conf
with this content:
[global]
extra-index-url = https://digi.bib.uni-mannheim.de/pypi/
https://download.pytorch.org/whl/nightly/cpu
Then install the required Python packages:
pip install calamari-ocr
pip install tensorflow-metal
Training with CPU (MacBook Pro M1, 16 GiB RAM)
The training process was interrupted manually after 100 steps.
time calamari-train --train PageXML --train.images "TrainingSet_ONB_Newseye_GT_M1+/*.tif" --val PageXML --val.images "ValidationSet_ONB_Newseye_GT_M1+/*.tif"
[...]
102/2665 [>.............................] - ETA: 22:57 - loss: 151.8406 - ctc-loss: 151.8406 - loss/mean_epoch: 151.8406 - CER: 1.0193
[...]
532,89s user 86,63s system 431% cpu 2:23,73 total
Training with GPU (MacBook Pro M1, 16 GiB RAM)
The training process was interrupted manually after 100 steps.
time calamari-train --device.gpus 0 --train PageXML --train.images "TrainingSet_ONB_Newseye_GT_M1+/*.tif" --val PageXML --val.images "ValidationSet_ONB_Newseye_GT_M1+/*.tif"
(--best_models_dir data/frak2021)
[...]
INFO 2022-05-23 18:05:34,961 tfaip.trainer.callbacks.logger: Start of epoch 1
[...]
103/2665 [>.............................] - ETA: 48:22 - loss: 153.9839 - ctc-loss: 153.9839 - loss/mean_epoch: 153.9839 - CER: 1.0141
441,12s user 82,42s system 249% cpu 3:30,17 total
Training with AustrianNewspapers 2.0
Training on Linux
Training with GPU (NVidia RTX A5000, 24 GiB RAM) – 2023-04-26, ocr-02
GPU Load 41 %, GPU Memory 3.5 GiB, time / epoch 2:15 min
cd data && nohup time calamari-train --device.gpus 0 --train PageXML --train.images TrainingSet_ONB_Newseye_GT_M1+/GT-PAGE/*.tif --val PageXML --val.images ValidationSet_ONB_Newseye_GT_M1+/GT-PAGE/*.tif --trainer.output_dir calamari-model --trainer.epochs 999 | tee calamari-training.log