Training with Calamari - UB-Mannheim/AustrianNewspapers GitHub Wiki

All trainings are run from the working directory of AustrianNewspapers.

Trained models are available from https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/calamari/. This URL provides the results and models for the first training with AustrianNewspapers 1.x and the latest training with AustrianNewspapers 2.0.

Training with AustrianNewspapers 1.x

Training on Linux

Training with CPU (AMD EPYC 7502 32-Core Processor, 64 GiB RAM)

The training process was interrupted manually after 100 steps.

time calamari-train --train PageXML --train.images "TrainingSet_ONB_Newseye_GT_M1+/*.tif" --val PageXML --val.images "ValidationSet_ONB_Newseye_GT_M1+/*.tif"
[...]
INFO     2022-05-23 18:34:57,423 tfaip.trainer.callbacks.logger: Start of epoch    1
 100/2665 [>.............................] - ETA: 19:28 - loss: 151.4249 - ctc-loss: 151.4249 - loss/mean_epoch: 151.4249 - CER: 1.0249
[...]
real	4m3.230s
user	29m56.892s
sys	3m35.661s

Training with GPU (NVidia RTX A5000, 24 GiB RAM)

The training process was interrupted manually after 100 steps.

time calamari-train  --device.gpus 0 --train PageXML --train.images "TrainingSet_ONB_Newseye_GT_M1+/*.tif" --val PageXML --val.images "ValidationSet_ONB_Newseye_GT_M1+/*.tif"
(--best_models_dir data/frak2021)
[...]
INFO     2022-05-23 18:41:38,742 tfaip.trainer.callbacks.logger: Start of epoch    1
2022-05-23 18:41:42.194654: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8204
2022-05-23 18:41:44.164328: I tensorflow/stream_executor/cuda/cuda_blas.cc:1760] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
 106/2665 [>.............................] - ETA: 2:19 - loss: 154.5983 - ctc-loss: 154.5983 - loss/mean_epoch: 154.5983 - CER: 1.0163
[...]
real	3m25.564s
user	10m2.160s
sys	1m32.504s

Training with GPU (NVidia RTX A5000, 24 GiB RAM) – 2022-07-24

This training was run on a different server (ocr-02) and also generated models.

time calamari-train --device.gpus 0 --train PageXML --train.images TrainingSet_ONB_Newseye_GT_M1+/*.tif --val PageXML --val.images ValidationSet_ONB_Newseye_GT_M1+/*.tif --trainer.output_dir calamari-model --trainer.epochs 999
INFO     2022-07-24 18:55:36,997             tfaip.util.logging: Logging to 'calamari-model/train.log'
INFO     2022-07-24 18:55:37,001     calamari_ocr.scripts.train: trainer_params={
  "epochs": 999,
  "current_epoch": 0,
[...]
  "best_model_prefix": "best",
  "network": null,
  "__cls__": "calamari_ocr.ocr.training.params:TrainerParams"
}
INFO     2022-07-24 18:55:37,002     tfaip.device.device_config: Setting up device config DeviceConfigParams(gpus=[0], gpu_auto_tune=False, gpu_memory=None, soft_device_placement=True, dist_strategy=<DistributionStrategy.DEFAULT: 'default'>)
INFO     2022-07-24 18:55:37,919 tfaip.data.pipeline.datapipeli: Preloading: Converting training to raw pipeline.
INFO     2022-07-24 18:57:38,813 tfaip.data.pipeline.datapipeli: Preloading: Converting evaluation to raw pipeline.
INFO     2022-07-24 18:57:51,663 tfaip.data.pipeline.datapipeli: Preloading: Converting targets to raw pipeline.
INFO     2022-07-24 18:57:51,752 tfaip.data.pipeline.datapipeli: Preloading: Converting targets to raw pipeline.
INFO     2022-07-24 18:57:51,759 calamari_ocr.ocr.training.trai: CODEC: ['', ' ', '!', '#', '%', '&', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '=', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '|', '~', '§', '°', '±', '²', '³', '·', '¹', '¼', '½', '¾', 'Ä', 'È', 'É', 'Ô', 'Ö', 'Ü', 'ß', 'à', 'á', 'â', 'ä', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ï', 'ñ', 'ò', 'ó', 'ô', 'ö', 'û', 'ü', 'ō', 'ő', 'ř', 'Š', 'Ž', 'ſ', 'ɔ', 'ʃ', 'ʞ', '˙', '–', '—', '‘', '’', '‚', '“', '”', '„', '†', '•', '⁰', '⁴', '⁵', '⁶', '⁷', '⁸', '⁹', '₄', '₆', '₈', '⅐', '⅓', '⅔', '⅕', '⅖', '⅙', '⅚', '⅛', '⅜', '⅝', '⅞', '≅', '▲', '△', '◯', '◻', '◼', '☚', '☛', '✕', '✤', '⬤', '⸗', '⸫', 'ꝛ', '＋', '－', '＝']
INFO     2022-07-24 18:57:51,759 tfaip.data.pipeline.datapipeli: Preloading: Converting training to raw pipeline.
WARNING  2022-07-24 18:57:52,196 calamari_ocr.ocr.dataset.image: Invalid line (longer outputs than inputs) (id=TrainingSet_ONB_Newseye_GT_M1+/ONB_nfp_19110701_021/line_1548751721484_210)
WARNING  2022-07-24 18:57:52,584 calamari_ocr.ocr.dataset.image: Invalid line (longer outputs than inputs) (id=TrainingSet_ONB_Newseye_GT_M1+/ONB_ibn_19110701_022/line_1545920018915_126)
WARNING  2022-07-24 18:57:53,761 calamari_ocr.ocr.dataset.image: Invalid line (longer outputs than inputs) (id=TrainingSet_ONB_Newseye_GT_M1+/ONB_nfp_19110701_022/line_1548816280516_119)
WARNING  2022-07-24 18:57:53,928 calamari_ocr.ocr.dataset.image: Invalid line (longer outputs than inputs) (id=TrainingSet_ONB_Newseye_GT_M1+/ONB_nfp_18950706_012/line_1548398603794_77)
WARNING  2022-07-24 18:57:54,644 calamari_ocr.ocr.dataset.image: Invalid line (longer outputs than inputs) (id=TrainingSet_ONB_Newseye_GT_M1+/ONB_ibn_19110701_022/line_1545915105649_97)
[...]
INFO     2022-07-24 18:58:16,630    tfaip.scenario.scenariobase: Total params: 1,566,638
INFO     2022-07-24 18:58:16,631    tfaip.scenario.scenariobase: Trainable params: 1,566,637
INFO     2022-07-24 18:58:16,631    tfaip.scenario.scenariobase: Non-trainable params: 1
INFO     2022-07-24 18:58:16,631    tfaip.scenario.scenariobase: ______________________________________________________________________________________________________________________________________________________
INFO     2022-07-24 18:58:17,128 tfaip.trainer.callbacks.logger: Start of epoch    1
INFO     2022-07-24 19:00:24,074 tfaip.model.print_evaluate_lay: Printing evaluation results of 10 instances
INFO     2022-07-24 19:00:24,090 tfaip.model.print_evaluate_lay: 
  CER:  0.013333333333333334
  PRED: '‪bei Einſenkung des Sargs geſprochen werden. Jch will ruhen in dem ſchon vor‬'
  TRUE: '‪bei Einſenkung des Sargs geſprochen werden. Ich will ruhen in dem ſchon vor‬'
INFO     2022-07-24 19:00:24,103 tfaip.model.print_evaluate_lay: 
  CER:  0.0
  PRED: '‪Jahren gebauten Grab neben meiner verewigten Gemahlin Katharina, wie ich es‬'
  TRUE: '‪Jahren gebauten Grab neben meiner verewigten Gemahlin Katharina, wie ich es‬'
INFO     2022-07-24 19:00:24,117 tfaip.model.print_evaluate_lay: 
  CER:  0.024691358024691357
  PRED: '‪ihr verſprochen hatte. b. Die Eandestrauer wünſche ich auf 3 Monate beſchränkt zu‬'
  TRUE: '‪ihr verſprochen hatte. 6. Die Landestrauer wünſche ich auf 3 Monate beſchränkt zu‬'
INFO     2022-07-24 19:00:24,130 tfaip.model.print_evaluate_lay: 
  CER:  0.0
  PRED: '‪ſehen, und nur 10 Tage nach meinem Begräbniß ſoll mit den Glocken geläutet‬'
  TRUE: '‪ſehen, und nur 10 Tage nach meinem Begräbniß ſoll mit den Glocken geläutet‬'
INFO     2022-07-24 19:00:24,144 tfaip.model.print_evaluate_lay: 
  CER:  0.01282051282051282
  PRED: '‪werden, meine Perſonalien ebenſo einfach in den Kirchen geleſen werden. 7. Jch‬'
  TRUE: '‪werden, meine Perſonalien ebenſo einfach in den Kirchen geleſen werden. 7. Ich‬'
INFO     2022-07-24 19:00:24,156 tfaip.model.print_evaluate_lay: 
  CER:  0.012345679012345678
  PRED: '‪fterbe als wahrer Chriſt, verzeihe allen meinen Feinden, danke meiner Familie für‬'
  TRUE: '‪ſterbe als wahrer Chriſt, verzeihe allen meinen Feinden, danke meiner Familie für‬'
INFO     2022-07-24 19:00:24,162 tfaip.model.print_evaluate_lay: 
  CER:  0.04054054054054054
  PRED: '‪ire innige Eiebe, meinen Oienern vom Civil wie vom Militär für ihre treue‬'
  TRUE: '‪ihre innige Liebe, meinen Dienern vom Civil wie vom Militär für ihre treue‬'
INFO     2022-07-24 19:00:24,168 tfaip.model.print_evaluate_lay: 
  CER:  0.012048192771084338
  PRED: '‪Anhänglichkeit und Eifer in Erfüllung ihrer Pflchten, allen meinen Unterthanen für‬'
  TRUE: '‪Anhänglichkeit und Eifer in Erfüllung ihrer Pflichten, allen meinen Unterthanen für‬'
INFO     2022-07-24 19:00:24,174 tfaip.model.print_evaluate_lay: 
  CER:  0.036585365853658534
  PRED: '‪ihre Dreue und Gehorſam gegen die Geſete. Jch habe für die Einigkeit, Selbſtſtän⸗‬'
  TRUE: '‪ihre Treue und Gehorſam gegen die Geſetze. Ich habe für die Einigkeit, Selbſtſtän⸗‬'
INFO     2022-07-24 19:00:24,180 tfaip.model.print_evaluate_lay: 
  CER:  0.012658227848101266
  PRED: '‪digkeit, Ruhe von Deutſchland gelebt, mein Württemberg über Aüles geliebt. Heil‬'
  TRUE: '‪digkeit, Ruhe von Deutſchland gelebt, mein Württemberg über Alles geliebt. Heil‬'
INFO     2022-07-24 19:00:28,837 tfaip.trainer.callbacks.benchm: Benchmark results:
+-----------------+----------------------+---------------------+--------------------+----------------------+---------------------+-----------------------+
|    Benchmark    |     Train Total      |     Train Batch     |    Train Sample    |      Test Total      |      Test Batch     |      Test Sample      |
+-----------------+----------------------+---------------------+--------------------+----------------------+---------------------+-----------------------+
|    AVG Count    |          1           |         2665        |        3601        |          1           |         2665        |          3601         |
|   AVG Time Per  |  131.73861646652222  | 0.04716916299000466 | 0.0349085863283428 |  131.7386302947998   | 0.04716916299000466 |   0.0349085863283428  |
|  AVG Per Second | 0.007590788690680707 |  21.200291389777345 | 28.646247390089385 | 0.007590787893894427 |  21.200291389777345 |   28.646247390089385  |
|    Last Count   |          1           |         2665        |        3601        |          1           |         226         |          3601         |
|  Last Time Per  |  131.70751333236694  | 0.04716916299000466 | 0.0349085863283428 |   5.72977089881897   | 0.02481046714613923 | 0.0015571134615460888 |
| Last Per Second | 0.007592581278764841 |  21.200291389777345 | 28.646247390089385 | 0.17452704788006826  |  40.305569182142975 |   642.2139585172426   |
+-----------------+----------------------+---------------------+--------------------+----------------------+---------------------+-----------------------+
INFO     2022-07-24 19:00:28,899 tfaip.trainer.callbacks.earlys: Better value of val_CER found. Old = None, Best = 0.06386524438858032
INFO     2022-07-24 19:00:28,922 tfaip.trainer.callbacks.logger: Results of epoch    1 CER: 0.2832 - ctc-loss: 41.7597 - loss: 41.7597 - loss/mean_epoch: 41.7597 - val_CER: 0.0639 - val_ctc-loss: 8.5366 - val_loss: 8.5366 - val_loss/mean_epoch: 8.5366
INFO     2022-07-24 19:00:28,937 tfaip.trainer.callbacks.logger: Start of epoch    2
[...]
INFO     2022-07-24 19:34:16,150 tfaip.trainer.callbacks.benchm: Benchmark results:
+-----------------+-----------------------+----------------------+----------------------+-----------------------+----------------------+-----------------------+
|    Benchmark    |      Train Total      |     Train Batch      |     Train Sample     |       Test Total      |      Test Batch      |      Test Sample      |
+-----------------+-----------------------+----------------------+----------------------+-----------------------+----------------------+-----------------------+
|    AVG Count    |           1           |        45305         |        61217         |           1           |        45305         |         61217         |
|   AVG Time Per  |   2159.0522861480713  | 0.045726970289074784 | 0.033841259600217805 |   2159.0522968769073  | 0.045726970289074784 |  0.033841259600217805 |
|  AVG Per Second | 0.0004631661800947318 |  21.868931916508863  |  29.549727516453434  | 0.0004631661777931507 |  21.868931916508863  |   29.549727516453434  |
|    Last Count   |           1           |         2665         |         3601         |           1           |         226          |          3601         |
|  Last Time Per  |   126.70879006385803  | 0.04568098866246207  | 0.033807229876551353 |   4.817528009414673   | 0.02080420582695345  | 0.0013056791216027436 |
| Last Per Second |  0.007892112295413959 |  21.890944773306554  |  29.579471718077635  |   0.2075753369872985  |  48.06720373360386   |   765.8849586049004   |
+-----------------+-----------------------+----------------------+----------------------+-----------------------+----------------------+-----------------------+
INFO     2022-07-24 19:34:16,214 tfaip.trainer.callbacks.earlys: Early stopping progressed. (remaining iteration without improvement: 1)
INFO     2022-07-24 19:34:16,214 tfaip.trainer.callbacks.earlys: No better value of val_CER = 0.026566628366708755 found. Keeping best = 0.026074493303894997
INFO     2022-07-24 19:34:16,214 tfaip.trainer.callbacks.earlys: Early stopping. Reached number of maximum iterations without improvement (5 = 5
INFO     2022-07-24 19:34:16,220 tfaip.trainer.callbacks.logger: Results of epoch   17 CER: 0.0269 - ctc-loss: 3.9759 - loss: 3.9759 - loss/mean_epoch: 3.9759 - val_CER: 0.0266 - val_ctc-loss: 3.6376 - val_loss: 3.6376 - val_loss/mean_epoch: 3.6376
INFO     2022-07-24 19:34:16,221 tfaip.trainer.callbacks.benchm: Benchmark results:
+-----------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
|    Benchmark    |      Train Total      |      Train Batch      |      Train Sample     |       Test Total      |       Test Batch      |      Test Sample      |
+-----------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
|    AVG Count    |           1           |         45305         |         61217         |           1           |         45305         |         61217         |
|   AVG Time Per  |   2159.123648405075   | 1.009313989384721e-06 | 5.528082003400657e-07 |   2159.123648405075   | 1.009313989384721e-06 | 5.528082003400657e-07 |
|  AVG Per Second | 0.0004631508717616478 |    990771.960477434   |   1808945.6693747297  | 0.0004631508717616478 |    990771.960477434   |   1808945.6693747297  |
|    Last Count   |           1           |          2665         |          3601         |           1           |          226          |          3601         |
|  Last Time Per  |   126.70879006385803  |  0.04568098866246207  |  0.033807229876551353 |   4.817528009414673   |  0.02080420582695345  | 0.0013056791216027436 |
| Last Per Second |  0.007892112295413959 |   21.890944773306554  |   29.579471718077635  |   0.2075753369872985  |   48.06720373360386   |   765.8849586049004   |
+-----------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+-----------------------+
INFO     2022-07-24 19:34:16,221 calamari_ocr.ocr.training.trai: Training finished
2022-07-24 19:34:16.383483: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
         [{{node PyFunc}}](/UB-Mannheim/AustrianNewspapers/wiki/{{node-PyFunc}})

real    38m41.824s
user    110m48.328s
sys     7m26.589s

Training on MacOS

Installation

Create a Python user configuration file ~/.config/pip/pip.conf with this content:

[global]
extra-index-url = https://digi.bib.uni-mannheim.de/pypi/
                  https://download.pytorch.org/whl/nightly/cpu

Then install the required Python packages:

pip install calamari-ocr
pip install tensorflow-metal

Training with CPU (MacBook Pro M1, 16 GiB RAM)

The training process was interrupted manually after 100 steps.

time calamari-train --train PageXML --train.images "TrainingSet_ONB_Newseye_GT_M1+/*.tif" --val PageXML --val.images "ValidationSet_ONB_Newseye_GT_M1+/*.tif"
[...]
 102/2665 [>.............................] - ETA: 22:57 - loss: 151.8406 - ctc-loss: 151.8406 - loss/mean_epoch: 151.8406 - CER: 1.0193
[...]
532,89s user 86,63s system 431% cpu 2:23,73 total

Training with GPU (MacBook Pro M1, 16 GiB RAM)

The training process was interrupted manually after 100 steps.

time calamari-train  --device.gpus 0 --train PageXML --train.images "TrainingSet_ONB_Newseye_GT_M1+/*.tif" --val PageXML --val.images "ValidationSet_ONB_Newseye_GT_M1+/*.tif"
(--best_models_dir data/frak2021)
[...]
INFO     2022-05-23 18:05:34,961 tfaip.trainer.callbacks.logger: Start of epoch    1
[...]
 103/2665 [>.............................] - ETA: 48:22 - loss: 153.9839 - ctc-loss: 153.9839 - loss/mean_epoch: 153.9839 - CER: 1.0141
441,12s user 82,42s system 249% cpu 3:30,17 total

Training with AustrianNewspapers 2.0

Training on Linux

Training with GPU (NVidia RTX A5000, 24 GiB RAM) – 2023-04-26, ocr-02

GPU Load 41 %, GPU Memory 3.5 GiB, time / epoch 2:15 min

cd data && nohup time calamari-train --device.gpus 0 --train PageXML --train.images TrainingSet_ONB_Newseye_GT_M1+/GT-PAGE/*.tif --val PageXML --val.images ValidationSet_ONB_Newseye_GT_M1+/GT-PAGE/*.tif --trainer.output_dir calamari-model --trainer.epochs 999 | tee calamari-training.log