Tesseract - AudiovisualMetadataPlatform/amp_documentation GitHub Wiki
- About
- [Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006. ]
- [It runs on images and produces a text output consisting of the text in the fed images.]
- [It has been added as a tool on AMP's Galaxy and performs video OCR on the input videos.]
- [This is achieved by embedding an FFmpeg before the tesseract. So the video is first passed through FFmpeg which produces image frames at an interval of 0.5 seconds throughout the duration of the video. These frames are passed as input to Tesseract.]
- [The output produced by this composite video OCR tool is a JSON consisting of the text and the corresponding bounding box information on each frame in the input.]
- [ If "dedupe" option is checked, will also generate AMP OCRR JSON with duplicate frames removed, i.e. consecutive frames with same texts within the specified period.]
- [Source Code]
- [galaxy/tools/tesseract.xml : ]This is the configuration file that details the tools usage, its inputs, outputs, version, and other things.
- galaxy/tools/run-tesseract.py : This is a python wrapper to run the FFmpeg on input video. FFmpeg creates frames from the video. These frames are then passed through the tesseract tool which runs the OCR and produces a JSON output. The JSON output has all the text predictions with their corresponding bounding box coordinates for all the frames.
- Dependencies
-
FFmpeg
-
[pytesseract]
-
[tesseract-ocr]
-
[libtesseract-dev]
-
- [Installations]
-
[$ sudo apt-get install FFmpeg]
-
[$ pip install pytesseract]
-
[$ sudo apt install tesseract-ocr]
-
[$ sudo apt install libtesseract-dev]
-
- [Running ]
- [[The tool can be invoked from Galaxy UI as other tools. User needs to supply input data in the form of a video file.]]
- [Parameters]
- input_video: the video file to be passed through the OCR.
- dedupe: Whether to dedupe consecutive frames with same texts. default true.
- period: Period in seconds to last as consecutive duplicate frames. default 5 seconds.
- [Output]
- [amp_vocr: It has the output of the OCR with all the recognized text in each frame and their bounding boxes. It also has other information like frame rate and resolution.]
- [amp_vocr_dedupe: The AMP OCRR JSON with duplicate frames removed]
[More inpormation about tesseract is here.]
Document generated by Confluence on Feb 25, 2025 10:39