The library behind the elaboration of documents: PySIC - documenti-aperti/documenti_aperti GitHub Wiki

Python simple image cropper (PySIC)

We made and installed this library on the site Documenti Aperti for the elaboration of documents inside Documenti Aperti. Images related to documents need to be put in the ./data folder, whereas the files created by the elaboration of images are put in the ./output folder, that contains two others folder based on the type of the file: ./output/out_hocr for .hocr files and ./output/out_pdf for .pdf files.

Requirements

The enviroment of python3 ( pip3 , setuptools...)

sudo apt-get install libpng-dev libjpeg-dev libtiff-dev zlib1g-dev
sudo apt-get install gcc g++
sudo apt-get install autoconf automake libtool checkinstall

wget http://www.leptonica.org/source/leptonica-1.76.0.tar.gz
tar -zxvf leptonica-1.76.0.tar.gz
cd leptonica-1.76.0
./configure
make
sudo checkinstall
sudo ldconfig

tesseract ( > 3.05)

sudo apt-get install git
git clone https://github.com/tesseract-ocr/tesseract
cd tesseract
./autogen.sh
./configure
make
sudo make install
sudo ldconfig
git clone https://github.com/tesseract-ocr/tessdata.git
sudo mv ~/tessdata/* /usr/local/share/tessdata/

Another way it can be done in Ubuntu using tesseract 4

sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update
sudo apt-get install tesseract-ocr

If you want tesseract to handle every language type:

sudo apt-get install tesseract-ocr-all

or if you want to install only specific languages like English or Italian:

sudo apt-get install tesseract-ocr-eng tesseract-ocr-ita

How to install pySIC on Ubuntu 16.04

For python dependencies see the requirements.txt file and/or launch in commandline:

git clone https://github.com/edoaxyz/image_to_hOCR.git
cd image_to_hOCR
sudo apt-get install libxml2-dev libxslt1-dev
sudo python3 setup.py install

Put your images or scans in the ./data directory and test it:

import pySIC
pySIC.elaborate("merge")
pySIC.elaborate("merge_ocr",ocr=True,lang='ita')

Aims

The objective was to create a script that helps users to digitalize a book. Users only need to scan the pages, even with a portable scanner, on a high contrast background. With a simple method the script finds the color changing on the axes, from the top and from the bottom, then it crosses the data to create a rectangle in which the page should stay. Then it crops the page with a jump (to improve the output).

Colors & Algorithm

Cropping

To define a rectangle you only need two points: in this case I've chosen the top-left one and the bottom-right one. So I need to find 4 coordinates (two abscissae and two ordinates). A way to solve this problem is to analyze four times, pixel by pixel, where the first substantial color changing appears: one from the top to the bottom, one from the left to the right, one from the bottom to the top and one from the right to the left. The color are written in BGR format thanks to the openCV library.

Old

The color changing is defined by the Euclidian distance HERE: distance =\sqrt{(R0 - R1)^2, (B0 - B1)^2, (G0 - G1)^2}. If the distance is bigger than a certain P I can firmly say that there was a change! Added the rotation algorithm which recognise if an image is skew.

New

Now it does the same thing but in a gray scale format.

Rotating

It also finds the image rotation with the "Projection profile method". By testing several times and finding the best solution, the script detects the skew and corrects it. It loads the image in negative gray scale and with scipy and numpy sums it determines if the angle is the correct one.

Reading

The reading script elaborates the images cropped by the cropper script and sends them to "tesseract" to get an hOCR file which contains the text boxes with the relative coordinates.

Debugging mode

I added a debug mode where you can see the algorithm results.

Accepted extensions

Pratically all the openCV extensions : ".jpeg", ".jpg", ".png", ".tif", ".tiff", ".bmp", ".dib", ".jpe", ".jp2", ".webp", ".pbm", ".pgm", ".ppm", ".sr", ".ras"

Optimization

To improve the algorithm velocity and to reduce the efforts, when it analyzes the image, it reduces its dimensions (with a coefficent K = DIMENSIONS / 500) and after it finds the points it crops the original image from the two points coordinates multiplied by K. Also the jump is based on the coefficent K.

CHANGELOG

New v1.1

Removed some bugs to the cropper script, adapted to work on all platform.
It load the images in grey scale so the distance its a simple difference.
Added the Rotation Algorithm.
Added the text output on a ".txt".
Fixed some unicode problem.

New v1.2

Removed bugs.
Now the reader will not make a txt file but an Hocr to improve the human correction helpness.
Updated Main.py and Reader.py.
Added Build.py to let you run the script easier.