OCR testing - stagezero-42/document-processing-api GitHub Wiki
Once the basic OCR functionality is working for simple, clear images, it's time to test its robustness and the effectiveness of the preprocessing and OCR parameter options.
Here's a list of suggested tests, categorized for clarity. You'll need a variety of test images for these.
I. Image Quality & Characteristics Tests:
-
Low-Resolution Images:
- Test: Scan a document at a low DPI (e.g., 75, 100, 150 DPI) or digitally downscale a clear image.
- Purpose: To see how Tesseract and your (optional) DPI scaling attempts handle it. See if text is still legible.
- Parameters to Vary:
ocr_apply_preprocessing
(to see if your scaling/enhancements help).
-
Skewed Images:
- Test: Images rotated by small angles (e.g., 1, 2, 5 degrees) and larger angles (e.g., 10, 15, 45 degrees).
- Purpose: To rigorously test the
deskew
function inimage_processor.py
. - Parameters to Vary:
ocr_deskew=true
(default) vs.ocr_deskew=false
. Compare results.
-
Noisy Images:
- Test: Images with "salt and pepper" noise, scanner dust, or background patterns. You can digitally add noise to existing images for testing.
- Purpose: To check if the basic noise removal (if any you've implemented beyond binarization) and Tesseract's internal capabilities can handle it.
- Parameters to Vary:
ocr_apply_preprocessing
. If you implement more specific noise removal options, test those.
-
Poor Contrast/Brightness:
- Test: Images that are too dark, too light, or have low contrast between text and background.
- Purpose: To see how well the binarization (especially Otsu's method) adapts.
- Parameters to Vary:
ocr_apply_preprocessing
.
-
Images with Different Fonts & Sizes:
- Test: Images containing various common fonts (serif, sans-serif, monospace), different text sizes (very small, normal, large), bold/italic text.
- Purpose: To assess Tesseract's versatility with different typography. Small text is often a challenge.
-
Images with Complex Layouts:
- Test:
- Multi-column text (like a newspaper).
- Text mixed with diagrams or pictures.
- Forms or tables (note: table structure extraction from images is not yet implemented, but OCR should still get the text within cells).
- Purpose: To test the effectiveness of different
ocr_page_segmentation_mode
(PSM) values. - Parameters to Vary:
ocr_page_segmentation_mode
(e.g., try1
,3
,4
,6
,11
,12
).
- Test:
-
Images with Non-Standard Text Orientation:
- Test: Text written vertically (if this is a use case).
- Purpose: Tesseract has some capability for this, often requiring specific PSM modes or orientation detection (OSD - Orientation and Script Detection, which is part of some PSM modes like
1
or12
). - Parameters to Vary:
ocr_page_segmentation_mode
.
II. OCR Parameter Tests:
-
Language Tests:
- Test: Images containing text in languages other than English (e.g., French, German, Spanish), assuming you have the corresponding Tesseract language packs installed (
tesseract-ocr-fra
,tesseract-ocr-deu
, etc.). Also, test with images containing mixed languages. - Purpose: To verify the
ocr_language
parameter works correctly. - Parameters to Vary:
ocr_language
(e.g., "fra", "deu", "eng+fra").
- Test: Images containing text in languages other than English (e.g., French, German, Spanish), assuming you have the corresponding Tesseract language packs installed (
-
Page Segmentation Mode (PSM) Tests:
- Test: Use a variety of images (single block, multi-column, sparse text) and iterate through different
ocr_page_segmentation_mode
values (0-13). - Purpose: To understand which PSM works best for different layouts and to ensure the API parameter is effective. Document common useful PSM values for your users.
- Test: Use a variety of images (single block, multi-column, sparse text) and iterate through different
-
OCR Engine Mode (OEM) Tests:
- Test: While
3
(default LSTM) is usually best, you can try other modes on particularly challenging images to see if they offer different results (though often they are legacy or for specific Tesseract versions). - Purpose: To confirm the parameter works, though you'll likely stick to the default.
- Parameters to Vary:
ocr_engine_mode
.
- Test: While
-
Character Whitelist/Blacklist:
- Test:
- An image containing only numbers, and set
ocr_char_whitelist="0123456789"
. - An image with mixed alphanumeric characters and try to whitelist only alphabets.
- An image containing only numbers, and set
- Purpose: To check if whitelisting improves accuracy for specific content types or prevents unwanted characters.
- Parameters to Vary:
ocr_char_whitelist
.
- Test:
-
Preprocessing Toggle:
- Test: Use a moderately challenging image (e.g., slightly skewed and noisy) and process it once with
ocr_apply_preprocessing=true
and once withocr_apply_preprocessing=false
. - Purpose: To quantify the benefit of your preprocessing pipeline.
- Test: Use a moderately challenging image (e.g., slightly skewed and noisy) and process it once with
III. API & System Level Tests:
-
Unsupported Image Formats:
- Test: Try uploading an image format not in your
SUPPORTED_IMAGE_EXTENSIONS
list (e.g., a.gif
if it's not listed, or a completely unrelated file type like.txt
renamed to.png
). - Purpose: To ensure your API correctly rejects it with a 400 error and an informative message.
- Test: Try uploading an image format not in your
-
Large Image Files:
- Test: Process large images (both in dimensions and file size).
- Purpose: To check for performance issues, memory limits, and timeouts. FastAPI/Uvicorn have default limits for request body size that might need adjustment if you expect very large uploads.
- Note: This might require adjusting server configurations, not just code.
-
Corrupted Image Files:
- Test: Try uploading an image file that is intentionally corrupted.
- Purpose: To ensure your image loading and processing (Pillow, OpenCV) handle these errors gracefully and don't crash the server, ideally returning a 422 or 500 error with a clear message.
-
Tesseract Language Pack Not Installed:
- Test: Request a language with
ocr_language
for which you know the Tesseract language pack is not installed on the server. - Purpose: To see how
pytesseract
and your error handling deal with this. It should ideally return an error to the API user indicating the language is unavailable or processing failed.
- Test: Request a language with
-
Concurrent Requests (Stress Test - More Advanced):
- Test: If possible, simulate multiple users uploading images simultaneously.
- Purpose: To check for race conditions (e.g., in temporary file naming if not using UUIDs) and overall system stability under load. Tesseract itself is CPU-intensive.
- Tools:
locust
,k6
,Apache Benchmark (ab)
.
How to Evaluate Test Results:
- Accuracy of
extracted_text
: Compare it to the ground truth text in the image. - Completeness: Is all text extracted? Are any parts missed or hallucinated?
word_level_details
:- Are bounding boxes reasonably accurate?
- Are confidence scores reflective of the recognition quality?
ocr_settings_used
in response: Verify they match what you requested.- Performance: How long does processing take for different images and settings?
- Error Handling: Does the API return appropriate HTTP status codes and error messages for invalid inputs or processing failures?
- Server Logs: Check for any warnings or errors printed by Uvicorn/FastAPI or your
image_processor.py
print
statements.
Start with the image quality and OCR parameter tests, as these will directly impact the core functionality for your users. Good luck!