OCR testing - stagezero-42/document-processing-api GitHub Wiki

Once the basic OCR functionality is working for simple, clear images, it's time to test its robustness and the effectiveness of the preprocessing and OCR parameter options.

Here's a list of suggested tests, categorized for clarity. You'll need a variety of test images for these.

I. Image Quality & Characteristics Tests:

Low-Resolution Images:
- Test: Scan a document at a low DPI (e.g., 75, 100, 150 DPI) or digitally downscale a clear image.
- Purpose: To see how Tesseract and your (optional) DPI scaling attempts handle it. See if text is still legible.
- Parameters to Vary: ocr_apply_preprocessing (to see if your scaling/enhancements help).
Skewed Images:
- Test: Images rotated by small angles (e.g., 1, 2, 5 degrees) and larger angles (e.g., 10, 15, 45 degrees).
- Purpose: To rigorously test the deskew function in image_processor.py.
- Parameters to Vary: ocr_deskew=true (default) vs. ocr_deskew=false. Compare results.
Noisy Images:
- Test: Images with "salt and pepper" noise, scanner dust, or background patterns. You can digitally add noise to existing images for testing.
- Purpose: To check if the basic noise removal (if any you've implemented beyond binarization) and Tesseract's internal capabilities can handle it.
- Parameters to Vary: ocr_apply_preprocessing. If you implement more specific noise removal options, test those.
Poor Contrast/Brightness:
- Test: Images that are too dark, too light, or have low contrast between text and background.
- Purpose: To see how well the binarization (especially Otsu's method) adapts.
- Parameters to Vary: ocr_apply_preprocessing.
Images with Different Fonts & Sizes:
- Test: Images containing various common fonts (serif, sans-serif, monospace), different text sizes (very small, normal, large), bold/italic text.
- Purpose: To assess Tesseract's versatility with different typography. Small text is often a challenge.
Images with Complex Layouts:
- Test:
  - Multi-column text (like a newspaper).
  - Text mixed with diagrams or pictures.
  - Forms or tables (note: table structure extraction from images is not yet implemented, but OCR should still get the text within cells).
- Purpose: To test the effectiveness of different ocr_page_segmentation_mode (PSM) values.
- Parameters to Vary: ocr_page_segmentation_mode (e.g., try 1, 3, 4, 6, 11, 12).
Images with Non-Standard Text Orientation:
- Test: Text written vertically (if this is a use case).
- Purpose: Tesseract has some capability for this, often requiring specific PSM modes or orientation detection (OSD - Orientation and Script Detection, which is part of some PSM modes like 1 or 12).
- Parameters to Vary: ocr_page_segmentation_mode.

II. OCR Parameter Tests:

Language Tests:
- Test: Images containing text in languages other than English (e.g., French, German, Spanish), assuming you have the corresponding Tesseract language packs installed (tesseract-ocr-fra, tesseract-ocr-deu, etc.). Also, test with images containing mixed languages.
- Purpose: To verify the ocr_language parameter works correctly.
- Parameters to Vary: ocr_language (e.g., "fra", "deu", "eng+fra").
Page Segmentation Mode (PSM) Tests:
- Test: Use a variety of images (single block, multi-column, sparse text) and iterate through different ocr_page_segmentation_mode values (0-13).
- Purpose: To understand which PSM works best for different layouts and to ensure the API parameter is effective. Document common useful PSM values for your users.
OCR Engine Mode (OEM) Tests:
- Test: While 3 (default LSTM) is usually best, you can try other modes on particularly challenging images to see if they offer different results (though often they are legacy or for specific Tesseract versions).
- Purpose: To confirm the parameter works, though you'll likely stick to the default.
- Parameters to Vary: ocr_engine_mode.
Character Whitelist/Blacklist:
- Test:
  - An image containing only numbers, and set ocr_char_whitelist="0123456789".
  - An image with mixed alphanumeric characters and try to whitelist only alphabets.
- Purpose: To check if whitelisting improves accuracy for specific content types or prevents unwanted characters.
- Parameters to Vary: ocr_char_whitelist.
Preprocessing Toggle:
- Test: Use a moderately challenging image (e.g., slightly skewed and noisy) and process it once with ocr_apply_preprocessing=true and once with ocr_apply_preprocessing=false.
- Purpose: To quantify the benefit of your preprocessing pipeline.

III. API & System Level Tests:

Unsupported Image Formats:
- Test: Try uploading an image format not in your SUPPORTED_IMAGE_EXTENSIONS list (e.g., a .gif if it's not listed, or a completely unrelated file type like .txt renamed to .png).
- Purpose: To ensure your API correctly rejects it with a 400 error and an informative message.
Large Image Files:
- Test: Process large images (both in dimensions and file size).
- Purpose: To check for performance issues, memory limits, and timeouts. FastAPI/Uvicorn have default limits for request body size that might need adjustment if you expect very large uploads.
- Note: This might require adjusting server configurations, not just code.
Corrupted Image Files:
- Test: Try uploading an image file that is intentionally corrupted.
- Purpose: To ensure your image loading and processing (Pillow, OpenCV) handle these errors gracefully and don't crash the server, ideally returning a 422 or 500 error with a clear message.
Tesseract Language Pack Not Installed:
- Test: Request a language with ocr_language for which you know the Tesseract language pack is not installed on the server.
- Purpose: To see how pytesseract and your error handling deal with this. It should ideally return an error to the API user indicating the language is unavailable or processing failed.
Concurrent Requests (Stress Test - More Advanced):
- Test: If possible, simulate multiple users uploading images simultaneously.
- Purpose: To check for race conditions (e.g., in temporary file naming if not using UUIDs) and overall system stability under load. Tesseract itself is CPU-intensive.
- Tools: locust, k6, Apache Benchmark (ab).

How to Evaluate Test Results:

Accuracy of extracted_text: Compare it to the ground truth text in the image.
Completeness: Is all text extracted? Are any parts missed or hallucinated?
word_level_details:
- Are bounding boxes reasonably accurate?
- Are confidence scores reflective of the recognition quality?
ocr_settings_used in response: Verify they match what you requested.
Performance: How long does processing take for different images and settings?
Error Handling: Does the API return appropriate HTTP status codes and error messages for invalid inputs or processing failures?
Server Logs: Check for any warnings or errors printed by Uvicorn/FastAPI or your image_processor.py print statements.

Start with the image quality and OCR parameter tests, as these will directly impact the core functionality for your users. Good luck!