Document Extractor Prototype - GSA-TTS/document-extractor-poc GitHub Wiki

Problem Statement

How might we get data out of a document into a machine readable format to reduce the manual burden of data entry for administrators?

  • Solution: An automated data extraction tool that converts PDFs and image files into machine-readable formats to reduce the manual burden of data entry for benefits administrators.

  • Intended User: Benefits administrator, likely processing paperwork as part of an application for a public benefit program like unemployment insurance (UI), SNAP, TANF, Medicaid, or the Low Income Home Energy Assistance Program (LIHEAP).

  • User Goal: When uploaded documents are received from a member of the public, benefits administrators need a way to quickly and accurately extract data from a static file to import into case management systems. Having machine readable data accessible to staff facilitates eligibility determinations and calculations.

Demonstrative Proof-of-Concept

Key Proof-of-Concept Tasks

  1. The tool ingests PDFs or image files (e.g., .JPEG, .PNG).
  2. The data are displayed in editable fields next to the original image, so that the user can check for accuracy and make any corrections.
  3. The extracted data are output as a .CSV or .JSON file, available for the user to download.
  4. The user uploads the extracted data into their existing system to be used for a distributed set of calculations and decision making.

Document types tested: Income documents (W2s, pay stubs, 1099, employer letter) and Veteran’s discharge form (DD214).

How it works

To begin our technical discovery, we focused a two week sprint on developing a document extractor proof of concept (PoC) to explore what technologies could be used to help our users achieve their goal.

The PoC streamlines the processing of documents by leveraging automation to extract key information efficiently. In its current state, the PoC allows users to upload preexisting documents, where the system extracts relevant data and metadata. It uses optical character recognition (OCR) and natural language processing (NLP) techniques to parse text from structured and semi-structured documents. The extracted data is then formatted for validation and further processing, ensuring compatibility with downstream workflows.

We are currently testing this functionality to gain a deeper understanding of how Amazon Web Services (AWS) Textract processes key attributes, such as confidence scores for extracted text, form fields, and tables. This testing will help us assess its accuracy, identify potential gaps in extraction, and determine whether additional post-processing or validation steps are needed to ensure reliable data handling. We will also explore other tools for comparison.

As a future phase, the PoC should be refined to enhance extraction accuracy, integrate user feedback mechanisms, optimize the document handling pipeline for scalability, and consider security compliance, to support a robust, production-ready implementation.

https://github.com/user-attachments/assets/c55480dc-e78a-4a6a-ba7d-dfff5fd7f6dc