Proof of Concept: Sprint Outcomes - GSA-TTS/document-extractor-poc GitHub Wiki

Sprint Outcomes

As we continue to refine our document extractor proof-of-concept, we will log summaries of our bi-weekly sprint outcomes below.

Sprint 5 Outcomes (April 14-April 25)

Sprint goals: Improve accuracy of extraction by training the tool.

Amazon Textract training: Train Textract using a diverse data set. We created 5 training samples and 5 test samples for each form type (W2, 1099, DD214) with a wide variety of data and anamolies--such as names with three last names, non-English characters, etc. Hypothesis 1: Value Capture and Value Accuracy should increase from Round 1 to Round 2, and from Round 2 to Round 3 (original testing samples only). Hypothesis 2: Value Capture, and Value Accuracy should increase from Round 1 to Round 2, and from Round 2 to Round 3 (all testing samples, inlcuding the 15 new test samples with more diverse data).
Accuracy improvements: Overall accuracy improved with an average 22% (13 points) change from round 1 to round 3 across form types (W-2, 1099, and DD214).

Insights

We discovered that by training the model, we inadvertently trained it to ignore fields that it was previously correctly extracting data from. We were able to retrain and correct this for W-2 and 1099 but haven't had time to correct it for DD214.

Sprint 4 Outcomes (Mar 31-Apr 11)

Unit Testing & Refactoring

Added initial unit tests using ApplicationContext, enabling easier mocking and modular code structure.
Refactored logic from Lambda handlers into usecases for better testability.
Integrated unit tests into CI pipeline.
Added unit tests for document fetching and writing to the database.
Wrote and reviewed ADRs (Architecture Decision Records) related to pytest and ApplicationContext.

Codebase Organization & Cleanup

Created aws and documents modules for better code organization.
Removed unused document_classifier.py file.
Updated static image imports for Parcel compatibility.

Authentication Implementation

Created user stories for login page and basic single-user authentication.
Combined front-end and back-end authentication tasks into a single ticket (#118).
Decided on simple SSO (no Login.gov integration needed).
Authentication development initiated and ready to be pulled into the sprint.

Document Processing & Training

W2 custom query training initiated and additional queries identified.
Created and trained secondary adapter to handle >30 queries for W2 documents.
DD214 adapter training completed.
1099-NEC training samples prepared and shared.

Team Coordination & Planning

Active collaboration and code review among team members.
Discussions around optimizing model behavior (e.g., dollar sign handling in form fields).
Coordinated approach for document extraction accuracy testing and query adapter integration.
Clarified next steps for AWS cloud environment migration (staying on AWS, no need for cloud.gov).

Sprint & Story Management

Created and assigned user stories for authentication, data consistency, and UI enhancements.
Planning to explore Forms context alongside core development (e.g., accuracy testing and authentication).

Sprint 3 Outcomes (March 17-28, 2025)

Goals of the sprint:

Document classification: Automatically classify uploaded documents (W-2, 1099, DD214) based on text keys found in the form and extract relevant fields accordingly. This will help set up the tool for training in the future.
Test accuracy: As we develop classification logic, we’ll conduct a second round of accuracy testing to assess if document classification and extraction features improve accuracy compared to Round 1.
Improve and test processing speed: Review processing components for lag time and identify areas to improve wait times. Conduct testing and measure any speed differences.

Document Classification

What we did: We built logic to automatically detect form types (W-2, 1099, DD214) using AWS Textract and keyword-based rules. To ensure accuracy, we checked key positions to avoid false positives. We used Textract queries to extract key data fields from each form type and implemented logic to ignore irrelevant text, such as instructions or URLs.

Insights

Classification should be clear to users. In all testing runs, document classification was successful: the rates for both key capture and key accuracy were 100%. However, the classification result (e.g., W-2, 1099, or DD214) was not displayed for users. Adding a 'Form' field may help users confirm that classification occurred.
Query adjustments improve detection but may clutter the user experience. To help the system detect the correct keys, additional text was added to the queries when the same keys appeared multiple times in a form (e.g., “street address” could appear for an employer and recipient). While this improves backend capture and accuracy, repetitive text can feel distracting to users. UI adjustments may help preserve backend accuracy while presenting outputs directly aligned with what users see on the actual form.

Sample mark up of fields to include or exclude when creating queries for extraction

Accuracy Testing

What we did: Re-tested Sprint 1 sample documents with version 2 of the tool. For data analysis, we adjusted Round 1 data to exclude fields not accounted for by the new logic introduced in Round 2 testing.

Insights

Accuracy improved, but confidence slightly declined. Compared to Round 1, key accuracy improved by 23.49% and value accuracy by 16.04% in Round 2. Despite these gains, average confidence unexpectedly dropped by 0.36%. Further investigation is needed to understand why confidence decreased even as accuracy improved.
Multi-line value capture remains inconsistent. Sometimes the system identifies line breaks and sometimes not. Additionally, multi-line responses from DD214 were cut off. The inconsistency appears tied to how the system handles line breaks, suggesting that improving line break recognition should be explored.
Representing data from tables can be further improved. Values from data in tables included table headings. Exploring table-specific parsing approaches may help streamline outputs and improve data usability.

Processing Speed Testing

What we did: We reviewed processing components for lag time and identified areas to reduce SQS wait times and reduced actual wait time by minimizing cold starts and optimizing queue behavior. Additionally, we added a UI component (i.e., a loading circle) to indicate active processing and improve perceived speed.

Insights

Lambda pre-warming and SQS optimization improve processing speed. Round 2 saw a 33.2% (9.5 second) reduction in average processing time compared to Round 1. This improvement was driven by a combination of strategies: pre-warming Lambdas to reduce cold start delays and optimizing how jobs were managed in SQS.

Sprint 2 Outcomes (March 3-14, 2025)

Goals of the sprint:

Accuracy Testing: Test the accuracy of AWS Textract with three common forms in various conditions to understand how the technology works and what kind of patterns we need to improve.
OCR Exploration: Review landscape of open source OCR technologies and weigh pros and cons of moving forward with Textract versus an open source option.

Accuracy Testing

What we did: Ran a series of forms through the PoC and validated the extracted outputs against the fields and values in the original document.

Document types tested:

W2 Wage and Tax Statement Copy B (2025)
1099 Non-employee Compensation (2024 Rev-1)
DD Form 214 (2009 August)

Conditions tested:

PDF (digital download)
Printed hard copy (JPG from mobile photo)
Printed, crumpled hard copy (JPG from mobile photo)

Sample documents (hard copies) used for POC testing

Insights

The extractor performs well overall: Baseline accuracy is high (average 85% for keys and 81% for values), providing a strong foundation for further refinement. Factors like PDF vs. image format or document crumpling had no significant impact on results.
Relevant data should be prioritized: The tool extracts all available data from the form, but not all of it may be relevant to users (such as website links or agency branding). Workshopping with users will help determine which fields to keep, omit, or retain as metadata.
Complex layouts need additional support: The tool performs well at recognizing form boundaries but requires additional training to handle unique layouts, such as tables, checkboxes, multi-value fields, and shaded areas.
Key detection can be improved: One of the most common errors is an inability to distinguish where a key ends and a value begins. This leads to data being pulled from incorrect or multiple areas when it should remain a single value. Conversely, data from a single value can be fragmented into multiple fields. Examples of key detection errors from POC testing
Confidence patterns should be analyzed: While AWS does not disclose how confidence scores are calculated, testing suggests that smaller font sizes and complex structures, such as tables, may lower confidence levels. Understanding these patterns can help refine extraction strategies.

OCR Exploration

What we did: Scanned the landscape and compared three open-source OCR tools (EasyOCR, PaddleOCR and Tesseract). We’ve documented our learnings in a downloadable spreadsheet. We implemented two of these OCR tools in addition to AWS Textract to explore their functionalities.

Insights

Existing OCR solutions have varying out-of-the-box extraction capabilities: Textract supports many different types of extraction: (standard text, forms, tables, queries, etc), can be trained for better results, and can guess the keys of the fields in a form. In a trial test, Textract significantly out-performed Tesseract at extracting structured data. Given Tesseract's limitations, we decided not to test other open-source tools, as Textract was already delivering strong results.
Open-source offerings come with varying learning models: Some tools have deep learning models, which can improve OCR accuracy for complex documents and handwriting. Others may allow us to bring our own models, but both of these options would require additional and extensive up-front work to get the capabilities to match Textract.
Tradeoffs exist between usage costs and cost of developer time: Textract comes with a price per use which can be costly for a high volume of usage, compared to open-source options that do not have a price per use. Open source options, on the other hand, have less out-of-the box controls, resulting in more developer time spent up-front in building up capabilities and a higher cost of maintenance, for example, with increased costs for memory and CPU time.
We will use AWS Textract for the near term: Given our focus on demonstrating what an AI-powered-OCR with self-learning capabilities can do for document processing staff we have elected, in the short term, to keep using AWS Textract. Open-source OCR may be more cost effective in the long run, but it requires significant initial investment that we’d like to avoid, given the limited scope of our proof-of-concept. We will continue to weigh trade-offs as we gain a clearer runway of a fully operational product.