OCR (discussion) - spectrumbranch/retro-translation-project GitHub Wiki

Accuracy

CR:

The general problems with our current OCR tech as I understand them. We face both problems.

Inaccurate recognition.

Source characters are mapped to incorrect characters in the OCR process. Training the OCR is the known path forward to improvement.

Unstable recognition.

The same image is converted to different OCR strings, with the result depending on noise generated in the video feed. We've attempted an average technique without improvement. We currently have no high-confidence path forward to improvement.

If training the OCR is already necessary, I wonder if the end result will also solve the noise issue. Phone cam OCRs translate images under much less stable conditions than our use cases. Granted, I've no idea what kind of image processing happens in those pipelines.

Do we work prioritize training for accurate recognition first, and revisit stabilizing the source after getting results. Or is there value in working on stabilizing the image along with training?

SB: I am not confident that training the OCR will resolve the unstable recognition issue. The instability of the video feed appears to trigger the change event for the OCR and even though the moving average filter was intended to solve this, it happens way too frequently. I think we could look into other approaches for addressing the digital hardware noise instead of averaging the last X samples, such as using a more strict thresholding approach. For example, we take the grayscale and pick 0 or 255 for the pixel's value based on the last several frames of data. I could write up a new python class using OpenCV to try this out at some point.

I believe that we could iterate on this approach in parallel to training efforts.

CR:

appears to trigger the change event for the OCR

But there is no change event as such. In a loop, mort records a video frame at an interval (what mort calls its speed) and OCRs it. The number of OCR events is the same whether no pixels in the image changed or all of them did. The UI updates when a result comes back that's different from the previous one.

In my testing with the short loop you sent me, I probably saw at least 10 different OCR results if I let it run for awhile, so we're not even close to stable right now. I'm not against doing more with image preprocessing, just adding this info for the technical record. The target is jumping all over the place at the moment.

Testing procedure

CR:

We talked previously about gathering enough sample data to train OCR well enough to be accurate for any particular game. As I remember it so far, we've only talked about doing this with live gameplay footage; playing through enough of the game to get enough data.

Did we discuss other ideas I don't remember?

I had another idea and want to discuss feasibility.

Assumptions:

Text color, background color, and size are constant.
Absolute character position on display and character positions relative to each other don't affect OCR result.

If those assumptions hold true, it seems to follow that rendering a game's entire character set to a screen, or however many screens it takes to fit the entire set, yields a complete test case for that game. Implementation would require writing code to do that and then getting it onto the console, probably with a flash cart. Then getting the data for a full test case is a matter of capturing a long enough video sample for all the screens.

This seems like it would potentially be faster/more accurate than i.e. playing through the entire game to capture all dialog used.

Do we know if the assumptions above actually hold true, though?

SB: Playing through the game definitely could be a form of acquiring dialog in-situ and I'm open to this process. Originally I was going to manually grab examples that failed basic tests, and also feed the trainer lines of the entire font spritesheet for the game. No matter what, some degree of manual character labeling is going to be necessary for the training process. My plan is to start with one game, and see how much of a difference it makes as we expand outward to games that both of us and our community want to try to play. If we "get big enough", it would be nice to have a pipeline where members of the public can contribute training data to help out.

I really like your idea of being able to render the entire font to a "playable game" ex: flash cart runnable *.sfc/etc file. I however lack the low-level experience to know how to go about doing that. It makes sense, though, and if we had that talent it would speed up certain aspects of font gathering for different games, assuming there was some in-game API/SDK for rendering font to a screen. In the past, I have been loading up *.SFC files into a spritesheet reader and arranging the font into a single file.

For the case of IoG, the font colors will not necessarily be consistent unless we "threshold" them to grayscale and (0,255). Different characters in the game can have different identifying font colors (ex: Kara is pink, Will/Tim is Yellow, etc) and in the Japanese version, even certain words are emphasized in red. I've seen other games have inconsistent background colors and change font sizes as well (not IoG though). I don't believe that absolute position SHOULD affect anything. We should also be running tesseract without vertical character mode, so that will help us a bit here too with consistency.