Image Analyzer - Nagarian/optc-box-manager GitHub Wiki

Image Analyzer (Advanced users)

In order to better understand how my solution works you will need to read first the work of this two smart people who have opened the way to detect how Image recognition works with the One Piece Treasure Cruise game. My own personal solution is based on their work and couldn't have been existed without them.

So I encourage you to check them first in order to understand on how my solution is just an increment on their https://github.com/Guillem96/optc-box-exporter and https://github.com/CMarah/optc-box-exporter. I've tried my best to improve their algorithm in order to achieve the following points:

being able to make this detection for free from a browser (no server should be required)
reducing the time passed on character recognition
make a first attempt on a video recognition to incentivize community find a better way than me to handle this use-case

So, to achieve theses goals, I've passed a lot of time to understand the documentation of the tool used by the two previous projects which is OpenCV JS. This tool supports a lot of platform and development language, but to respond to my first requirement the JS port (which use WASM technologies) will be the way to go. However, the documentation is a lot rougher to understand, the same goes for the concept of image recognition. Having the previous projects as an example has been really helpful even if Guillem96's one is using python version which has a different syntax, its documentation helps me a lot to understand the concept and procedure.

Improving the characters recognition performance

For the second point of improvement, I've analyzed the way of how both previous projects work. They both used the same technique which consists of comparing a found square with all the characters available on the database comparing image one-by-one, calculating a score and keeping the highest.

This approach has major drawbacks:

we spend a lot of time (down)loading all individual images (currently we have almost 4000 individual images)
the algorithm used in OpenCV JS matchTemplate to compute the matching score is heavy and can take easily more than milliseconds or even seconds to process

In this type of situation, my preferred approach is to take the problem from the opposite direction. What will happens if instead of comparing images one-by-one, we try to find our square in a bigger picture composed with all the DB images?

After re-reading the documentation of matchTemplate function we found that:

Template Matching is a method for searching and finding the location of a template image in a larger image. OpenCV comes with a function cv.matchTemplate() for this purpose. ... Once you got the result, you can use cv.minMaxLoc() function to find where is the maximum/minimum value

It seems that's the way those functions should should have been used in the first place. so I've taken a try on it.

The first step was to generate the big picture with all the Database characters. To make it, I've written a script that combines all theses pictures into one that I was able to modulate some parameters like the character width and height or the zone of each picture to extract (in order to remove redundancy and to reduce the overall size). You can check it here https://github.com/Nagarian/optc-box-manager/blob/main/src/scripts/cropImage.mjs.

NB: This script generates two images: one for Global version and one for Japanese version, for those who use OPTC-DB regularly you know that they don't respect real global character's in-game id. Global-first ids available on japan version have japan ids, and global-only characters (which are not available on japan version) have a weird id. However my script will generate a picture which respects real in-game character's id that's why, we generate two versions of the picture

The generated picture is generated with a fixed width of 100 characters. This way, I can do some basic math with a bunch of predefined parameters, and with some coordinates, I can compute the corresponding character id. The following pictures is similar to the first version I've generated.

Heavy version of the characters matrix

With this first version, my algorithm took around 3 to 4 seconds to process and find a matching character. It was already a big improvement versus the previous algorithm I've could tried, but for a video analysis or a great user experience it was too high for me. In addition, the produced image is a bit heavier (> 26Mo) which is fine on localhost but for mobile users on a metro, it will be a pain to download especially because they will need to re-download it on each database update (when new characters are added).

So I've tried to reduce the quality of the generated picture to reduce its size and weight. And this is here that's the magic has operated.

Reducing the character size on the picture cuts off loading and processing time by a lot without reducing the detection rate. After some manual testing, I've found that with a character width of 10px, we got a process time of around 300ms by square and the picture size is around 1Mo! This is what they look like

Global Version
Japan Version

Improving the square detection algorithm

Now that my solution was working great, I've started to integrate it into my application OPTC Box Manager and while doing it, I asked myself which user-experience will be the more useful. For me the most required use-case was to handle tavern after-pull screen and friend point after-pull screen. And for the people who want to use my application the required use-case is to replicate their game box. Those 3 uses-cases are really different and has their own specificity, user box is the hardest due to the complexity implied by all the noises produced by the additional number, text, colors and icons put above the original images. The two others are easier, but they can be annoying due to the Sugo fest wording and the new banner which prevents the square detection algorithm to find them.

In-Game Screenshot

That's why, I've taken the decision to let the user choose the algorithm he wants to apply on its screenshot. So we found the following ones:

Image Analyzer algorithm

The Generic Square is my own mix-up from the original work of Guillem96 and CMarah versions and try to detect square on the picture above a certain threshold. It works almost great for the tavern and fpp, but for box detection it fail completely due to too much noise that prevents to detect rows and columns.

The Box videos will be discussed in the last section, and the others 3 have the same mechanism which seems to be the best compromise for me. And I will now explain its behavior...

After many tries with different techniques I've concluded that's not possible to handle all of theses use-cases with a master algorithm which detects square on the screen, they are too different for that. However, there is an element in common in all of the screenshots mentioned above, an element that's here for so many years that's a shame that bandai hasn't changed it, but for me it was the best opportunity to simplify everything: the game has a fixed screen size!

Since the release of the game, Bandai hasn't changed its shape. With the arrival of devices with higher screen, they had chosen to add an ugly background in order to keep the real screen game size the same. So, retrieving those constants from my own phone screen and scaling them with the user screenshots will do the trick. We can generate a generic matrix and extract sub-pictures from screenshot this way. And that's what I do.

But, there is one thing to consider before proceeding: how do we detect where the screen game starts? If we took a closer look at the border of the 3 screens above, we can find that the two first has a hard yellow horizontal line inherent to the display of money, gem and user name; and the last screen has a clear color difference from an almost white background from the tavern and the black one from their ugly offscreen background. So I took the same algorithm as the others, I apply a Canny detection and a HoughLinesP to detect horizontal lines. And I keep the one that should best match with the one I need.

We can see on the screen above, that the only job done is detect the top screen, then I apply the matrix and I try to find matching but it doesn't work since that's not the right matrix one. Wrong used algorithm

For the User Box algorithm, there is an additional computation which is made before applying the square matrix. Indeed, the user can scroll on his box, so the characters are not always on the same position, and we need to find the first row position. That's why I reuse the algorithm found by Guillem96 (translated in JS) but applied in a limited zone of the screen (the zone I've marked in blue in the following picture), then I extract the most promising squares from this zone.

User Box analyzed

Video Analysis

For this last improvement, I've studied OpenCV JS documentation to understand how it works since nobody has tried to implement a solution. So here is the result of my study.

Video analysis is basically the same as Image recognition because a video is just a bunch of screenshot that animates at a defined framerate. So we only need to take a screen, analyze it and we process the next frame until the video ends.

But, there is a crucial point to understand: the algorithm should be fast, really fast. A video need to refresh the display at a rate of 30 Frame Per Second to be considered as acceptable. But as I mentioned above my character recognition algorithm takes around 300ms to process so it lets me 2 options to construct an acceptable experience:

we stop the video to process the character recognition, then we restart again until the end of the video
we let the video play and we took screenshot at a defined timing and we process for the character recognition on a second step

I've tried both solutions, however there are some technical limitations I've faced that drive me to choose the second options:

I've chosen to use OpenCV only in a JS Background Worker, this is a recommended approach in order to avoid user interface freezing when OpenCV is processing
HTML video (HTMLMediaElement) has a limited functional interface definition which doesn't allow to process video frame by frame, we are forced to take a screenshot of the video instead of accessing of raw frame
synchronisation between the two previous points is almost impossible due to their nature of independent component (when we send a pause signal from a worker, it will take some few milliseconds to be processed until the video is really paused, which is already too late)

All theses reasons drive me to the second option. So here is how it works.

I create a videoElement on the main JS thread which will play the video, then:

I use a processFrame function which takes a screenshot and send it to the background worker
the background worker tries to detect the game screen size (once he got it, it's saved on a constant to avoid recomputing it on each frame)
when we've got those constants defined, we try to find the horizontal lines on the box (see previous explanation with the blue box above)
if the first line detected is below the previous registered one, it seems that the user has scrolled to the bottom, but if it's above we considered that the first character row has been completely scrolled
when there are five character's rows scrolled, that's mean we should process this frame because we have sufficiently scrolled on it
the signal is then sent to the main thread which keep the analyzed frame as a valid analysis which will be processed when the video analysis reaches its end