2. Upscaling and refining from 1K to 24K with Krita AI Hires - minsky91/krita-ai-diffusion-hires GitHub Wiki

A benchmark test series comparing Hires against the standard plugin version

In this study, I compare output quality and performance of the Hires version with those of the standard one, based on the results of a benchmarking test series.

For benchmarking, I used a single test image, a fantasy landscape titled “The mirage of Gaia”, in a variety of 7 resolutions, starting from the base image of 1K (1024x576) size, produced by applying progressive upscaling and refinement and using the same prompt and the same set of parameters. At each step, the image is refined in Krita AI with 40% strength (eq. 0.4 denoise in other SD WebUI tools), using an Unblur control layer (eq. the Tile Resample controlnet) attached to the background image, with the custom strength of 0.54 (eq. Image Guidance 54% in the plugin’s Upscale / Refine mode). The resulting output image, selected out of a batch or series of generated images, is then 2x or 1.5x upscaled using the standard (non-AI) Lanczos method with subsequent unsharp-masking applied (done to avoid any additional AI model influencing the result), and used as input at the next step. For each of the 3 tested components, an individual sequence of images was generated this way, up to and including 24K (24576x13824) resolution.

Note that such a mechanical upscaling & refining scheme has been chosen primarily for comparative performance benchmarking and artefact detection. It is based on the expectation that any systematic glitch exposed by a step generation will be amplified by the next step done at a higher resolution, and thereby easily spotted. This kind of routine, while having elements of a regular upscaling & refinement process, is not intended for getting the best quality possible.

Here’s a screenshot of Krita AI UI made during its processing of the final 24K image, to illustrate the set of parameters applied to each refinement step using Hires:

Screenshot of Krita AI finishing 24K refinement on a laptop

Benchmark tests were carried out on a Windows 11 server PC equipped with 2.50 GHz Intel Core 9 185H chip and 32 GB RAM, using SSD storage. The GPU was an NVIDIA GeForce RTX 4070 Ti SUPER with 16 GB VRAM, operated by the Comfy server v0.3.19. No TeaCache or any other accelerator extension has been used. For measuring the impact of a wifi connection, a part of the tests was performed on a Windows 11 laptop with Krita connected over LAN to the server PC. Of the models and important generation parameters used, I will mention only the dreamshaperXL_lightningDPMSDE checkpoint (chosen for the fine detail it can generate), coupled with the Hyper sampler Euler a at 10 steps. The Unblur (Tile Resample) model used was xinsirtile-sdxl-1.0.

The 3 tested components were standard Krita AI Refine, standard Upscale / Refine and Krita AI Hires Refine. Clicking on the link for each component will open a respective google drive folder with full benchmarking test data, ComfyUI workflows, metadata, input and output images and Krita .kra documents. The pdf with a full benchmark spreadsheet is available here; the excerpt below shows the quoted data, with Hires columns in pink:

Benchmark data Standard vs Hires fragment

I will first compare results of the standard Krita AI Refine component (non-tiled) vs those of Krita AI Hires Refine, powered by Tiled Diffusion. Speed-wise, starting from the 4K resolution, the testing showed much shorter generation times for the Hires version, as compared to the standard one: 38s vs 63s (including 1s vs 10s of output image download time, when performed over wifi), in average for a series of 5 images, wherein each next generation was requested only after the previous one is finished. When generating in batches, the time difference was substantially smaller since in that case the slow rate of python websocket-based transfer used by the standard version for downloading of the output was compensated by the server processing the next image while the download transfer is still in progress.

For the next resolution of 8K, the testing was performed only once for the standard version Refine, which took 2713 seconds to complete (which amounts to 45+ minutes) - a time I can’t afford spending for a single run again. In contrast, it took only 128s for Hires to complete (including 1s vs 36s standard download time on the server PC, with no wifi involved) - a 21+ times speedup overall.

Output quality was comparable between the standard version’s results and the Hires’s one. Of noticeable differences, in the rendering of the Gaia planet shape by the standard version some distortions can be seen on the edges (see the comparison image below, on the left). In the renderings produced by Hires, on the other hand, a characteristically softer texture overall can be observed, as well as some faint tile seams in the image generated with the tile overlap of 32 pixels (overlap of 64 pixels resulted in no visible artefacts).

Gaia shape rendering comparison - standard vs Hires - fragment from 8K

From the 8K resolution and on, therefore, a meaningful comparison could only be made between the results of the Refine option of the plugin’s standard Upscale workspace (2nd tested component) and those of Hires Refine. In this part, the speed competition was more even, since the former also employs a (proprietary) tile-based method for high resolution image processing, and a blazingly fast one at that. The speed difference was particularly substantial on the server side (125s with Hires vs 117s with the standard Upscale / Refine for the 8K image). On the client side, however, the slow python websocket-based image download transfers of the standard version made its total processing times quite longer, which could not be explained by a slow wifi connection since similarly slow transfer rates were registered for tests done on the PC with both server and the Krita AI client. The registered processing times (Hires vs standard) were: 324s against 370s (incl. 3s vs 86s download), on average in a series of 3 refined images at 12K, and 588s against 637s (incl. 6s vs 158s download), on average in a series of 2 images at 16K.

At the 24K test, the standard tiled version ran into a failure, displaying error message “Failed to validate prompt for output 27:* ImageScale 18: - Value 24576 bigger than max of 16384: width) Output will be ignored”. The Hires version finished the 24K test every time I tried it (perhaps over 10 runs by now), except an OOM registered when a LoRa was used. The best timing at 24K was 1491s, which amounts to 24m 51s (incl. download time of 4s, for a 53 MB jpeg image. Jpeg format usage is a feature introduced in Hires specifically for saving in, and transfers at ultra-high resolutions, since png becomes impractical from certain sizes up).

Benchmarking of the 40% refine test also revealed the significant overhead incurred by the Base64 encoding-based scheme of embedding images in the workflow used by the standard version, resulting in overblown workflow .json files sent with each generation to the server, over and over. In contrast, the Hires version uses new, optimized methods for both uploading and downloading images, the 1st of which ensures that an input image is uploaded only once per session and without any overhead. For added speed, uploads of multiple images are done in parallel (not used in this particular benchmark series). Another revealed drawback of the standard version is its tile-generating logic which produces incomprehensible node counts that grow with the input image resolution. With the Hires version, the node count is not proportional to the input or output resolution, and the generated workflow stays basically the same for the same set of parameters and input images of varying resolutions.

For the 8K image, the size of the workflow file generated by the standard version was 67 MB (46% overhead as computed against 44 MB of total input image size), and it contained 611 nodes in total. This is compared to 44 MB of the input image files uploaded - only once per series - by the Hires version (0% overhead) and the workflow of 0.005 MB size with 18 nodes - the attributes which stayed the same for images of all resolutions tested. For the 12K test image, the size of the standard workflow file was 184 MB (51% overhead against 122 MB of total input image size), and it contained 1451 (!) nodes in total, as compared to 122 MB of the input images uploaded by the Hires version and 18 nodes. For the 16K test image, the size of the standard workflow file was 339 MB (55% overhead against 228 MB of total input image size), and it contained 1186 nodes in total, as compared to 228 MB of the input images uploaded by the Hires version and 18 nodes.

Speaking of the output quality, while the two versions were in general on a par with each other, some issues were registered with either of them, when generating at ultra hires. The common issue was the noticeable at 12K and up fuzzy graininess that accumulated along the upscale steps, which could not be helped, given the automatic routine. The Hires version was prone to banding in the light areas (sky for instance), while the standard version tended to suffer more from the tile seam artefacts (see the 12K fragment comparison below, on the left), with equal tile overlap sizes. From 12K and up, the seam artefacts were most pronounced in the output from the standard version and quite difficult to get rid of. The image detail produced by the standard Upscale Refine, on the other hand, was consistently sharper than that rendered by the Hires version, due to the ‘softish’ Tilde Diffusion’s Mixture of Diffusers method used by the latter.

Gaia shape rendering comparison - standard vs Hires - fragment from 12K

When Tiled Diffusion is mentioned by Stable Diffusion artists, it is oftentimes because of its famed ‘creativity’ in generating images, the tendency to produce more varying, bordering on hallucination, content at a high denoise value while not betraying its tiled nature much, as compared to other upscale & refine methods such as Ultimate SD Upscale. Concluding this benchmarking test, I couldn't resist running a special 24K generation at 100% strength (1.0 denoise), combined with Unblur at a slightly liberal 0.54 strength, to see the magic of Tiled Diffusion in action. The result didn’t disappoint, it was one of the most appealing renderings I have ever seen at such immense image size. Here are 2 fragments from that generation, each with a pixel area covering only 2-3% (roughly) of the entire output image:

24K Generate Hiresfix unblur 0 54-0 90 fragment 1

24K Generate Hiresfix unblur 0 54-0 90 fragment 2

It bears repeating, however, that the images generated in this test series were meant only to supplement the benchmark comparison, which was all about performance and detecting possible artefacts, as mentioned above. Given the rigid, mechanical routine of their generation, they cannot be possibly regarded as examples of quality upscaling or artworks of any sort. Nor can they be viewed properly from a google drive, even if they were. To see the actual work titled “The mirage of Gaya”, in its full artist-curated, partially inpainted 24K glory, please visit the version displayed on the EasyZoom site. Viewing the image on the largest screen you can get a hold of is highly recommended.