4. Usage examples: regional prompting - minsky91/krita-ai-diffusion-hires GitHub Wiki

Regional prompting is a powerful image-generating feature, supported in Krita AI since v1.18. In this case study, I compare output quality and performance of the standard plugin version with those of the Hires one, based on the results of a benchmarking test series.

For benchmarking, I used a single test image, titled “An old forester”, in the variety of 3 resolutions: 4Kx3K (4096x3072), 6Kx4K (6144x4608) and 8Kx6K (8192x6144), using the same base prompt and the same set of regions presented as layers in a Krita document, with their region prompts attached. Here’s a screenshot of the Krita AI UI with regions as layers, their prompts and the selection denoting their combined extent (the selection is shown only for illustration and wasn’t active during the actual generation; the plugin uses layers’ opacity as masks, and those can overlap to some degree):

An old forester - Krita AI screenshot with background and region selection

The examples were produced and the benchmark tests were carried out on a Windows 11 PC equipped with 2.50 GHz Intel Core 9 185H chip and 32 GB RAM, using SSD storage. The GPU was an NVIDIA GeForce RTX 4070 Ti SUPER with 16 GB VRAM, operated by the Comfy server v0.3.19. No TeaCache or any other accelerator extension has been used. For measuring the impact of a wifi connection, a part of the tests was performed on a Windows 11 laptop with Krita connected over LAN to the server PC. Of the models and important generation parameters used, I will mention only the dynavisionXLAllInOneStylized_releaseV0610Bakedvae checkpoint (renown for the fine stylized illustrations it can generate), coupled with the Hyper sampler Euler a at 10 steps, via the Hyper-SDXL-12steps-CFG-lora LoRa at 100%. The Unblur (Tile Resample) model used was xinsirtile-sdxl-1.0.

The following two modes of the plugin were tested: generate at 100% strength (as in the screenshot) and refine at 50% (0.5 denoise) using an Unblur control layer attached to the background image, with the custom strength of 0.36 (eq. Image Guidance 36% in Upscale / Refine). The tested components were standard Krita AI Refine, standard Upscale / Refine and Krita AI Hires Refine. Clicking on the link for each component will open a respective google drive folder with full benchmarking test data, ComfyUI workflows, metadata, input and output images and Krita .kra documents. The pdf with a full benchmark spreadsheet is available here; the excerpt below shows the quoted data, with Hires columns in pink:

Benchmark data Standard vs Hires fragment

For the generate at 100% mode, the testing showed much faster generation times for the Hires version, as compared to the standard one: 46s vs 142s for the 4K x 3K image (including 1s vs 11s of output image download time, when performed over wifi), for a batch of 5 generated images, and 81s vs 181s, for the 6K x 4K one (including 2s vs 25s download time), for a batch of 4 images. I decided to skip testing with 8K images for the standard version, having already observed its excessively long processing times for images of a resolution this high. For the 8K x 6K rest image, the Hires version took 166s on average for images within a batch of 3 (including the download time of 3s for an output image of 51 MB size).

Output quality was comparable between the standard version’s results and the Hires’s one, except in the rendering of the forester where the standard version produced a messed up figure in the 6K x 4K image (see below on the left), while Hires’s one was generally okay. Also, the standard version has hallucinated some mountains in the sky area, while its rendering of the hut was somewhat better than that of the Hires one.

forester figure comparison - standard vs Hires - fragment from 6Kx4K

For the refine at 50% mode, where the Refine option of the plugin’s Upscale workspace was tested against Refine of the Hires version, the speed competition was more even, since the former also employs a (proprietary) tile-based method for high resolution image processing, and a blazingly fast one at that. The speed difference was particularly substantial on the server side (275s with Hires vs 192s with the standard Upscale / Refine for the 8K x 6K image). On the client side, however, the slow python websocket-based image download transfers of the standard version made its total processing times quite longer, which could not be explained by a slow wifi connection, since similarly slow transfer rates were registered for tests done on the PC with both server and the Krita AI client. The registered processing times (Hires vs standard) were: 66s against 59s (incl. 1s vs 17s download), in a series of 5 4K x 3K refined images, 124s against 129s (incl. 2s vs 28s download), in a series of 4 6K x 4K images, and 286s against 266s (incl. 3s vs 74s download), in a series of 3 8K x 6K images.

Benchmarking in the refine at 50% strength mode also revealed the significant overhead incurred by the Base64 encoding-based scheme of embedding images in the workflow used by the standard version, resulting in overblown workflow .json files sent with each generation to the server, over and over. In contrast, the Hires version uses new, optimized methods for both uploading and downloading images, the 1st of which ensures that an input image is uploaded only once per session and without any overhead. For added speed, uploads of multiple images are done in parallel. Another revealed drawback of the standard version is its tile-generating logic which produces incomprehensible node counts that grow with the input image resolution. With the Hires version, the node count is not proportional to the input or output resolution, and the generated workflow stays basically the same for the same set of parameters and input images of varying resolutions.

For the 4K x 3K image, the size of the workflow file generated by the standard version was 41 MB (164% overhead as computed against 15.4 MB of total input image size, which also suggests that there were duplicates of some images embedded), and it contained 361 nodes in total. This is compared to just 15.4 MB of the input images uploaded - only once per series - by the Hires version (0% overhead) and the workflow of 0.01 MB size with 36 nodes - the attributes which stayed the same for images of all 3 resolutions tested. For the 6K x 4K test image, the size of the standard workflow file was 90 MB (164% overhead against 34 MB of total input image size), and it contained 813 nodes in total, as compared to 34 MB of the input images uploaded by the Hires version and 36 nodes. For the 8K x 46 test image, the size of the standard workflow was 180 MB (165% overhead against 68 MB of total input image size), and it contained 1319 (!) nodes in total, as compared to 68 MB of the input images uploaded by the Hires version and 36 nodes.

Speaking of the output quality, while the two versions were in general on a par with each other, some issues were registered with either of them, when generating at ultra hires. The Hires version tended to hallucinate more, generating miniature versions of the forester in various spots across the image, which suggests that Tiled Diffusion is probably not tuned to the plugin’s regional prompting logic that well (*). Some hallucinations were expected though, since I deliberately set a relatively high refine strength of 50% (eq. 0.5 denoise), not compensated much by the Unblur control layer which was set deliberately low at 0.36. Lowering the refine strength or raising the Unblur strength usually eliminates such hallucinations.

The standard version was also prone to hallucinating (at the higher resolution) tiny versions of the forester and the hut, although to a lesser degree. More troubling are the tile seam artefacts produced by the standard version in the Upscale / Refine mode (see the fragment comparison below, on the left). They start to show up already at the 4K x 3K resolution, and at 8K x 6K the tile artefacts are rife all over the output image, despite the higher than default tile overlap of 64 pixels that I set in hopes to prevent them.

tile artefact comparison - standard vs Hires - fragment from8Kx6K

So far I could see, the powered by Tiled Diffusion Hires version did not produce any visible tile artefacts, using the same tile overlap of 64 pixels.

(*) This idea was later proven correct by additional testing, with the regional prompting logic turned off in the Hires version. That resulted in an output clean of any artefacts or hallucinations whatsoever, for images of all 3 resolutions; overall, I couldn’t see anything superior about the region-enabled output, whether hires or standard. Processing without regions was also 10% faster per image. The images generated in this subtest can be found in the benchmark data folder by the “NO regions” marker in their filenames here.