Installation

Requires Python 3.10 or later.

By default, prebuilt packages for CUDA 12.4 are installed. If you need a different CUDA version, change the index URL in requirements-pytorch.txt and requirements-llamacpp.txt before running the setup script.

Git clone or download this repository.
Run setup.sh on Linux, setup.bat on Windows.
- This will create a virtual environment that needs 7-9 GB.

If the setup scripts didn't work for you, but you manually got it running, please share your solution and raise an issue.

Startup

Linux: run.sh
Windows: run.bat or run-console.bat

You can open files or folders directly in qapyq by associating the file types with the respective run script in your OS.
For shortcuts, icons are available in the qapyq/res folder.

Update

If git was used to clone the repository, simply use git pull to update.

If the repository was downloaded as a zip archive, download it again and replace the installed files.

New dependencies may be added. If the program crashes or fails to start, run the setup script again to install the missing packages.

Installation Troubleshooting

No space left on device

During installation, packages are extracted into your temporary folder. Some packages may require a few GB of temporary space. If you encounter this error even though your disk has plenty of space left, you might be running out of space on the RAM disk for the /tmp folder.

Try setting a different temporary folder and start the setup.sh like this:

cd qapyq
mkdir tmp
TMPDIR=./tmp ./setup.sh
rmdir tmp

Configuration

Most configuration can be done in the GUI. The settings are stored in qapyq_config.json inside the application's folder.

Some settings that may be worth changing manually in the config file are listed below.

Default export path

The path_export setting in the config file is used as the default path for file choosers.
When using a relative path in a path template for exporting files, it will also be based on this setting.

The default is the application's installation folder which is probably impractical. I suggest changing it to the top path of your datasets folder.

GUI Scale

The user interface should be scaled automatically to your display's DPI and OS settings. If you wish to scale it further for better readability or space utilization, you can change the gui_scale setting in the config file.

Color Scheme

The appearance should automatically match your OS settings. If this detection fails, you can change the color scheme manually by setting color_scheme to either dark or light.

Font

For better readability, qapyq uses a monospace font throughout the interface. A default font is included (DejaVu).

To change the font, edit the font_monospace path in the config file and point it to another TTF font file.
When the path is empty, it will use the default monospace font as defined in your OS.

Image EXIF Transformation

Images may contain information about rotation or flipping. This transformation is ignored by default, because:

Some training tools seem to ignore EXIF information, so it's probably better to see the images in their native state and rotate them manually.
Applying transformation might invalidate existing masks.
It can slow down thumbnail generation a bit.

To enable EXIF transformation globally, change the exif_transform setting from "false" to "true", and also change it on remote hosts if needed. The image viewer, thumbnails in the gallery, variables in path templates, image size stats and inference backends will then use the transformed versions of the images.

Filename suffix for masks

Masks are not loaded as images. They are ignored when loading files and folders, and they don't appear in the Gallery.
The Gallery will check if a mask exists and display an icon.

The identification of masks is based on the mask_suffix setting in the config file. The default is -masklabel. Currently, it simply checks if the filename ends with this suffix. For showing icons, the Gallery will only consider masks with .png extension, while the file loader will ignore all extensions. This works well with the format that OneTrainer expects, but might not work properly with other tools.

The Mask Tool allows defining a more flexible path template for loading and saving masks.

GPU Device Number

If you have multiple GPUs, you can set the device number in the infer_devices setting. This setting takes a list of values, but only the first one is used currently.

To use multiple GPUs at once for data parallel processing, you can setup multiple hosts for the local machine and spawn multiple inference subprocesses.
See Host Setup for Remote Inference.

Model Setup

To use automated captioning and masking, you first have to create presets for the AI models in the Model Settings, accessible via the burger menu in the top left corner of the Main Window.

qapyq makes no internet connections (except during the setup script) and does not automatically download models.
Models need to be downloaded manually from https://huggingface.co. Some links to the model download pages are listed in the Readme.

An exception is the transparent-background package for the "InSPyReNet RemBg" backend. When running this backend, it will attempt to download the model if it couldn't find it locally. It will also compile the model and create a new file next to it.

GGUF vision models (for llama.cpp) consist of a single .gguf file + a multi-modal projector model (mmproj...gguf).
The WD tagging models need the .onnx file + the selected_tags.csv.
- The CSV file is the same for all WD variants.
Ultralytics/YOLO detection models usually come as single files in .onnx or .pt format.
- Models made for Adetailer should work too, since Adetailer also uses YOLO.
Embedding models for the ONNX backend need the folder with the configuration files + a text model and vision model in .onnx format.
- The linked pages provide ONNX models in different sizes. I suggest using the largest text model and the fp16 vision model.
- If there's a .onnx and .onnx_data file, you'll need both. Choose the .onnx file in qapyq's Model Settings.
All other models (transformers, torch) need the whole folder with all the files.
- I haven't fully tested which files are necessary for each model, but generally, you don't have to download screenshots or examples.
- Some model weights are available in multiple formats. In that case you generally only need one of the files. If available, take the .safetensors file.
- Avoid using . (dot) in folder names for transformers models. Apparently there are issues with loading and the resulting error message is misleading.

Prequantized Models (AWQ/GPTQ)

These should work, but the required packages are not installed by default. Installing those requires downgrading torch and CUDA versions and may have to be built from source.

Configuration

Caption / LLM

Visual models consists of a LLM and a visual part, which translates the image into something the LLM can understand. These models may require a great amount of video memory (VRAM). To reduce VRAM usage, you can enable quantization and/or load layers into your normal RAM on the CPU side. The more layers are loaded to the GPU, the faster the model will work. Support for quantization and layer offloading depends on the model and its version.

The available quantization options can compress the model's weights to 4 or 8 bits. This will reduce the quality of the output, though quantized large models might still perform better than smaller models. The inference speed with quantization depends on hardware support.

The Model Settings allow to specify the number of layers that should be loaded into the GPU's VRAM. It accepts relative values from 0% (only CPU) to 100% (fully loaded to GPU). While loading the model, the actual number of layers is logged to the console. The values may be different for each model.

Most time is usually spent generating the text. Encoding the image is faster. So when VRAM is short, it makes sense to keep the LLM on the GPU and offload some visual layers to the CPU first by reducing the % value in the Model Settings.

Use your OS' system monitor or your GPU driver's control panel to see the VRAM usage and adjust the layer settings accordingly.

Mask

Some models, like YOLO detection models, can detect various "classes" (types of objects, like person, face, hand, car, etc.). The list of supported classes should be part of the model's documentation. If it doesn't mention the classes, it might only support one class.
The "Detect" operation in the Mask Tool has a "Retrieve" button which can load the classes directly from a model.

Often, we're only interested in a few of the classes. Put these classes into the text box and separate them with comma. You can create presets using the same model, but for different classes.
If the text box is empty, all detections are added to the mask.

Embedding

Embedding models encode images and text prompts into feature vectors, each around 3-6 KB in size. The image features are cached to disk with hashed filenames inside the qapyq/.cache/embedding folder. Each .npy file corresponds to a folder. Simply delete the .cache folder to clear the cache.

I recommend SigLIP2 (giant-opt-384) with the ONNX backend: It loads and processes faster, it only loads the required model (text/vision), and the models are available with different quantizations.

Prompt Templates

To better extract the meaning, text prompts are augmented with templates. This means, when we enter for example "a cat" to sort the images by their "catness", the average of multiple text embeddings is used:

a photo of a cat
a drawing of a cat
a cat shown in an illustration
...

You can see and change the templates inside the qapyq/user/embedding-prompt-templates/ folder. Each line must contain a {}-placeholder, which is replaced with the prompt. Empty lines and comments starting with # are ignored.
Creating new .txt files inside that folder and subfolders will make them available for selection in the Model Settings.

default.txt contains a list of templates. prompt-only.txt disables the augmentation.
These two files will be reset when updating. Create new files if you want to add your own templates.

Image Processing

The original models were trained on center-cropped square images, and they expect square images as the input during inference. However, center-cropping could miss some important parts of the images that we're looking for.

qapyq implements different processing strategies (ONNX backend only) to handle non-square images:

Center Crop
- Like the original. Might cut away important parts of an image.
Squish to Square:
- Simply resize an image and distort it by changing the aspect ratio.
- Is fast and works quite well, despite the distortion.
Multi Patch
- Create multiple crops along the longer axis and combine the embeddings using the selected aggregate function.
- The number of patches depends on the aspect ratio, while an overlap of at least 1/8 is ensured.
- If the aspect ratio is close to square, it resizes and squishes the image instead.
- This is slower and the speed depends on the aspect ratios in your dataset.

Multi Patch has different modes:

Force centering in landscape orientation
- For images which are wider than tall, the number of patches is rounded up to the next odd number. This ensures that the middle patch is still a "center crop", like the model expects.
Force centering in portrait orientation
- Likewise, for images which are taller than wide.
Always force centering
- Always round up the patch count to the next odd number.

Since "force centering" generates more patches that need to be encoded, it will be slower. I settled on "Force centering in landscape orientation". Images in portrait orientation seem to be captured well enough even without centering, but this depends on the dataset.

Patch Aggregate

The Patch Aggregate setting only affects Multi Patch processing.

Mean works well but could dilute some details that only appear in one patch.
Max preserves strong signals from each patch, but might be noisy.

When changing the processing strategy, new image embeddings are created and cached to a different folder.

Known Issues

For Ovis-1.6 to run multiple times, I had to make these changes to the model's code: GitHub Issue
InternVL2 doesn't follow system prompts well.
InternVL2-26B couldn't see the images and only wrote about its hallucinations.
Qwen2VL raises out-of-memory error when visual layers are offloaded to CPU, or when describing very large images.
The onnx version of RMBG-2.0 issued warnings and was very slow. The transformers version works well.

Example Configuration

Model Settings for 24GB VRAM

Model	Quantization	LLM GPU Layers	Visual GPU Layers
InternVL2 1B-8B	None	100%	100%
InternVL2 40B	NF4	100%	0%
MiniCPM-V-2.6 32K Context	Q8 / f16	100%	100%
Molmo-7B-D	None	100%	100%
Ovis1.6-Gemma2-9B	None	100%	0%
Qwen2-VL 2B/7B	None	100%	100%

I haven't tried InternVL-2-76B or Qwen2-VL-72B.

Host Setup for Remote Inference

qapyq supports running inference on remote machines over SSH. For this, you need to:

Setup the SSH server on the remote machine.
Setup the SSH client on your local machine.
- On Windows, PuTTY can be used as the SSH client.
Install qapyq on the remote machine using the setup script.
Upload your models to the remote machine.
- Models need to be placed into the same folder structure as locally.
- The paths are translated using the "Model Base Path" setting in the Hosts Window.
Add the remote machine as a host in qapyq's Hosts Window (opens with Ctrl+H).
- See below for example commands.

Communication between the qapyq GUI and the remote host works exclusively through pipes over the secure SSH connection.

When starting a task, images are uploaded and the remote host caches them only in memory until it is processed. The number of images which are uploaded in advance and kept in the cache can be configured with the "Queue Size" option in the Hosts Window. A size of 2-3 seems best for Gigabit LAN connections.

Multiple hosts can be used in the same task, which scales well and speeds up the task considerably. Images are queued to all hosts that are selected in the Hosts Window.
Alternatively, only one host can be activated before starting a task, which keeps other hosts available for other tasks.

Speedup can also be achieved by increasing the "Process Count" setting and spawning multiple processes on one machine for the same GPU, if the VRAM allows for it.

Commands

Below are a few example commands which you can enter in the Hosts Window.
Replace the path to the run-host.sh or run-host.bat script with the actual path on the remote host.

For Linux clients:

ssh hostname "/srv/qapyq/run-host.sh"
- Use this if you made a profile for hostname in your ~/.ssh/config.
ssh -i "/path/to/private.key" -l [remote-username] -p [port] [ip-address] "/srv/qapyq/run-host.sh"
- For Linux remote hosts.
ssh -i "/path/to/private.key" -l [remote-username] -p [port] [ip-address] "C:\qapyq\run-host.bat"
- For Windows remote hosts.

For Windows clients when using PuTTY, create a profile in PuTTY and then use plink.exe:

"C:\Program Files\PuTTY\plink.exe" profile "/srv/qapyq/run-host.sh"
- For Linux remote hosts.
"C:\Program Files\PuTTY\plink.exe" profile "C:\qapyq\run-host.bat"
- For Windows remote hosts.

qapyq won't ask for passwords for logins or keys. The command must run without expecting input, otherwise the connection will fail. Use ssh-add "/path/to/private.key" or PuTTY's pageant to decrypt and store the key beforehand. You could also add the password to the command, but it would be saved as plaintext in qapyq_config.json.

The run-host scripts take the device number as an optional argument. For a machine with multiple GPUs, you can add multiple hosts with different device numbers. The first GPU is usually 0.

Local processes for a different GPU can be started with the following commands (replace "1" with the actual device number):

sh -c "/path/to/qapyq/run-host.sh 1"
- On Linux
cmd.exe /c "C:\path\to\qapyq\run-host.bat 1"
- On Windows

Per-Host Model Settings

If your local or remote hosts have a different amount of VRAM, you may have to use different settings per host for loading the models (number of layers, quantization).

This works for caption and LLM presets: To create an overriding preset for a host, open the Model Settings Window, duplicate the original preset and add [hostname] at the end of the preset name (replace hostname with the exact name from the Hosts Window, case-sensitive). Spaces between the preset name and the [ are optional.

When starting a task, select the original preset without hostnames. The overriding preset with [hostname] is then used for the denoted host, or if no per-host preset exists, the selected original preset is used.
The console will show "Using preset ..." when it found the overriding preset.

The sample settings are the same across all hosts and model paths are still translated, which allows to test the presets locally.

It might also be possible to set a different model for a host, for example one with more or less parameters, but I haven't tested this. It would of course produce different captions for the images that were processed on this host.

Setup - FennelFetish/qapyq GitHub Wiki

Installation

Startup

Update

Installation Troubleshooting

No space left on device

Configuration

Default export path

GUI Scale

Color Scheme

Font

Image EXIF Transformation

Filename suffix for masks

GPU Device Number

Model Setup

Prequantized Models (AWQ/GPTQ)

Configuration

Caption / LLM

Tags

Adaptive Threshold

Mask

Embedding

Prompt Templates

Image Processing

Patch Aggregate

Known Issues

Example Configuration

Model Settings for 24GB VRAM

Host Setup for Remote Inference

Commands

Per-Host Model Settings

⚠️ GitHub.com Fallback ⚠️

Setup - FennelFetish/qapyq GitHub Wiki

Installation

Startup

Update

Installation Troubleshooting

No space left on device

Configuration

Default export path

GUI Scale

Color Scheme

Font

Image EXIF Transformation

Filename suffix for masks

GPU Device Number

Model Setup

Prequantized Models (AWQ/GPTQ)

Configuration

Caption / LLM

Tags

Adaptive Threshold

Mask

Embedding

Prompt Templates

Image Processing

Patch Aggregate

Known Issues

Example Configuration

Model Settings for 24GB VRAM

Host Setup for Remote Inference

Commands

Per-Host Model Settings

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️