Setup - FennelFetish/qapyq GitHub Wiki
Requires Python 3.10 or later.
By default, prebuilt packages for CUDA 12.4 are installed. If you need a different CUDA version, change the index URL in requirements-pytorch.txt
and requirements-llamacpp.txt
before running the setup script.
- Git clone or download this repository.
- Run
setup.sh
on Linux,setup.bat
on Windows.- This will create a virtual environment that needs 7-9 GB.
If the setup scripts didn't work for you, but you manually got it running, please share your solution and raise an issue.
- Linux:
run.sh
- Windows:
run.bat
orrun-console.bat
You can open files or folders directly in qapyq by associating the file types with the respective run script in your OS.
For shortcuts, icons are available in the qapyq/res
folder.
If git was used to clone the repository, simply use git pull
to update.
If the repository was downloaded as a zip archive, download it again and replace the installed files.
New dependencies may be added. If the program crashes or fails to start, run the setup script again to install the missing packages.
During installation, packages are extracted into your temporary folder. Some packages may require a few GB of temporary space. If you encounter this error even though your disk has plenty of space left, you might be running out of space on the RAM disk for the /tmp
folder.
Try setting a different temporary folder and start the setup.sh
like this:
cd qapyq
mkdir tmp
TMPDIR=./tmp ./setup.sh
rmdir tmp
Most configuration can be done in the GUI. The settings are stored in qapyq_config.json
inside the application's folder.
Some settings that may be worth changing manually in the config file are listed below.
The path_export
setting in the config file is used as the default path for file choosers.
When using a relative path in a path template for exporting files, it will also be based on this setting.
The default is the application's installation folder which is probably impractical. I suggest changing it to the top path of your datasets folder.
The user interface should be scaled automatically to your display's DPI and OS settings. If you wish to scale it further for better readability or space utilization, you can change the gui_scale
setting in the config file.
The appearance should automatically match your OS settings. If this detection fails, you can change the color scheme manually by setting color_scheme
to either dark
or light
.
For better readability, qapyq uses a monospace font throughout the interface. A default font is included (DejaVu).
To change the font, edit the font_monospace
path in the config file and point it to another TTF font file.
When the path is empty, it will use the default monospace font as defined in your OS.
Images may contain information about rotation or flipping. This transformation is ignored by default, because:
- Some training tools seem to ignore EXIF information, so it's probably better to see the images in their native state and rotate them manually.
- Applying transformation might invalidate existing masks.
- It can slow down thumbnail generation a bit.
To enable EXIF transformation globally, change the exif_transform
setting from "false" to "true", and also change it on remote hosts if needed. The image viewer, thumbnails in the gallery, variables in path templates, image size stats and inference backends will then use the transformed versions of the images.
Masks are not loaded as images. They are ignored when loading files and folders, and they don't appear in the Gallery.
The Gallery will check if a mask exists and display an icon.
The identification of masks is based on the mask_suffix
setting in the config file. The default is -masklabel
.
Currently, it simply checks if the filename ends with this suffix. For showing icons, the Gallery will only consider masks with .png
extension, while the file loader will ignore all extensions. This works well with the format that OneTrainer expects, but might not work properly with other tools.
The Mask Tool allows defining a more flexible path template for loading and saving masks.
If you have multiple GPUs, you can set the device number in the infer_devices
setting. This setting takes a list of values, but only the first one is used currently.
To use multiple GPUs at once for data parallel processing, you can setup multiple hosts for the local machine and spawn multiple inference subprocesses.
See Host Setup for Remote Inference.
To use automated captioning and masking, you first have to create presets for the AI models in the Model Settings, accessible via the burger menu in the top left corner of the Main Window.
qapyq makes no internet connections (except during the setup script) and does not automatically download models.
Models need to be downloaded manually from https://huggingface.co. Some links to the model download pages are listed in the Readme.
An exception is the transparent-background
package for the "InSPyReNet RemBg" backend. When running this backend, it will attempt to download the model if it couldn't find it locally. It will also compile the model and create a new file next to it.
- GGUF vision models (for llama.cpp) consist of a single
.gguf
file + a multi-modal projector model (mmproj...gguf). - The WD tagging models need the
.onnx
file + theselected_tags.csv
.- The CSV file is the same for all WD variants.
- Ultralytics/YOLO detection models usually come as single files in
.onnx
or.pt
format.- Models made for Adetailer should work too, since Adetailer also uses YOLO.
- Embedding models for the ONNX backend need the folder with the configuration files + a text model and vision model in
.onnx
format.- The linked pages provide ONNX models in different sizes. I suggest using the largest text model and the fp16 vision model.
- If there's a
.onnx
and.onnx_data
file, you'll need both. Choose the.onnx
file in qapyq's Model Settings.
- All other models (transformers, torch) need the whole folder with all the files.
- I haven't fully tested which files are necessary for each model, but generally, you don't have to download screenshots or examples.
- Some model weights are available in multiple formats. In that case you generally only need one of the files. If available, take the
.safetensors
file. - Avoid using
.
(dot) in folder names for transformers models. Apparently there are issues with loading and the resulting error message is misleading.
These should work, but the required packages are not installed by default. Installing those requires downgrading torch and CUDA versions and may have to be built from source.
Visual models consists of a LLM and a visual part, which translates the image into something the LLM can understand. These models may require a great amount of video memory (VRAM). To reduce VRAM usage, you can enable quantization and/or load layers into your normal RAM on the CPU side. The more layers are loaded to the GPU, the faster the model will work. Support for quantization and layer offloading depends on the model and its version.
The available quantization options can compress the model's weights to 4 or 8 bits. This will reduce the quality of the output, though quantized large models might still perform better than smaller models. The inference speed with quantization depends on hardware support.
The Model Settings allow to specify the number of layers that should be loaded into the GPU's VRAM. It accepts relative values from 0%
(only CPU) to 100%
(fully loaded to GPU). While loading the model, the actual number of layers is logged to the console. The values may be different for each model.
Most time is usually spent generating the text. Encoding the image is faster. So when VRAM is short, it makes sense to keep the LLM on the GPU and offload some visual layers to the CPU first by reducing the %
value in the Model Settings.
Use your OS' system monitor or your GPU driver's control panel to see the VRAM usage and adjust the layer settings accordingly.
These models output all the known tags together with a confidence score. The threshold
setting defines how to filter the tags: Only tags with a confidence score above the threshold are kept.
If your results contain lots of inaccurate tags, try raising the threshold a bit. If you want to include more uncertain tags, lower the threshold.
The WD models can also find character tags and a SFW rating for the image. These tags are prepended at the front of the output.
When "Only Best Match" is enabled for the character tags, it will only output the character tag with the best score (if it lies above the threshold).
MCUT is a simple yet efficient algorithm used for classification problems that finds the biggest gap in scores. For example, when tags with the scores 0.9, 0.8, 0.5, 0.4
are returned, MCUT will define a dynamic threshold in the middle between 0.8 and 0.5.
However, the threshold found this way is often very high and would exclude many valid tags. And since it ignores and overrides the user-defined threshold, it's not tunable.
qapyq implements two experimental threshold modes that are based on MCUT. It repeatedly applies MCUT to the remaining scores to create clusters of tags, separated by the largest gaps in scores.
For example: 0.9, 0.8 | 0.5, 0.4, 0.3 | 0.1
(3 clusters separated by |
)
Which clusters are included in the output depends on the selected threshold and mode:
- "Adaptive Strict" excludes the whole cluster when some of its tags have a score below the threshold.
- In the example above, a threshold of 0.35 would exclude the second cluster, because it ends with 0.3.
- "Adaptive Lax" will include this cluster, even when some scores are below the threshold.
- In the example above, a threshold of 0.35 would include the second cluster, because it starts with 0.5.
All tags of the best cluster are always included when an adaptive mode is selected. A threshold of 1.0 for both modes is equivalent to the original MCUT.
Some models, like YOLO detection models, can detect various "classes" (types of objects, like person, face, hand, car, etc.). The list of supported classes should be part of the model's documentation. If it doesn't mention the classes, it might only support one class.
The "Detect" operation in the Mask Tool has a "Retrieve" button which can load the classes directly from a model.
Often, we're only interested in a few of the classes. Put these classes into the text box and separate them with comma. You can create presets using the same model, but for different classes.
If the text box is empty, all detections are added to the mask.
Embedding models encode images and text prompts into feature vectors, each around 3-6 KB in size. The image features are cached to disk with hashed filenames inside the qapyq/.cache/embedding
folder. Each .npy
file corresponds to a folder. Simply delete the .cache
folder to clear the cache.
I recommend SigLIP2 (giant-opt-384) with the ONNX backend: It loads and processes faster, it only loads the required model (text/vision), and the models are available with different quantizations.
To better extract the meaning, text prompts are augmented with templates. This means, when we enter for example "a cat" to sort the images by their "catness", the average of multiple text embeddings is used:
a photo of a cat
a drawing of a cat
a cat shown in an illustration
...
You can see and change the templates inside the qapyq/user/embedding-prompt-templates/
folder. Each line must contain a {}
-placeholder, which is replaced with the prompt. Empty lines and comments starting with #
are ignored.
Creating new .txt
files inside that folder and subfolders will make them available for selection in the Model Settings.
default.txt
contains a list of templates. prompt-only.txt
disables the augmentation.
These two files will be reset when updating. Create new files if you want to add your own templates.
The original models were trained on center-cropped square images, and they expect square images as the input during inference. However, center-cropping could miss some important parts of the images that we're looking for.
qapyq implements different processing strategies (ONNX backend only) to handle non-square images:
- Center Crop
- Like the original. Might cut away important parts of an image.
- Squish to Square:
- Simply resize an image and distort it by changing the aspect ratio.
- Is fast and works quite well, despite the distortion.
- Multi Patch
- Create multiple crops along the longer axis and combine the embeddings using the selected aggregate function.
- The number of patches depends on the aspect ratio, while an overlap of at least 1/8 is ensured.
- If the aspect ratio is close to square, it resizes and squishes the image instead.
- This is slower and the speed depends on the aspect ratios in your dataset.
Multi Patch has different modes:
- Force centering in landscape orientation
- For images which are wider than tall, the number of patches is rounded up to the next odd number. This ensures that the middle patch is still a "center crop", like the model expects.
- Force centering in portrait orientation
- Likewise, for images which are taller than wide.
- Always force centering
- Always round up the patch count to the next odd number.
Since "force centering" generates more patches that need to be encoded, it will be slower. I settled on "Force centering in landscape orientation". Images in portrait orientation seem to be captured well enough even without centering, but this depends on the dataset.
The Patch Aggregate setting only affects Multi Patch processing.
-
Mean
works well but could dilute some details that only appear in one patch. -
Max
preserves strong signals from each patch, but might be noisy.
When changing the processing strategy, new image embeddings are created and cached to a different folder.
- For Ovis-1.6 to run multiple times, I had to make these changes to the model's code: GitHub Issue
- InternVL2 doesn't follow system prompts well.
- InternVL2-26B couldn't see the images and only wrote about its hallucinations.
- Qwen2VL raises out-of-memory error when visual layers are offloaded to CPU, or when describing very large images.
- The onnx version of RMBG-2.0 issued warnings and was very slow. The transformers version works well.
Model | Quantization | LLM GPU Layers | Visual GPU Layers |
---|---|---|---|
InternVL2 1B-8B | None | 100% | 100% |
InternVL2 40B | NF4 | 100% | 0% |
MiniCPM-V-2.6 32K Context | Q8 / f16 | 100% | 100% |
Molmo-7B-D | None | 100% | 100% |
Ovis1.6-Gemma2-9B | None | 100% | 0% |
Qwen2-VL 2B/7B | None | 100% | 100% |
I haven't tried InternVL-2-76B or Qwen2-VL-72B.
qapyq supports running inference on remote machines over SSH. For this, you need to:
- Setup the SSH server on the remote machine.
- Setup the SSH client on your local machine.
- On Windows, PuTTY can be used as the SSH client.
- Install qapyq on the remote machine using the setup script.
- Upload your models to the remote machine.
- Models need to be placed into the same folder structure as locally.
- The paths are translated using the "Model Base Path" setting in the Hosts Window.
- Add the remote machine as a host in qapyq's Hosts Window (opens with Ctrl+H).
- See below for example commands.
Communication between the qapyq GUI and the remote host works exclusively through pipes over the secure SSH connection.
When starting a task, images are uploaded and the remote host caches them only in memory until it is processed. The number of images which are uploaded in advance and kept in the cache can be configured with the "Queue Size" option in the Hosts Window. A size of 2-3 seems best for Gigabit LAN connections.
Multiple hosts can be used in the same task, which scales well and speeds up the task considerably. Images are queued to all hosts that are selected in the Hosts Window.
Alternatively, only one host can be activated before starting a task, which keeps other hosts available for other tasks.
Speedup can also be achieved by increasing the "Process Count" setting and spawning multiple processes on one machine for the same GPU, if the VRAM allows for it.
Below are a few example commands which you can enter in the Hosts Window.
Replace the path to the run-host.sh
or run-host.bat
script with the actual path on the remote host.
For Linux clients:
-
ssh hostname "/srv/qapyq/run-host.sh"
- Use this if you made a profile for
hostname
in your~/.ssh/config
.
- Use this if you made a profile for
-
ssh -i "/path/to/private.key" -l [remote-username] -p [port] [ip-address] "/srv/qapyq/run-host.sh"
- For Linux remote hosts.
-
ssh -i "/path/to/private.key" -l [remote-username] -p [port] [ip-address] "C:\qapyq\run-host.bat"
- For Windows remote hosts.
For Windows clients when using PuTTY, create a profile in PuTTY and then use plink.exe
:
-
"C:\Program Files\PuTTY\plink.exe" profile "/srv/qapyq/run-host.sh"
- For Linux remote hosts.
-
"C:\Program Files\PuTTY\plink.exe" profile "C:\qapyq\run-host.bat"
- For Windows remote hosts.
qapyq won't ask for passwords for logins or keys. The command must run without expecting input, otherwise the connection will fail.
Use ssh-add "/path/to/private.key"
or PuTTY's pageant
to decrypt and store the key beforehand. You could also add the password to the command, but it would be saved as plaintext in qapyq_config.json
.
The run-host
scripts take the device number as an optional argument. For a machine with multiple GPUs, you can add multiple hosts with different device numbers. The first GPU is usually 0.
Local processes for a different GPU can be started with the following commands (replace "1" with the actual device number):
-
sh -c "/path/to/qapyq/run-host.sh 1"
- On Linux
-
cmd.exe /c "C:\path\to\qapyq\run-host.bat 1"
- On Windows
If your local or remote hosts have a different amount of VRAM, you may have to use different settings per host for loading the models (number of layers, quantization).
This works for caption and LLM presets: To create an overriding preset for a host, open the Model Settings Window, duplicate the original preset and add [hostname]
at the end of the preset name (replace hostname with the exact name from the Hosts Window, case-sensitive). Spaces between the preset name and the [
are optional.
When starting a task, select the original preset without hostnames. The overriding preset with [hostname]
is then used for the denoted host, or if no per-host preset exists, the selected original preset is used.
The console will show "Using preset ..." when it found the overriding preset.
The sample settings are the same across all hosts and model paths are still translated, which allows to test the presets locally.
It might also be possible to set a different model for a host, for example one with more or less parameters, but I haven't tested this. It would of course produce different captions for the images that were processed on this host.