User Guide ‐ Tips and Workflows - FennelFetish/qapyq GitHub Wiki
The other user guide describes how the separate tools work. In these tips here, I'll describe my current workflow for using the tools together. This is subjective and specific to a dataset with around 10k images.
Use existing tags as grounding by including them in prompts to a captioning model. This can improve the model's accuracy if the tags themselves are accurate. Editing tags is much faster than editing prose, and with grounding you can steer the VLM to include certain aspects in the caption.
Grounding can be any text, not only tags. You can include any aspects which you think are difficult for the VLM to understand.
Auto-generated tags should be cleaned up manually and then further refined using rules. See the next section on how to rapidly correct tags.
(I mask out watermarks after cropping, so I tell the VLM to ignore them.)
System prompt:
You are an assistant that describes scenes in detailed, vivid but concise and dense English prose. Write simple, clear sentences and avoid ambiguity.
No relative clauses. Don't use pronouns to refer to elements: Repeat the object names if needed.
No formatting, no bullet points. No fill words, no embellishment, no juxtaposition. Focus on the observable, no interpretations, no speculation.
Describe all elements while maintaining a logical flow from main subject to background: Start with the primary focal point, describe its details, then move to secondary elements in order of prominence.
Ignore text: No OCR. Ignore watermarks and logos.
The description is intended as caption for AI training.
Prompt:
Describe the image in detail. Begin your answer directly with the main subject without writing "The main subject", without "The image", etc.
Use the following list of Booru-tags as approximate reference:
<tags>{{tags.refined}}</tags>
If you disagree with the tags, do not mention any absence. Remember to write very concise and precise prose in full sentences. Avoid mistakes and repetitions.
Instead of going image-by-image, it can be much faster to go vertically, aspect-by-aspect.
- Prepare rules and groups for colored highlighting. It just helps the eyes.
- Open the Caption Window, place windows side by side.
- Enable the refined preview. The sorting, combining and banning makes the caption easier to read.
- Use the Slideshow Tool (with shuffling) to navigate through the images.
Eventually, you'll notice wrong tags. Write them down and make a list. Often, the tagging model will do the same mistakes across images.
Use the Stats Window, enter the tag in the filter box, select the tag from the table.
This will list all images with that tag.
Some tags (for example color, length, perspective) are sometimes mutually exclusive, meaning if one of them exists, the others should be removed.
You can either use Batch Rules and a group with "Mutually Exclusive" set to "Keep First" and filter these tags automatically. "Keep First" will keep the tag with the highest confidence score.
But often, when multiple tags appear, it means it was difficult for the model to distinguish them and I'd rather correct them manually if feasible.
Use the Stats Window and select all these tags (hold CTRL). Select "List files with: Multiple" and it will display all images with more than one of the conflicting tags.
- Use the Stats Window to filter images with wrong tags.
- Ensure the Caption Window is open.
- Place it side by side with the Main Window.
- In the Stats Window, use "With Files..." -> "Focus in New Tab".
- This will open the listed files only and prepare the Focus tab in the Caption Window with your selected tags highlighted.
- In the Focus tab:
- Optionally enable "Mutually Exclusive" if only one of the tags should be kept.
- Optionally enable "Auto Save".
- Optionally enable auto skipping on save in the bottom right corner of the Caption Window.
- Memorize the keyboard shortcuts for your focus tags.
- Change the order if necessary.
- Click "Enable Keyboard Shortcuts"
- Keep your eyes on the image.
- In this mode we're focusing on a specific aspect of the image.
- Pressing the shortcut key will now:
- Add the tag (if it doesn't exist)
- When "Mutually Exclusive" is enabled, all other focus tags are removed.
- Pressing
0
will remove all existing focus tags.
- Auto save the tags.
- Auto skip to the next image in that tab.
- Skipping won't loop and will stop at the last image. When it doesn't go further, you're done.
- Add the tag (if it doesn't exist)
If you have sorted your images into folders, you can use Batch Apply to add the same tag to multiple files.
When you only have a small number of files, you can also select the respective images in the Gallery and use the Caption Window's Multi-Edit mode to add a tag to multiple files.
If however the folders don't clearly reflect the distinction, and you want to choose a separate class/activation tag for each individual image:
- Load the images and select the first.
- Open the Caption Window.
- Choose a new key for storing the tag, for example
tags.class
.- Enter that key in the "Load From" selector and press reload.
- The "Save To" selector will also change to that key.
- Open the Focus tab
- Enter all different activation tags in the "Focus on" text field, separated by comma.
- Enable "Auto Save"
- Also enable the Auto Skip toggle button in the bottom right corner of the Caption Window.
- Enable the keyboard shortcuts.
- Go through your images and add a tag to them.
If you want to integrate this tag in a text generated by a VLM, you can use template functions in Batch Apply.
For this to work, you have to identify a word that is generated in all captions. Let's say you're building a dataset for towers. Every caption will likely contain "tower" or "building".
- Open the Batch Window and the Apply tab.
- Write a template to load the caption and replace the word with your activation tag:
{{captions.caption#lower#replace:building:tower#replace:tower:CLASS tower:1#replacevar:CLASS:tags.class}}
The parts in detail:
-
captions.caption
- Load this caption from the
.json
file.
- Load this caption from the
-
#lower
- Convert the caption to lowercase.
-
#replace:building:tower
- Replaces all occurrences of
building
withtower
.
- Replaces all occurrences of
-
#replace:tower:CLASS tower:1
- Replaces the first occurrence of
tower
withCLASS tower
.
- Replaces the first occurrence of
-
#replacevar:CLASS:tags.class
- Replaces
CLASS
with the value stored intags.class
.
- Replaces
The result is a caption where building
is replaced by tower
.
The first occurrence of tower
is prefixed with your activation tag: [activation tag] tower
.
Write the template to a .json
key or the final .txt
file.
For training on only a filtered selection of images:
- Filter your images using the Stats Window.
- Open the filtered files in a new tab.
- Use the "File" tab in the Batch Window.
- Choose a target folder.
- Mode: Create Symlinks
- On Windows you'd have to enable permission for symlinks first, or use the Copy mode instead.
- Verify the base path.
- Include images, .txt captions and masks.
This will write folders into the target folder, reflecting the folder hierarchy in the original dataset,
but only including the filtered files as symlinks. The symlinks won't take extra space and modifications of the captions or masks in your original dataset are reflected in this subset (and vice versa!).
Load this folder in your training software.
A similar approach can be used to split your dataset into a training set and validation set.