Tools - Nerogar/OneTrainer GitHub Wiki

Dataset Tools

A tool that help to generate captions with BLIP, BLIP2 or WD14 and mask for masked training using ClipSeg or Rembg. For caption generation, if you set an initial caption, it starts generating from that text instead of an empty string (BLIP and BLIP2 only).

Caption Generation

If you set an initial caption, it starts generating from that text instead of an empty string (BLIP and BLIP2 only). You also can use in addition caption prefix and postfix, usefull with WD14, remember it doesn't add spaces, so you need to add them or "," or ".".

Mask Generation

There are a few different tools to automatically generate a mask, depending on your image type, included in OneTrainer under the dataset tools button. With the batch generate mask tool, you can use ClipSeg, Rembg, Rembg-human and Hex Color.

With ClipSeg, you can use prompts such as "a woman" or "face of a woman" or "face and hair of a woman" to have the model create a mask outside of the areas you specify.

With the manual paint features, you can mask an area off and then use the fill option to fill in the remainder instead of trying to use a brush.

Do not forget to press enter after you are done manually editing your mask! Changes will not be saved unless you press enter

Video Tools

This provides some basic tools for getting screenshots from videos, splitting long videos into short clips, and downloading videos from urls. It may be necessary to install ffmpeg for some video formats.

Multiple videos are processed in parallel, if you have one very long (like movie-length) video then it will run faster if manually split into a few shorter chunks.

Extract Clips - specify either a single video file or a folder with multiple videos (including in subfolders) and a path to an output folder. If the "Output to Subdirectories" option is enabled, the outputs will be saved to separate folders under the output folder for each video processed, otherwise they will all be saved to the top level of the output folder. Videos are saved as .avi.

If a start and end time is specified in "Time Range" then the extracted clips will be limited to that range. This only applies when doing single-video processing, extracting from a directory of multiple videos will ignore the range. Times can be specified in "hh:mm:ss", "mm:ss", or "ss" format.

If the "Split at Cuts" option is enabled, it will use PySceneDetect to identify cuts in the video and split the input video at those locations. Otherwise splits may happen at any location, and the output clips may include cuts.

The "Max Length" setting determines how long the outputted clips will be, specified in seconds. Scenes detected with the "Split at Cuts" option may be split into shorter sections if they are longer. The resulting clips may be shorter than specified if the scenes are short, for example a 4-second scene with a max length of 3 seconds would be split into 2-second clips. Any sections <0.25 seconds are discarded.

"Set FPS" will re-encode the output videos to the specified FPS. Most video models have a fixed framerate they work best at, and videos at different framerates can result in outputs that seem too fast/slow, so this can help prevent that. Hunyuan uses 24 fps, which is the default value. Set to 0 to disable this, and output the videos at the same framerate as the source video. Note that FFmpegis required to be installed on your system for this - on Linux you should just be able to do apt install ffmpeg or equivalent for the distro, for Windows a a few more steps are required.

"Crop Variation" will randomly vary the aspect ratio of the output clips with a configurable variation, which may help avoid overfitting to a specific aspect ratio during training but can risk cropping out details near the edges, so output clips should be reviewed to ensure they match the captions. Clips will be cropped to a random aspect ratio around the original value (defined as height/width, so a 3:2 video would be 0.66) with somewhat of a bias towards square cropping, and the center of the crop picked randomly along the cropped axis. Clips with an initial aspect ratio that is close to square (1) will be output between aspect-var and aspect+var, landscape videos (<0.85) will be cropped between aspect*(1-var/2) and aspect*(1+var1.5) with portrait videos (>1.2) between aspect(1-var1.5) and aspect(1+var/2).

Extract Images - same as clips, specify an input video/folder and output folder, a configurable time range when extracting from a single video, and with the option to output to subdirectories. Images are saved as jpegs.

Specify the number of images to capture per second, default is 0.5 which is one image every 2 seconds. Images are selected at a random spread around the "center" frame at that frequency, if run multiple times you will end up with different frames selected.

"Blur Removal" specifies a portion of the captured images to discard as "too blurry", which may help avoid capturing frames which are mid-transition or have motion blur. Uses the "variance of the Laplacian" method to quantify the sharpness of the image and discards the lowest in the set of selected frames, only saving the remaining.

"Crop Variation" works the same as described above for clips.

Download - Provide either a single link or a list of links (.txt file with each link on a separate line) to download to a specified output directory. Uses yt-dlp, which supports a wide variety of sites. Additional arguments can be provided, see the yt-dlp github page for a list of applicable args and supported websites. The default arguments just reduce the spam in the terminal window while displaying download progress.

Convert Model Tools

Convert between different model formats.