Anime Training - Anime4000/sd_dreambooth_extension GitHub Wiki
In this wiki, show basic steps how to train your own anime waifu
or harem
.
Still in WIP and many thing need to update, test, and experiment.
- Prerequisite
- Dataset
- Style
- Tagging
- Training
- Training in Stages
- Generate CKPT
- Multi Concepts create your own harem world
-
Troubleshoot
- Over Fitting
- Over Training
- Under Training
- VRAM OOM tldr; buy good GPU at first place
- Windows 10/11
- Stable Diffusion WebUI (AUTOMATIC1111)
- Dreambooth Extension
- Dataset Tag Editor
- Decent multi-core CPU (High GHz 4-core minimum)
- 16GB RAM
- Modern Nvidia GPU
- 10GB VRAM (Windows)
- 8GB VRAM must use LoRA!
Dreambooth training requires a lot of memory. Linux does not support using a technology called Nvidia TurboCache, which allows using system RAM as a memory buffer for graphics. However, CUDA is able to use ~99% of the VRAM.
- Training on 10GB VRAM is only works on Windows 10+
- Linux users must use LoRA or modify Kernel Mode Set (KMS) to offload some VRAM to System RAM just like Windows.
- Try kill DE to save some VRAM and run via SSH
Similar to ELI5 Training with different tweaks
It's important to have a dataset that is large enough for Dreambooth to learn from, but not so large that it leads to over-fitting or over-training. A good rule of thumb is to limit your dataset to a maximum of around 30 images. Additionally, it's important to balance the number of images for each concept in your dataset. If you have a high count dataset for one concept and a low count dataset for another, the high count dataset may overpower or crush the low count dataset.
The length of your dataset for each concept can also greatly affect the number of training epochs needed. If your dataset contains less than 15 images for a particular concept, you may need to train for over 100 epochs to achieve good results. Conversely, if you have more images for a particular concept, you may be able to train for fewer epochs and still achieve good results.
More info and settings can be found here: TEnc + UNET
-
Sharp and High Resolution
Picture will be down-scaled to 512x512 pixel -
Clear and Clean
Waifu must be alone without other character in the frame -
Diversity
All waifu activity, location, different background, face expression,from
above
,below
,side
,behind
...
-
Uncomplicated
Avoid rare expression, wearing a mask, glitched, blurry, ... -
Less Close-up
Too many close-up will make txt2img more close-up, losing it's diversity
-
Text
Text on t-shirt, dialog, signboard, try minimise/less of this -
Low Light
Avoid low light (dungeon, underworld) -
Too Bright
Avoid bright scene (lens flare, god rays over character face)
-
Wrong Aspect Ratio
Your waifu getting squeezed -
Indistinguishable
Repeated frame, same background, same angle -
Multiple Character
Other dude getting close to your waifu that can't be crop away -
Background Waifu
Your waifu is not main focus and blurred, behind another -
Subtitle
Source from burned subtitle, bad screen shoot
Think of Dreambooth as an employee that is learning from your dataset. Don't give it too complex of tasks or it can result in incomplete training (under-trained), errors and glitches in the training (over-trained), or the model becoming too specific to your dataset and not generalizing well to new data (over-fitting).
Check every picture manually make sure picture fall into Good Source category and few Acceptable!
If your source from a screenshot, or low resolution JPEG, you need upscale it first to reduce compression artifacts, using build-in upscaler at Extra tab inside Stable Diffusion WebUI, and choose R-ESRGAN 4x+ Anime6B at 1
to 2
â Using 4
times upscale can lead to thick art line, downscale will be issue!
â With latest Dreambooth Extension, you can skip this downscale step, Dreambooth ImageBucket will downscale do it for you automatically and beautifully, if you feel not safe because of these reasons, proceed:
- Hide bad upscale
- Reduce art line thickness
- Eliminate compression artifact
- Making it look sharp
Use XnConvert to properly downscale at highest quality. Do not mix Wide and Portrait in the input files, process Wide or Portrait first...
-
Add action
>Image
>Resize
- Enlarge/Reduce:
Always
- Resample:
Lanczos2
(like 8x anti-aliasing)
- Mode:
Height
- Mode:
Width
With Image Bucket, you can skip this step, let ImageBucket pick, resize and crop automatically. If you not confident with Image Bucket, you can still manually crop by your self.
To accelerate training and improve training quality, it is recommended to tightly crop your dataset subject. By cropping out extraneous information from your images, your model can focus on learning the important features of the subject and reduce the amount of noise in the data. This can result in faster and more accurate training, as well as more robust models that are better able to generalize to new data.
Ratio | 512 | 1080p |
---|---|---|
1:1 | 512x512 | 1080x1080 |
7:8 | 448x512 | 945x1080 |
3:4 | 384x512 | 810x1080 |
5:8 | 320x512 | 675x1080 |
1:2 | 256x512 | 540x1080 |
These crop ratios are for vertical/portrait images and are optimized for the common 1080p resolution used in screencaps.
BIRME downscale using Nearest Neighbor Algorithm will cause your picture no longer anti-aliased,
always downscale with XnConvert:
Waifu is focused and blur background
Tightly Cropping 810x1080 (384x512)
Other character have < 5% in the crop area, try make this few in dataset, it's better crop fit to reduce noise and unwanted data
Character holding an object covering the face
Other character too visible in the crop area
Character too close will cause your model lose variation!
Make sure no bad dataset, having one will cause your final model produce bad results
â Always remove bad drawing from dataset!
Versus | |
RAW | R-ESRGAN 4x+ Anime6B (1X) |
![]() |
![]() |
â Always preprocess your dataset, especially screencaps
Try get many full body as possible, if source image from a screenshot, stitches related image like this:
Avoid to have splitting frame, instead try to merge it with gradient to make it look blend:
Most raw screencaps are not ideal for training purposes, so it's recommended to manually check each image and apply Auto Level, Auto Contrast, or both. This will help improve the clarity and distinction between the subject and the background, making the images more suitable for training.
âšī¸ You can mix images that have been processed with Auto Level and/or Auto Contrast with raw images in your dataset. This can help Dreambooth learn how to reproduce colours accurately during inference.
This very important step, you need describe each picture what is that, manual tagging is preferred, you can use automatic DeepDanbooru or Waifu Tagger, however automatic tagging can lead to false positive
â Also keep tag short as possible.
â Avoid repeated tag: skirt, pleated skirt
just pleated skirt
â Incremental tag method will be use.
â Keep common tag to the left!
â Use Danbooru Tag is preferred.
To systematically organize subject names for training files, it's important to arrange them in a consistent manner, such as starting with the family name first. This will make it easier for the Stable Diffusion algorithm to search for specific tokens inside UNET neural networks, and also ensure that the subjects are properly identified.
Look around, find what is most common dress that your waifu is wearing and use it as default
Arrange your tag accordingly, where most important (character name) at first tag followed by clothing, expression...
Name | Clothing | Face Expression | Action | Body Direction | Camera |
---|---|---|---|---|---|
gotou hitori | black shirt | frown | standing | facing away | from above |
kita ikuyo | blue dress | smile | walking | facing to the side | from below |
shiina mahiru | school uniform, blazer | blush | lying | facing viewer | from behind |
kubo nagisa | school uniform, cardigan | blush, smile | sitting | facing back | looking at viewer |
When preparing a dataset for Dreambooth training, you can choose to omit certain information about the characters such as their hair color, eye color, and default clothing. Instead, you only need to include the anime name and background in the dataset.
During inference (txt2img), you simply provide the anime name in the prompt and the model will generate the correct eye color, hair color, etc. based on what it learned during training. This makes the process of generating images more efficient and streamlined, as you don't need to specify every detail about the characters in the prompt.
Avoid using generic tag (eg: 1girl
) for your waifu, it may lead to over-fitting cause other waifu become trained dataset
Anime Stable Diffusion model doesn't understand Left and Right, mind that since Danbooru Tag doesn't have it!
It's possible to introduce own prompt for left
and right
, this training will be a big project and many hours of troubleshooting
Camera | Body Direction | Face/Head | |
---|---|---|---|
âŦī¸ | from above | facing up | looking up |
âŦī¸ | from below | facing down | looking down |
behind | from behind | ||
side | from side | facing to the side | looking to the side |
back | facing back | looking back | |
another | facing another | looking at another | |
camera | facing viewer | looking at viewer |
You can do this standing, facing away, looking at viewer
to make anime character body away while head look at you
More tag can be found at Danbooru
For the best [filewords]
for prompting later on is character name, artist name, eye style
, where user can use any combination and any style.
[filewords]
default | |||||
saitou yoshiko |
hanekoto |
bekkankou |
|||
kubo nagisa |
shiina mahiru |
inoue takina |
yamano mitsuha |
sabine |
sendou erika |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Eyes drawn with the top eyelid slanted outwards, to the point where the outer corner of the eye is much lower than the inner corner. This usually produces a weak, gentle look and is generally given to characters with soft personalities (naturally, exceptions exist).
tareme |
|||||
arawi keiichi |
nekotofu |
hamazi aki |
|||
naganohara mio |
aioi yuuko |
oyama mihari |
oyama mahiro |
gotou hitori |
ijichi nijika |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
When the top of the eye is drawn with a flat line. Used to effect listlessness, apathy, or a bored
, expressionless
, scornful, or smug face.
jitome |
|||
arawi keiichi |
hamazi aki |
nekotofu |
|
minakami mai |
ijichi seika |
oyama mahiro |
nishikigi chisato |
![]() |
![]() |
![]() |
![]() |
Eyes drawn with the top eyelid slanting inwards. This usually produces a strong, piercing look and is generally given to characters with forceful personalities (naturally, exceptions exist).
tsurime |
|||
yoshimizu kagami |
tashiro tetsuya |
||
sabine |
hiiragi kagami |
akame |
|
![]() |
![]() |
![]() |
Ensure that the artist's name and the art style remain the same throughout your dataset. This will help your model learn to recognize and reproduce the specific characteristics of that style.
To avoid bias and improve the model's ability to generalize, it's important to ensure that characters in your dataset are distinct from one another. If there are multiple images of the same character, try interleaving them with images of other characters to provide a more diverse set of examples for your model to learn from.
example | artist name | hair length | hair colour | eye colour |
---|---|---|---|---|
![]() |
bekkankou | long hair | purple hair | purple eyes |
![]() |
bekkankou | long hair | brown hair | green eyes |
![]() |
bekkankou | long hair | blonde hair | blue eyes |
With trained model, to apply art style just simply invoke artist name like this in txt2img
masterpiece, best quality, highres, game cg, <artist name>, 1girl, <char name>, cherry blossoms, petals, flying petals, wind, upper body
masterpiece, best quality, highres, game cg, bekkankou, 1girl, sabine, cherry blossoms, petals, flying petals, wind, upper body
bekkankou | sabine |
![]() ![]() ![]() |
![]() |
![]() |
This visualization helps illustrate how Dreambooth is used for training and Stable Diffusion is used for inference. Together, these tools help the model understand the prompt and generate outputs that match it.
Use Dataset Tag Editor to speed up tagging process, first step to add character name gotou hitori
For starting, load dataset, and go to
- Batch edit caption
- Replace Text
- Entire Caption
- Search and Replace
All picture now has tag that you set!
Now, manually identify your waifu via Edit Caption of Selected Image, if waifu wearing different than default dress, tell it! Face expression, activity, etc...
After you done, click Save all changes Each picture will have text file along side:
Once you satisfy with your tag, can begin training of your favorite waifu!
- Set a Name
your model/project name, example:UltimateWaifuMk1
Ultimate Waifu Mark 1 - Choose Source Checkpoint
Example:Anything-V3.0-pruned-fp32.ckpt
-
- Extract EMA Weights Optional
some ckpt has contain EMA and non-EMA
- Extract EMA Weights Optional
-
- Unfreeze Model Optional
help your training not get replaced with global tag (eg:1girl
) and improve model training, reduce over-training at cost of some VRAM
- Unfreeze Model Optional
-
- Click
button
Avoid using model that has been "Checkpoint Merger" or "Merge Frankenstein", please use original model for better training results.
- WaifuDiffusion
- AnythingV3
- NovelAI
If you have 8GB of VRAM:
- Use LORA
Value: 80
When training a Dreambooth model for anime characters, it is typically best to train for less than 100 epochs. An ideal number of epochs is around 80, particularly when using a CFG of 12 or higher. This means that the model will repeat the learning process 80 times to refine its understanding of the data and improve its ability to generate new images of anime characters. By training for fewer epochs, you can help ensure that the model converges to a good solution without overfitting to the training data.
Value: 0
By having Dreambooth compile the model every "x" epochs, you can monitor the training process and detect when over-training is occurring. Over-training is when the model becomes too specific to the training data and starts to perform poorly on new, unseen data. By saving the model every few epochs, you can check its performance and make adjustments if necessary. However, saving the model too frequently can be harmful to the lifetime of the SSD (Solid State Drive) on which the model is stored. This is because writing data to an SSD too often can shorten its lifespan, so it's important to find a balance between monitoring the model's progress and preserving the SSD's health.
Value: 0
Generating a preview at the current "x" epoch allows you to view the results of the training process at that point. Ideally, this should be set to the same value as the "Save Model Frequency (Epochs)" to allow you to easily compare the model's progress. However, it is worth noting that the quality of the preview generated by Dreambooth may not be as good as the results obtained using native Inference Euler A or DDIM. Nevertheless, the general idea or concept of the preview should be enough to give you a rough idea of the model's progress and performance.
If you have a graphics card with more than 10GB of VRAM (Video Random Access Memory), you can speed up your training process by increasing the values of the relevant parameters. VRAM is a type of memory that is used by a graphics card to store and process visual data, and having more VRAM can allow your training to run faster by giving the model more memory to work with. By increasing the values of relevant parameters, you can take advantage of the additional VRAM and make the training process more efficient.
Value: 2
Value: 2
Value: 0.000001
In Dreambooth, anime characters are treated as objects. When training a model to generate images of anime characters, it's important to choose an appropriate learning rate, which controls the speed at which the model updates its parameters during training. A good starting point for the learning rate in this scenario is 1e-6 or 0.000001. This low value helps ensure that the model updates its parameters slowly and steadily, reducing the risk of overshooting a good solution or becoming stuck in a suboptimal one. The precise value of the learning rate will depend on the specifics of your training data and model, but starting with a value of 1e-6 or 0.000001 is a good place to begin.
-
LoRA UNET Learning Rate:
0.0001
-
LoRA Text Encoder Learning Rate:
0.00005
â LoRA is new training engine and replace UNET, more testing against anime is needed!
Value: masterpiece, best quality, 1girl
-
- Use 8bit Adam
- Mixed Precision:
fp16
- Memory Attention:
xformers
- Step Ratio of Text Encoder Training:
0
- Optional: AdamW Weight Decay:
0.01
-0.005
By default, you can use 4 concepts at the time, if you plan to use more than 4 concept, read here Multi Concepts
- Dataset Directory:
X:\path\to\your\dataset
- Instance Prompt:
[filewords]
- Class Prompt:
[filewords]
Optional - Sample Image Prompt:
masterpiece, best quality, 1girl, [filewords]
- Sample & Classification Image Negative Prompt:
lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name
- Class Images Per Instance Image:
0
Value2
-10
is good balance to counter over-training and over-fitting - Classification CFG Scale:
12
Most anime model using CFG 12, why not. - Sample Seed:
420420
Can be any number, useful to track sample results - Sample CFG Scale:
12
Same reasons
Add more concepts with same value.
-
- Save in
.safetensors
format
if you plan to share with internet
- Save in
-
- Half Model
Model in 2GB size, this might help to reduce your SSD write life-time
- Half Model
Click Save Settings button first!, then click Train
Once training is complete, it will produce a report and sanity check, can start trying.
If you plan to train your model in stages, it's best to start with the smaller parts of the body such as legs
, fingers
, thighs
, ...
. Training on these smaller parts first will significantly improve the overall performance of your model without destroying your waifu
.
For 10GB VRAM user (RTX 3080), You can train Text Encoder first, then resume training without Text Encoder
Stage 1 | Stage 2 | |
---|---|---|
Epoch | 35 | 65 |
Learning Rate | 0.000002 |
0.000001 |
Optimizer | 8bit AdamW | |
Mixed Precision | fp16 | |
Memory Attention |
xformers |
|
Train UNET | â | â |
Step Ratio of Text Encoder Training | 1 | 0 |
Freeze CLIP Normalization Layers | â | |
Strict Tokens | â | |
Testing Tab | ||
Deterministic | â | â |
Use EMA for prediction | â | â |
As issue #916 has been solved, you can use Concepts List to train multiple concepts, this generally better control and it will train in partition.
No need to do Interleaving Dataset đ
A well-organized folder structure can make it easier to manage and keep track of your dataset. By having a clear and logical arrangement of your files and folders, you can quickly find what you need and ensure that everything is in its proper place. This guide can help you establish a good folder structure, but you can also choose to disregard it if you have a different approach that works better for you.
[Project Name]
â
âââ [Anime]
â â
â âââ [nishikigi chisato]
â â â
â â âââ dataset_001.png
â â âââ dataset_001.txt
â â â
â â
â âââ [inoue takina]
â â â
â âââ dataset_001.png
â âââ dataset_001.txt
â â
â
âââ [Artwork]
â â
â âââ [artist style 1]
â â â
â â âââ dataset_001.png
â â âââ dataset_001.txt
â â â
â â
â âââ [artist style 2]
â â â
â âââ dataset_001.png
â âââ dataset_001.txt
â â
â
âââ [CG]
â â
â âââ [game 1]
â â â
â â âââ dataset_001.png
â â âââ dataset_001.txt
â â â
â â
â âââ [game 2]
â â â
â âââ dataset_001.png
â âââ dataset_001.txt
â â
â
âââ [Parts]
â â
â âââ [hands]
â âââ [feet]
â
â
âââ project.json
The following is an example of a JSON file. By using this file format, you can add as many data items as you need to train your Dreambooth Concept. The idea is to get a basic understanding of how to create and format the JSON file so that you can make use of it in your training process.
[
{
"class_data_dir": "",
"class_guidance_scale": 12,
"class_infer_steps": 40,
"class_negative_prompt": "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name",
"class_prompt": "1girl, bob cut, blonde hair, red eyes, red blazer",
"class_token": "",
"instance_data_dir": "E:\\dataset\\_FallenAngel\\Anime\\Nishikigi Chisato",
"instance_prompt": "[filewords]",
"instance_token": "",
"is_valid": true,
"n_save_sample": 1,
"num_class_images_per": 0,
"sample_seed": 420420,
"save_guidance_scale": 12,
"save_infer_steps": 40,
"save_sample_negative_prompt": "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name",
"save_sample_prompt": "masterpiece, best quality, 1girl, [filewords]",
"save_sample_template": ""
},
{
"class_data_dir": "",
"class_guidance_scale": 12,
"class_infer_steps": 40,
"class_negative_prompt": "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name",
"class_prompt": "1girl, long hair, black hair, purple eyes, open jacket",
"class_token": "",
"instance_data_dir": "E:\\dataset\\_FallenAngel\\Anime\\Inoue Takina",
"instance_prompt": "[filewords]",
"instance_token": "",
"is_valid": true,
"n_save_sample": 1,
"num_class_images_per": 0,
"sample_seed": 420420,
"save_guidance_scale": 12,
"save_infer_steps": 40,
"save_sample_negative_prompt": "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name",
"save_sample_prompt": "masterpiece, best quality, 1girl, [filewords]",
"save_sample_template": ""
}
]
Adjust your total Epochs, Learning Rate and Text Encoder Scale
- Prior Loss Weight:
0.25
-0.35
(0.3
) higher the number, more towards to class image instead of dataset - Class Images Per Instance Image:
1
-10
Overfitting occurs when a model becomes too specific to the training data and does not generalize well to new data.
One way to prevent overfitting is to use a limited number of classes in the training dataset, around 2 to 10 classes, this can make the model simpler and more general, and thus less prone to overfitting.
- Try reduce number of tag (
[filewords]
token), manual tagging is preferred, keep under 8 tag at most. - Using Class Image and set
Prior Loss Weight
around.1
to.5
or default.75
Overtraining occurs when a model is trained for too long, or with too much data, and it starts to perform poorly on new, unseen data. The model has "memorized" the training data and is no longer able to generalize to new situations.
To troubleshoot overtraining, one solution is to use fewer number of epochs during training. An epoch is one complete pass through the entire training dataset. If you use too many epochs, the model may start to memorize the training data rather than learn the underlying patterns.
Another solution is to increase the learning rate. The learning rate controls how quickly the model updates its parameters during training. If the learning rate is too low, the model may take too long to converge and overtrain.
You can also reduce the text encoder scale, because the text encoder is used to generate captions for the generated 3D models of people. If the text encoder is too complex, it may overtrain and produce errors and glitches.
In summary, overtraining occurs when a model is trained for too long, or with too much data, and it starts to perform poorly on new, unseen data. To troubleshoot overtraining, you can use fewer number of epochs during training, increase the learning rate, or reduce the text encoder scale.
- Try reduce number of Epochs, less than
100
(default value) - Increase Learning Rate
Undertraining occurs when a model is not trained for long enough or with enough data, and it is not able to capture the patterns in the data. This can result in poor performance on the task at hand, such as producing nothing of your dataset or incomplete or wrong picture.
To troubleshoot undertraining, one solution is to resume training for the same number of epochs. An epoch is one complete pass through the entire training dataset. By resuming training for the same number of epochs, you are giving the model more opportunities to learn the patterns in the data.
Another solution is to reduce the learning rate. The learning rate controls how quickly the model updates its parameters during training. If the learning rate is too high, the model may not have enough time to converge and undertrain.
You also can increase the text encoder scale, because the text encoder is used to generate captions for the generated 3D models of people. If the text encoder is too simple, it may undertrain and produce nothing of your dataset or incomplete or wrong picture.
In summary, Under-training happens when a model is not trained for long enough or with enough data, and it is not able to capture the patterns in the data. To troubleshoot under-training, you can resume training for the same number of epochs, reduce the learning rate, or increase the text encoder scale.
- Resume training at lower epochs (
20
). - Increase number of epochs, > 50.
- Decrease Learning Rate
When training deep learning models, it is important to have enough memory (VRAM) on your GPU. This is because the model needs to store all the weights and intermediate computations during the training process.
Having a GPU with 12GB or more of VRAM, such as the RTX 3060 12GB, RTX 3080 12GB, RTX 4080 16GB, can help prevent issues such as out of memory (OOM) errors, under-training, over-training, and over-fitting.
- Use LoRA
- Train in Stages for 10GB VRAM user
- Buy new Nvidia GPU that has more than 12GB VRAM!