Anime Training - Anime4000/sd_dreambooth_extension GitHub Wiki

Preamble

In this wiki, show basic steps how to train your own anime waifu or harem.

Still in WIP and many thing need to update, test, and experiment.

Table of Content

  1. Prerequisite
  2. Dataset
  3. Style
  4. Tagging
  5. Training
  6. Training in Stages
  7. Generate CKPT
  8. Multi Concepts create your own harem world
  9. Troubleshoot

Prerequisite

Software

  1. Windows 10/11
  2. Stable Diffusion WebUI (AUTOMATIC1111)
  3. Dreambooth Extension
  4. Dataset Tag Editor

Hardware

  1. Decent multi-core CPU (High GHz 4-core minimum)
  2. 16GB RAM
  3. Modern Nvidia GPU
  4. 10GB VRAM (Windows)
    • 8GB VRAM must use LoRA!

Limitation

Nvidia TurboCache

Dreambooth training requires a lot of memory. Linux does not support using a technology called Nvidia TurboCache, which allows using system RAM as a memory buffer for graphics. However, CUDA is able to use ~99% of the VRAM.

10GB VRAM

  • Training on 10GB VRAM is only works on Windows 10+
  • Linux users must use LoRA or modify Kernel Mode Set (KMS) to offload some VRAM to System RAM just like Windows.
    • Try kill DE to save some VRAM and run via SSH

8GB VRAM

⚠️ Must use LoRA to train!

Dataset

Similar to ELI5 Training with different tweaks

Length

It's important to have a dataset that is large enough for Dreambooth to learn from, but not so large that it leads to over-fitting or over-training. A good rule of thumb is to limit your dataset to a maximum of around 30 images. Additionally, it's important to balance the number of images for each concept in your dataset. If you have a high count dataset for one concept and a low count dataset for another, the high count dataset may overpower or crush the low count dataset.

The length of your dataset for each concept can also greatly affect the number of training epochs needed. If your dataset contains less than 15 images for a particular concept, you may need to train for over 100 epochs to achieve good results. Conversely, if you have more images for a particular concept, you may be able to train for fewer epochs and still achieve good results.

More info and settings can be found here: TEnc + UNET

Source Image

Good Source

  • Sharp and High Resolution
    Picture will be down-scaled to 512x512 pixel
  • Clear and Clean
    Waifu must be alone without other character in the frame
  • Diversity
    All waifu activity, location, different background, face expression, from above, below, side, behind ...
  • Uncomplicated
    Avoid rare expression, wearing a mask, glitched, blurry, ...
  • Less Close-up
    Too many close-up will make txt2img more close-up, losing it's diversity

Acceptable

  • Text
    Text on t-shirt, dialog, signboard, try minimise/less of this
  • Low Light
    Avoid low light (dungeon, underworld)
  • Too Bright
    Avoid bright scene (lens flare, god rays over character face)

Bad Source

  • Wrong Aspect Ratio
    Your waifu getting squeezed
  • Indistinguishable
    Repeated frame, same background, same angle
  • Multiple Character
    Other dude getting close to your waifu that can't be crop away
  • Background Waifu
    Your waifu is not main focus and blurred, behind another
  • Subtitle
    Source from burned subtitle, bad screen shoot

Think of Dreambooth as an employee that is learning from your dataset. Don't give it too complex of tasks or it can result in incomplete training (under-trained), errors and glitches in the training (over-trained), or the model becoming too specific to your dataset and not generalizing well to new data (over-fitting).

Preprocess

Check every picture manually make sure picture fall into Good Source category and few Acceptable!

Upscale

If your source from a screenshot, or low resolution JPEG, you need upscale it first to reduce compression artifacts, using build-in upscaler at Extra tab inside Stable Diffusion WebUI, and choose R-ESRGAN 4x+ Anime6B at 1 to 2
SDWebUI_ExtraTab.PNG

⚠ Using 4 times upscale can lead to thick art line, downscale will be issue!

Downscale

⚠ With latest Dreambooth Extension, you can skip this downscale step, Dreambooth ImageBucket will downscale do it for you automatically and beautifully, if you feel not safe because of these reasons, proceed:

  1. Hide bad upscale
  2. Reduce art line thickness
  3. Eliminate compression artifact
  4. Making it look sharp

Use XnConvert to properly downscale at highest quality. Do not mix Wide and Portrait in the input files, process Wide or Portrait first...

  1. Add action > Image > Resize
  2. Enlarge/Reduce: Always
  3. Resample: Lanczos2 (like 8x anti-aliasing)

Wide Screen

  1. Mode: Height Wide

Portrait

  1. Mode: Width Height

Crop

With Image Bucket, you can skip this step, let ImageBucket pick, resize and crop automatically. If you not confident with Image Bucket, you can still manually crop by your self.

Valid Resolution

To accelerate training and improve training quality, it is recommended to tightly crop your dataset subject. By cropping out extraneous information from your images, your model can focus on learning the important features of the subject and reduce the amount of noise in the data. This can result in faster and more accurate training, as well as more robust models that are better able to generalize to new data.

Ratio 512 1080p
1:1 512x512 1080x1080
7:8 448x512 945x1080
3:4 384x512 810x1080
5:8 320x512 675x1080
1:2 256x512 540x1080

These crop ratios are for vertical/portrait images and are optimized for the common 1080p resolution used in screencaps.

NOTE

BIRME downscale using Nearest Neighbor Algorithm will cause your picture no longer anti-aliased,
always downscale with XnConvert: XnConvert vs BRIME

⚠️ If you prefer, you can skip the downscaling step and use the original resolution of your images. The ImageBucket Latent algorithm can downscale the images for you while maintaining high quality.

Good Cropping

Waifu is focused and blur background
other char background blured
Tightly Cropping 810x1080 (384x512)
BIRME FIT

Acceptable

Other character have < 5% in the crop area, try make this few in dataset, it's better crop fit to reduce noise and unwanted data
OK MAYBE
50-50

NO!

Character holding an object covering the face
visible eye
Other character too visible in the crop area
another char in frame
Character too close will cause your model lose variation!
Close-up

Bad Dataset

Make sure no bad dataset, having one will cause your final model produce bad results

Finger

finger ded

Too Small

finger bad
⚠ Always remove bad drawing from dataset!

Comparison RAW vs Processed

Versus
RAW R-ESRGAN 4x+ Anime6B (1X)

⚠ Always preprocess your dataset, especially screencaps

Stitches

Try get many full body as possible, if source image from a screenshot, stitches related image like this: stitches_nagisa

Avoid to have splitting frame, instead try to merge it with gradient to make it look blend: stitches_mahiru

⚠️ This way, Dreambooth able to understand whole character it's uniform, clothes, dress, skirt, etc. Merging become one is highly recommended!

Colour Correction

Most raw screencaps are not ideal for training purposes, so it's recommended to manually check each image and apply Auto Level, Auto Contrast, or both. This will help improve the clarity and distinction between the subject and the background, making the images more suitable for training.

Raw

Raw Screencaps Raw Screencaps

Auto Level & Contrast

Processed Screencaps

Auto Contrast

Processed Screencaps

ℹ️ You can mix images that have been processed with Auto Level and/or Auto Contrast with raw images in your dataset. This can help Dreambooth learn how to reproduce colours accurately during inference.

Tagging

This very important step, you need describe each picture what is that, manual tagging is preferred, you can use automatic DeepDanbooru or Waifu Tagger, however automatic tagging can lead to false positive

⚠ Also keep tag short as possible.
⚠ Avoid repeated tag: skirt, pleated skirt just pleated skirt
⚠ Incremental tag method will be use.
⚠ Keep common tag to the left!
⚠ Use Danbooru Tag is preferred.

Naming

To systematically organize subject names for training files, it's important to arrange them in a consistent manner, such as starting with the family name first. This will make it easier for the Stable Diffusion algorithm to search for specific tokens inside UNET neural networks, and also ensure that the subjects are properly identified.

Identify Your Waifu

Look around, find what is most common dress that your waifu is wearing and use it as default

Example

tag

Tag Syntax

Arrange your tag accordingly, where most important (character name) at first tag followed by clothing, expression...

Example

Name Clothing Face Expression Action Body Direction Camera
gotou hitori black shirt frown standing facing away from above
kita ikuyo blue dress smile walking facing to the side from below
shiina mahiru school uniform, blazer blush lying facing viewer from behind
kubo nagisa school uniform, cardigan blush, smile sitting facing back looking at viewer

⚠ Character Recognition

When preparing a dataset for Dreambooth training, you can choose to omit certain information about the characters such as their hair color, eye color, and default clothing. Instead, you only need to include the anime name and background in the dataset.

During inference (txt2img), you simply provide the anime name in the prompt and the model will generate the correct eye color, hair color, etc. based on what it learned during training. This makes the process of generating images more efficient and streamlined, as you don't need to specify every detail about the characters in the prompt.

⚠ Generic Tag

Avoid using generic tag (eg: 1girl) for your waifu, it may lead to over-fitting cause other waifu become trained dataset

⚠ Direction

Anime Stable Diffusion model doesn't understand Left and Right, mind that since Danbooru Tag doesn't have it!
It's possible to introduce own prompt for left and right, this training will be a big project and many hours of troubleshooting

Common Tag

Camera Body Direction Face/Head
⬆️ from above facing up looking up
⬇️ from below facing down looking down
behind from behind
side from side facing to the side looking to the side
back facing back looking back
another facing another looking at another
camera facing viewer looking at viewer

You can do this standing, facing away, looking at viewer to make anime character body away while head look at you

More tag can be found at Danbooru

Style

Eye Style

For the best [filewords] for prompting later on is character name, artist name, eye style, where user can use any combination and any style.

Default

⚠️ Default eye style which is no need to mention on [filewords]

default
saitou yoshiko hanekoto bekkankou
kubo nagisa shiina mahiru inoue takina yamano mitsuha sabine sendou erika

tareme

Eyes drawn with the top eyelid slanted outwards, to the point where the outer corner of the eye is much lower than the inner corner. This usually produces a weak, gentle look and is generally given to characters with soft personalities (naturally, exceptions exist).

tareme
arawi keiichi nekotofu hamazi aki
naganohara mio aioi yuuko oyama mihari oyama mahiro gotou hitori ijichi nijika
naganohara_mio aioi_yuuko oyama_mihari oyama_mahiro gotou_hitori ijichi_nijika

jitome

When the top of the eye is drawn with a flat line. Used to effect listlessness, apathy, or a bored, expressionless, scornful, or smug face.

jitome
arawi keiichi hamazi aki nekotofu  
minakami mai ijichi seika oyama mahiro nishikigi chisato

tsurime

Eyes drawn with the top eyelid slanting inwards. This usually produces a strong, piercing look and is generally given to characters with forceful personalities (naturally, exceptions exist).

tsurime
yoshimizu kagami tashiro tetsuya
sabine hiiragi kagami akame

Art Style

Ensure that the artist's name and the art style remain the same throughout your dataset. This will help your model learn to recognize and reproduce the specific characteristics of that style.

To avoid bias and improve the model's ability to generalize, it's important to ensure that characters in your dataset are distinct from one another. If there are multiple images of the same character, try interleaving them with images of other characters to provide a more diverse set of examples for your model to learn from.

example artist name hair length hair colour eye colour
bekkankou1 bekkankou long hair purple hair purple eyes
bekkankou2 bekkankou long hair brown hair green eyes
bekkankou3 bekkankou long hair blonde hair blue eyes

Invoking Art Style

With trained model, to apply art style just simply invoke artist name like this in txt2img

Syntax

masterpiece, best quality, highres, game cg, <artist name>, 1girl, <char name>, cherry blossoms, petals, flying petals, wind, upper body

Example

masterpiece, best quality, highres, game cg, bekkankou, 1girl, sabine, cherry blossoms, petals, flying petals, wind, upper body

Results

bekkankou sabine
txt2img results

This visualization helps illustrate how Dreambooth is used for training and Stable Diffusion is used for inference. Together, these tools help the model understand the prompt and generate outputs that match it.

Dataset Tag Editor

Use Dataset Tag Editor to speed up tagging process, first step to add character name gotou hitori
DSTE

For starting, load dataset, and go to

  1. Batch edit caption
  2. Replace Text
  3. Entire Caption
  4. Search and Replace

All picture now has tag that you set!

Now, manually identify your waifu via Edit Caption of Selected Image, if waifu wearing different than default dress, tell it! Face expression, activity, etc...

Save Tagging

After you done, click Save all changes Each picture will have text file along side:
FileExplorer

Training

Once you satisfy with your tag, can begin training of your favorite waifu!

Create Model

DB Create Model

  1. Set a Name
    your model/project name, example: UltimateWaifuMk1 Ultimate Waifu Mark 1
  2. Choose Source Checkpoint
    Example: Anything-V3.0-pruned-fp32.ckpt
      • Extract EMA Weights Optional
        some ckpt has contain EMA and non-EMA
      • Unfreeze Model Optional
        help your training not get replaced with global tag (eg: 1girl) and improve model training, reduce over-training at cost of some VRAM
  3. Click button

⚠ Checkpoint Merger Model

Avoid using model that has been "Checkpoint Merger" or "Merge Frankenstein", please use original model for better training results.

List known model for anime

  • WaifuDiffusion
  • AnythingV3
  • NovelAI

Settings

Basic

DB Basic

General

If you have 8GB of VRAM:

  • Use LORA

Intervals

Training Steps Per Image (Epochs)

Value: 80

When training a Dreambooth model for anime characters, it is typically best to train for less than 100 epochs. An ideal number of epochs is around 80, particularly when using a CFG of 12 or higher. This means that the model will repeat the learning process 80 times to refine its understanding of the data and improve its ability to generate new images of anime characters. By training for fewer epochs, you can help ensure that the model converges to a good solution without overfitting to the training data.

Save Model Frequency (Epochs)

Value: 0

By having Dreambooth compile the model every "x" epochs, you can monitor the training process and detect when over-training is occurring. Over-training is when the model becomes too specific to the training data and starts to perform poorly on new, unseen data. By saving the model every few epochs, you can check its performance and make adjustments if necessary. However, saving the model too frequently can be harmful to the lifetime of the SSD (Solid State Drive) on which the model is stored. This is because writing data to an SSD too often can shorten its lifespan, so it's important to find a balance between monitoring the model's progress and preserving the SSD's health.

Save Preview(s) Frequency (Epochs)

Value: 0

Generating a preview at the current "x" epoch allows you to view the results of the training process at that point. Ideally, this should be set to the same value as the "Save Model Frequency (Epochs)" to allow you to easily compare the model's progress. However, it is worth noting that the quality of the preview generated by Dreambooth may not be as good as the results obtained using native Inference Euler A or DDIM. Nevertheless, the general idea or concept of the preview should be enough to give you a rough idea of the model's progress and performance.

Batching

If you have a graphics card with more than 10GB of VRAM (Video Random Access Memory), you can speed up your training process by increasing the values of the relevant parameters. VRAM is a type of memory that is used by a graphics card to store and process visual data, and having more VRAM can allow your training to run faster by giving the model more memory to work with. By increasing the values of relevant parameters, you can take advantage of the additional VRAM and make the training process more efficient.

Batch Size

Value: 2

Gradient Accumulation Steps

Value: 2

Learning Rate

DB LR

Learning Rate

Value: 0.000001

In Dreambooth, anime characters are treated as objects. When training a model to generate images of anime characters, it's important to choose an appropriate learning rate, which controls the speed at which the model updates its parameters during training. A good starting point for the learning rate in this scenario is 1e-6 or 0.000001. This low value helps ensure that the model updates its parameters slowly and steadily, reducing the risk of overshooting a good solution or becoming stuck in a suboptimal one. The precise value of the learning rate will depend on the specifics of your training data and model, but starting with a value of 1e-6 or 0.000001 is a good place to begin.

LoRA (For 8GB GPU)

  1. LoRA UNET Learning Rate: 0.0001
  2. LoRA Text Encoder Learning Rate: 0.00005

⚠ LoRA is new training engine and replace UNET, more testing against anime is needed!

Image Processing

DB Image Processing

Sanity Sample Prompt

Value: masterpiece, best quality, 1girl

Advanced

DB Advance

Tuning

    • Use 8bit Adam
  1. Mixed Precision: fp16
  2. Memory Attention: xformers
  3. Step Ratio of Text Encoder Training: 0
  4. Optional: AdamW Weight Decay: 0.01 - 0.005

Concepts

By default, you can use 4 concepts at the time, if you plan to use more than 4 concept, read here Multi Concepts

Directories

  1. Dataset Directory: X:\path\to\your\dataset

Prompts

  1. Instance Prompt: [filewords]
  2. Class Prompt: [filewords] Optional
  3. Sample Image Prompt: masterpiece, best quality, 1girl, [filewords]
  4. Sample & Classification Image Negative Prompt:
lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name

Image Generation

  1. Class Images Per Instance Image: 0
    Value 2 - 10 is good balance to counter over-training and over-fitting
  2. Classification CFG Scale: 12
    Most anime model using CFG 12, why not.
  3. Sample Seed: 420420
    Can be any number, useful to track sample results
  4. Sample CFG Scale: 12
    Same reasons

Add more concepts with same value.

Savings (optional)

    • Save in .safetensors format
      if you plan to share with internet
    • Half Model
      Model in 2GB size, this might help to reduce your SSD write life-time

Start Training!

Click Save Settings button first!, then click Train

Once training is complete, it will produce a report and sanity check, can start trying. DB Train Done

Training in Stages

Body Parts

If you plan to train your model in stages, it's best to start with the smaller parts of the body such as legs, fingers, thighs, .... Training on these smaller parts first will significantly improve the overall performance of your model without destroying your waifu.

TEnc + UNET

For 10GB VRAM user (RTX 3080), You can train Text Encoder first, then resume training without Text Encoder

  Stage 1 Stage 2
Epoch 35 65
Learning Rate 0.000002 0.000001
Optimizer 8bit AdamW 
Mixed Precision fp16 
Memory Attention xformers 
Train UNET
Step Ratio of Text Encoder Training 1 0
Freeze CLIP Normalization Layers
Strict Tokens
Testing Tab
Deterministic
Use EMA for prediction
⚠️ With these settings, your dataset length must be around ~12 to ~20 images for art style and ~20 to ~30 images for character.

Multi Concepts

Multi Concepts

As issue #916 has been solved, you can use Concepts List to train multiple concepts, this generally better control and it will train in partition.

No need to do Interleaving Dataset 😊

Folder Structure

A well-organized folder structure can make it easier to manage and keep track of your dataset. By having a clear and logical arrangement of your files and folders, you can quickly find what you need and ensure that everything is in its proper place. This guide can help you establish a good folder structure, but you can also choose to disregard it if you have a different approach that works better for you.

[Project Name]
   │
   ├── [Anime]
   │     │
   │     ├── [nishikigi chisato]
   │     │     │
   │     │     ├── dataset_001.png
   │     │     ├── dataset_001.txt
   │     │     ⁞
   │     │
   │     ├── [inoue takina]
   │     ⁞     │
   │           ├── dataset_001.png
   │           ├── dataset_001.txt
   │           ⁞
   │
   ├── [Artwork]
   │     │
   │     ├── [artist style 1]
   │     │     │
   │     │     ├── dataset_001.png
   │     │     ├── dataset_001.txt
   │     │     ⁞
   │     │
   │     ├── [artist style 2]
   │     ⁞     │
   │           ├── dataset_001.png
   │           ├── dataset_001.txt
   │           ⁞
   │
   ├── [CG]
   │     │
   │     ├── [game 1]
   │     │     │
   │     │     ├── dataset_001.png
   │     │     ├── dataset_001.txt
   │     │     ⁞
   │     │
   │     ├── [game 2]
   │     ⁞     │
   │           ├── dataset_001.png
   │           ├── dataset_001.txt
   │           ⁞
   │
   ├── [Parts]
   │     │
   │     ├── [hands]
   │     ├── [feet]
   │
   ⁞
   └── project.json

JSON

The following is an example of a JSON file. By using this file format, you can add as many data items as you need to train your Dreambooth Concept. The idea is to get a basic understanding of how to create and format the JSON file so that you can make use of it in your training process.

⚠️ Please validate your JSON file before using it

[
	{
		"class_data_dir": "",
		"class_guidance_scale": 12,
		"class_infer_steps": 40,
		"class_negative_prompt": "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name",
		"class_prompt": "1girl, bob cut, blonde hair, red eyes, red blazer",
		"class_token": "",
		"instance_data_dir": "E:\\dataset\\_FallenAngel\\Anime\\Nishikigi Chisato",
		"instance_prompt": "[filewords]",
		"instance_token": "",
		"is_valid": true,
		"n_save_sample": 1,
		"num_class_images_per": 0,
		"sample_seed": 420420,
		"save_guidance_scale": 12,
		"save_infer_steps": 40,
		"save_sample_negative_prompt": "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name",
		"save_sample_prompt": "masterpiece, best quality, 1girl, [filewords]",
		"save_sample_template": ""
	},
	{
		"class_data_dir": "",
		"class_guidance_scale": 12,
		"class_infer_steps": 40,
		"class_negative_prompt": "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name",
		"class_prompt": "1girl, long hair, black hair, purple eyes, open jacket",
		"class_token": "",
		"instance_data_dir": "E:\\dataset\\_FallenAngel\\Anime\\Inoue Takina",
		"instance_prompt": "[filewords]",
		"instance_token": "",
		"is_valid": true,
		"n_save_sample": 1,
		"num_class_images_per": 0,
		"sample_seed": 420420,
		"save_guidance_scale": 12,
		"save_infer_steps": 40,
		"save_sample_negative_prompt": "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name",
		"save_sample_prompt": "masterpiece, best quality, 1girl, [filewords]",
		"save_sample_template": ""
	}
]

Troubleshoot

Adjust your total Epochs, Learning Rate and Text Encoder Scale

  • Prior Loss Weight: 0.25 - 0.35 (0.3) higher the number, more towards to class image instead of dataset
  • Class Images Per Instance Image: 1 - 10

Over-fitting

Example of Over Fitting

Overfitting occurs when a model becomes too specific to the training data and does not generalize well to new data.

One way to prevent overfitting is to use a limited number of classes in the training dataset, around 2 to 10 classes, this can make the model simpler and more general, and thus less prone to overfitting.

Solution

  1. Try reduce number of tag ([filewords] token), manual tagging is preferred, keep under 8 tag at most.
  2. Using Class Image and set Prior Loss Weight around .1 to .5 or default .75

Over-training

Example of Over Training

Overtraining occurs when a model is trained for too long, or with too much data, and it starts to perform poorly on new, unseen data. The model has "memorized" the training data and is no longer able to generalize to new situations.

To troubleshoot overtraining, one solution is to use fewer number of epochs during training. An epoch is one complete pass through the entire training dataset. If you use too many epochs, the model may start to memorize the training data rather than learn the underlying patterns.

Another solution is to increase the learning rate. The learning rate controls how quickly the model updates its parameters during training. If the learning rate is too low, the model may take too long to converge and overtrain.

You can also reduce the text encoder scale, because the text encoder is used to generate captions for the generated 3D models of people. If the text encoder is too complex, it may overtrain and produce errors and glitches.

In summary, overtraining occurs when a model is trained for too long, or with too much data, and it starts to perform poorly on new, unseen data. To troubleshoot overtraining, you can use fewer number of epochs during training, increase the learning rate, or reduce the text encoder scale.

Solution

  1. Try reduce number of Epochs, less than 100 (default value)
  2. Increase Learning Rate

Under-training

Undertraining occurs when a model is not trained for long enough or with enough data, and it is not able to capture the patterns in the data. This can result in poor performance on the task at hand, such as producing nothing of your dataset or incomplete or wrong picture.

To troubleshoot undertraining, one solution is to resume training for the same number of epochs. An epoch is one complete pass through the entire training dataset. By resuming training for the same number of epochs, you are giving the model more opportunities to learn the patterns in the data.

Another solution is to reduce the learning rate. The learning rate controls how quickly the model updates its parameters during training. If the learning rate is too high, the model may not have enough time to converge and undertrain.

You also can increase the text encoder scale, because the text encoder is used to generate captions for the generated 3D models of people. If the text encoder is too simple, it may undertrain and produce nothing of your dataset or incomplete or wrong picture.

In summary, Under-training happens when a model is not trained for long enough or with enough data, and it is not able to capture the patterns in the data. To troubleshoot under-training, you can resume training for the same number of epochs, reduce the learning rate, or increase the text encoder scale.

Solution

  1. Resume training at lower epochs (20).
  2. Increase number of epochs, > 50.
  3. Decrease Learning Rate

VRAM OOM

When training deep learning models, it is important to have enough memory (VRAM) on your GPU. This is because the model needs to store all the weights and intermediate computations during the training process.

Having a GPU with 12GB or more of VRAM, such as the RTX 3060 12GB, RTX 3080 12GB, RTX 4080 16GB, can help prevent issues such as out of memory (OOM) errors, under-training, over-training, and over-fitting.

Solution

  1. Use LoRA
  2. Train in Stages for 10GB VRAM user
  3. Buy new Nvidia GPU that has more than 12GB VRAM!
⚠️ **GitHub.com Fallback** ⚠️