Model Training ‐ Comparison ‐ Brief - InfluxOW/Stable-Diffusion-Text-To-Person GitHub Wiki

So, now that you have a rough understanding of how I made the comparisons, we can proceed directly to the comparison results and an overview of the main parameters. We will go through them in the order they are presented in Kohya's GUI.

Dataset

However, we'll start with the dataset requirements. I've already told you about quaility, diversity and required amount of images. But there's another pretty important dataset parameter - its aspect ratio. It's better to crop all the images to 1:1 aspect ratio to avoid possible issues that might lead to results instability.

Model Quick Pick | Regularisation

So, the Model Quick Pick is a parameter from the Source Model tab, which is essentially the path to the checkpoint used for the model training. If its purpose is still somewhat clear, what about regularization? In the Folders tab, you need to specify the Regularisation Folder, which is the path to the folder containing subfolders with regularization images. I have prepared them for you. You should specify the path to the root folder, either man or woman, depending on whose model you are training.

Still, what is it, and how does it work? Honestly, I couldn't find a reasonably clear and technical explanation. But the essence is that along with our dataset images, the model also learns from images with the class it's studying. In other words, we train the model on photos of our woman - Billie Eilish, while simultaneously feeding it random woman photos. As a result, it better understands what woman means in general, and therefore generates Billie Eilish better.

Previously, it was believed that it's better to use regularization images generated from the same checkpoint that is used for the model training. Often, these images are terribly distorted and unattractive. Just take a look at this ugliness. However, recently there has been a growing idea that it's better to use real photos, called Ground Truth images in that case. As an experimenter, I gathered photos with a free license from Unsplash and used them as regularization images alongside images from various checkpoints, including SDXL. In fact, the model with real photos from Unsplash subjectively outperformed the comparison. I recommend it!

As for the checkpoint... You might think that it's best to train the model and generate images on the same checkpoint. Unfortunately, this is almost never the case. You might also think that using the base checkpoint - SD1.5 - for training is the best option. While this works quite well for SDXL, the results with SD1.5 are too ambiguous. Even though you get the juiciest colors and the highest compatibility with all other checkpoints, the resemblance to the character is only superficial. The result looks like we made a bad twin. I'll even show you!

It seems like there's a superficial resemblance, with the hair and all, but in reality, it turns out to be a different person.

The best results are actually obtained with other checkpoints. One of them is Realistic Vision 2.0. Specifically, 2.0, not some later version. The image on the left is created using this model. While I can still speculate on why it yields such good results, I have almost no assumptions about the second checkpoint. The second one is 2.8D STABLE BEST VERSION. In fact, it's some unknown checkpoint that I added to the comparison almost by chance, but it outperformed almost everyone. Once again, I'll add a few examples! It's always on the far right.

Not everywhere did it yield perfect results, but in the comparison, it looks almost perfect. Initially, the comparison consisted of almost 20 checkpoints, so I had to make an effort.

So, make sure to use regularization images! And make sure to use real photos as regularization images! For peace of mind, it's better to crop them to a 1:1 aspect ratio too. Despite doubling the model training time, you still save time in the end because instead of dealing with 95% of garbage, you'll only have to filter out 75%. The number of high-quality results increases significantly!

As for the checkpoint, I recommend training models on each of them. In this way, with these two models, you can generate images on almost all existing checkpoints. Models trained on Realistic Vision 2.0 are more compatible with checkpoints that also generate realism, while models trained on 2.8D STABLE BEST VERSION are better suited for graphic checkpoints. However, some are compatible with both types, and some are compatible in the opposite way. In general, there's no clear logic here, so train on each of them!

#UPD: during additional experiments with different datasets, I repeatedly had situations where a model on 2.8D STABLE BEST VERSION was incompatible with almost all checkpoints, while a model on Realistic Vision 2.0 was compatible with almost all checkpoints. However, there were still exceptions. In general, I advise you to start with the model on Realistic Vision 2.0 and then look at the situation from there.

LoRA Network Weights

If your model ended up undertrained or if training was interrupted for some reason, you can specify a model that will serve as a starting point for the current training. By the way, training a model for 10 epochs and training for 5 epochs and then continuing for another 5 epochs is almost never the same thing.

Batch Size

This parameter determines the number of images learned simultaneously, consequently reducing the total number of training steps by a factor of the Batch Size and shortening the overall training time. It is desirable for it to be a multiple of the number of the dataset images or a multiple of Repeats.

Here's how changing it affects training time and VRAM consumption:

1 - 26 min, 8.2 Gb;
3 - 17 min, 9.2 Gb;
11 - 15 min, 13.3 Gb.

Increasing it by 1 typically increases VRAM consumption by approximately 0.5 Gb. However, there's no point in increasing it to the maximum unless you reach the VRAM limit. There comes a point where training time no longer decreases, but VRAM consumption keeps increasing. If you have a lot of VRAM, this point may be reached relatively early.

There are opinions that using a value greater than 1 can lead to relatively worse results. In certain cases, this may be true, but in our case, the smart optimizer adjusts DLR in such a way that the model consistently learns successfully, leading to minimal differences in the results quality.

In summary, it's advisable to start with a value of 1 and gradually increase it, determining the point at which the speed of the training diminishes. Most likely, this value will fall within the range from 2 to 5. I am using 3.

Save Every N Epochs | Save Every N Steps

By default, only the fully trained model is saved. However, with these parameters, you can save the model every N epochs or steps. This is convenient because then you can choose the best model from several ones. If the model overtrains, you can revert to an older one. I recommend saving the model at least every 10% of training.

Epoch | Total Steps | Repeats

The number of epochs, total number of steps, and the number of repeats are all considered together.

Let me remind you some formulas.

$Epoch\ Steps = Image\ Count * Repeats$

$Total\ Steps = Epoch\ Count * Epoch\ Steps$

Also, in the formula, there's formally Batch Size, but it doesn't change the total number of steps; it simply groups multiple steps into one. So, it's not necessary to consider it in further calculations.

However, you need to take into account the presence of regularization images, which, as I mentioned earlier, should definitely be used! The required number of regularization images is equal to Epoch Steps. When using regularization images, each dataset image is effectively studied Repeats times during one epoch (let's call these training steps). Along with each training step, one regularization image is studied (let's call these regularization steps). This means that the number of steps in an epoch and, consequently, the total number of steps, double, which will affect further calculations. So, you need a significant number of training steps to ensure an adequate use of regularization images. The more, the better, within reasonable limits.

I have collected 330 of these images, so this is our imaginary limit for training steps. In total, we get 660 steps in an epoch, which is objectively a lot, but with the use of regularization, it's still a compromise. In other words, we have determined one unknown - use such an amount of Repeats to cover as many of the 330 regularization images as possible in one epoch, i.e., Repeats = math.floor(330 / Image Count).

So, essentially, we have one variable - the number of epochs. Determining the ideal number of epochs is quite problematic, so I always recommend setting it with a margin and more than seems enough.

Here are some rough examples:

10 images in the dataset, 33 repeats, 6 epochs - 3960 steps;
15 images in the dataset, 22 repeats, 8 epochs - 5280 steps;
30 images in the dataset, 11 repeats, 10 epochs - 6600 steps;
50 images in the dataset, 6 repeats, 14 epochs - 8400 steps.

In general, the idea is that as the dataset grows, you need to increase the total number of steps. It turns out that the number of training steps is effectively constant. By changing the number of dataset images, we change Repeats for each of them. As a result, with a large enough dataset, each image may be repeated an insufficient number of times, which can either help prevent overtraining or harm by making the model undertrained. With a small dataset, each image can be repeated too many times, which can lead to the model learning quickly or overtraining, where these images become baked into the model, causing it to reproduce only them.

There is always a significant risk that the model will be undertrained or overtrained. However, this effect can manifest very randomly. In any case, neither an undertrained model nor an overtrained model is an issue. An undertrained model can be further trained, and in an overtrained model, you can choose earlier epochs. I would even say that in any model, you should select the best epochs, and I will explain how to do this in the image generation section.

Caption Extension

For each image in the dataset, you can prepare a file with keywords. In theory, it should help the model understand what to learn on and what not to. In practice, it makes sense only if you want to train the model for a specific style or train several different characters simultaneously. In the case of a single character, the model generally performs well without this.

By default, the file extension for these keyword files is .caption. If you don't have such files next to the images in the dataset, nothing will be used.

Mixed Precision | Save Precision

These two parameters are responsible for the used floating-point number format. You can choose different types for both model training and model saving.

Floating-point numbers consist of three parts: sign, exponent, and mantissa. Depending on the number format, a different number of bits is allocated for the exponent and mantissa:

fp32 - 8E, 23M;
fp16 - 5E, 10M;
bf16 - 8E, 7M.

When selecting the bf16 or fp16 format, model is trained with a mixture of 32-bit and 16-bit data. However, you can train the model exclusively with 16-bit data by enabling the Full fp16 Training or Full bf16 Training setting, depending on the chosen format.

Based on the results, there's absolutely no sense in using fp32 as it significantly slows down training time, increases VRAM consumption, and doesn't seem to improve results in any noticeable way. Regarding the remaining two options, bf16 is not supported on all graphics cards. If your graphics card supports it and there are no errors during training, I recommend using it. If not, use fp16 as the difference is almost negligible. In the case of bf16, you can enable Full bf16 Training to slightly reduce VRAM consumption, but it's safer to skip this having enough VRAM. Interestingly, a similar setting for fp16 doesn't seem to work, completely disabling model training.

Seed

The seed allows you to capture certain random processes during training, but not all of them. If no training settings have changed and the seed remains the same, retraining will give you a copy of the previously trained model. However, in some cases, it may not. If you don't specify the seed or specify a different one than before, the model will definitely be different. In general, it's essential to use the same seed if you are going to compare multiple models. If not, you can either specify it or not.

Cache Latents to Disk

Before training begins, all the training images, including regularization images, go through a preparation stage, which takes some time. To save time on preparation in subsequent trainings using the same images, you can cache them on your disk. I recommend doing this.

LR Scheduler

Scheduler defines the function of changing LR during the training process. Here's how it can look.

Here you can observe the functions cosine, constant, and polynomial. I didn't include any more sophisticated ones in the comparison. In general, cosine and polynomial turned out to be almost identical, although the latter can be configured to be not just linear.

If you look at the graph of DLR(step), you can notice that DLR gradually increases at first and then follows the function defined by the Scheduler.

And right here, we can see an important, seemingly, advantage of constant - in this case, the model consistently learns with the same strength throughout almost the entire training, whereas in the case of the other two functions, the training gradually fades towards the end. One might think that this would result in a better outcome, but in practice, it turns out differently. Besides, we make our optimizer less smart, essentially transforming it from adaptive to non-adaptive. In general, cosine provides more stable and high-quality results.

Optimizer

Although we have already discussed this choice, I still made a small comparison of the optimizer DAdaptAdam, which was initially chosen, with the optimizer Prodigy, which is considered its successor. I didn't see significant differences, so I still believe it's better to stick with the initial choice.

Optimizer Extra Arguments

Here we can specify additional optimizer parameters. You can check possible parameters, for example, in the optimizer documentation.

Firstly, there's the parameter decouple=True. Without it, the model simply doesn't train.

Secondly, there's the parameter use_bias_correction=True. Without it, the model can train but may produce artifacts like this.

Why they are named differently in the documentation is completely unclear ¯_(ツ)_/¯.

We can also specify the parameter weight_decay, which theoretically should help to avoid model overtraining but has a relatively weak impact in practice. Nonetheless, a value in the range of 0.01 to 0.60 doesn't hurt, and I'll use an intermediate value of 0.20.

Thus, we get a value decouple=True use_bias_correction=True weight_decay=0.20.

Learning Rate | Text Encoder Learning Rate | UNet Learning Rate

We've already discussed what LR is. The documentation for the selected optimizer strongly recommends using a value of 1 until we encounter any instability. We didn't encounter any instability - we use 1. The same LR should be specified for both the Text Encoder and UNet, which are two blocks in the architecture of Stable Diffusion, the technical details of which we won't delve into.

LR Warmup

Despite documentation says this parameter only affects constant_with_warmup scheduler, it's not. If you set it to N%, then Learning Rate will grow from 0 to 1 during N% of the training. With 10% it will look like this.

We have a smart optimizer that handles everything related to LR on its own, so there's no need to interfere. We set it to 0.

Max Resolution

This is a maximum resolution of images the model will be trained on. It is highly desirable that all the dataset images have a higher resolution than specified here.

Here are the resolutions I compared and how they affect training time and VRAM consumption:

512x512 - 11 min, 8.6 Gb;
768x768 - 17 min, 9.7 Gb;
1024x1024 - 25 min, 10.9 Gb;
1280x1280 - 39 min, 13.1 Gb.

Initially, SD1.5 was trained on 512x512 resolution images. Many checkpoints based on it were trained on 768x768 resolution images. So, these two resolutions provide the most stable results when training models. However, 768x768 resolution significantly improves the quality and resemblance to the character compared to 512x512, so I recommend lowering the resolution only as a last resort. Resolutions higher than 768x768 do not provide a stable and corresponding increase in quality. While 1024x1024 might be a backup option when you want to experiment, it occasionally causes artifacts. When using 1280x1280 resolution, the model loses resemblance to the character and generates artifacts in addition. Also, increasing the resolution during training doesn't necessarily mean you can generate images in resolution higher than 768x768. It depends on your luck.

In general, I do not recommend going significantly beyond 768x768. Reserve higher resolutions for SDXL.

Network Rank

This parameter determines how much information the model can memorize. Its change does not significantly affect training time, but has a significant impact on VRAM consumption and the model file size:

32 - 7.7 Gb, 37 Mb,
64 - 8.0 Gb, 74 Mb,
128 - 8.6 Gb, 148 Mb
192 - 9.5 Gb, 221 Mb,
256 - 10.0 Gb, 295 Mb,
512 - 12.5 Gb, 590 Mb.

Increasing NR by 32 leads to an increase in VRAM by approximately 300 Mb and an increase in the model file size by 37 Mb.

Since this parameter often leads to debates, I will provide graphs for reference.

As you can see, the logic here is generally simple: the more the model can memorize, the harder it learns. The optimizer comes to the rescue again.

This leads to practically identical similarity graphs.

However, in practice, with NR equal to 32 and 64, the model, in my opinion, loses similarity to the character, but in the other cases, I didn't see a significant difference. Values in the range from 128 to 256 seem to be the most justified.

Network Alpha

This parameter determines how easily the model memorizes information. However, this value only makes sense in relation to NR. In other words, the strength of memorization of the model changes by a factor of NA/NR.

However, in practice, this change is effectively neutralized by our smart optimizer, adjusting DLR. As a result, it turns out that the worse the model memorizes, the harder it learns.

Therefore, the difference between possible values turns out to be minimal, both in theory and in practice. Due to this, it is simpler to use the standard value of 1.

Clip Skip

This parameter determines from which layer of the CLIP model the vectors will be sent to the U-Net, counting from the end. You can set it both during model training and image generation. There are a total of 12 such layers, and by default, vectors are sent from the last layer, which corresponds to the standard value of 1. However, in a once leaked Novel AI checkpoint, they were sent from the second-to-last layer. Due to the merging of this checkpoint with many others, the non-standard value of 2 became widely used.

This is what guides, documentation, and Google say. However, even if we try to change this parameter on the base SD1.5 checkpoint, it still affects the generation results, even though it shouldn't.

In a direct comparison, it was found that the difference between its combinations is not so significant. However, personally, I feel more comfortable using 2 both during model training and image generation.

Gradient Checkpointing

By default, the whole checkpoint is being loaded into VRAM during training. With this parameter turned on, checkpoint loads into VRAM gradually and is being used piece by piece. It actually saves A LOT OF VRAM. And it doesn't affect result quality. I don't understand why it is disabled by default.

CrossAttention

Here we can set optimization algorithms.

xFormers is a library developed by the Meta AI team. It speeds up training and reduces memory usage by implementing memory-efficient attention and Flash Attention. SDPA is Scaled Dot Product Attention optimization. It's native PyTorch implementation of the same algorithms.

Here's how these optimizations impacts my RTX 4080 PC during training with 7 Gb model 2.8D STABLE BEST VERSION:

xFormers + Gradient Checkpointing - 15 min, 8.3 Gb;
SDPA + Gradient Checkpointing - 23 min, 8.7 Gb;
xFormers - 33 min, 15.6 Gb;
Gradient Checkpointing - Out of Memory Error.

Winner is pretty obvious.

Some people say that these cross attention optimizations has negative impact on result quality. As you could see, we made some pretty good results with usage of xFormers. And in case of large models you don't have an alternative option.

Min SNR Gamma

This is a training technique designed to make the training process more stable.

It was difficult to choose the best value visually, so I suggest relying on Kohya's GUI documentation and using the recommended value of 5. The graphs for this value also looked the best.

Don't Upscale Bucket Resolution

Bucket is a group of images with similar resolution. If you are training a model at a resolution of 512x512 and Bucket Resolution Steps is set to 64, these buckets will be 512, 448, 384, and so on pixels. Buckets are separate for vertical and horizontal resolutions. However, an image with dimensions of 500 pixels will be placed in the 448 bucket, with the extra 52 pixels being cropped. If you turn off this setting, the image will be upscaled to 512 pixels. If you have to choose the lesser of two evils, it's better to upscale than to crop. But it's even better to collect images in your dataset at a resolution higher than what you intend to use for training.

Working with these buckets can be quite challenging. That's why it's generally better to crop all images to 1:1 aspect ratio, choose Batch Size that is a multiple of the number of the dataset images or a multiple of Repeats, and gather high-resolution images in your dataset.

Noise Offset

This parameter allows you to add additional noise to the images during training, which theoretically can result in brighter and more vibrant colors in the generated outputs.

The more noise we add, the harder the model learns.

However, even this may not save you from issues at high noise levels.

Here's how these issues look directly in the results.

In general, adding a bit of noise does make the colors slightly brighter and more interesting. But it seems like it's not the most critical problem during training. SDXL has been trained with a value of 0.0357, which seems like a reasonable compromise, so I suggest using it, even though it wasn't directly compared.

Samples

During training, you can create sample images using the trained model and the checkpoint it's currently training on. This can help you monitor the training process and detect any issues (such as model overtraining), allowing you to stop training if necessary. Generating samples slightly slows down training because it takes some time to generate images, but overall, it's not a significant slowdown.

You can generate samples, as well as save models, every N epochs or steps. I recommend generating samples at least every 10% of training.

For generating samples, your prompt should include a context that is not present in the dataset. For example, here's a prompt I use: masterpiece, best quality, (<instance prompt> <class prompt>:1.2), green t-shirt, beach, ocean, upper body, looking at viewer --n low quality, worst quality, bad anatomy, bad composition, poor, low effort --w 768 --h 768 --l 3.5 --s 50 --d 1. Don't forget to replace the placeholders.

Next - Model Training ‐ Comparison - Final