Diffusion Models Overview - Nerogar/OneTrainer GitHub Wiki
Model-Specific Technical Information for Diffusion Models in OneTrainer
This wiki page provides detailed technical information about the diffusion models supported by OneTrainer, specifically focusing on SD1.5, SDXL, and Flux. The information is tailored for quite advanced users.
SD1.5
Base Architecture
SD1.5 utilizes a UNet architecture with an encoder-decoder structure, based on a hierarchy of denoising autoencoders
Training Resolution
The final training resolution was effectively 512x512
Tokenization and Max Tokens
- Uses CLIP tokenizer
- Max tokens per caption for OT: 75
LoRA Full Set of Blocks / Layer Keys A working example of a custom layer set for SD1.5 LoRA training is:
down_blocks.1.attentions.0,down_blocks.1.attentions.1,down_blocks.2.attentions.0,down_blocks.2.attentions.1,mid_block.attentions.0
The complete set of blocks for SD1.5 includes can be referenced here or here
VAE Compression
- Compression factor: 8x8 (8 times per dimension)
- VAE trained on 256px x 256px resolution
- Number of channels: 4
Paper: https://arxiv.org/pdf/2112.10752
Stable Diffusion XL (SDXL)
Base Architecture
SDXL uses an enhanced UNet architecture, significantly larger than SD1.5.
Training Resolution
SDXL is trained at higher resolutions, effectively 1024x1024.
Tokenization and Max Tokens
- Uses two CLIP text encoders (CLIP ViT-L & OpenCLIP ViT-bigG)
- Max tokens: 75 again in OneTrainer
VAE Compression
- Compression factor: 8x8 (8 times per dimension)
- VAE trained on 256px x 256px resolution
- Uses the same VAE model as SD1.5, but trained with larger batch size and EMA enabled
Paper: https://arxiv.org/pdf/2307.01952
FLUX
Placeholder. Cant find much info, nor papers.
Training Resolution
Unknown, at least same or higher than SDXL
Tokenization and Max Tokens
- Same as SDXL, 75 tokens max in OneTrainer, anything larger gets truncated.
- CLIP L/14 and T5-v1
LoRA Full Set of Blocks / Layer Keys
Flux uses the following LoRA layers:
[
"down_blocks.0.attentions.0.transformer_blocks.0.attn1",
"down_blocks.0.attentions.0.transformer_blocks.0.attn2",
"down_blocks.0.attentions.1.transformer_blocks.0.attn1",
"down_blocks.0.attentions.1.transformer_blocks.0.attn2",
"down_blocks.1.attentions.0.transformer_blocks.0.attn1",
"down_blocks.1.attentions.0.transformer_blocks.0.attn2",
"down_blocks.1.attentions.1.transformer_blocks.0.attn1",
"down_blocks.1.attentions.1.transformer_blocks.0.attn2",
"down_blocks.2.attentions.0.transformer_blocks.0.attn1",
"down_blocks.2.attentions.0.transformer_blocks.0.attn2",
"down_blocks.2.attentions.1.transformer_blocks.0.attn1",
"down_blocks.2.attentions.1.transformer_blocks.0.attn2",
"up_blocks.1.attentions.0.transformer_blocks.0.attn1",
"up_blocks.1.attentions.0.transformer_blocks.0.attn2",
"up_blocks.1.attentions.1.transformer_blocks.0.attn1",
"up_blocks.1.attentions.1.transformer_blocks.0.attn2",
"up_blocks.1.attentions.2.transformer_blocks.0.attn1",
"up_blocks.1.attentions.2.transformer_blocks.0.attn2",
"up_blocks.2.attentions.0.transformer_blocks.0.attn1",
"up_blocks.2.attentions.0.transformer_blocks.0.attn2",
"up_blocks.2.attentions.1.transformer_blocks.0.attn1",
"up_blocks.2.attentions.1.transformer_blocks.0.attn2",
"up_blocks.2.attentions.2.transformer_blocks.0.attn1",
"up_blocks.2.attentions.2.transformer_blocks.0.attn2",
"up_blocks.3.attentions.0.transformer_blocks.0.attn1",
"up_blocks.3.attentions.0.transformer_blocks.0.attn2",
"up_blocks.3.attentions.1.transformer_blocks.0.attn1",
"up_blocks.3.attentions.1.transformer_blocks.0.attn2",
"up_blocks.3.attentions.2.transformer_blocks.0.attn1",
"up_blocks.3.attentions.2.transformer_blocks.0.attn2",
"mid_block.attentions.0.transformer_blocks.0.attn1",
"mid_block.attentions.0.transformer_blocks.0.attn2"
]
Full layers can be seen here