Diffusion Models Overview - Nerogar/OneTrainer GitHub Wiki

Model-Specific Technical Information for Diffusion Models in OneTrainer

This wiki page provides detailed technical information about the diffusion models supported by OneTrainer, specifically focusing on SD1.5, SDXL, and Flux. The information is tailored for quite advanced users.

SD1.5

Base Architecture

SD1.5 utilizes a UNet architecture with an encoder-decoder structure, based on a hierarchy of denoising autoencoders

Training Resolution

The final training resolution was effectively 512x512

Tokenization and Max Tokens

Uses CLIP tokenizer
Max tokens per caption for OT: 75

LoRA Full Set of Blocks / Layer Keys A working example of a custom layer set for SD1.5 LoRA training is:

down_blocks.1.attentions.0,down_blocks.1.attentions.1,down_blocks.2.attentions.0,down_blocks.2.attentions.1,mid_block.attentions.0

The complete set of blocks for SD1.5 includes can be referenced here or here

VAE Compression

Compression factor: 8x8 (8 times per dimension)
VAE trained on 256px x 256px resolution
Number of channels: 4

Paper: https://arxiv.org/pdf/2112.10752

Stable Diffusion XL (SDXL)

Base Architecture

SDXL uses an enhanced UNet architecture, significantly larger than SD1.5.

Training Resolution

SDXL is trained at higher resolutions, effectively 1024x1024.

Tokenization and Max Tokens

Uses two CLIP text encoders (CLIP ViT-L & OpenCLIP ViT-bigG)
Max tokens: 75 again in OneTrainer

VAE Compression

Compression factor: 8x8 (8 times per dimension)
VAE trained on 256px x 256px resolution
Uses the same VAE model as SD1.5, but trained with larger batch size and EMA enabled

Paper: https://arxiv.org/pdf/2307.01952

FLUX

Placeholder. Cant find much info, nor papers.

Training Resolution

Unknown, at least same or higher than SDXL

Tokenization and Max Tokens

Same as SDXL, 75 tokens max in OneTrainer, anything larger gets truncated.
CLIP L/14 and T5-v1

LoRA Full Set of Blocks / Layer Keys

Flux uses the following LoRA layers:

[
    "down_blocks.0.attentions.0.transformer_blocks.0.attn1",
    "down_blocks.0.attentions.0.transformer_blocks.0.attn2",
    "down_blocks.0.attentions.1.transformer_blocks.0.attn1",
    "down_blocks.0.attentions.1.transformer_blocks.0.attn2",
    "down_blocks.1.attentions.0.transformer_blocks.0.attn1",
    "down_blocks.1.attentions.0.transformer_blocks.0.attn2",
    "down_blocks.1.attentions.1.transformer_blocks.0.attn1",
    "down_blocks.1.attentions.1.transformer_blocks.0.attn2",
    "down_blocks.2.attentions.0.transformer_blocks.0.attn1",
    "down_blocks.2.attentions.0.transformer_blocks.0.attn2",
    "down_blocks.2.attentions.1.transformer_blocks.0.attn1",
    "down_blocks.2.attentions.1.transformer_blocks.0.attn2",
    "up_blocks.1.attentions.0.transformer_blocks.0.attn1",
    "up_blocks.1.attentions.0.transformer_blocks.0.attn2",
    "up_blocks.1.attentions.1.transformer_blocks.0.attn1",
    "up_blocks.1.attentions.1.transformer_blocks.0.attn2",
    "up_blocks.1.attentions.2.transformer_blocks.0.attn1",
    "up_blocks.1.attentions.2.transformer_blocks.0.attn2",
    "up_blocks.2.attentions.0.transformer_blocks.0.attn1",
    "up_blocks.2.attentions.0.transformer_blocks.0.attn2",
    "up_blocks.2.attentions.1.transformer_blocks.0.attn1",
    "up_blocks.2.attentions.1.transformer_blocks.0.attn2",
    "up_blocks.2.attentions.2.transformer_blocks.0.attn1",
    "up_blocks.2.attentions.2.transformer_blocks.0.attn2",
    "up_blocks.3.attentions.0.transformer_blocks.0.attn1",
    "up_blocks.3.attentions.0.transformer_blocks.0.attn2",
    "up_blocks.3.attentions.1.transformer_blocks.0.attn1",
    "up_blocks.3.attentions.1.transformer_blocks.0.attn2",
    "up_blocks.3.attentions.2.transformer_blocks.0.attn1",
    "up_blocks.3.attentions.2.transformer_blocks.0.attn2",
    "mid_block.attentions.0.transformer_blocks.0.attn1",
    "mid_block.attentions.0.transformer_blocks.0.attn2"
]

Full layers can be seen here