扩散模型2 Fine‐Tuning and Guidance - SunXiaoXiang/Diffusers GitHub Wiki

Fine-Tuning 新的数据集上重新训练已有的模型，来改变它原有的输出类型

Guidance 推理阶段引导现有模型的生成过程，以此来获取额外的控制

创建一个采样循环，并使用调度器（scheduler）更快地生成样本
在新数据集上微调一个现有的扩散模型，这包括：
- 使用累积梯度的方法去应对训练的 batch 太小所带来的一些问题
- 在训练过程中，将样本上传到 Weights and Biases 来记录日志，以此来监控训练过程（通过附加的实例脚本程序）
- 将最终结果管线（pipeline）保存下来，并上传到Hub
通过新加的损失函数来引导采样过程，以此对现有模型施加控制，这包括：
- 通过一个简单的基于颜色的损失来探索不同的引导方法
- 使用 CLIP，用文本来引导生成过程
- 用 Gradio 和 🤗 Spaces 来分享你的定制的采样循环

在 🤗 Diffusers 库中，这些采样方法是通过调度器（scheduler）来操控的，每次更新通过step()函数完成。

为了生成图片，我们从随机噪声x开始
每一个迭代周期（timestep）我们都送入模型一个带噪声的输入x并把模型预测结果再次输入step()函数。
这里返回的输出都被命名为prev_sample —— 之所以是“previous”，是因为我们是在时间上“后退”，即从高噪声到低噪声（这和前向扩散过程是相反的）。

# The random starting point
x = torch.randn(4, 3, 256, 256).to(device)  # Batch of 4, 3-channel 256 x 256 px images

# Loop through the sampling timesteps
for i, t in tqdm(enumerate(scheduler.timesteps)):

    # Prepare model input
    model_input = scheduler.scale_model_input(x, t)

    # Get the prediction
    with torch.no_grad():
        noise_pred = image_pipe.unet(model_input, t)["sample"]

    # Calculate what the updated sample should look like with the scheduler
    scheduler_output = scheduler.step(noise_pred, t, x)

    # Update x
    x = scheduler_output.prev_sample

    # Occasionally display both x and the predicted denoised images
    if i % 10 == 0 or i == len(scheduler.timesteps) - 1:
        fig, axs = plt.subplots(1, 2, figsize=(12, 5))

        grid = torchvision.utils.make_grid(x, nrow=4).permute(1, 2, 0)
        axs[0].imshow(grid.cpu().clip(-1, 1) * 0.5 + 0.5)
        axs[0].set_title(f"Current x (step {i})")

        pred_x0 = (
            scheduler_output.pred_original_sample
        )  # Not available for all schedulers
        grid = torchvision.utils.make_grid(pred_x0, nrow=4).permute(1, 2, 0)
        axs[1].imshow(grid.cpu().clip(-1, 1) * 0.5 + 0.5)
        axs[1].set_title(f"Predicted denoised images (step {i})")
        plt.show()

dataset_name = "huggan/anime-faces"

dataset = load_dataset(dataset_name, split="train")

image_size = 256 # @param
batch_size = 4 # @param
preprocess = transforms.Compose(
[
transforms.Resize((image_size, image_size)),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.5], [0.5]),
]
)

  
  

def transform(examples):
images = [preprocess(image.convert("RGB")) for image in examples["image"]]
return {"images": images}

  
  

dataset.set_transform(transform)
train_dataloader = torch.utils.data.DataLoader(
dataset, batch_size=batch_size, shuffle=True

)

  

print("Previewing batch:")

batch = next(iter(train_dataloader))

grid = torchvision.utils.make_grid(batch["images"], nrow=4)

plt.imshow(grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5);

修改采样循环，在每一步，我们要做这些事情：

创建一个新版的 x，并且 requires_grad = True
算出去噪后的版本（x0）
将预测出的x0送入我们的损失函数中
找到这个损失函数对于 x 的梯度
在我们使用调度器前，用这个梯度去修改 x ，希望 x 朝着能减低损失值的方向改进

有两种实现方法第一，我们是在从 UNet 得到噪声预测后才给 x 设置 requires_grad 的，这样对内存来讲更高效一点（因为我们不用穿过扩散模型去追踪梯度），但这样做梯度的精度会低一点。


# Variant 1: shortcut method

# The guidance scale determines the strength of the effect
guidance_loss_scale = 40  # Explore changing this to 5, or 100

x = torch.randn(8, 3, 256, 256).to(device)

for i, t in tqdm(enumerate(scheduler.timesteps)):

    # Prepare the model input
    model_input = scheduler.scale_model_input(x, t)

    # predict the noise residual
    with torch.no_grad():
        noise_pred = image_pipe.unet(model_input, t)["sample"]

    # Set x.requires_grad to True
    x = x.detach().requires_grad_()

    # Get the predicted x0
    x0 = scheduler.step(noise_pred, t, x).pred_original_sample

    # Calculate loss
    loss = color_loss(x0) * guidance_loss_scale
    if i % 10 == 0:
        print(i, "loss:", loss.item())

    # Get gradient
    cond_grad = -torch.autograd.grad(loss, x)[0]

    # Modify x based on this gradient
    x = x.detach() + cond_grad

    # Now step with scheduler
    x = scheduler.step(noise_pred, t, x).prev_sample

# View the output
grid = torchvision.utils.make_grid(x, nrow=4)
im = grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5
Image.fromarray(np.array(im * 255).astype(np.uint8))

第二种方法是，我们先给 x 设置 requires_grad，然后再送入 UNet 并计算预测出的 x0。

# Variant 2: setting x.requires_grad before calculating the model predictions

guidance_loss_scale = 40
x = torch.randn(4, 3, 256, 256).to(device)

for i, t in tqdm(enumerate(scheduler.timesteps)):

    # Set requires_grad before the model forward pass
    x = x.detach().requires_grad_()
    model_input = scheduler.scale_model_input(x, t)

    # predict (with grad this time)
    noise_pred = image_pipe.unet(model_input, t)["sample"]

    # Get the predicted x0:
    x0 = scheduler.step(noise_pred, t, x).pred_original_sample

    # Calculate loss
    loss = color_loss(x0) * guidance_loss_scale
    if i % 10 == 0:
        print(i, "loss:", loss.item())

    # Get gradient
    cond_grad = -torch.autograd.grad(loss, x)[0]

    # Modify x based on this gradient
    x = x.detach() + cond_grad

    # Now step with scheduler
    x = scheduler.step(noise_pred, t, x).prev_sample

grid = torchvision.utils.make_grid(x, nrow=4)
im = grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5
Image.fromarray(np.array(im * 255).astype(np.uint8))

CLIP 引导是一个由 OpenAI 开发的模型，它可以让我们拿图片和文字说明去作比较。这是个非常强大的功能，因为它让我们能量化一张图和一句提示语有多匹配。另外，由于这个过程是可微分的，我们可以使用它作为损失函数去引导我们的扩散模型。

基本的方法是：

给文字提示语做嵌入（embedding），为 CLIP 获取一个 512 维的 embedding
对于扩散模型的生成过程的每一步：
- 做出多个不同版本的预测出来的去噪图片（不同的变种可以提供一个更干净的损失信号）
- 对每一个预测出的去噪图片，用 CLIP 给图片做嵌入（embedding），并将这个嵌入和文字的嵌入做对比（用一种叫 Great Circle Distance Squared 的度量方法）
计算这个损失对于当前带噪的 x 的梯度，并在用调度器（scheduler）更新它之前用这个梯度去修改 x