[25.06.26] Denoising Diffusion Probabilistic Models - Paper-Reading-Study/2025 GitHub Wiki

Paper Reading Study Notes

General Information

  • Paper Title: Denoising Diffusion Probabilistic Models
  • Authors: Jonathan Ho, Ajay Jain, Pieter Abbeel
  • Published In: NeurIPS (Conference on Neural Information Processing Systems)
  • Year: 2020
  • Link: https://arxiv.org/abs/2006.11239
  • Date of Discussion: 2025-06-26

Summary

  • Research Problem: To demonstrate that diffusion probabilistic models, a class of latent variable models, can be trained to produce high-quality image samples that are competitive with or superior to those from other generative models like GANs.
  • Key Contributions:
    1. Achieved state-of-the-art image synthesis results (at the time) on datasets like CIFAR10, surpassing many GAN-based models in FID score.
    2. Established a novel connection between diffusion models and denoising score matching with Langevin dynamics.
    3. Proposed a simplified training objective based on a weighted variational bound, which proved crucial for achieving the best sample quality.
    4. Showcased that the model's sampling process can be interpreted as a progressive lossy decompression scheme.
  • Methodology/Approach: The model consists of two processes:
    1. Forward Process (Diffusion): A fixed Markov chain that gradually adds Gaussian noise to an image over T timesteps until it becomes pure noise.
    2. Reverse Process (Denoising): A learned Markov chain, parameterized by a neural network (a U-Net), that reverses the diffusion process. It starts from noise and progressively denoises it step-by-step to generate a clean image. The core training objective is to make the model predict the noise (ε) that was added at each timestep, which simplifies the variational bound and improves performance.
  • Results: The model achieved an Inception score of 9.46 and a then state-of-the-art FID score of 3.17 on unconditional CIFAR10. It also produced high-quality 256x256 images on CelebA-HQ and LSUN, with quality comparable to ProgressiveGAN.

Discussion Points

  • Strengths:
    • The mathematical framework, rooted in variational inference and KL divergence, was found to be clear and elegant.
    • The model's ability to generate very high-quality and diverse samples is a significant strength.
    • The connection to information theory concepts like rate-distortion and progressive coding provides a compelling interpretation of the model's behavior.
  • Weaknesses:
    • The sampling process is very slow and computationally expensive due to the large number of sequential steps required (e.g., T=1000).
  • Key Questions:
    • Initial Confusion: There was initial difficulty in intuitively understanding the variational bound (ELBO) and the role of the q distribution. This was resolved by viewing it as minimizing the "distance" (KL divergence) between the learned reverse process p and the true posterior q over all possible noising trajectories.
    • Discrete Data Handling: The formula for handling discrete pixel data (Eq. 13) was initially confusing, particularly the multiplication over the data dimension D. The group concluded it represents the joint probability of all pixels changing, assuming independence for that single step.
    • Future Work: A natural question that arose is how the slow sampling process could be accelerated, which is a major focus of subsequent research in this area.
  • Applications:
    • High-fidelity, unconditional image generation.
    • Potential for advanced data compression, given its properties as a lossy compressor.
  • Connections:
    • VAE: The model was heavily compared to VAEs. Both use a variational framework and Gaussian assumptions. However, the key difference is that VAEs learn a static latent space, while diffusion models learn a dynamic, step-by-step denoising process.
    • Score Matching: The paper explicitly connects its simplified objective to denoising score matching, providing a bridge between variational and score-based generative models.

Notes and Reflections

  • Interesting Insights:
    • The decision to have the model predict the added noise (ε) rather than the denoised image's mean (μ) was a crucial and insightful simplification that led to better results.
    • Viewing the generation process as a gradual refinement from pure noise (high entropy) to a structured image (low entropy) was a helpful mental model.
  • Lessons Learned:
    • A solid understanding of concepts like KL divergence and entropy is highly beneficial for grasping the mechanics of modern generative models.
    • The discussion clarified the fundamental difference between VAEs and diffusion models: modeling a latent space versus modeling a stochastic process.
  • Future Directions:
    • Exploring methods to reduce the number of sampling steps (T) to make the model faster and more practical.
    • Applying the diffusion framework to other data modalities beyond images, such as video or audio.