[25.03.10] Denoising Diffusion Probabilistic Models - Paper-Reading-Study/2025 GitHub Wiki

Paper Reading Study Notes

General Information

Paper Title: Denoising Diffusion Probabilistic Models
Authors: Jonathan Ho, Ajay Jain, Pieter Abbeel
Published In: 34th Conference on Neural Information Processing Systems (NeurIPS 2020)
Year: 2020
Link: arxiv
Date of Discussion: 2025.03.10

Summary

Research Problem: The paper addresses the problem of generating high-quality images using a novel approach based on diffusion probabilistic models, inspired by non-equilibrium thermodynamics. It aims to improve upon existing generative models like GANs, VAEs, and autoregressive models.
Key Contributions:
- Demonstrates that diffusion probabilistic models can generate high-quality image samples, sometimes surpassing the quality of other generative models.
- Establishes a novel connection between diffusion models and denoising score matching with Langevin dynamics. This connection leads to a simplified and weighted variational bound objective.
- Shows that the sampling procedure of diffusion models can be interpreted as a form of progressive lossy decompression, generalizing autoregressive decoding.
- Achieves state-of-the-art FID scores on CIFAR10 and competitive sample quality on LSUN.
Methodology/Approach:
- The core idea is to train a parameterized Markov chain (the "reverse process") to reverse a diffusion process. The diffusion process gradually adds Gaussian noise to an image until it becomes pure noise.
- The reverse process learns to denoise the image, step by step, from pure noise back to a realistic image.
- A key aspect is the parameterization of the reverse process, which is shown to be equivalent to denoising score matching at multiple noise levels.
- A U-Net architecture with group normalization and self-attention is used to represent the reverse process.
Results:
- Achieves an Inception score of 9.46 and a state-of-the-art FID score of 3.17 on unconditional CIFAR10.
- Obtains sample quality on 256x256 LSUN comparable to ProgressiveGAN.
- Demonstrates high-quality image generation on CelebA-HQ.
- Shows that the simplified training objective (Lsimple) leads to better sample quality than the full variational bound, although the latter yields better log-likelihoods.

Discussion Points

Strengths:
- The novel connection between diffusion models and denoising score matching is a significant theoretical contribution.
- The simplified training objective and the resulting high sample quality are compelling.
- The interpretation of the sampling process as progressive decoding is insightful.
- The model achieves excellent quantitative results (FID, Inception score) on benchmark datasets.
- The ability to generate high-quality images without adversarial training is a notable advantage.
- The discussion highlights the intuitive understanding of the model learning the "direction" of noise at each step.
Weaknesses:
- The mathematical derivations are complex and difficult to follow, with many steps skipped or glossed over. The participants in the discussion struggled with the equations, particularly the transitions between equations (e.g., 8 to 12).
- The concept of "codelength" and its relation to the model's performance is not clearly explained.
- The paper acknowledges that the model's log-likelihoods are not competitive with other likelihood-based models, despite the high sample quality.
- The discussion participants found the notation and terminology confusing, particularly regarding which parts of the model were parameterized and what constituted the "labels" during training.
Key Questions:
- How exactly are the mathematical derivations performed, particularly the transitions between key equations?
- What is the precise meaning of "codelength" in this context, and how does it relate to the model's compression capabilities?
- Why does the simplified objective lead to better sample quality, despite being a looser bound?
- How does the discrete decoder (Equation 13) relate to VAE decoders, and what is the significance of the discretization?
- How can the model's architecture be further improved, and what are the limitations of the current U-Net based approach?
Applications:
- High-quality image generation.
- Lossy image compression.
- Potential applications in other data modalities beyond images.
- Could be used as components in other generative models or machine learning systems.
Connections:
- Relates to other generative models like GANs, VAEs, autoregressive models, and flows.
- Connects to energy-based models and score matching through the link to Langevin dynamics.
- The progressive decoding aspect relates to convolutional DRAW and other autoregressive models.

Notes and Reflections

Interesting Insights:
- The interpretation of the diffusion process as adding noise in a way that's easier to reverse than simply masking pixels (as in some autoregressive models).
- The idea that the model learns the "direction" of noise at each step of the diffusion process.
- The observation that a significant portion of the codelength is used to describe imperceptible image details.
Lessons Learned:
- Diffusion models are a powerful and promising approach to generative modeling.
- The connection to denoising score matching provides a valuable theoretical framework.
- Careful parameterization and objective design are crucial for achieving good results.
- The mathematical details of diffusion models can be quite complex.
Future Directions:
- Further exploration of the mathematical derivations to gain a deeper understanding.
- Investigation of different model architectures and training strategies.
- Application of diffusion models to other data modalities.
- Exploration of the model's potential for lossy compression.
- Research into improving the model's log-likelihoods while maintaining high sample quality.