Source: Denoising Diffusion Probabilistic Models
Source: Denoising Diffusion Probabilistic Models
Summary
This paper demonstrated that Diffusion Models — a class of generative models that learn to reverse a gradual noising process — can produce high-quality images competitive with GANs. By framing image generation as iterative denoising, the authors showed that a conceptually simple training objective (predict the noise added to an image) yields state-of-the-art sample quality. This work launched the diffusion model paradigm that now powers image generators like DALL-E, Stable Diffusion, and Midjourney.
Key Claims
- Diffusion models match or exceed GANs on image quality. On unconditional CIFAR-10, the model achieved an Inception Score of 9.46 and FID of 3.17 — state-of-the-art at the time. On 256x256 LSUN, sample quality was comparable to ProgressiveGAN.
- The training objective is denoising. The forward process gradually adds Gaussian noise to an image over T steps until it becomes pure noise. The model learns to reverse each step — given a noisy image, predict the noise that was added. Training reduces to a simple mean-squared error loss.
- A connection to score matching exists. The authors showed that a specific parameterization of the model is equivalent to denoising score matching with Langevin dynamics, connecting diffusion models to an established theoretical framework.
- Sampling is progressive decompression. The generation process starts from pure Gaussian noise and iteratively denoises, producing progressively more detailed images. This can be interpreted as a generalization of autoregressive decoding along a bit ordering.
- Log-likelihoods are not competitive. Despite excellent sample quality, the model’s log-likelihoods were worse than other likelihood-based models. The authors found that most of the model’s capacity was spent on imperceptible image details (lossless coding of fine structure).
Relevance and Implications
DDPM established diffusion models as a viable alternative to GANs for high-quality image generation, with several advantages: stable training (no mode collapse), principled mathematical framework, and flexibility. The approach was rapidly extended to text-to-image generation (DALL-E 2, Stable Diffusion), video, audio, 3D content, and even molecular design. Diffusion models represent the second major generative paradigm alongside autoregressive language models — while LLMs generate text token by token, diffusion models generate images by iterative refinement from noise.