Diffusion Models

AI concept developed
1 source

Diffusion Models

Diffusion models are a class of generative AI models that learn to create data (typically images) by reversing a gradual noising process. Introduced by Ho et al. (2020), they train a neural network to predict and remove noise from progressively corrupted data. Generation works by starting from pure random noise and iteratively denoising it into a coherent output. This approach powers modern image generators including DALL-E, Stable Diffusion, and Midjourney.

How They Work

The Forward Process (Adding Noise)

Starting from a clean image, gradually add Gaussian noise over many timesteps (typically 1,000). At each step, a small amount of noise is added, so the image degrades slowly from clean to pure noise. This process is fixed (not learned) and follows a predefined noise schedule.

The Reverse Process (Removing Noise)

A neural network learns to reverse each step: given a noisy image at timestep $t$, predict what the image looked like at timestep $t-1$. In practice, the network predicts the noise that was added (rather than the clean image directly), and the training objective reduces to a simple mean-squared error between the predicted noise and the actual noise.

Generation

To generate a new image, start from pure Gaussian noise and apply the learned denoising network repeatedly — each step removes a bit of noise, progressively revealing an image. The generation quality depends on the number of denoising steps, with more steps producing better results at the cost of speed.

Why Diffusion Models Succeeded

Prior generative approaches each had significant limitations:

  • GANs (Generative Adversarial Networks) produce sharp images but suffer from training instability, mode collapse (generating limited variety), and difficulty evaluating sample diversity.
  • VAEs (Variational Autoencoders) are stable to train but produce blurry images.
  • Autoregressive models produce high-quality samples but are slow (generating one pixel/token at a time).

Diffusion models avoided all three problems: training is stable (just predict noise), there’s no mode collapse (the model must denoise from any random starting point), and quality matches or exceeds GANs. The main tradeoff is speed — generation requires many sequential denoising steps.

Key Properties

  • Training is simple. The loss function is just MSE between predicted and actual noise. No adversarial training, no complex loss balancing.
  • Score matching connection. A specific parameterization of diffusion models is equivalent to denoising score matching with Langevin dynamics, connecting them to a rigorous theoretical framework from statistical physics.
  • Conditional generation. By conditioning the denoising process on text prompts (via cross-attention with text embeddings), diffusion models enable text-to-image generation. This is how DALL-E 2 and Stable Diffusion work.
  • Progressive refinement. The iterative denoising process naturally produces coarse structure first and fine details last, analogous to how an artist might sketch before adding detail.

The Two Generative Paradigms

Modern AI has two dominant generative paradigms:

  1. Autoregressive models (LLMs) — generate tokens one at a time, each conditioned on all previous tokens. Natural for text, where order matters.
  2. Diffusion models — generate all elements simultaneously through iterative refinement. Natural for images, where there’s no inherent left-to-right order.

These paradigms are increasingly converging: diffusion models use Transformer backbones (DiT), and some text generation approaches use diffusion-like processes. Multimodal models must bridge both paradigms.

Sources