Latent Diffusion Models: The Architecture Behind Stable Diffusion

Quick answer

Latent Diffusion Models (LDMs) move the diffusion process out of pixel space and into the compressed latent space of a pretrained autoencoder, so the denoising network operates on a much smaller tensor. Pixel-space diffusion models routinely consumed hundreds of GPU-days to train; LDMs reach a “near-optimal point between complexity reduction and detail preservation” with far less compute, set a new state of the art on image inpainting, and stay competitive on unconditional generation, semantic scene synthesis, and super-resolution. This is the exact architecture Stable Diffusion is built on.

Why pixel-space diffusion is so expensive

A diffusion model generates an image by starting from noise and applying a denoising network many times in sequence. The catch in the original formulation: every one of those steps runs directly on the full-resolution pixel grid. For high-resolution synthesis that means a large network repeatedly evaluating a large tensor, hundreds of times per sample. The paper is blunt about the cost — training the strongest pixel-space DMs “often consumes hundreds of GPU days,” and inference is slow because the sequential steps cannot be skipped. That cost is the real barrier the paper attacks; the quality of pixel diffusion was already good.

Diffusing in latent space

The fix is a two-stage design. First, train an autoencoder that compresses an image into a lower-dimensional latent (with a modest downsampling factor) and reconstructs it with high fidelity. Then train the diffusion model entirely inside that latent space. Because the latent throws away imperceptible high-frequency detail that the autoencoder can restore on decode, the diffusion model no longer wastes capacity modeling pixels it does not need to. The authors frame the autoencoder stage as “perceptual compression” and the diffusion stage as the “semantic” generative model — a separation that is the conceptual core of the paper.

The judgment worth making: the headline contribution is economic, not a new generative principle. LDMs do not denoise differently from prior DMs; they denoise somewhere cheaper. That reframing is what made everything downstream possible.

Cross-attention conditioning

The second contribution is how LDMs accept input. The paper adds cross-attention layers into the denoising U-Net so the model can be steered by general conditioning signals — text prompts, semantic maps, bounding boxes — by mapping the condition into the attention layers. This turns a class-conditional image generator into a flexible, promptable one, and it is the mechanism that text-to-image systems inherited directly. High-resolution synthesis also becomes possible “in a convolutional manner,” letting the model generalize beyond the training resolution.

Key results

New state of the art on image inpainting at the time of publication.
Highly competitive on unconditional image generation, class-conditional generation, semantic scene synthesis, and super-resolution, evaluated across standard datasets.
Significantly reduced compute versus pixel-based diffusion models for both training and sampling, while retaining quality and the guidance flexibility of diffusion.
A single architecture spans multiple conditioning tasks via cross-attention, rather than a bespoke model per task.

The durable result is not any single benchmark number but the cost curve: LDMs made high-resolution diffusion trainable and runnable on hardware that academic and indie teams actually have.

Why this became the foundation of Stable Diffusion

When LDMs are scaled up and conditioned on text with a frozen text encoder, you get Stable Diffusion — the open text-to-image model that put generative imaging in the hands of millions and seeded a large fine-tuning, LoRA, and tooling ecosystem. The reason that ecosystem could exist is the latent-space efficiency this paper introduced: generation cheap enough to run on a single consumer GPU. If you want to understand modern open image generation, this is the paper to read first.

Limits and open questions

Compression is not free, and the paper is honest that the latent bottleneck is a tradeoff. The autoencoder sets a ceiling on reconstructable detail — too aggressive a downsampling factor loses fidelity, too little gives back the pixel-space cost. Sampling is still sequential and slower than a single forward pass, so latency remains a real constraint (later work on faster samplers and distillation targets exactly this). Conditioning quality is bounded by the text-image alignment of whatever encoder you bolt on; the LDM itself does not fix a weak prompt encoder. And the two-stage design means autoencoder artifacts can propagate into every generation regardless of how good the diffusion model is. Subsequent systems improved fidelity, control, and safety, but the central latent-space tradeoff is permanent.

FAQ

What are Latent Diffusion Models in one sentence?

Latent Diffusion Models run a diffusion model inside the compressed latent space of a pretrained autoencoder instead of on raw pixels, achieving comparable or better image quality at a fraction of the compute.

Stable Diffusion is a scaled, text-conditioned Latent Diffusion Model. The LDM paper (arXiv 2112.10752) is the underlying architecture; Stable Diffusion is the well-known open implementation built on it.

Why is latent diffusion cheaper than pixel-space diffusion?

Because the denoising network operates on a small latent tensor rather than the full-resolution pixel grid, every one of the many sequential diffusion steps processes far less data, cutting both training and inference cost.

What does cross-attention do in Latent Diffusion Models?

Cross-attention layers inject conditioning signals — text, semantic maps, or bounding boxes — into the denoising U-Net, turning the model into a flexible, promptable generator instead of a fixed unconditional one.

What are Latent Diffusion Models best at?

At publication they set a new state of the art on image inpainting and were highly competitive on unconditional generation, semantic synthesis, super-resolution, and text-to-image, all while using far less compute than pixel-based diffusion.

If pixel diffusion proved the idea worked, latent diffusion proved it could be afforded — and that is why everyone is using it. Read the original at https://arxiv.org/abs/2112.10752.