Video Generation · Efficient AI

VideoMLA: A Low-Rank Latent KV Cache for Minute-Scale Video Diffusion

VideoMLA ports Multi-Head Latent Attention into causal video diffusion, cutting per-token KV memory 92.7% (224 vs 3,072 scalars), winning VBench at 60s, and lifting B200 throughput 1.23x.

VideoMLA: A Low-Rank Latent KV Cache for Minute-Scale Video Diffusion

Quick answer

VideoMLA is the first study of Multi-Head Latent Attention (MLA) inside causal video diffusion. It swaps the standard per-head key-value cache for a shared low-rank content latent plus a shared decoupled 3D-RoPE positional key, cutting per-token KV memory by 92.7% at every cached layer — 224 scalars per token instead of 3,072, a 13.7x reduction. On VBench it matches short-horizon streaming baselines, posts the best overall score among evaluated methods at long horizons (0.859 at 60s), and improves throughput 1.23x on a single B200. The surprising part: it works even though pretrained video attention is not low-rank, so the usual spectral justification for MLA does not apply.

The streaming-memory problem this attacks

Long-rollout causal video diffusion has standardized on a fixed-size sliding-window KV cache. Recent work innovates inside that window — which tokens occupy it, how their positions are encoded — but leaves the per-head KV layout itself untouched. That layout is a dominant contributor to streaming memory and latency: every cached layer stores full keys and values for every head, and that footprint is what forces small batches and caps how long a video you can roll out before running out of memory.

VideoMLA targets the layout directly. The bet is that the per-head cache is redundant and can be replaced by a much smaller shared latent without losing the quality that the pretrained attention learned. If that holds, you shrink the streaming footprint enough to roll out minute-scale video and to pack far larger batches on the same hardware.

How VideoMLA compresses the cache

VideoMLA borrows MLA from the language-model world but adapts it for video. Instead of caching one key and one value per head, it caches two shared things: a low-rank content latent that all heads project from, and a single decoupled positional key carrying the 3D rotary position embedding. At inference you cache the compact latent, then reconstruct the per-head keys and values on the fly. The decoupled 3D-RoPE key keeps positional information separate from content so the rotary embedding does not get tangled into the low-rank bottleneck.

The hard question is why this should work at all. In language models MLA is usually motivated by a spectral argument: attention is approximately low-rank, so a small latent loses little. VideoMLA shows that argument does not hold here — pretrained video attention has a 99%-energy effective rank far above any practical latent dimension, so direct spectral approximation would predict large reconstruction error. Yet VideoMLA retains quality at those compression ratios anyway.

Why it works without low-rank attention

The paper’s central finding is that the MLA bottleneck — not the pretrained spectrum — sets the effective rank. Whether you initialize the latent from the top spectral directions or from random noise, the projection occupies nearly the full rank budget from the very first step, and training preserves that budget while adapting within it. In other words, the model does not try to approximate the original high-rank attention map and fail; it learns a fresh attention that lives within the budget the bottleneck allows. This reframes MLA for diffusion: the latent dimension is a capacity knob you train into, not a spectrum you compress toward, which is why random and spectral initialization land in the same place.

Key results

  • KV memory: 92.7% reduction in per-token KV memory at every cached layer; 224 scalars per token versus 3,072 for the per-head baseline, a 13.7x cut.
  • Quality at length: matches short-horizon streaming video diffusion baselines on VBench, and reaches the best overall score among evaluated methods at long horizons — 0.859 at 60s.
  • Motion preserved: dynamic-degree stays high deep into the rollout — 0.981 at 30s and 0.958 at 60s.
  • Throughput: 1.23x improvement on a single B200; reported 23.96 FPS at batch size 1 with 3.38s latency.
  • Batch scaling: up to 8.0x larger non-OOM batches at latent content dimension 192, the direct payoff of the smaller streaming footprint.

The honest read: this is an efficiency and systems contribution, not a generation-quality leap. The headline is that you keep baseline-level quality while spending a fraction of the streaming memory and gaining headroom for longer rollouts and bigger batches — plus a clean mechanistic explanation for why MLA survives in a regime where the standard low-rank story breaks.

Limits and open questions

  • Single-model, single-hardware evidence. The throughput and batch-scaling numbers are reported on one B200 setup; gains on other accelerators or memory-bandwidth-bound regimes are not characterized here.
  • VBench-centric quality. Wins are measured on VBench at fixed horizons; VBench is an automated proxy, so whether the long-horizon edge survives human preference at minute scale is open.
  • No free lunch on reconstruction. The method admits direct spectral approximation would fail at these ratios; the on-the-fly reconstruction of per-head keys and values adds compute that partly offsets the memory savings.
  • Latent dimension as a knob. Because the bottleneck sets the rank, picking the latent content dimension trades quality against memory, and the paper does not give a general rule for choosing it across models.

FAQ

What is Multi-Head Latent Attention (MLA) in VideoMLA?

MLA replaces per-head cached keys and values with a single shared low-rank content latent plus a shared decoupled 3D-RoPE positional key. VideoMLA caches the compact latent and reconstructs per-head keys and values at inference, cutting per-token KV memory 92.7%.

Why does VideoMLA work if video attention is not low-rank?

Pretrained video attention has a 99%-energy effective rank far above any practical latent size, so it is not low-rank. VideoMLA shows the MLA bottleneck, not the pretrained spectrum, sets the effective rank: the projection fills the rank budget from initialization and training adapts within it, so spectral and random initialization converge.

How much faster and lighter is VideoMLA?

Per-token KV memory drops 92.7% (224 vs 3,072 scalars, 13.7x), throughput rises 1.23x on a single B200, and non-OOM batch size grows up to 8.0x at content dimension 192, while VBench quality matches baselines short-horizon and leads at 60s.