SANA-Streaming: Real-time Video Editing at 24 FPS on One RTX 5090

Quick answer

SANA-Streaming is a system-and-algorithm co-designed framework from NVIDIA and MIT that edits streaming video in real time: 1280x704 resolution at 24 end-to-end frames per second on a single RTX 5090, with the diffusion transformer (DiT) core alone running at 58 FPS. It targets video-to-video editing for live use cases — broadcasting, gaming — where the model must keep up with an incoming frame stream while staying temporally consistent. The three moves that make this work are a hybrid DiT that adds softmax attention to an otherwise linear-attention backbone, a Cycle-Reverse Regularization training trick, and Blackwell-specific kernels plus mixed-precision quantization.

The problem: real-time V2V is brutally constrained

Streaming video-to-video editing has two demands that fight each other. First, throughput: to feel live, the system has to process frames faster than they arrive, so high-resolution editing needs to clear a real-time bar on a consumer GPU rather than a datacenter rack. Second, temporal consistency: edits applied frame by frame tend to flicker and drift, because nothing forces frame N to agree with frame N-1. Most prior editing methods optimize quality on offline clips and either run far below real time or fall apart on long streams. SANA-Streaming’s framing is that you cannot solve this with the model alone or the system alone — the resolution and frame-rate target only close when architecture, training, and kernels are designed together.

How SANA-Streaming works

The architecture is a Hybrid Diffusion Transformer. The SANA line is built on linear attention, which is cheap but weaker at local detail. SANA-Streaming reintroduces softmax attention in a subset of the blocks to recover local modeling capability, while keeping the linear layers everywhere else so the overall compute stays efficient. It is a deliberate partial trade: pay for full attention only where it buys the most quality.

The training contribution is Cycle-Reverse Regularization. The hard part of streaming edits is enforcing semantic consistency without a dataset of paired long edited videos, which essentially does not exist. Instead, the method predicts the source frames back from the generated content via flow matching — a cycle that pushes the edit to stay faithful to the original over time. Because the supervision comes from reconstructing the input rather than from matching a long ground-truth edited clip, it sidesteps the missing-data problem entirely.

The third piece is Efficient System Co-design tuned for NVIDIA Blackwell (the RTX 5090). It combines fused GDN kernels with Mixed-Precision Quantization (MPQ). The MPQ assignment is driven by profiling real-world throughput so it maximizes Tensor Core utilization while holding generation quality — quantization placed where the hardware actually benefits, not uniformly.

Why now

Real-time generative video is the current frontier, and the bottleneck has been the gap between offline quality and live throughput. Two things changed. Blackwell-class consumer GPUs (RTX 5090) put serious Tensor Core throughput in reach without a datacenter, and the SANA family’s linear-attention backbone gives an efficient starting point that softmax-heavy video DiTs lack. SANA-Streaming is the argument that, with co-design, real-time high-resolution editing is now a consumer-GPU problem rather than a cloud problem — which matters for latency-sensitive live applications where round-tripping to a server is a non-starter.

Key results

24 end-to-end FPS at 1280x704 resolution on a single RTX 5090 — the headline real-time target, measured end to end rather than for the model core alone.
58 FPS for the DiT core in isolation, showing the model itself has substantial headroom and that the end-to-end figure is gated by the surrounding pipeline.
Outperforms existing SOTA methods on both temporal coherence and system throughput, per the paper’s experiments — the two axes the framework was explicitly designed around.
Achieved without paired long edited videos for training, thanks to Cycle-Reverse Regularization predicting source frames from generated content.

A candid note on the numbers: the abstract publishes the throughput figures (24 and 58 FPS, 1280x704, RTX 5090) and states the SOTA comparison qualitatively, but the available material does not list a single headline temporal-consistency metric value or the specific baselines beaten. Treat the FPS numbers as hard and the SOTA claim as directional pending the full tables.

Limits and open questions

The strongest gap is quantified quality. “Outperforms SOTA on temporal coherence” appears without a published metric or named baseline in the available material, so the consistency claim is directional. The throughput numbers are tied to one specific GPU — the RTX 5090 — and the MPQ and fused-kernel gains are explicitly Blackwell-tuned, so it is unclear how much of the speed survives on older or non-NVIDIA hardware. The 24-versus-58 FPS gap shows the end-to-end pipeline, not the DiT, is the bottleneck, and the paper’s framing does not (in the abstract) break down where those frames go. Finally, the editing scope — what kinds of edits, how strong, how they hold over very long streams — is not characterized in the material here, and Cycle-Reverse Regularization reconstructing source frames could in principle bias edits toward under-changing the input.

FAQ

What is SANA-Streaming in one sentence?

SANA-Streaming is a co-designed framework from NVIDIA and MIT that performs real-time streaming video-to-video editing at 1280x704 and 24 end-to-end FPS on a single RTX 5090, using a hybrid linear-plus-softmax diffusion transformer.

How fast is SANA-Streaming?

24 end-to-end frames per second at 1280x704 resolution on one RTX 5090 GPU, with the diffusion transformer core running at 58 FPS on its own.

What is the Hybrid Diffusion Transformer in SANA-Streaming?

It is a DiT that keeps the SANA line’s efficient linear-attention layers in most blocks but adds softmax attention in a subset of them, recovering local modeling quality without paying full-attention cost everywhere.

What is Cycle-Reverse Regularization?

A training strategy that enforces semantic consistency by predicting the source frames back from the generated content via flow matching, which improves temporal consistency without needing paired long edited videos for supervision.

Does SANA-Streaming need special hardware?

Its throughput and the Mixed-Precision Quantization plus fused GDN kernels are optimized for NVIDIA Blackwell (RTX 5090), so the reported real-time numbers are tied to that architecture; portability to other GPUs is not established in the available material.

One line: co-design the model, the training, and the kernels together, and real-time high-resolution video editing fits on a single consumer GPU. Read the original paper on arXiv.