Topics

Diffusion Models

Generative models that synthesize data through iterative denoising.

Layered waves and soft texture suggesting iterative image synthesis

Diffusion models changed image generation by turning synthesis into iterative denoising. Instead of generating pixels in one step, the model learns how to reverse a corruption process, which gives strong control over fidelity, diversity, conditioning, and later editing workflows.

The key SEO distinction is that diffusion is not only a text-to-image trick. Latent Diffusion made high-resolution generation practical by moving denoising into compressed latent space. Imagen showed that text understanding is a major driver of prompt alignment. DALL-E 2 connected language-image representations with generation. Together these papers explain why modern creative AI is built around both denoising and strong conditioning.

Start here

Text-to-Image · Google Research

Imagen: Why a Frozen Text Encoder Beats a Bigger Image Model

Google's Imagen hit a new COCO FID of 7.27 without training on COCO, and showed that scaling a frozen T5-XXL text encoder lifts fidelity and alignment more than scaling the diffusion model.

Text-to-Image · OpenAI

DALL·E 2 (unCLIP): Text-to-Image via CLIP Image Latents

DALL·E 2, called unCLIP in the paper, generates a CLIP image embedding from text with a prior, then renders it with a diffusion decoder — buying more diversity at almost no cost to photorealism or caption match.

Diffusion Models · CompVis

Latent Diffusion Models: The Architecture Behind Stable Diffusion

Latent diffusion runs denoising inside a pretrained autoencoder's compressed latent space instead of raw pixels, cutting training and inference cost while adding cross-attention conditioning for text and layout.

Foundational papers

Diffusion Models · UC Berkeley

DDPM: The Paper That Made Diffusion Models Actually Work

Denoising Diffusion Probabilistic Models trains a network to undo gradual Gaussian noise step by step, hitting FID 3.17 on CIFAR-10 — and laying the groundwork that Stable Diffusion and DALL-E 2 later built on.

Diffusion Models · CompVis

Latent Diffusion Models: The Architecture Behind Stable Diffusion

Text-to-Image · OpenAI

DALL·E 2 (unCLIP): Text-to-Image via CLIP Image Latents

Text-to-Image · Google Research

Imagen: Why a Frozen Text Encoder Beats a Bigger Image Model

Google's Imagen hit a new COCO FID of 7.27 without training on COCO, and showed that scaling a frozen T5-XXL text encoder lifts fidelity and alignment more than scaling the diffusion model.

Recent papers

Brain Decoding · Independent Researcher

Brain-Diffuser: Natural Scene Reconstruction from fMRI

Brain-Diffuser turns natural scene reconstruction from fMRI signals into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Text-to-Image · Independent Researcher

DIRECT: 3D-Aware Object Insertion with Visual Proxies

DIRECT: 3D-Aware Object Insertion with Visual Proxies turns 3D-aware object insertion into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Brain Decoding · Independent Researcher

DreamDiffusion: EEG-to-Image Generation with Diffusion

DreamDiffusion turns EEG-to-image generation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Biomolecular Modeling · Independent Researcher

Feynman-Kac Steering for Controllable Protein Design

Feynman-Kac steering turns controllable protein design with guided diffusion into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Diffusion Models · The Hong Kong Polytechnic University

GGT-100K: Generative Ground Truth for Image Restoration

GGT-100K: Generative Ground Truth for Image Restoration turns real-world image restoration data into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Brain Decoding · Independent Researcher

MinD-Vis: fMRI Vision Decoding with Latent Diffusion

MinD-Vis turns fMRI-to-image reconstruction with latent diffusion into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Brain Decoding · Independent Researcher

Brain-Diffuser: Natural Scene Reconstruction from fMRI

Brain-Diffuser turns natural scene reconstruction from fMRI signals into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Text-to-Image · Independent Researcher

DIRECT: 3D-Aware Object Insertion with Visual Proxies

DIRECT: 3D-Aware Object Insertion with Visual Proxies turns 3D-aware object insertion into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Brain Decoding · Independent Researcher

DreamDiffusion: EEG-to-Image Generation with Diffusion

DreamDiffusion turns EEG-to-image generation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Biomolecular Modeling · Independent Researcher

Feynman-Kac Steering for Controllable Protein Design

Feynman-Kac steering turns controllable protein design with guided diffusion into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Diffusion Models · The Hong Kong Polytechnic University

GGT-100K: Generative Ground Truth for Image Restoration

GGT-100K: Generative Ground Truth for Image Restoration turns real-world image restoration data into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Brain Decoding · Independent Researcher

MinD-Vis: fMRI Vision Decoding with Latent Diffusion

MinD-Vis turns fMRI-to-image reconstruction with latent diffusion into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Speech Synthesis · Independent Researcher

MMAE: A Massive Benchmark for Audio Editing Models

MMAE: A Massive Benchmark for Audio Editing Models turns audio editing evaluation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Speech Synthesis · Zhejiang University

SwanSphere: Streaming Spatial Audio Generation From Video and Text

SwanSphere streams first-order ambisonic audio synced to video or text, emitting its first chunk in 0.21s while cutting Frechet Distance to 120.28 vs OmniAudio's 157.67. Quality without waiting for the whole clip.

AI for Science · Microsoft Research

MatterGen Explained: Diffusion for Inverse Materials Design

MatterGen is a diffusion model that generates inorganic crystals matching a target property — and the one example it actually synthesized, TaCr2O6, came within 20% of its 200 GPa stiffness goal.

Brain Decoding · Princeton University

MindEye: fMRI Image Reconstruction with Diffusion Priors

MindEye maps fMRI activity into CLIP-like spaces for retrieval and diffusion reconstruction, showing state-of-the-art retrieval and image reconstruction on NSD.

Speech Synthesis · Microsoft Research

NaturalSpeech 2: Diffusion TTS Beyond Codec LMs

NaturalSpeech 2 uses latent diffusion over neural-audio-codec vectors and scales to 44K hours of speech and singing, aiming for stronger zero-shot prosody than token LMs.

Video Generation · NVIDIA

SANA-Streaming: Real-time Video Editing at 24 FPS on One RTX 5090

SANA-Streaming edits 1280x704 video in real time at 24 end-to-end FPS on a single RTX 5090, with the diffusion transformer core hitting 58 FPS via a hybrid DiT and Cycle-Reverse Regularization.

Diffusion Models · NVIDIA

AnyFlow: Any-Step Video Diffusion via Flow Map Distillation

AnyFlow distills a video diffusion model that keeps improving as you add sampling steps, fixing the quality drop consistency-distilled models suffer at higher step counts. Tested on Wan2.1 from 1.3B to 14B.

Diffusion Models · Tsinghua University

Causal Forcing++: Few-Step Autoregressive Diffusion for Real-Time Video

Causal Forcing++ distills bidirectional video diffusion into a 1-2 step frame-wise autoregressive generator at 14.1 FPS, halves first-frame latency, and cuts few-step training cost ~4x (11,600 to 2,900 A800 GPU hours).

Multimodal Models · NVIDIA

Cosmos 3 Explained: NVIDIA's Omnimodal World Model for Physical AI

Cosmos 3 packs language, image, video, audio, and robot actions into one mixture-of-transformers model; NVIDIA reports it ranks first among open models on text-to-image, image-to-video, and RoboArena policy.

Diffusion Models · Stanford University

ControlNet: Adding Spatial Control to Diffusion Models

ControlNet bolts a trainable copy onto a frozen Stable Diffusion via zero-initialized convolutions, so an edge map, depth, pose, or segmentation steers the image — and it trains on under 50k examples.

Diffusion Models · UC Berkeley

DDPM: The Paper That Made Diffusion Models Actually Work

Diffusion Models · Alibaba Qwen Team

MIGA: Train-Free Infinite-Frame Generation for Consistent Long Videos

MIGA turns a fixed-length video diffusion model into a 1000+-frame generator with no training and constant memory, hitting 97.82 overall on VBench with VideoCrafter2 — about 2.8 points over FIFO-Diffusion.

Text-to-Image · University of Science and Technology of China

Flow-OPD: On-Policy Distillation Fixes Reward Conflict in Text-to-Image RL

Flow-OPD trains one specialist teacher per reward, then distills them on-policy into one SD 3.5 student — lifting GenEval 0.63 to 0.92 and OCR 0.59 to 0.94 without the aesthetic collapse of multi-reward GRPO.

World Models · NVIDIA

Gamma-World: A Multi-Agent World Model That Scales Past Two Players

Gamma-World is NVIDIA's video world model for multiplayer simulation that runs at 24 FPS and generalizes from two to four players with no retraining, cutting Solaris's FVD roughly in half.

Text-to-Image · Google Research

Imagen: Why a Frozen Text Encoder Beats a Bigger Image Model

Google's Imagen hit a new COCO FID of 7.27 without training on COCO, and showed that scaling a frozen T5-XXL text encoder lifts fidelity and alignment more than scaling the diffusion model.

World Models · Microsoft Research

Mirage: Latent Spatial Memory Makes Video World Models 10x Faster

Mirage stores a video world model's 3D memory inside diffusion latent space instead of an RGB point cloud, hitting state-of-the-art WorldScore (70.36) while running 10.57x faster and using 55x less GPU memory.

Text-to-Image · Microsoft Research

Lens: A 3.8B Text-to-Image Model Trained on ~19% of Z-Image's Compute

Microsoft's Lens is a 3.8B-parameter text-to-image diffusion model that matches 6B+ rivals while using about 19.3% of Z-Image's training compute, mostly by feeding it longer, denser captions.

Diffusion Models · NVIDIA

LongLive-2.0: NVFP4 4-bit Training and Inference for Long Video

LongLive-2.0 runs a 5B long-video model end to end in NVFP4 4-bit, hitting 45.7 FPS at 720p, 2.1x faster training and 1.84x faster inference, while VBench total drops only ~0.5 points from BF16.

Diffusion Models · Independent Researcher

Mean Mode Screaming: Stabilizing 1000-Layer Diffusion Transformers

Very deep DiTs collapse into a mean-dominated state the author calls Mean Mode Screaming. Splitting the residual into mean and centered paths fixes it, training a stable 1000-layer DiT to FID 2.77.

Text-to-Image · Alibaba Qwen Team

Qwen-Image-2.0: One Model for High-Fidelity Generation and Editing

Qwen-Image-2.0 from Alibaba unifies text-to-image generation and editing in one diffusion transformer, renders up to 1K-token instructions for slides and posters, and adds native 2K photorealism via a 16x VAE.

Multimodal Models · ByteDance

Representation Forcing: Unified Multimodal Models Without a VAE

Representation Forcing drops the frozen VAE from unified multimodal models. RF-Pixel predicts visual representation tokens before pixels, hits 0.84 GenEval, and lifts MMMU by 4.3 points over its VAE variant.

Diffusion Models · Alibaba Qwen Team

Rethinking Cross-Layer Information Routing in Diffusion Transformers

DAR replaces the residual add in diffusion transformers with timestep-adaptive aggregation of past sublayer outputs, cutting SiT-XL/2's ImageNet FID from 9.67 to 7.56 with 8.75x fewer iterations.

Diffusion Models · University of Science and Technology of China

Stream-R1: Reliability-Perplexity Aware Reward Distillation Explained

Stream-R1 reweights DMD losses by video reward scores and per-region perplexity instead of treating signals equally. Its 1.3B streaming model hits 84.40 VBench at 23.1 FPS, beating its 14B teacher's 84.26 for free.

Text-to-Image · Stability AI

Stable Diffusion 3: Rectified Flow and the MM-DiT Architecture

Stable Diffusion 3 trades U-Net diffusion for a rectified-flow transformer (MM-DiT) with separate image and text weights, fixing spelled-out text and prompt following while scaling predictably from 800M to 8B parameters.

Diffusion Models · University of Science and Technology of China

Stream-T1: Test-Time Scaling for Streaming Video Generation

Stream-T1 adds test-time search to streaming video generation without retraining, lifting VideoAlign motion quality from 0.350 to 0.629 at 5s and cutting the drift that wrecks 30-second clips.

Speech Synthesis · ByteDance

SwanVoice: Zero-Shot Speech Synthesis for Long Monologue and Dialogue

SwanVoice is a zero-shot TTS system that generates an entire 1-4 speaker conversation in one pass, keeping voice, mood, and prosody consistent across turns where turn-by-turn synthesis drifts — but content accuracy lags.

Text-to-Image · OpenAI

DALL·E 2 (unCLIP): Text-to-Image via CLIP Image Latents

Diffusion Models · CompVis

Start here

Foundational papers

Recent papers

Related topics