Topics
Diffusion Models
Generative models that synthesize data through iterative denoising.
Brain Decoding · Independent Researcher
Brain-Diffuser turns natural scene reconstruction from fMRI signals into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Text-to-Image · Independent Researcher
DIRECT: 3D-Aware Object Insertion with Visual Proxies turns 3D-aware object insertion into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Brain Decoding · Independent Researcher
DreamDiffusion turns EEG-to-image generation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Biomolecular Modeling · Independent Researcher
Feynman-Kac steering turns controllable protein design with guided diffusion into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Diffusion Models · The Hong Kong Polytechnic University
GGT-100K: Generative Ground Truth for Image Restoration turns real-world image restoration data into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Brain Decoding · Independent Researcher
MinD-Vis turns fMRI-to-image reconstruction with latent diffusion into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Speech Synthesis · Independent Researcher
MMAE: A Massive Benchmark for Audio Editing Models turns audio editing evaluation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Speech Synthesis · Zhejiang University
SwanSphere streams first-order ambisonic audio synced to video or text, emitting its first chunk in 0.21s while cutting Frechet Distance to 120.28 vs OmniAudio's 157.67. Quality without waiting for the whole clip.
AI for Science · Microsoft Research
MatterGen is a diffusion model that generates inorganic crystals matching a target property — and the one example it actually synthesized, TaCr2O6, came within 20% of its 200 GPa stiffness goal.
Brain Decoding · Princeton University
MindEye maps fMRI activity into CLIP-like spaces for retrieval and diffusion reconstruction, showing state-of-the-art retrieval and image reconstruction on NSD.
Speech Synthesis · Microsoft Research
NaturalSpeech 2 uses latent diffusion over neural-audio-codec vectors and scales to 44K hours of speech and singing, aiming for stronger zero-shot prosody than token LMs.
Video Generation · NVIDIA
SANA-Streaming edits 1280x704 video in real time at 24 end-to-end FPS on a single RTX 5090, with the diffusion transformer core hitting 58 FPS via a hybrid DiT and Cycle-Reverse Regularization.
Diffusion Models · NVIDIA
AnyFlow distills a video diffusion model that keeps improving as you add sampling steps, fixing the quality drop consistency-distilled models suffer at higher step counts. Tested on Wan2.1 from 1.3B to 14B.
Diffusion Models · Tsinghua University
Causal Forcing++ distills bidirectional video diffusion into a 1-2 step frame-wise autoregressive generator at 14.1 FPS, halves first-frame latency, and cuts few-step training cost ~4x (11,600 to 2,900 A800 GPU hours).
Multimodal Models · NVIDIA
Cosmos 3 packs language, image, video, audio, and robot actions into one mixture-of-transformers model; NVIDIA reports it ranks first among open models on text-to-image, image-to-video, and RoboArena policy.
Diffusion Models · Stanford University
ControlNet bolts a trainable copy onto a frozen Stable Diffusion via zero-initialized convolutions, so an edge map, depth, pose, or segmentation steers the image — and it trains on under 50k examples.
Diffusion Models · UC Berkeley
Denoising Diffusion Probabilistic Models trains a network to undo gradual Gaussian noise step by step, hitting FID 3.17 on CIFAR-10 — and laying the groundwork that Stable Diffusion and DALL-E 2 later built on.
Diffusion Models · Alibaba Qwen Team
MIGA turns a fixed-length video diffusion model into a 1000+-frame generator with no training and constant memory, hitting 97.82 overall on VBench with VideoCrafter2 — about 2.8 points over FIFO-Diffusion.
Text-to-Image · University of Science and Technology of China
Flow-OPD trains one specialist teacher per reward, then distills them on-policy into one SD 3.5 student — lifting GenEval 0.63 to 0.92 and OCR 0.59 to 0.94 without the aesthetic collapse of multi-reward GRPO.
World Models · NVIDIA
Gamma-World is NVIDIA's video world model for multiplayer simulation that runs at 24 FPS and generalizes from two to four players with no retraining, cutting Solaris's FVD roughly in half.
Text-to-Image · Google Research
Google's Imagen hit a new COCO FID of 7.27 without training on COCO, and showed that scaling a frozen T5-XXL text encoder lifts fidelity and alignment more than scaling the diffusion model.
World Models · Microsoft Research
Mirage stores a video world model's 3D memory inside diffusion latent space instead of an RGB point cloud, hitting state-of-the-art WorldScore (70.36) while running 10.57x faster and using 55x less GPU memory.
Text-to-Image · Microsoft Research
Microsoft's Lens is a 3.8B-parameter text-to-image diffusion model that matches 6B+ rivals while using about 19.3% of Z-Image's training compute, mostly by feeding it longer, denser captions.
Diffusion Models · NVIDIA
LongLive-2.0 runs a 5B long-video model end to end in NVFP4 4-bit, hitting 45.7 FPS at 720p, 2.1x faster training and 1.84x faster inference, while VBench total drops only ~0.5 points from BF16.
Diffusion Models · Independent Researcher
Very deep DiTs collapse into a mean-dominated state the author calls Mean Mode Screaming. Splitting the residual into mean and centered paths fixes it, training a stable 1000-layer DiT to FID 2.77.
Text-to-Image · Alibaba Qwen Team
Qwen-Image-2.0 from Alibaba unifies text-to-image generation and editing in one diffusion transformer, renders up to 1K-token instructions for slides and posters, and adds native 2K photorealism via a 16x VAE.
Multimodal Models · ByteDance
Representation Forcing drops the frozen VAE from unified multimodal models. RF-Pixel predicts visual representation tokens before pixels, hits 0.84 GenEval, and lifts MMMU by 4.3 points over its VAE variant.
Diffusion Models · Alibaba Qwen Team
DAR replaces the residual add in diffusion transformers with timestep-adaptive aggregation of past sublayer outputs, cutting SiT-XL/2's ImageNet FID from 9.67 to 7.56 with 8.75x fewer iterations.
Diffusion Models · University of Science and Technology of China
Stream-R1 reweights DMD losses by video reward scores and per-region perplexity instead of treating signals equally. Its 1.3B streaming model hits 84.40 VBench at 23.1 FPS, beating its 14B teacher's 84.26 for free.
Text-to-Image · Stability AI
Stable Diffusion 3 trades U-Net diffusion for a rectified-flow transformer (MM-DiT) with separate image and text weights, fixing spelled-out text and prompt following while scaling predictably from 800M to 8B parameters.
Diffusion Models · University of Science and Technology of China
Stream-T1 adds test-time search to streaming video generation without retraining, lifting VideoAlign motion quality from 0.350 to 0.629 at 5s and cutting the drift that wrecks 30-second clips.
Speech Synthesis · ByteDance
SwanVoice is a zero-shot TTS system that generates an entire 1-4 speaker conversation in one pass, keeping voice, mood, and prosody consistent across turns where turn-by-turn synthesis drifts — but content accuracy lags.
Text-to-Image · OpenAI
DALL·E 2, called unCLIP in the paper, generates a CLIP image embedding from text with a prior, then renders it with a diffusion decoder — buying more diversity at almost no cost to photorealism or caption match.
Diffusion Models · CompVis
Latent diffusion runs denoising inside a pretrained autoencoder's compressed latent space instead of raw pixels, cutting training and inference cost while adding cross-attention conditioning for text and layout.