Topics

Text-to-Image

Models that generate or edit images from natural-language prompts.

Text-to-Image · The Chinese University of Hong Kong

InterleaveThinker: Planner-Critic Agents for Interleaved Image Generation

InterleaveThinker adds planner and critic agents around frozen image generators, reaching 66.3 to 67.2 average on UEval and lifting FLUX.2-klein WISE from 0.47 to 0.73.

Text-to-Image · Independent Researcher

DIRECT: 3D-Aware Object Insertion with Visual Proxies

DIRECT: 3D-Aware Object Insertion with Visual Proxies turns 3D-aware object insertion into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Brain Decoding · Independent Researcher

DreamDiffusion: EEG-to-Image Generation with Diffusion

DreamDiffusion turns EEG-to-image generation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Brain Decoding · Independent Researcher

MinD-Vis: fMRI Vision Decoding with Latent Diffusion

MinD-Vis turns fMRI-to-image reconstruction with latent diffusion into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Text-to-Image · Alibaba Qwen Team

Qwen-Image-Flash: Beyond Objective Design in Few-Step Distillation

Qwen-Image-Flash distills Qwen-Image-2.0 to 4 sampling steps for both text-to-image and editing. The Alibaba Qwen team shows the training recipe — data, teachers, task mix — matters as much as the distillation objective.

Brain Decoding · MIT

BrainCause: Finding Causal Visual Representations in the Brain

BrainCause uses text-to-image generation plus an fMRI encoder to causally test what brain regions represent, cutting false-positive localizations from 73.4% to 23% across 260 visual concepts.

Diffusion Models · Stanford University

ControlNet: Adding Spatial Control to Diffusion Models

ControlNet bolts a trainable copy onto a frozen Stable Diffusion via zero-initialized convolutions, so an edge map, depth, pose, or segmentation steers the image — and it trains on under 50k examples.

Multimodal Models · University of Illinois Urbana-Champaign

Crafter: A Multi-Agent Harness for Editable Scientific Figures

Crafter wraps an image model in five cooperating agents and scores 50.34 on PaperBanana-Bench vs 11.13 for the raw backbone — then CraftEditor turns the raster output into editable SVG you can actually fix.

Text-to-Image · University of Science and Technology of China

Flow-OPD: On-Policy Distillation Fixes Reward Conflict in Text-to-Image RL

Flow-OPD trains one specialist teacher per reward, then distills them on-policy into one SD 3.5 student — lifting GenEval 0.63 to 0.92 and OCR 0.59 to 0.94 without the aesthetic collapse of multi-reward GRPO.

Text-to-Image · Google Research

Imagen: Why a Frozen Text Encoder Beats a Bigger Image Model

Google's Imagen hit a new COCO FID of 7.27 without training on COCO, and showed that scaling a frozen T5-XXL text encoder lifts fidelity and alignment more than scaling the diffusion model.

Text-to-Image · Microsoft Research

Lens: A 3.8B Text-to-Image Model Trained on ~19% of Z-Image's Compute

Microsoft's Lens is a 3.8B-parameter text-to-image diffusion model that matches 6B+ rivals while using about 19.3% of Z-Image's training compute, mostly by feeding it longer, denser captions.

Diffusion Models · Independent Researcher

Mean Mode Screaming: Stabilizing 1000-Layer Diffusion Transformers

Very deep DiTs collapse into a mean-dominated state the author calls Mean Mode Screaming. Splitting the residual into mean and centered paths fixes it, training a stable 1000-layer DiT to FID 2.77.

Text-to-Image · Alibaba Qwen Team

Qwen-Image-2.0: One Model for High-Fidelity Generation and Editing

Qwen-Image-2.0 from Alibaba unifies text-to-image generation and editing in one diffusion transformer, renders up to 1K-token instructions for slides and posters, and adds native 2K photorealism via a 16x VAE.

Multimodal Models · ByteDance

Representation Forcing: Unified Multimodal Models Without a VAE

Representation Forcing drops the frozen VAE from unified multimodal models. RF-Pixel predicts visual representation tokens before pixels, hits 0.84 GenEval, and lifts MMMU by 4.3 points over its VAE variant.

Diffusion Models · Alibaba Qwen Team

Rethinking Cross-Layer Information Routing in Diffusion Transformers

DAR replaces the residual add in diffusion transformers with timestep-adaptive aggregation of past sublayer outputs, cutting SiT-XL/2's ImageNet FID from 9.67 to 7.56 with 8.75x fewer iterations.

Multimodal Models · SenseTime

SenseNova-U1: One Model for Multimodal Understanding and Generation

SenseNova-U1 puts image understanding and image generation in one network with shared attention. Its A3B variant hits 80.55 on MMMU and 0.91 on GenEval — a single model that reads and draws.

Text-to-Image · Stability AI

Stable Diffusion 3: Rectified Flow and the MM-DiT Architecture

Stable Diffusion 3 trades U-Net diffusion for a rectified-flow transformer (MM-DiT) with separate image and text weights, fixing spelled-out text and prompt following while scaling predictably from 800M to 8B parameters.

Text-to-Image · OpenAI

DALL·E 2 (unCLIP): Text-to-Image via CLIP Image Latents

DALL·E 2, called unCLIP in the paper, generates a CLIP image embedding from text with a prior, then renders it with a diffusion decoder — buying more diversity at almost no cost to photorealism or caption match.

Diffusion Models · CompVis

Latent Diffusion Models: The Architecture Behind Stable Diffusion

Latent diffusion runs denoising inside a pretrained autoencoder's compressed latent space instead of raw pixels, cutting training and inference cost while adding cross-attention conditioning for text and layout.