Topics
Text-to-Image
Models that generate or edit images from natural-language prompts.
Text-to-Image · The Chinese University of Hong Kong
InterleaveThinker adds planner and critic agents around frozen image generators, reaching 66.3 to 67.2 average on UEval and lifting FLUX.2-klein WISE from 0.47 to 0.73.
Text-to-Image · Independent Researcher
DIRECT: 3D-Aware Object Insertion with Visual Proxies turns 3D-aware object insertion into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Brain Decoding · Independent Researcher
DreamDiffusion turns EEG-to-image generation into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Brain Decoding · Independent Researcher
MinD-Vis turns fMRI-to-image reconstruction with latent diffusion into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Text-to-Image · Alibaba Qwen Team
Qwen-Image-Flash distills Qwen-Image-2.0 to 4 sampling steps for both text-to-image and editing. The Alibaba Qwen team shows the training recipe — data, teachers, task mix — matters as much as the distillation objective.
Brain Decoding · MIT
BrainCause uses text-to-image generation plus an fMRI encoder to causally test what brain regions represent, cutting false-positive localizations from 73.4% to 23% across 260 visual concepts.
Diffusion Models · Stanford University
ControlNet bolts a trainable copy onto a frozen Stable Diffusion via zero-initialized convolutions, so an edge map, depth, pose, or segmentation steers the image — and it trains on under 50k examples.
Multimodal Models · University of Illinois Urbana-Champaign
Crafter wraps an image model in five cooperating agents and scores 50.34 on PaperBanana-Bench vs 11.13 for the raw backbone — then CraftEditor turns the raster output into editable SVG you can actually fix.
Text-to-Image · University of Science and Technology of China
Flow-OPD trains one specialist teacher per reward, then distills them on-policy into one SD 3.5 student — lifting GenEval 0.63 to 0.92 and OCR 0.59 to 0.94 without the aesthetic collapse of multi-reward GRPO.
Text-to-Image · Google Research
Google's Imagen hit a new COCO FID of 7.27 without training on COCO, and showed that scaling a frozen T5-XXL text encoder lifts fidelity and alignment more than scaling the diffusion model.
Text-to-Image · Microsoft Research
Microsoft's Lens is a 3.8B-parameter text-to-image diffusion model that matches 6B+ rivals while using about 19.3% of Z-Image's training compute, mostly by feeding it longer, denser captions.
Diffusion Models · Independent Researcher
Very deep DiTs collapse into a mean-dominated state the author calls Mean Mode Screaming. Splitting the residual into mean and centered paths fixes it, training a stable 1000-layer DiT to FID 2.77.
Text-to-Image · Alibaba Qwen Team
Qwen-Image-2.0 from Alibaba unifies text-to-image generation and editing in one diffusion transformer, renders up to 1K-token instructions for slides and posters, and adds native 2K photorealism via a 16x VAE.
Multimodal Models · ByteDance
Representation Forcing drops the frozen VAE from unified multimodal models. RF-Pixel predicts visual representation tokens before pixels, hits 0.84 GenEval, and lifts MMMU by 4.3 points over its VAE variant.
Diffusion Models · Alibaba Qwen Team
DAR replaces the residual add in diffusion transformers with timestep-adaptive aggregation of past sublayer outputs, cutting SiT-XL/2's ImageNet FID from 9.67 to 7.56 with 8.75x fewer iterations.
Multimodal Models · SenseTime
SenseNova-U1 puts image understanding and image generation in one network with shared attention. Its A3B variant hits 80.55 on MMMU and 0.91 on GenEval — a single model that reads and draws.
Text-to-Image · Stability AI
Stable Diffusion 3 trades U-Net diffusion for a rectified-flow transformer (MM-DiT) with separate image and text weights, fixing spelled-out text and prompt following while scaling predictably from 800M to 8B parameters.
Text-to-Image · OpenAI
DALL·E 2, called unCLIP in the paper, generates a CLIP image embedding from text with a prior, then renders it with a diffusion decoder — buying more diversity at almost no cost to photorealism or caption match.
Diffusion Models · CompVis
Latent diffusion runs denoising inside a pretrained autoencoder's compressed latent space instead of raw pixels, cutting training and inference cost while adding cross-attention conditioning for text and layout.