Topics
Multimodal Models
Foundation models that combine language with images, audio, video, or other signals.
Text-to-Image · The Chinese University of Hong Kong
InterleaveThinker adds planner and critic agents around frozen image generators, reaching 66.3 to 67.2 average on UEval and lifting FLUX.2-klein WISE from 0.47 to 0.73.
Multimodal Models · Kuaishou Technology
Kwai Keye-VL-2.0 is a 30B-A3B open MoE multimodal model with 256K context, strong long-video scores, and 62.0 on SWE-bench Verified.
AI Agents · NVIDIA
SpatialClaw replaces rigid tool calls with a persistent Python kernel and reaches 59.9% average accuracy across 20 spatial reasoning benchmarks, +11.2 points over the recent spatial-agent baseline.
Video Generation · Nanjing University
CoVEBench: Can Video Editors Follow Complex Instructions? turns complex instruction following for video editing into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
World Models · Independent Researcher
Function2Scene: 3D Indoor Layout from Functional Specs turns functional 3D scene layout into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Multimodal Models · Peking University
A survey that reframes long-video MLLMs as three abilities (watch, remember, reason), comparing against 11 prior surveys and organizing 100+ methods plus 5 application domains.
Speech Synthesis · Independent Researcher
A Broad Benchmark for Long-Form Speech Generation turns long-form speech generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Speech Synthesis · Independent Researcher
MMAE: A Massive Benchmark for Audio Editing Models turns audio editing evaluation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Multimodal Models · Shanghai AI Laboratory
OVO-S-Bench: Streaming Spatial Intelligence in MLLMs turns streaming spatial intelligence into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
AI Agents · Independent Researcher
SpatialWorld: Interactive Spatial Reasoning for Agents turns interactive spatial reasoning into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Speech Synthesis · Zhejiang University
SwanSphere streams first-order ambisonic audio synced to video or text, emitting its first chunk in 0.21s while cutting Frechet Distance to 120.28 vs OmniAudio's 157.67. Quality without waiting for the whole clip.
Agent Memory · ByteDance
TaskMem trains a multimodal agent to write its own memory with RL, lifting streaming-video QA accuracy to 67.9% on VideoMME and 45.4% on EgoLife, gains of 6.3 and 7.0 points over the Qwen3-VL-30B baseline.
Multimodal Models · Independent Researcher
VideoKR: Knowledge-Intensive Video Understanding turns knowledge and reasoning in video understanding into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Video Generation · Kuaishou Technology
Instead of asking a video model to reason directly, a VLM grades its in-progress frames and fine-tunes a per-instance LoRA. The trick lifts RULER-Bench from 46.4 to 68.2.
World Models · University of Macau
PF-OPSD teaches a Qwen3.5-9B MLLM to decide when to simulate the future with a video world model, verify the rollout, and fold it into its answer, lifting accuracy +10.6 and +10.9 points on two new QA benchmarks.
Multimodal Models · The Chinese University of Hong Kong
X-Stream is the first benchmark for watching several live video streams at once. The best model, Gemini 3 Pro, hits 49.6% versus a 91.84% human baseline, and proactive ability collapses below 21%.
Multimodal Models · Meta AI
VLM3 shows a standard 4B vision-language model matches expert 3D models — 0.904 depth accuracy, 94.0% camera-pose AUC, 91.35% object-3D accuracy — with no 3D-specific architecture, only focal unification and scaling.
Multimodal Models · Skywork AI
The Audio Interaction Model runs a perceive-decide-respond loop so an audio LLM listens, decides if and when to reply, and answers on the fly — trained on StreamAudio-2M and competitive across 8 benchmarks.
Multimodal Models · Shanghai AI Laboratory
CiteVQA makes document QA models return bounding-box citations with every answer. The top model scores 76.0 Strict Attributed Accuracy; the best open model just 22.5 — most answer right but cite the wrong region.
Multimodal Models · NVIDIA
Cosmos 3 packs language, image, video, audio, and robot actions into one mixture-of-transformers model; NVIDIA reports it ranks first among open models on text-to-image, image-to-video, and RoboArena policy.
Multimodal Models · University of Illinois Urbana-Champaign
Crafter wraps an image model in five cooperating agents and scores 50.34 on PaperBanana-Bench vs 11.13 for the raw backbone — then CraftEditor turns the raster output into editable SVG you can actually fix.
Multimodal Models · Google DeepMind
Flamingo bolts trainable cross-attention onto a frozen vision encoder and a frozen language model, then learns new image and video tasks from a handful of in-context examples — no fine-tuning.
Text-to-Image · Google Research
Google's Imagen hit a new COCO FID of 7.27 without training on COCO, and showed that scaling a frozen T5-XXL text encoder lifts fidelity and alignment more than scaling the diffusion model.
Multimodal Models · University of Washington
Imaginative Perception Tokens (IPT) make a VLM render a new viewpoint instead of reasoning in text — lifting multiview counting 3.4%, rivaling closed models on path tracing, while text chain-of-thought sometimes hurts.
Multimodal Models · Microsoft Research
LLaVA bolts a CLIP vision encoder onto a Vicuna LLM with one linear projection, then trains on GPT-4-generated image instructions — hitting 85.1% of GPT-4's score and 92.53% on ScienceQA.
Multimodal Models · NVIDIA
LocateAnything emits each bounding box in a single decoding step instead of digit-by-digit, hitting 12.7 boxes/sec in hybrid mode — about 2.5x faster than Rex-Omni-3B — while leading on COCO and LVIS at the same 3B size.
AI Agents · Shanghai Jiao Tong University
MMSkills packages textual procedures, runtime state cards, and keyframes into reusable skills for visual agents, lifting Qwen3-VL-235B from 21.34% to 39.17% on OSWorld and a small 8B model from 10.78% to 25.40%.
Multimodal Models · NVIDIA
MulTaBench is a 40-dataset benchmark (20 image-tabular, 20 text-tabular) where each task needs both the table and the image or text. Its finding: tuning embeddings to the target beats frozen embeddings on every learner.
Multimodal Models · Sea AI Lab
OpenSearch-VL open-sources data, code, and weights for vision-language search agents that call real search, OCR, and image tools — its 30B-A3B model lifts seven benchmarks by 13.8 points on average over Qwen3-VL.
Multimodal Models · The University of Tokyo
MM-OCEAN tests whether multimodal LLMs justify Big Five personality ratings with real video evidence. Across 27 models, 51.3% of correct ratings rest on wrong cues, and the best grounds only 33.5% fully.
Vision-Language-Action · Shanghai AI Laboratory
PhysBrain 1.0 compiles human egocentric video into physics QA to pretrain a VLM, then adapts it to robot control — lifting Franka grasping from 47.1% to 63.3% over 50 trials versus a pi0.5 baseline.
Text-to-Image · Alibaba Qwen Team
Qwen-Image-2.0 from Alibaba unifies text-to-image generation and editing in one diffusion transformer, renders up to 1K-token instructions for slides and posters, and adds native 2K photorealism via a 16x VAE.
Vision-Language-Action · Alibaba Qwen Team
Qwen-VLA extends Qwen's vision-language stack with a DiT action decoder and embodiment-aware prompts to run manipulation, navigation, and trajectory prediction in one model — 97.9% on LIBERO and 69.0% OSR on R2R.
Multimodal Models · ByteDance
Representation Forcing drops the frozen VAE from unified multimodal models. RF-Pixel predicts visual representation tokens before pixels, hits 0.84 GenEval, and lifts MMMU by 4.3 points over its VAE variant.
Vision-Language-Action · RLWRLD
RLDX-1, from RLWRLD and KAIST, adds motion, memory and tactile streams to a Qwen3-VL backbone. It catches fast-moving objects 87.5% of the time vs 29.2% for pi0.5, and beats GR00T N1.6 on LIBERO-Plus 86.7% to 72.6%.
Multimodal Models · SenseTime
SenseNova-U1 puts image understanding and image generation in one network with shared attention. Its A3B variant hits 80.55 on MMMU and 0.91 on GenEval — a single model that reads and draws.
AI Agents · Peking University
Video2GUI turns 500M unlabeled tutorial videos into WildGUI — 12M grounded GUI interaction trajectories across 1,500+ apps and sites — and pretraining Qwen2.5-VL and Mimo-VL on it lifts GUI benchmarks by 5-20%.
Multimodal Models · University of California, Davis
Top video models look like they hear audio but really guess it from the picture. This paper's THUD probes catch the cheat, and a 10K-sample fix lifts audio grounding by 28 points.
World Models · Fudan University
WBench scores interactive video world models on five axes — quality, setting, interaction, consistency, physics — across 289 cases and 1,058 turns, and finds no single model wins on all five.
Speech Recognition · OpenAI
OpenAI's Whisper trains a single sequence-to-sequence model on 680,000 hours of web audio. It matches fully supervised systems zero-shot — no fine-tuning — and adds translation and language ID.
Multimodal Models · OpenAI
CLIP trains paired image and text encoders on 400 million internet image-text pairs, then matches the original ResNet-50's ImageNet accuracy zero-shot — without using any of its 1.28M labeled examples.
Text-to-Image · OpenAI
DALL·E 2, called unCLIP in the paper, generates a CLIP image embedding from text with a prior, then renders it with a diffusion decoder — buying more diversity at almost no cost to photorealism or caption match.
Long Context · Google DeepMind
Gemini 1.5 Pro and Flash keep >99% retrieval recall up to at least 10M tokens of text, video, and audio — and Pro matches Gemini 1.0 Ultra with far less compute.
Multimodal Models · OpenAI
OpenAI's GPT-4 report is a measurement document, not a recipe. It hits human-level scores on professional and academic exams — bar exam ~top 10% — yet discloses no architecture, data, or compute.