Topics

Multimodal Models

Foundation models that combine language with images, audio, video, or other signals.

Text-to-Image · The Chinese University of Hong Kong

InterleaveThinker: Planner-Critic Agents for Interleaved Image Generation

InterleaveThinker adds planner and critic agents around frozen image generators, reaching 66.3 to 67.2 average on UEval and lifting FLUX.2-klein WISE from 0.47 to 0.73.

Multimodal Models · Kuaishou Technology

Kwai Keye-VL-2.0: Open Long-Video Multimodal Agent Model

Kwai Keye-VL-2.0 is a 30B-A3B open MoE multimodal model with 256K context, strong long-video scores, and 62.0 on SWE-bench Verified.

AI Agents · NVIDIA

SpatialClaw: Why VLM Spatial Agents Need a Python Workspace

SpatialClaw replaces rigid tool calls with a persistent Python kernel and reaches 59.9% average accuracy across 20 spatial reasoning benchmarks, +11.2 points over the recent spatial-agent baseline.

Video Generation · Nanjing University

CoVEBench: Can Video Editors Follow Complex Instructions?

CoVEBench: Can Video Editors Follow Complex Instructions? turns complex instruction following for video editing into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

World Models · Independent Researcher

Function2Scene: 3D Indoor Layout from Functional Specs

Function2Scene: 3D Indoor Layout from Functional Specs turns functional 3D scene layout into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Multimodal Models · Peking University

Watch, Remember, Reason: A Human-View Map of Video MLLMs

A survey that reframes long-video MLLMs as three abilities (watch, remember, reason), comparing against 11 prior surveys and organizing 100+ methods plus 5 application domains.

Speech Synthesis · Independent Researcher

A Broad Benchmark for Long-Form Speech Generation

A Broad Benchmark for Long-Form Speech Generation turns long-form speech generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Speech Synthesis · Independent Researcher

MMAE: A Massive Benchmark for Audio Editing Models

MMAE: A Massive Benchmark for Audio Editing Models turns audio editing evaluation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Multimodal Models · Shanghai AI Laboratory

OVO-S-Bench: Streaming Spatial Intelligence in MLLMs

OVO-S-Bench: Streaming Spatial Intelligence in MLLMs turns streaming spatial intelligence into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

SpatialWorld: Interactive Spatial Reasoning for Agents

SpatialWorld: Interactive Spatial Reasoning for Agents turns interactive spatial reasoning into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Speech Synthesis · Zhejiang University

SwanSphere: Streaming Spatial Audio Generation From Video and Text

SwanSphere streams first-order ambisonic audio synced to video or text, emitting its first chunk in 0.21s while cutting Frechet Distance to 120.28 vs OmniAudio's 157.67. Quality without waiting for the whole clip.

Agent Memory · ByteDance

TaskMem: Teaching a Video Agent What Is Worth Remembering

TaskMem trains a multimodal agent to write its own memory with RL, lifting streaming-video QA accuracy to 67.9% on VideoMME and 45.4% on EgoLife, gains of 6.3 and 7.0 points over the Qwen3-VL-30B baseline.

Multimodal Models · Independent Researcher

VideoKR: Knowledge-Intensive Video Understanding

VideoKR: Knowledge-Intensive Video Understanding turns knowledge and reasoning in video understanding into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Video Generation · Kuaishou Technology

VLM Teachers Score Video-Model Reasoning at Test Time

Instead of asking a video model to reason directly, a VLM grades its in-progress frames and fine-tunes a per-instance LoRA. The trick lifts RULER-Bench from 46.4 to 68.2.

World Models · University of Macau

PF-OPSD: When Should an MLLM Trust a World Model's Video?

PF-OPSD teaches a Qwen3.5-9B MLLM to decide when to simulate the future with a video world model, verify the rollout, and fold it into its answer, lifting accuracy +10.6 and +10.9 points on two new QA benchmarks.

Multimodal Models · The Chinese University of Hong Kong

X-Stream: Why MLLMs Score ~50% on Multi-Stream Video

X-Stream is the first benchmark for watching several live video streams at once. The best model, Gemini 3 Pro, hits 49.6% versus a 91.84% human baseline, and proactive ability collapses below 21%.

Multimodal Models · Meta AI

VLM3: Vision Language Models Are Native 3D Learners

VLM3 shows a standard 4B vision-language model matches expert 3D models — 0.904 depth accuracy, 94.0% camera-pose AUC, 91.35% object-3D accuracy — with no 3D-specific architecture, only focal unification and scaling.

Multimodal Models · Skywork AI

Audio Interaction Model: A Streaming Audio LLM That Decides When to Speak

The Audio Interaction Model runs a perceive-decide-respond loop so an audio LLM listens, decides if and when to reply, and answers on the fly — trained on StreamAudio-2M and competitive across 8 benchmarks.

Multimodal Models · Shanghai AI Laboratory

CiteVQA: A Benchmark That Catches Document AI Citing the Wrong Evidence

CiteVQA makes document QA models return bounding-box citations with every answer. The top model scores 76.0 Strict Attributed Accuracy; the best open model just 22.5 — most answer right but cite the wrong region.

Multimodal Models · NVIDIA

Cosmos 3 Explained: NVIDIA's Omnimodal World Model for Physical AI

Cosmos 3 packs language, image, video, audio, and robot actions into one mixture-of-transformers model; NVIDIA reports it ranks first among open models on text-to-image, image-to-video, and RoboArena policy.

Multimodal Models · University of Illinois Urbana-Champaign

Crafter: A Multi-Agent Harness for Editable Scientific Figures

Crafter wraps an image model in five cooperating agents and scores 50.34 on PaperBanana-Bench vs 11.13 for the raw backbone — then CraftEditor turns the raster output into editable SVG you can actually fix.

Multimodal Models · Google DeepMind

Flamingo: a Visual Language Model for Few-Shot Learning

Flamingo bolts trainable cross-attention onto a frozen vision encoder and a frozen language model, then learns new image and video tasks from a handful of in-context examples — no fine-tuning.

Text-to-Image · Google Research

Imagen: Why a Frozen Text Encoder Beats a Bigger Image Model

Google's Imagen hit a new COCO FID of 7.27 without training on COCO, and showed that scaling a frozen T5-XXL text encoder lifts fidelity and alignment more than scaling the diffusion model.

Multimodal Models · University of Washington

Imaginative Perception Tokens: Letting VLMs Picture Space, Not Describe It

Imaginative Perception Tokens (IPT) make a VLM render a new viewpoint instead of reasoning in text — lifting multiview counting 3.4%, rivaling closed models on path tracing, while text chain-of-thought sometimes hurts.

Multimodal Models · Microsoft Research

LLaVA Explained: Visual Instruction Tuning for a Vision-Language Chat Model

LLaVA bolts a CLIP vision encoder onto a Vicuna LLM with one linear projection, then trains on GPT-4-generated image instructions — hitting 85.1% of GPT-4's score and 92.53% on ScienceQA.

Multimodal Models · NVIDIA

LocateAnything: Parallel Box Decoding for Faster Vision Grounding

LocateAnything emits each bounding box in a single decoding step instead of digit-by-digit, hitting 12.7 boxes/sec in hybrid mode — about 2.5x faster than Rex-Omni-3B — while leading on COCO and LVIS at the same 3B size.

AI Agents · Shanghai Jiao Tong University

MMSkills: Multimodal Skill Packages for General Visual Agents

MMSkills packages textual procedures, runtime state cards, and keyframes into reusable skills for visual agents, lifting Qwen3-VL-235B from 21.34% to 39.17% on OSWorld and a small 8B model from 10.78% to 25.40%.

Multimodal Models · NVIDIA

MulTaBench: A 40-Dataset Benchmark for Multimodal Tabular Learning

MulTaBench is a 40-dataset benchmark (20 image-tabular, 20 text-tabular) where each task needs both the table and the image or text. Its finding: tuning embeddings to the target beats frozen embeddings on every learner.

Multimodal Models · Sea AI Lab

OpenSearch-VL: An Open Recipe for Multimodal Search Agents

OpenSearch-VL open-sources data, code, and weights for vision-language search agents that call real search, OCR, and image tools — its 30B-A3B model lifts seven benchmarks by 13.8 points on average over Qwen3-VL.

Multimodal Models · The University of Tokyo

Perception or Prejudice: Can MLLMs Ground Personality in Real Evidence?

MM-OCEAN tests whether multimodal LLMs justify Big Five personality ratings with real video evidence. Across 27 models, 51.3% of correct ratings rest on wrong cues, and the best grounds only 33.5% fully.

Vision-Language-Action · Shanghai AI Laboratory

PhysBrain 1.0: Turning Human Video into Physical Priors for Robots

PhysBrain 1.0 compiles human egocentric video into physics QA to pretrain a VLM, then adapts it to robot control — lifting Franka grasping from 47.1% to 63.3% over 50 trials versus a pi0.5 baseline.

Text-to-Image · Alibaba Qwen Team

Qwen-Image-2.0: One Model for High-Fidelity Generation and Editing

Qwen-Image-2.0 from Alibaba unifies text-to-image generation and editing in one diffusion transformer, renders up to 1K-token instructions for slides and posters, and adds native 2K photorealism via a 16x VAE.

Vision-Language-Action · Alibaba Qwen Team

Qwen-VLA: One Model for Manipulation, Navigation, and Trajectories

Qwen-VLA extends Qwen's vision-language stack with a DiT action decoder and embodiment-aware prompts to run manipulation, navigation, and trajectory prediction in one model — 97.9% on LIBERO and 69.0% OSR on R2R.

Multimodal Models · ByteDance

Representation Forcing: Unified Multimodal Models Without a VAE

Representation Forcing drops the frozen VAE from unified multimodal models. RF-Pixel predicts visual representation tokens before pixels, hits 0.84 GenEval, and lifts MMMU by 4.3 points over its VAE variant.

Vision-Language-Action · RLWRLD

RLDX-1: A Multi-Stream Vision-Language-Action Model for Dexterous Robots

RLDX-1, from RLWRLD and KAIST, adds motion, memory and tactile streams to a Qwen3-VL backbone. It catches fast-moving objects 87.5% of the time vs 29.2% for pi0.5, and beats GR00T N1.6 on LIBERO-Plus 86.7% to 72.6%.

Multimodal Models · SenseTime

SenseNova-U1: One Model for Multimodal Understanding and Generation

SenseNova-U1 puts image understanding and image generation in one network with shared attention. Its A3B variant hits 80.55 on MMMU and 0.91 on GenEval — a single model that reads and draws.

AI Agents · Peking University

Video2GUI: Mining 12M GUI Agent Trajectories From Internet Videos

Video2GUI turns 500M unlabeled tutorial videos into WildGUI — 12M grounded GUI interaction trajectories across 1,500+ apps and sites — and pretraining Qwen2.5-VL and Mimo-VL on it lifts GUI benchmarks by 5-20%.

Multimodal Models · University of California, Davis

When Vision Speaks for Sound: The Audio-Visual Clever Hans Effect

Top video models look like they hear audio but really guess it from the picture. This paper's THUD probes catch the cheat, and a 10K-sample fix lifts audio grounding by 28 points.

World Models · Fudan University

WBench: A Multi-turn Benchmark for Interactive Video World Models

WBench scores interactive video world models on five axes — quality, setting, interaction, consistency, physics — across 289 cases and 1,058 turns, and finds no single model wins on all five.

Speech Recognition · OpenAI

Whisper: 680,000 Hours of Weak Supervision for Robust ASR

OpenAI's Whisper trains a single sequence-to-sequence model on 680,000 hours of web audio. It matches fully supervised systems zero-shot — no fine-tuning — and adds translation and language ID.

Multimodal Models · OpenAI

CLIP: Learning Visual Models From Natural Language Supervision

CLIP trains paired image and text encoders on 400 million internet image-text pairs, then matches the original ResNet-50's ImageNet accuracy zero-shot — without using any of its 1.28M labeled examples.

Text-to-Image · OpenAI

DALL·E 2 (unCLIP): Text-to-Image via CLIP Image Latents

DALL·E 2, called unCLIP in the paper, generates a CLIP image embedding from text with a prior, then renders it with a diffusion decoder — buying more diversity at almost no cost to photorealism or caption match.

Long Context · Google DeepMind

Gemini 1.5: Near-Perfect Recall Across Millions of Tokens

Gemini 1.5 Pro and Flash keep >99% retrieval recall up to at least 10M tokens of text, video, and audio — and Pro matches Gemini 1.0 Ultra with far less compute.

Multimodal Models · OpenAI

GPT-4 Technical Report Explained: Benchmarks, Not Blueprints

OpenAI's GPT-4 report is a measurement document, not a recipe. It hits human-level scores on professional and academic exams — bar exam ~top 10% — yet discloses no architecture, data, or compute.