Latest

Latest paper explainers

The latest AI research paper explainers across language models, embodied intelligence, and frontier science.

Agent Memory · National University of Singapore

EvoArena: Why Agent Memory Must Track Environment Changes

EvoArena turns static agent tasks into evolving chains and finds current agents average only 39.6% accuracy; EvoMem adds patch memory and improves chain-level accuracy by 3.7 points.

Text-to-Image · The Chinese University of Hong Kong

InterleaveThinker: Planner-Critic Agents for Interleaved Image Generation

InterleaveThinker adds planner and critic agents around frozen image generators, reaching 66.3 to 67.2 average on UEval and lifting FLUX.2-klein WISE from 0.47 to 0.73.

Vision-Language-Action · Zhejiang University

LabVLA: A VLA Model for Scientific Lab Robots

LabVLA trains a Qwen3-VL-4B backbone plus DiT action expert on laboratory workflows and reports 71.1% ID and 70.0% OOD success on LabUtopia.

Theorem Proving · MiniMax AI

MaxProof: How MiniMax M3 Reaches Gold-Level Proof Scores

MaxProof turns MiniMax-M3 into a generator, verifier, fixer, and ranker; with population-level test-time scaling it reports 35/42 on IMO 2025 and 36/42 on USAMO 2026.

Long Context · MiniMax AI

MiniMax Sparse Attention: 1M Context Without Dense Attention

MiniMax Sparse Attention keeps only 2,048 selected KV tokens per query group and reports 28.4x lower attention FLOPs plus 14.2x prefill speedup at 1M context.

AI Agents · NVIDIA

SpatialClaw: Why VLM Spatial Agents Need a Python Workspace

SpatialClaw replaces rigid tool calls with a persistent Python kernel and reaches 59.9% average accuracy across 20 spatial reasoning benchmarks, +11.2 points over the recent spatial-agent baseline.

AI Agents · Google DeepMind

From AGI to ASI: DeepMind's Map of Superintelligence Pathways

Google DeepMind's report lays out four non-exclusive paths from AGI to ASI and treats each bottleneck, from data walls to regulation, as an open research question.

AI Agents · Renmin University of China

Arbor: Autonomous Research With Hypothesis Trees

Arbor stores research attempts in a persistent hypothesis tree, then admits changes only through held-out evaluation. It reports best held-out results on six AO tasks and 86.36% Any Medal on MLE-Bench Lite.

AI Agents · TokenRhythm Technologies

Claw-SWE-Bench: Why Coding Agent Harnesses Matter

Claw-SWE-Bench evaluates OpenClaw-style coding-agent harnesses on 350 GitHub issue tasks. OpenClaw jumps from 19.1% to 73.4% Pass@1 with a full adapter.

Mixture of Experts · Renmin University of China

Manifold Power Iteration: A Better Router for MoE Models

MPI redesigns MoE routers by aligning router rows with expert weight directions. On 11B MoE, average benchmark accuracy rises from 40.92 to 42.76 with only 0.2% training slowdown.

Multimodal Models · Kuaishou Technology

Kwai Keye-VL-2.0: Open Long-Video Multimodal Agent Model

Kwai Keye-VL-2.0 is a 30B-A3B open MoE multimodal model with 256K context, strong long-video scores, and 62.0 on SWE-bench Verified.

World Models · Alibaba Qwen Team

ABot-Earth 0.5: Generating 3D Cities From Satellite Images

ABot-Earth 0.5 uses satellite imagery to generate 3D Gaussian Splatting city scenes, reporting under 10 minutes per square kilometer and FID 16.1.

World Models · JD.com (Joy Future Academy)

Echo-Memory: Which Memory Lets a World Model Remember a Room?

When a camera revisits an old spot, block-wise state-space recurrence scored 69.0 open-domain VLM consistency vs 12.25 for the no-memory baseline; aggressive compression and spatial summaries mostly collapsed.

AI Agents · Independent Researcher

SpatialWorld: Interactive Spatial Reasoning for Agents

SpatialWorld: Interactive Spatial Reasoning for Agents turns interactive spatial reasoning into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Long Context · Tencent

FlashMemory-DeepSeek-V4: Cutting KV Cache to 13.5% for 500K Context

FlashMemory-DeepSeek-V4 keeps only the KV chunks a neural indexer predicts you will need, shrinking physical KV cache to 13.5% of full-context decoding while accuracy stays flat or edges up ~0.6%.

World Models · Microsoft Research

Mirage: Latent Spatial Memory Makes Video World Models 10x Faster

Mirage stores a video world model's 3D memory inside diffusion latent space instead of an RGB point cloud, hitting state-of-the-art WorldScore (70.36) while running 10.57x faster and using 55x less GPU memory.

Video Generation · Nanjing University

CoVEBench: Can Video Editors Follow Complex Instructions?

CoVEBench: Can Video Editors Follow Complex Instructions? turns complex instruction following for video editing into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

World Models · Independent Researcher

AnchorWorld: Egocentric World Simulation for Embodied AI

AnchorWorld: Egocentric World Simulation for Embodied AI turns egocentric world simulation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Multimodal Models · Peking University

Watch, Remember, Reason: A Human-View Map of Video MLLMs

A survey that reframes long-video MLLMs as three abilities (watch, remember, reason), comparing against 11 prior surveys and organizing 100+ methods plus 5 application domains.

Speech Synthesis · Independent Researcher

MMAE: A Massive Benchmark for Audio Editing Models

MMAE: A Massive Benchmark for Audio Editing Models turns audio editing evaluation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Shanghai Jiao Tong University

SWE-Explore: Can Coding Agents Find the Right Code?

SWE-Explore isolates the repo-exploration stage of coding agents over 848 issues. Agentic explorers crush BM25 (HitFile 0.65 vs 0.08), but line-level recall stalls at 0.15-0.20, and that gap is what limits repairs.

Fine-Tuning & Adaptation · HKUST

On the Geometry of On-Policy Distillation: A Distinct Update Regime

On-policy distillation does not sit between SFT and RLVR — it carves its own geometry. Its updates touch fewer weights, avoid principal directions, and lock into a narrow low-dimensional subspace early in training.

Text Embeddings · Renmin University of China

EmbFilter: Turning an LLM's UnEmbedding Matrix Into a Feature Lens

EmbFilter reads the LLM unembedding matrix as a lens, strips the subspace that ties text embeddings to high-frequency junk tokens, and lifts zero-shot retrieval while shrinking dimensions.

AI Agents · Independent Researcher

AdaPlanBench: Testing Adaptive Planning in LLM Agents

AdaPlanBench: Testing Adaptive Planning in LLM Agents turns adaptive planning under constraints into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

ArcANE: Measuring When Role-Playing Agents Break Character

ArcANE: Measuring When Role-Playing Agents Break Character turns role-playing language agent reliability into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Text-to-Image · Independent Researcher

DIRECT: 3D-Aware Object Insertion with Visual Proxies

DIRECT: 3D-Aware Object Insertion with Visual Proxies turns 3D-aware object insertion into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Shanghai Jiao Tong University

LatentSkill: Bake Agent Skills Into LoRA Weights, Not the Prompt

A hypernetwork compiles a textual skill into a LoRA adapter in one forward pass. On ALFWorld, LatentSkill lifts success by 21.4 points (seen) with 64.1% fewer prefill tokens.

AI Agents · Lehigh University

OpenSkill: Self-Evolving LLM Agents With No Task Supervision

OpenSkill lets agents build skills and their own verifiers from the open web, hitting 43.6% on SkillsBench (+8.9 over the best baseline) with zero target-task answers.

AI Agents · Independent Researcher

SoCRATES: Evaluating Proactive LLM Mediation

SoCRATES: Evaluating Proactive LLM Mediation turns proactive mediation agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

ToolMaze: When LLM Agents Must Replan After Tool Failures

ToolMaze: When LLM Agents Must Replan After Tool Failures turns dynamic replanning after tool failures into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Code Generation · University of Waterloo

Code2LoRA: Hypernetworks That Generate Repo-Specific LoRA Adapters

Code2LoRA trains a hypernetwork to emit a repo-specific LoRA adapter for a code model with no inference-time token cost — 66.2% in-repo and 63.8% cross-repo exact match, plus an Evo variant that tracks diffs with a GRU.

Vision-Language-Action · ETH Zurich

Robots Need More than VLA and World Models: Four Missing Interfaces

A position paper from ETH Zurich, Stanford and TU Darmstadt argues scaling VLA and world models is not enough — robots need four interfaces to turn unstructured human and video behaviour into grounded supervision.

AI Agents · UC Berkeley

Agents' Last Exam: Why AI Agents Still Fail at Work

Agents' Last Exam tests AI agents on 1,490 expert-built professional tasks across 55 digital industries; the hardest tier averages only 2.6% full pass.

Reinforcement Learning · Tsinghua University

CHERRL: A Controlled Sandbox for Reward Hacking in Rubric RL

CHERRL injects four known judge biases to reliably reproduce reward hacking in rubric RL; an agent reading only training logs pinned the onset with 11-step total interval error and missed none of six runs.

AI Agents · HKUST

StreamMA: Streaming Beats Waiting in Multi-Agent Reasoning

StreamMA pipes each reasoning step to the next agent the moment it is written, not after the full chain. Across 8 benchmarks it gains +7.3 pp on average (max +22.4 pp on HMMT 2026) and runs up to 26.9x faster.

AI Agents · Independent Researcher

TIDE: Proactive Multi-Problem Discovery with Templates

TIDE: Proactive Multi-Problem Discovery with Templates turns proactive problem discovery into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Multimodal Models · Independent Researcher

VideoKR: Knowledge-Intensive Video Understanding

VideoKR: Knowledge-Intensive Video Understanding turns knowledge and reasoning in video understanding into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Video Generation · HKUST

Echo-Infinity: Learnable Evolving Memory for 24-Hour Real-Time Video

Echo-Infinity is an autoregressive video model with a learnable evolving memory that compresses any-length history at constant cost, hitting 24-hour rollouts (over 1.3M frames) in real time at 18.5 FPS on an H100.

Multimodal Models · Skywork AI

Audio Interaction Model: A Streaming Audio LLM That Decides When to Speak

The Audio Interaction Model runs a perceive-decide-respond loop so an audio LLM listens, decides if and when to reply, and answers on the fly — trained on StreamAudio-2M and competitive across 8 benchmarks.

Biomolecular Modeling · AIRI

GENEB: Why Genomic Foundation Models Are So Hard to Compare

GENEB probes frozen representations from 40 genomic foundation models across 100 tasks in 13 functional categories, and finds rankings flip across categories while extra parameters buy only modest, inconsistent gains.

Multimodal Models · Shanghai AI Laboratory

OVO-S-Bench: Streaming Spatial Intelligence in MLLMs

OVO-S-Bench: Streaming Spatial Intelligence in MLLMs turns streaming spatial intelligence into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

LLM Reasoning · Shanghai AI Laboratory

ThoughtFold: Cutting 56% of Reasoning Tokens Without Losing Accuracy

ThoughtFold trims the redundant reasoning of DeepSeek-R1-Distill-Qwen-7B by about 56% of tokens while keeping accuracy on AIME, MATH-500, and GPQA-Diamond intact, using a masked preference objective.

World Models · University of Macau

PF-OPSD: When Should an MLLM Trust a World Model's Video?

PF-OPSD teaches a Qwen3.5-9B MLLM to decide when to simulate the future with a video world model, verify the rollout, and fold it into its answer, lifting accuracy +10.6 and +10.9 points on two new QA benchmarks.

Robotics · Tsinghua University

Humanoid-GPT: A GPT-Style Transformer for Humanoid Motion Tracking

Humanoid-GPT treats humanoid control like language modeling: a causal Transformer distilled from ~384 PPO experts on a 2-billion-frame corpus, 200x prior data. It hits 92.58 percent sim success, under 1.5ms.

Language Models · Google Research

Language Models Need Sleep: A Consolidate-and-Dream Recipe

Google Research argues LLMs need an offline sleep phase to turn short-term context into stable weights. With sleep, Qwen3-8B hits 79.2% on AIME-24 and a Transformer reaches 80% on ARC few-shot, beating SEAL.

Text-to-Image · Alibaba Qwen Team

Qwen-Image-Flash: Beyond Objective Design in Few-Step Distillation

Qwen-Image-Flash distills Qwen-Image-2.0 to 4 sampling steps for both text-to-image and editing. The Alibaba Qwen team shows the training recipe — data, teachers, task mix — matters as much as the distillation objective.

Multimodal Models · University of Washington

Imaginative Perception Tokens: Letting VLMs Picture Space, Not Describe It

Imaginative Perception Tokens (IPT) make a VLM render a new viewpoint instead of reasoning in text — lifting multiview counting 3.4%, rivaling closed models on path tracing, while text chain-of-thought sometimes hurts.

Efficient AI · Huawei

KVarN: 2-Bit KV-Cache Quantization Without Calibration

KVarN compresses the KV-cache to 2 bits with no calibration data, using a Hadamard rotation plus dual-axis variance normalization to stop quantization errors from snowballing across long reasoning chains.

AI Agents · Nanjing University

DRIFT: Pinpointing Where Deep-Research Agents Go Wrong

TELBench asks models to find the span that broke a 12-step research trajectory. DRIFT audits claims against evidence, lifting macro-F1 to 54.91% with Claude-Sonnet-4.6, up to 30 points over raw inspection.

AI Agents · University of Illinois Urbana-Champaign

Harness-1: Move Search-Agent Bookkeeping Out of the Policy

Harness-1 is a 20B RL search agent that hands working memory to the environment, hitting 0.730 average curated recall and beating the next open subagent by +11.4 points.

AI Agents · Independent Researcher

K-BrowseComp: Korean Web-Browsing Agent Benchmark

K-BrowseComp: Korean Web-Browsing Agent Benchmark turns Korean-context web browsing agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Reinforcement Learning · Tianjin University

Why Multi-Domain RL Forgets, and How a Math Refresh Heals It

When you RL-tune an LLM across math, code, QA, and writing in sequence, math drops from 66.49 to 57.66 even though gradients look orthogonal. A short math refresh pulls it back to 66.04 without wrecking the other three.

Video Generation · Kuaishou Technology

VLM Teachers Score Video-Model Reasoning at Test Time

Instead of asking a video model to reason directly, a VLM grades its in-progress frames and fine-tunes a per-instance LoRA. The trick lifts RULER-Bench from 46.4 to 68.2.

Multimodal Models · The Chinese University of Hong Kong

X-Stream: Why MLLMs Score ~50% on Multi-Stream Video

X-Stream is the first benchmark for watching several live video streams at once. The best model, Gemini 3 Pro, hits 49.6% versus a 91.84% human baseline, and proactive ability collapses below 21%.

Multimodal Models · NVIDIA

Cosmos 3 Explained: NVIDIA's Omnimodal World Model for Physical AI

Cosmos 3 packs language, image, video, audio, and robot actions into one mixture-of-transformers model; NVIDIA reports it ranks first among open models on text-to-image, image-to-video, and RoboArena policy.

Fine-Tuning & Adaptation · Mind Lab

Scaling PEFT: Toward a Million Personal Models on One Base

A position paper reframing LoRA adapters as persistent personal state, not a cheap full-finetune substitute, across three axes: scale up the base, scale down the adapter, scale out to millions, plus a serving stack MinT.

AI Agents · Ant Group

SkillAdaptor: How LLM Agents Rewrite Their Own Skills

SkillAdaptor edits an agent's skill library from failed trajectories without touching model weights, lifting WebShop score +2.3 and PinchBench +1.5 over the frozen backbone.

Robotics · Independent Researcher

TVRBench: Can Models Move to a Target Viewpoint?

TVRBench: Can Models Move to a Target Viewpoint? turns active 3D viewpoint reproduction into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

LLM Reasoning · Samsung Research

TrOPD: Trust-Region On-Policy Distillation for Small LLMs

TrOPD masks on-policy distillation to the tokens where the teacher is actually trustworthy, adding +3.06 to +3.52 average points over standard OPD on math, code, and STEM benchmarks with 1.5B-1.7B students.

Retrieval-Augmented Generation · AIRI

OCC-RAG: Small Models Built Only to Read Context Faithfully

OCC-RAG is a pair of 0.6B and 1.7B reasoning models trained to answer strictly from the given context and refuse when the answer isn't there — matching or beating general models 2-6x their size on multi-hop QA.

World Models · Independent Researcher

Function2Scene: 3D Indoor Layout from Functional Specs

Function2Scene: 3D Indoor Layout from Functional Specs turns functional 3D scene layout into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Diffusion Models · The Hong Kong Polytechnic University

GGT-100K: Generative Ground Truth for Image Restoration

GGT-100K: Generative Ground Truth for Image Restoration turns real-world image restoration data into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Long Context · Tsinghua University

LongTraceRL: Harder Distractors and Rubric Rewards for Long-Context RL

Tsinghua's LongTraceRL mines distractors from real search-agent trajectories and adds entity-level rubric rewards, lifting a Qwen3-4B reasoner from 53.3 to 59.0 average across five long-context benchmarks (+5.7).

AI Agents · Independent Researcher

When Masking Stale Observations Helps Search Agents

When Masking Stale Observations Helps Search Agents turns context management for search agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Reinforcement Learning · University of Edinburgh

SCOPE: Self-Play RL That Trains LLMs on Open-Ended Tasks

SCOPE co-evolves a task-writing Challenger and a retrieval Solver, judged by a frozen copy of the base model, lifting eight open-ended benchmarks by up to +10.4 points with zero curated prompts.

Speech Synthesis · Zhejiang University

SwanSphere: Streaming Spatial Audio Generation From Video and Text

SwanSphere streams first-order ambisonic audio synced to video or text, emitting its first chunk in 0.21s while cutting Frechet Distance to 120.28 vs OmniAudio's 157.67. Quality without waiting for the whole clip.

Agent Memory · ByteDance

TaskMem: Teaching a Video Agent What Is Worth Remembering

TaskMem trains a multimodal agent to write its own memory with RL, lifting streaming-video QA accuracy to 67.9% on VideoMME and 45.4% on EgoLife, gains of 6.3 and 7.0 points over the Qwen3-VL-30B baseline.

Mixture of Experts · National University of Singapore

dMoE: Block-Level Expert Routing for Diffusion LLMs

dMoE aligns token-level MoE routing with block-parallel decoding in diffusion LLMs. On LLaDA2.0-mini it cuts unique experts per block from 69.5 to 14.6, keeps 99.11% accuracy, and frees 76-80% of expert memory.

AI Agents · Shanghai AI Laboratory

COLLEAGUE.SKILL: Turning One Person's Expertise Into a Portable AI Skill

COLLEAGUE.SKILL distills one person's work traces into a versioned skill package with two tracks — capability and bounded behavior — that any agent can install, correct, and roll back. The open repo reports ~18.5k stars.

Code Generation · JetBrains

Mellum 2: A 12B MoE Code Model Running at 2.5B Compute

Mellum 2 is JetBrains' open-weight 12B Mixture-of-Experts code model that activates only 2.5B parameters per token, matching dense 4B-14B baselines on software tasks at a fraction of the per-token compute.

Multimodal Models · ByteDance

Representation Forcing: Unified Multimodal Models Without a VAE

Representation Forcing drops the frozen VAE from unified multimodal models. RF-Pixel predicts visual representation tokens before pixels, hits 0.84 GenEval, and lifts MMMU by 4.3 points over its VAE variant.

Speech Synthesis · ByteDance

SwanVoice: Zero-Shot Speech Synthesis for Long Monologue and Dialogue

SwanVoice is a zero-shot TTS system that generates an entire 1-4 speaker conversation in one pass, keeping voice, mood, and prosody consistent across turns where turn-by-turn synthesis drifts — but content accuracy lags.

Fine-Tuning & Adaptation · T-Tech

Trust-Region Behavior Blending: A Warmup Fix for On-Policy Distillation

On-policy distillation wastes teacher supervision on a student's weak early rollouts. TRB blends teacher-like behavior inside a KL trust region during warmup, then anneals it to zero — best average on two math settings.

Language Models · Independent Researcher

WASH: Averaging 3 LLMs Erases Text Watermarks

Averaging the output distributions of 3 independent LLMs collapses watermark detection z-scores from 5-300 down below 2, and the WASH paper proves why it works with an O(1/sqrt(N)) error bound.

AI Agents · Shanghai AI Laboratory

ResearchClawBench: Testing Autonomous Research Agents

ResearchClawBench: Testing Autonomous Research Agents turns end-to-end scientific research agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Xiamen University

SAAS: Teaching Search Agents When to Stop Searching

SAAS uses self-aware RL to cut a Qwen2.5-7B search agent's average queries from 2.19 to 0.97 per question, while keeping accuracy near the best baseline (48.7% vs 49.8%).

Efficient AI · Shanghai AI Laboratory

Draft-OPD: On-Policy Distillation Pushes Speculative Decoding Past 5x

Draft-OPD trains speculative draft models on states their own drafting induces, not just target transcripts. On Qwen3 thinking models it hits 4.86x to 4.89x, beating EAGLE-3 by 23 percent and DFlash by 13 percent.

Video Generation · NVIDIA

SANA-Streaming: Real-time Video Editing at 24 FPS on One RTX 5090

SANA-Streaming edits 1280x704 video in real time at 24 end-to-end FPS on a single RTX 5090, with the diffusion transformer core hitting 58 FPS via a hybrid DiT and Cycle-Reverse Regularization.

Video Generation · Virginia Tech

VideoMLA: A Low-Rank Latent KV Cache for Minute-Scale Video Diffusion

VideoMLA ports Multi-Head Latent Attention into causal video diffusion, cutting per-token KV memory 92.7% (224 vs 3,072 scalars), winning VBench at 60s, and lifting B200 throughput 1.23x.

Multimodal Models · Meta AI

VLM3: Vision Language Models Are Native 3D Learners

VLM3 shows a standard 4B vision-language model matches expert 3D models — 0.904 depth accuracy, 94.0% camera-pose AUC, 91.35% object-3D accuracy — with no 3D-specific architecture, only focal unification and scaling.

AI Agents · Shanghai AI Laboratory

AgentDoG 1.5: A Lightweight Guardrail for AI Agent Safety

AgentDoG 1.5 trains 0.8B-8B agent-safety guard models on only ~1k samples, hits 92.2% accuracy on R-Judge with the 4B variant, rivals GPT-5.4, and cuts agentic-RL deployment overhead by two orders of magnitude.

Multimodal Models · University of Illinois Urbana-Champaign

Crafter: A Multi-Agent Harness for Editable Scientific Figures

Crafter wraps an image model in five cooperating agents and scores 50.34 on PaperBanana-Bench vs 11.13 for the raw backbone — then CraftEditor turns the raster output into editable SVG you can actually fix.

Efficient AI · Shanghai Jiao Tong University

Domino: Splitting the Draft and the Causal Fix in Speculative Decoding

Domino lets a parallel drafter propose a whole block at once, then a lightweight head adds back the token-to-token dependencies — reaching up to 5.49x speedup on Transformers and 5.8x throughput on SGLang.

Retrieval-Augmented Generation · University of Massachusetts Amherst

GrepSeek: Search Agents That grep the Corpus Instead of a Vector Index

GrepSeek trains an LLM to answer questions by issuing shell commands like grep against the raw corpus — no embedding index — and posts the best F1 and Exact Match across seven open-domain QA benchmarks.

Vision-Language-Action · Alibaba Qwen Team

Qwen-VLA: One Model for Manipulation, Navigation, and Trajectories

Qwen-VLA extends Qwen's vision-language stack with a DiT action decoder and embodiment-aware prompts to run manipulation, navigation, and trajectory prediction in one model — 97.9% on LIBERO and 69.0% OSR on R2R.

Speech Synthesis · Independent Researcher

A Broad Benchmark for Long-Form Speech Generation

A Broad Benchmark for Long-Form Speech Generation turns long-form speech generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

TASTE: Harder Agent Benchmarks from Tool Sequences

TASTE: Harder Agent Benchmarks from Tool Sequences turns tool-use benchmark generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

World Models · NVIDIA

Gamma-World: A Multi-Agent World Model That Scales Past Two Players

Gamma-World is NVIDIA's video world model for multiplayer simulation that runs at 24 FPS and generalizes from two to four players with no retraining, cutting Solaris's FVD roughly in half.

Fine-Tuning & Adaptation · The Hong Kong Polytechnic University

Token Teachability: Distilling LLMs on Just 5% of Tokens

Teachability-Aware OPD supervises only ~5% of tokens, those where the teacher's correction lands inside the student's top-K support, matching or beating full-token distillation (44.89 vs 42.37 on Qwen3-4B to 1.7B).

Multimodal Models · NVIDIA

LocateAnything: Parallel Box Decoding for Faster Vision Grounding

LocateAnything emits each bounding box in a single decoding step instead of digit-by-digit, hitting 12.7 boxes/sec in hybrid mode — about 2.5x faster than Rex-Omni-3B — while leading on COCO and LVIS at the same 3B size.

LLM Reasoning · Alibaba Qwen Team

DVAO: Variance-Adaptive Advantage Weighting for Multi-Reward RL

DVAO weights each reward by its in-group variance instead of fixed coefficients, lifting Qwen3-4B-Base from 38.99% to 42.19% average accuracy and length compliance to 99.91% in math-plus-tool-use RL.

World Models · Fudan University

WBench: A Multi-turn Benchmark for Interactive Video World Models

WBench scores interactive video world models on five axes — quality, setting, interaction, consistency, physics — across 289 cases and 1,058 turns, and finds no single model wins on all five.

Language Models · Xiaohongshu

NITP: Predict the Next Token's Meaning, Not Just Its ID

NITP adds a dense target to next-token prediction: forecast a shallow-layer embedding of the next token. On a 9B MoE it lifts MMLU-Pro by 5.71 points for about 2 percent extra training FLOPs and zero inference cost.

Brain Decoding · MIT

BrainCause: Finding Causal Visual Representations in the Brain

BrainCause uses text-to-image generation plus an fMRI encoder to causally test what brain regions represent, cutting false-positive localizations from 73.4% to 23% across 260 visual concepts.

AI Agents · Microsoft Research

SkillOpt: Training a Frozen Agent's Skill Text Like a Model

SkillOpt trains a single skill document for a frozen LLM agent with bounded add/delete/replace edits and a held-out gate, lifting GPT-5.5 by +23.5 points in direct chat across six benchmarks.

Theorem Proving · Google DeepMind

AI Formal Proof Search for Open Math Problems

This work evaluates AI-aided formal proof search on open math problems: the strongest agent resolves 9 of 353 Erdos problems and proves 44 of 492 OEIS conjectures.

Multimodal Models · The University of Tokyo

Perception or Prejudice: Can MLLMs Ground Personality in Real Evidence?

MM-OCEAN tests whether multimodal LLMs justify Big Five personality ratings with real video evidence. Across 27 models, 51.3% of correct ratings rest on wrong cues, and the best grounds only 33.5% fully.

Language Models · Alibaba Qwen Team

TransitLM: A Map-Free Transit Routing Dataset and Benchmark

TransitLM is a 13M-record corpus from four Chinese cities (120,845 stations) that trains a language model to plan transit routes with no map engine — a 4B model hits 97.0% connectivity and 71.0% exact match.

LLM Reasoning · Renmin University of China

DelTA: Discriminative Token Credit Assignment for RLVR Reasoning

DelTA reweights RLVR updates so credit lands on tokens that actually separate right answers from wrong ones, lifting Qwen3-8B-Base by 3.26 and Qwen3-14B-Base by 2.62 average points over the strongest baselines.

Efficient AI · Sapient Intelligence

HRM-Text: A 1B Model Trained From Scratch for $1,500

HRM-Text trains a 1B language model from scratch on 40B tokens for about $1,500, scoring 60.7% MMLU, 84.5% GSM8K and 56.2% MATH by swapping Transformers for a hierarchical recurrent model.