Topics

AI Agents

LLM-driven systems that plan, act, use tools, and carry skills across tasks.

An autonomous agent workflow on a dark screen

AI agents wrap a language model in a loop of planning, tool use, memory, and action, turning a one-shot responder into a system that can pursue a goal over many steps. The research that matters is less about any single model and more about how agents reason, call tools, recover from errors, and carry reusable skills between tasks.

This topic tracks the shift from clever prompting to durable infrastructure: ReAct interleaved reasoning with actions, Toolformer taught models to call APIs, and skill-packaging systems like COLLEAGUE.SKILL turn expertise into portable, correctable artifacts. The hard open questions are reliability, evaluation, safety bounds, and how to author and maintain skills at scale.

Foundational papers

Agent Memory · UC Berkeley

MemGPT: Treating the LLM Context Window Like an Operating System

MemGPT borrows OS virtual memory — it lets the LLM page data in and out of its own context with function calls, lifting deep memory retrieval to 93.4% with GPT-4 vs 35.3% for recursive summarization.

Long Context · University of Illinois Urbana-Champaign

From Context to Skills: Ctx2Skill Self-Evolves Context Learning

Ctx2Skill is a self-play framework that discovers natural-language skills from a long context with no human labels or external rewards, lifting GPT-4.1 from 11.1% to 16.5% and GPT-5.1 from 21.2% to 25.8% on CL-bench.

AI Agents · University of Illinois Urbana-Champaign

Eywa: Letting LLM Agents Call Scientific Foundation Models

Eywa lets an LLM agent invoke domain models like Chronos and TabPFN through a learned interface instead of serializing data into text. On EywaBench it lifts utility from 0.6154 to 0.6558 while cutting ~30% tokens.

AI Agents · University of Waterloo

Direct Corpus Interaction: Letting Agents grep Instead of a Retriever

Direct Corpus Interaction (DCI) lets a search agent grep the raw corpus instead of calling a retriever. On BrowseComp-Plus it lifts accuracy from 69.0% to 80.0% while cutting cost 29.4%.

Recent papers

Agent Memory · National University of Singapore

EvoArena: Why Agent Memory Must Track Environment Changes

EvoArena turns static agent tasks into evolving chains and finds current agents average only 39.6% accuracy; EvoMem adds patch memory and improves chain-level accuracy by 3.7 points.

AI Agents · Google DeepMind

From AGI to ASI: DeepMind's Map of Superintelligence Pathways

Google DeepMind's report lays out four non-exclusive paths from AGI to ASI and treats each bottleneck, from data walls to regulation, as an open research question.

AI Agents · NVIDIA

SpatialClaw: Why VLM Spatial Agents Need a Python Workspace

SpatialClaw replaces rigid tool calls with a persistent Python kernel and reaches 59.9% average accuracy across 20 spatial reasoning benchmarks, +11.2 points over the recent spatial-agent baseline.

AI Agents · Renmin University of China

Arbor: Autonomous Research With Hypothesis Trees

Arbor stores research attempts in a persistent hypothesis tree, then admits changes only through held-out evaluation. It reports best held-out results on six AO tasks and 86.36% Any Medal on MLE-Bench Lite.

AI Agents · TokenRhythm Technologies

Claw-SWE-Bench: Why Coding Agent Harnesses Matter

Claw-SWE-Bench evaluates OpenClaw-style coding-agent harnesses on 350 GitHub issue tasks. OpenClaw jumps from 19.1% to 73.4% Pass@1 with a full adapter.

AI Agents · Independent Researcher

AdaPlanBench: Testing Adaptive Planning in LLM Agents

AdaPlanBench: Testing Adaptive Planning in LLM Agents turns adaptive planning under constraints into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Agent Memory · National University of Singapore

EvoArena: Why Agent Memory Must Track Environment Changes

EvoArena turns static agent tasks into evolving chains and finds current agents average only 39.6% accuracy; EvoMem adds patch memory and improves chain-level accuracy by 3.7 points.

AI Agents · Google DeepMind

From AGI to ASI: DeepMind's Map of Superintelligence Pathways

Google DeepMind's report lays out four non-exclusive paths from AGI to ASI and treats each bottleneck, from data walls to regulation, as an open research question.

AI Agents · NVIDIA

SpatialClaw: Why VLM Spatial Agents Need a Python Workspace

SpatialClaw replaces rigid tool calls with a persistent Python kernel and reaches 59.9% average accuracy across 20 spatial reasoning benchmarks, +11.2 points over the recent spatial-agent baseline.

AI Agents · Renmin University of China

Arbor: Autonomous Research With Hypothesis Trees

AI Agents · TokenRhythm Technologies

Claw-SWE-Bench: Why Coding Agent Harnesses Matter

Claw-SWE-Bench evaluates OpenClaw-style coding-agent harnesses on 350 GitHub issue tasks. OpenClaw jumps from 19.1% to 73.4% Pass@1 with a full adapter.

AI Agents · Independent Researcher

AdaPlanBench: Testing Adaptive Planning in LLM Agents

AdaPlanBench: Testing Adaptive Planning in LLM Agents turns adaptive planning under constraints into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · UC Berkeley

Agents' Last Exam: Why AI Agents Still Fail at Work

Agents' Last Exam tests AI agents on 1,490 expert-built professional tasks across 55 digital industries; the hardest tier averages only 2.6% full pass.

AI Agents · Independent Researcher

ArcANE: Measuring When Role-Playing Agents Break Character

ArcANE: Measuring When Role-Playing Agents Break Character turns role-playing language agent reliability into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Video Generation · Nanjing University

CoVEBench: Can Video Editors Follow Complex Instructions?

CoVEBench: Can Video Editors Follow Complex Instructions? turns complex instruction following for video editing into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Nanjing University

DRIFT: Pinpointing Where Deep-Research Agents Go Wrong

TELBench asks models to find the span that broke a 12-step research trajectory. DRIFT audits claims against evidence, lifting macro-F1 to 54.91% with Claude-Sonnet-4.6, up to 30 points over raw inspection.

AI Agents · University of Illinois Urbana-Champaign

Harness-1: Move Search-Agent Bookkeeping Out of the Policy

Harness-1 is a 20B RL search agent that hands working memory to the environment, hitting 0.730 average curated recall and beating the next open subagent by +11.4 points.

AI Agents · Independent Researcher

K-BrowseComp: Korean Web-Browsing Agent Benchmark

K-BrowseComp: Korean Web-Browsing Agent Benchmark turns Korean-context web browsing agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Shanghai Jiao Tong University

LatentSkill: Bake Agent Skills Into LoRA Weights, Not the Prompt

A hypernetwork compiles a textual skill into a LoRA adapter in one forward pass. On ALFWorld, LatentSkill lifts success by 21.4 points (seen) with 64.1% fewer prefill tokens.

AI Agents · Independent Researcher

When Masking Stale Observations Helps Search Agents

When Masking Stale Observations Helps Search Agents turns context management for search agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Lehigh University

OpenSkill: Self-Evolving LLM Agents With No Task Supervision

OpenSkill lets agents build skills and their own verifiers from the open web, hitting 43.6% on SkillsBench (+8.9 over the best baseline) with zero target-task answers.

AI Agents · Shanghai AI Laboratory

ResearchClawBench: Testing Autonomous Research Agents

ResearchClawBench: Testing Autonomous Research Agents turns end-to-end scientific research agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Reinforcement Learning · Tsinghua University

CHERRL: A Controlled Sandbox for Reward Hacking in Rubric RL

CHERRL injects four known judge biases to reliably reproduce reward hacking in rubric RL; an agent reading only training logs pinned the onset with 11-step total interval error and missed none of six runs.

AI Agents · Xiamen University

SAAS: Teaching Search Agents When to Stop Searching

SAAS uses self-aware RL to cut a Qwen2.5-7B search agent's average queries from 2.19 to 0.97 per question, while keeping accuracy near the best baseline (48.7% vs 49.8%).

Reinforcement Learning · University of Edinburgh

SCOPE: Self-Play RL That Trains LLMs on Open-Ended Tasks

SCOPE co-evolves a task-writing Challenger and a retrieval Solver, judged by a frozen copy of the base model, lifting eight open-ended benchmarks by up to +10.4 points with zero curated prompts.

AI Agents · Ant Group

SkillAdaptor: How LLM Agents Rewrite Their Own Skills

SkillAdaptor edits an agent's skill library from failed trajectories without touching model weights, lifting WebShop score +2.3 and PinchBench +1.5 over the frozen backbone.

AI Agents · Independent Researcher

SoCRATES: Evaluating Proactive LLM Mediation

SoCRATES: Evaluating Proactive LLM Mediation turns proactive mediation agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

SpatialWorld: Interactive Spatial Reasoning for Agents

SpatialWorld: Interactive Spatial Reasoning for Agents turns interactive spatial reasoning into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · HKUST

StreamMA: Streaming Beats Waiting in Multi-Agent Reasoning

StreamMA pipes each reasoning step to the next agent the moment it is written, not after the full chain. Across 8 benchmarks it gains +7.3 pp on average (max +22.4 pp on HMMT 2026) and runs up to 26.9x faster.

AI Agents · Shanghai Jiao Tong University

SWE-Explore: Can Coding Agents Find the Right Code?

SWE-Explore isolates the repo-exploration stage of coding agents over 848 issues. Agentic explorers crush BM25 (HitFile 0.65 vs 0.08), but line-level recall stalls at 0.15-0.20, and that gap is what limits repairs.

AI Agents · Independent Researcher

TASTE: Harder Agent Benchmarks from Tool Sequences

TASTE: Harder Agent Benchmarks from Tool Sequences turns tool-use benchmark generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

TIDE: Proactive Multi-Problem Discovery with Templates

TIDE: Proactive Multi-Problem Discovery with Templates turns proactive problem discovery into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · Independent Researcher

ToolMaze: When LLM Agents Must Replan After Tool Failures

ToolMaze: When LLM Agents Must Replan After Tool Failures turns dynamic replanning after tool failures into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Multimodal Models · The Chinese University of Hong Kong

X-Stream: Why MLLMs Score ~50% on Multi-Stream Video

X-Stream is the first benchmark for watching several live video streams at once. The best model, Gemini 3 Pro, hits 49.6% versus a 91.84% human baseline, and proactive ability collapses below 21%.

Agent Memory · UC Berkeley

MemGPT: Treating the LLM Context Window Like an Operating System

AI Agents · Shanghai AI Laboratory

AgentDoG 1.5: A Lightweight Guardrail for AI Agent Safety

AgentDoG 1.5 trains 0.8B-8B agent-safety guard models on only ~1k samples, hits 92.2% accuracy on R-Judge with the 4B variant, rivals GPT-5.4, and cuts agentic-RL deployment overhead by two orders of magnitude.

AI Agents · Shanghai Jiao Tong University

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

ARIS is an open-source autonomous-research harness pairing a Claude-family executor with a GPT-family reviewer to attack the failure it calls 'plausible unsupported success', with 65+ skills and a three-stage audit.

AI Agents · UNC-Chapel Hill

AutoResearchClaw: An AI Research Agent That Beats AI Scientist v2

AutoResearchClaw is a 23-stage multi-agent system for autonomous ML research. It scores 0.648 vs AI Scientist v2's 0.419 on its 25-topic ARC-Bench, and rises to 7.27/10 quality with a human in the loop.

AI Agents · Shanghai AI Laboratory

Pi-Bench: Can AI Assistants Anticipate What You Did Not Say?

Pi-Bench scores agents on proactivity, not just task completion, across 100 long-horizon tasks. The best model, GPT-5.4, hits only 67.0% proactivity, and removing prior sessions drops it 9.5 points.

AI Agents · University of Waterloo

Direct Corpus Interaction: Letting Agents grep Instead of a Retriever

Direct Corpus Interaction (DCI) lets a search agent grep the raw corpus instead of calling a retriever. On BrowseComp-Plus it lifts accuracy from 69.0% to 80.0% while cutting cost 29.4%.

AI Agents · University of Illinois Urbana-Champaign

Code as Agent Harness: Reframing Code as the Runtime of AI Agents

This survey reframes code not as a thing agents generate but as the executable substrate they run on, mapping 40-plus systems across three layers — interface, mechanisms, multi-agent scaling — plus seven open problems.

AI Agents · Shanghai AI Laboratory

COLLEAGUE.SKILL: Turning One Person's Expertise Into a Portable AI Skill

COLLEAGUE.SKILL distills one person's work traces into a versioned skill package with two tracks — capability and bounded behavior — that any agent can install, correct, and roll back. The open repo reports ~18.5k stars.

Multimodal Models · University of Illinois Urbana-Champaign

Crafter: A Multi-Agent Harness for Editable Scientific Figures

Crafter wraps an image model in five cooperating agents and scores 50.34 on PaperBanana-Bench vs 11.13 for the raw backbone — then CraftEditor turns the raster output into editable SVG you can actually fix.

Long Context · University of Illinois Urbana-Champaign

From Context to Skills: Ctx2Skill Self-Evolves Context Learning

World Models · NVIDIA

Gamma-World: A Multi-Agent World Model That Scales Past Two Players

Gamma-World is NVIDIA's video world model for multiplayer simulation that runs at 24 FPS and generalizes from two to four players with no retraining, cutting Solaris's FVD roughly in half.

AI Agents · University of Illinois Urbana-Champaign

Eywa: Letting LLM Agents Call Scientific Foundation Models

AI Agents · MemTensor

MemPrivacy: Private Edge-Cloud Agent Memory via Reversible Placeholders

MemPrivacy swaps sensitive spans for type-aware placeholders on-device, processes memory in the cloud over them, then restores them locally — utility loss stays within 1.6% and 0.6B-4B models beat GPT-5.2 at detection.

AI Agents · Shanghai Jiao Tong University

MMSkills: Multimodal Skill Packages for General Visual Agents

MMSkills packages textual procedures, runtime state cards, and keyframes into reusable skills for visual agents, lifting Qwen3-VL-235B from 21.34% to 39.17% on OSWorld and a small 8B model from 10.78% to 25.40%.

Multimodal Models · Sea AI Lab

OpenSearch-VL: An Open Recipe for Multimodal Search Agents

OpenSearch-VL open-sources data, code, and weights for vision-language search agents that call real search, OCR, and image tools — its 30B-A3B model lifts seven benchmarks by 13.8 points on average over Qwen3-VL.

AI Agents · Zhejiang University

Self-Distilled Agentic RL: A Privileged Teacher Steering GRPO Per Token

SDAR adds a gated, token-level self-distillation signal from a skill-augmented teacher on top of GRPO, lifting multi-turn agents by up to +10.2 points on WebShop and +9.4 on ALFWorld for small Qwen models.

AI Agents · University of Science and Technology of China

Skill1: One RL Policy That Selects, Uses, and Distills Agent Skills

Skill1 trains a single Qwen2.5-7B policy to retrieve, apply, and create reusable skills under one task-outcome reward — reaching 97.5% on ALFWorld, 6.5 points over the strongest RL-only baseline.

AI Agents · Microsoft Research

SkillOpt: Training a Frozen Agent's Skill Text Like a Model

SkillOpt trains a single skill document for a frozen LLM agent with bounded add/delete/replace edits and a held-out gate, lifting GPT-5.5 by +23.5 points in direct chat across six benchmarks.

AI Agents · MemTensor

SkillsVote: Governing the Lifecycle of Reusable Agent Skills

SkillsVote treats agent skills as a governed library — profiling a million-scale corpus, recommending skills before a run, and gating updates after. Offline evolution lifts GPT-5.2 on Terminal-Bench 2.0 by up to 7.9 pp.

AI Agents · Peking University

Video2GUI: Mining 12M GUI Agent Trajectories From Internet Videos

Video2GUI turns 500M unlabeled tutorial videos into WildGUI — 12M grounded GUI interaction trajectories across 1,500+ apps and sites — and pretraining Qwen2.5-VL and Mimo-VL on it lifts GUI benchmarks by 5-20%.

Foundational papers

Recent papers

Related topics