Topics

LLM Reasoning

Eliciting and improving step-by-step reasoning in large language models.

From AGI to ASI: DeepMind's Map of Superintelligence Pathways

Google DeepMind's report lays out four non-exclusive paths from AGI to ASI and treats each bottleneck, from data walls to regulation, as an open research question.

Theorem Proving · MiniMax AI

MaxProof: How MiniMax M3 Reaches Gold-Level Proof Scores

MaxProof turns MiniMax-M3 into a generator, verifier, fixer, and ranker; with population-level test-time scaling it reports 35/42 on IMO 2025 and 36/42 on USAMO 2026.

AI Agents · Renmin University of China

Arbor: Autonomous Research With Hypothesis Trees

Arbor stores research attempts in a persistent hypothesis tree, then admits changes only through held-out evaluation. It reports best held-out results on six AO tasks and 86.36% Any Medal on MLE-Bench Lite.

AI Agents · Independent Researcher

AdaPlanBench: Testing Adaptive Planning in LLM Agents

AdaPlanBench: Testing Adaptive Planning in LLM Agents turns adaptive planning under constraints into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · UC Berkeley

Agents' Last Exam: Why AI Agents Still Fail at Work

Agents' Last Exam tests AI agents on 1,490 expert-built professional tasks across 55 digital industries; the hardest tier averages only 2.6% full pass.

AI Agents · Nanjing University

DRIFT: Pinpointing Where Deep-Research Agents Go Wrong

TELBench asks models to find the span that broke a 12-step research trajectory. DRIFT audits claims against evidence, lifting macro-F1 to 54.91% with Claude-Sonnet-4.6, up to 30 points over raw inspection.

Multimodal Models · Peking University

Watch, Remember, Reason: A Human-View Map of Video MLLMs

A survey that reframes long-video MLLMs as three abilities (watch, remember, reason), comparing against 11 prior surveys and organizing 100+ methods plus 5 application domains.

AI Agents · Independent Researcher

K-BrowseComp: Korean Web-Browsing Agent Benchmark

K-BrowseComp: Korean Web-Browsing Agent Benchmark turns Korean-context web browsing agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Theorem Proving · Princeton University

LeanDojo: Retrieval-Augmented Language Models for Theorem Proving

LeanDojo turns retrieval-augmented theorem proving in Lean into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

Long Context · Tsinghua University

LongTraceRL: Harder Distractors and Rubric Rewards for Long-Context RL

Tsinghua's LongTraceRL mines distractors from real search-agent trajectories and adds entity-level rubric rewards, lifting a Qwen3-4B reasoner from 53.3 to 59.0 average across five long-context benchmarks (+5.7).

Theorem Proving · Independent Researcher

MiniF2F: Formal Olympiad Mathematics Benchmark

MiniF2F turns formal Olympiad-level mathematics benchmarking into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.

AI Agents · Shanghai AI Laboratory

ResearchClawBench: Testing Autonomous Research Agents

ResearchClawBench: Testing Autonomous Research Agents turns end-to-end scientific research agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Reinforcement Learning · University of Edinburgh

SCOPE: Self-Play RL That Trains LLMs on Open-Ended Tasks

SCOPE co-evolves a task-writing Challenger and a retrieval Solver, judged by a frozen copy of the base model, lifting eight open-ended benchmarks by up to +10.4 points with zero curated prompts.

AI Agents · Independent Researcher

SoCRATES: Evaluating Proactive LLM Mediation

SoCRATES: Evaluating Proactive LLM Mediation turns proactive mediation agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

AI Agents · HKUST

StreamMA: Streaming Beats Waiting in Multi-Agent Reasoning

StreamMA pipes each reasoning step to the next agent the moment it is written, not after the full chain. Across 8 benchmarks it gains +7.3 pp on average (max +22.4 pp on HMMT 2026) and runs up to 26.9x faster.

AI Agents · Independent Researcher

TASTE: Harder Agent Benchmarks from Tool Sequences

TASTE: Harder Agent Benchmarks from Tool Sequences turns tool-use benchmark generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

LLM Reasoning · Shanghai AI Laboratory

ThoughtFold: Cutting 56% of Reasoning Tokens Without Losing Accuracy

ThoughtFold trims the redundant reasoning of DeepSeek-R1-Distill-Qwen-7B by about 56% of tokens while keeping accuracy on AIME, MATH-500, and GPQA-Diamond intact, using a masked preference objective.

AI Agents · Independent Researcher

TIDE: Proactive Multi-Problem Discovery with Templates

TIDE: Proactive Multi-Problem Discovery with Templates turns proactive problem discovery into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Fine-Tuning & Adaptation · The Hong Kong Polytechnic University

Token Teachability: Distilling LLMs on Just 5% of Tokens

Teachability-Aware OPD supervises only ~5% of tokens, those where the teacher's correction lands inside the student's top-K support, matching or beating full-token distillation (44.89 vs 42.37 on Qwen3-4B to 1.7B).

AI Agents · Independent Researcher

ToolMaze: When LLM Agents Must Replan After Tool Failures

ToolMaze: When LLM Agents Must Replan After Tool Failures turns dynamic replanning after tool failures into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Multimodal Models · Independent Researcher

VideoKR: Knowledge-Intensive Video Understanding

VideoKR: Knowledge-Intensive Video Understanding turns knowledge and reasoning in video understanding into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.

Video Generation · Kuaishou Technology

VLM Teachers Score Video-Model Reasoning at Test Time

Instead of asking a video model to reason directly, a VLM grades its in-progress frames and fine-tunes a per-instance LoRA. The trick lifts RULER-Bench from 46.4 to 68.2.

World Models · University of Macau

PF-OPSD: When Should an MLLM Trust a World Model's Video?

PF-OPSD teaches a Qwen3.5-9B MLLM to decide when to simulate the future with a video world model, verify the rollout, and fold it into its answer, lifting accuracy +10.6 and +10.9 points on two new QA benchmarks.

Theorem Proving · Google DeepMind

AI Formal Proof Search for Open Math Problems

This work evaluates AI-aided formal proof search on open math problems: the strongest agent resolves 9 of 353 Erdos problems and proves 44 of 492 OEIS conjectures.

Theorem Proving · DeepSeek

DeepSeek-Prover-V1.5: Lean Proofs with RL and Search

DeepSeek-Prover-V1.5 combines Lean feedback, reinforcement learning, and RMaxTS search, reaching 63.5% on miniF2F and 25.3% on ProofNet.

LLM Reasoning · Samsung Research

TrOPD: Trust-Region On-Policy Distillation for Small LLMs

TrOPD masks on-policy distillation to the tokens where the teacher is actually trustworthy, adding +3.06 to +3.52 average points over standard OPD on math, code, and STEM benchmarks with 1.5B-1.7B students.

LLM Reasoning · Shanghai AI Laboratory

SU-01: Gold-Medal Olympiad Reasoning from a 30B Open Model

SU-01, a 30B-A3B open model from Shanghai AI Lab, hits 35 points on IMO 2025 and clears gold lines at IPhO 2024/2025 using only ~338K short SFT trajectories plus a 200-step two-stage RL pipeline.

LLM Reasoning · Xiaohongshu

Anti-Self-Distillation (AntiSD): A PMI Reward That Speeds Up Reasoning RL

AntiSD inverts self-distillation — it rewards tokens where a privileged context disagrees with the base model — reaching GRPO's accuracy in 2 to 10x fewer steps and ending up to 11.5 points higher on 4B-30B models.

AI Agents · Shanghai Jiao Tong University

ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

ARIS is an open-source autonomous-research harness pairing a Claude-family executor with a GPT-family reviewer to attack the failure it calls 'plausible unsupported success', with 65+ skills and a three-stage audit.

AI Agents · UNC-Chapel Hill

AutoResearchClaw: An AI Research Agent That Beats AI Scientist v2

AutoResearchClaw is a 23-stage multi-agent system for autonomous ML research. It scores 0.648 vs AI Scientist v2's 0.419 on its 25-topic ARC-Bench, and rises to 7.27/10 quality with a human in the loop.

LLM Reasoning · Google Research

Chain-of-Thought Prompting: How Showing the Steps Unlocks LLM Reasoning

Showing a few worked examples with intermediate reasoning steps lets big models solve multi-step problems — a 540B model with 8 chain-of-thought exemplars hits 57% on GSM8K, beating fine-tuned GPT-3 with a verifier.

AI Agents · University of Illinois Urbana-Champaign

Code as Agent Harness: Reframing Code as the Runtime of AI Agents

This survey reframes code not as a thing agents generate but as the executable substrate they run on, mapping 40-plus systems across three layers — interface, mechanisms, multi-agent scaling — plus seven open problems.

LLM Reasoning · Renmin University of China

DelTA: Discriminative Token Credit Assignment for RLVR Reasoning

DelTA reweights RLVR updates so credit lands on tokens that actually separate right answers from wrong ones, lifting Qwen3-8B-Base by 3.26 and Qwen3-14B-Base by 2.62 average points over the strongest baselines.

LLM Reasoning · Alibaba Qwen Team

DVAO: Variance-Adaptive Advantage Weighting for Multi-Reward RL

DVAO weights each reward by its in-group variance instead of fixed coefficients, lifting Qwen3-4B-Base from 38.99% to 42.19% average accuracy and length compliance to 99.91% in math-plus-tool-use RL.

Fine-Tuning & Adaptation · HKUST

On the Geometry of On-Policy Distillation: A Distinct Update Regime

On-policy distillation does not sit between SFT and RLVR — it carves its own geometry. Its updates touch fewer weights, avoid principal directions, and lock into a narrow low-dimensional subspace early in training.

Retrieval-Augmented Generation · University of Massachusetts Amherst

GrepSeek: Search Agents That grep the Corpus Instead of a Vector Index

GrepSeek trains an LLM to answer questions by issuing shell commands like grep against the raw corpus — no embedding index — and posts the best F1 and Exact Match across seven open-domain QA benchmarks.

Multimodal Models · University of Washington

Imaginative Perception Tokens: Letting VLMs Picture Space, Not Describe It

Imaginative Perception Tokens (IPT) make a VLM render a new viewpoint instead of reasoning in text — lifting multiview counting 3.4%, rivaling closed models on path tracing, while text chain-of-thought sometimes hurts.

Efficient AI · Huawei

KVarN: 2-Bit KV-Cache Quantization Without Calibration

KVarN compresses the KV-cache to 2 bits with no calibration data, using a Hadamard rotation plus dual-axis variance normalization to stop quantization errors from snowballing across long reasoning chains.

Retrieval-Augmented Generation · AIRI

OCC-RAG: Small Models Built Only to Read Context Faithfully

OCC-RAG is a pair of 0.6B and 1.7B reasoning models trained to answer strictly from the given context and refuse when the answer isn't there — matching or beating general models 2-6x their size on multi-hop QA.

Alignment · OpenAI

PPO Explained: The Clipped Objective Behind RLHF

PPO keeps policy-gradient RL stable with a clipped surrogate objective — almost as well-behaved as TRPO but far simpler — which made it the default RL engine behind RLHF for ChatGPT and InstructGPT.

LLM Reasoning · Princeton University

ReAct: How Interleaving Reasoning and Acting Built the LLM Agent

ReAct interleaves a model's reasoning traces with task actions like search and API calls, cutting chain-of-thought hallucination and beating RL agents on ALFWorld by 34% absolute with one or two examples.

AI Agents · Zhejiang University

Self-Distilled Agentic RL: A Privileged Teacher Steering GRPO Per Token

SDAR adds a gated, token-level self-distillation signal from a skill-augmented teacher on top of GRPO, lifting multi-turn agents by up to +10.2 points on WebShop and +9.4 on ALFWorld for small Qwen models.

AI Agents · University of Science and Technology of China

Skill1: One RL Policy That Selects, Uses, and Distills Agent Skills

Skill1 trains a single Qwen2.5-7B policy to retrieve, apply, and create reusable skills under one task-outcome reward — reaching 97.5% on ALFWorld, 6.5 points over the strongest RL-only baseline.

LLM Reasoning · Meta AI

Toolformer: How a Language Model Teaches Itself to Use Tools

Toolformer trains a model to decide which API to call — calculator, QA, search, translation, calendar — purely by keeping the sampled calls that lower next-token loss, with only a handful of demos per tool.

Language Models · Alibaba Qwen Team

TransitLM: A Map-Free Transit Routing Dataset and Benchmark

TransitLM is a 13M-record corpus from four Chinese cities (120,845 stations) that trains a language model to plan transit routes with no map engine — a 4B model hits 97.0% connectivity and 71.0% exact match.

Fine-Tuning & Adaptation · T-Tech

Trust-Region Behavior Blending: A Warmup Fix for On-Policy Distillation

On-policy distillation wastes teacher supervision on a student's weak early rollouts. TRB blends teacher-like behavior inside a KL trust region during warmup, then anneals it to zero — best average on two math settings.

Theorem Proving · Google DeepMind

AlphaGeometry: Olympiad Geometry Without Human Proofs

AlphaGeometry pairs a language model with a symbolic engine and trains on 100M synthetic theorems, solving 25 of 30 olympiad geometry problems versus 10 for the prior best.

Alignment · Stanford University

DPO Explained: Aligning LLMs Without the RLHF Reward Model

Direct Preference Optimization solves the RLHF problem with a single classification-style loss on preference pairs — no separate reward model, no RL loop, no sampling during training.

Multimodal Models · OpenAI

GPT-4 Technical Report Explained: Benchmarks, Not Blueprints

OpenAI's GPT-4 report is a measurement document, not a recipe. It hits human-level scores on professional and academic exams — bar exam ~top 10% — yet discloses no architecture, data, or compute.

Open Models · Meta AI

Llama 3: A 405B Dense Open Model That Matches GPT-4

Meta released Llama 3 as a herd of language models led by a dense 405B-parameter flagship with a 128K context window, trained on 15T+ tokens and openly published with weights.

LLM Reasoning · DeepSeek

DeepSeek-R1: How Pure Reinforcement Learning Taught an LLM to Reason

DeepSeek-R1 learns to reason from reinforcement learning on whether its answer is correct — with no human reasoning examples — matches OpenAI o1 on AIME and MATH-500, and ships open MIT-licensed weights.