From AGI to ASI: DeepMind's Map of Superintelligence Pathways
Google DeepMind's report lays out four non-exclusive paths from AGI to ASI and treats each bottleneck, from data walls to regulation, as an open research question.
Topics
Eliciting and improving step-by-step reasoning in large language models.
Google DeepMind's report lays out four non-exclusive paths from AGI to ASI and treats each bottleneck, from data walls to regulation, as an open research question.
MaxProof turns MiniMax-M3 into a generator, verifier, fixer, and ranker; with population-level test-time scaling it reports 35/42 on IMO 2025 and 36/42 on USAMO 2026.
AI Agents · Renmin University of China
Arbor stores research attempts in a persistent hypothesis tree, then admits changes only through held-out evaluation. It reports best held-out results on six AO tasks and 86.36% Any Medal on MLE-Bench Lite.
AI Agents · Independent Researcher
AdaPlanBench: Testing Adaptive Planning in LLM Agents turns adaptive planning under constraints into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Agents' Last Exam tests AI agents on 1,490 expert-built professional tasks across 55 digital industries; the hardest tier averages only 2.6% full pass.
AI Agents · Nanjing University
TELBench asks models to find the span that broke a 12-step research trajectory. DRIFT audits claims against evidence, lifting macro-F1 to 54.91% with Claude-Sonnet-4.6, up to 30 points over raw inspection.
Multimodal Models · Peking University
A survey that reframes long-video MLLMs as three abilities (watch, remember, reason), comparing against 11 prior surveys and organizing 100+ methods plus 5 application domains.
AI Agents · Independent Researcher
K-BrowseComp: Korean Web-Browsing Agent Benchmark turns Korean-context web browsing agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Theorem Proving · Princeton University
LeanDojo turns retrieval-augmented theorem proving in Lean into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
Long Context · Tsinghua University
Tsinghua's LongTraceRL mines distractors from real search-agent trajectories and adds entity-level rubric rewards, lifting a Qwen3-4B reasoner from 53.3 to 59.0 average across five long-context benchmarks (+5.7).
Theorem Proving · Independent Researcher
MiniF2F turns formal Olympiad-level mathematics benchmarking into a concrete research object, with evidence anchors, method tradeoffs, and limits for practical use.
AI Agents · Shanghai AI Laboratory
ResearchClawBench: Testing Autonomous Research Agents turns end-to-end scientific research agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Reinforcement Learning · University of Edinburgh
SCOPE co-evolves a task-writing Challenger and a retrieval Solver, judged by a frozen copy of the base model, lifting eight open-ended benchmarks by up to +10.4 points with zero curated prompts.
AI Agents · Independent Researcher
SoCRATES: Evaluating Proactive LLM Mediation turns proactive mediation agents into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
StreamMA pipes each reasoning step to the next agent the moment it is written, not after the full chain. Across 8 benchmarks it gains +7.3 pp on average (max +22.4 pp on HMMT 2026) and runs up to 26.9x faster.
AI Agents · Independent Researcher
TASTE: Harder Agent Benchmarks from Tool Sequences turns tool-use benchmark generation into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
LLM Reasoning · Shanghai AI Laboratory
ThoughtFold trims the redundant reasoning of DeepSeek-R1-Distill-Qwen-7B by about 56% of tokens while keeping accuracy on AIME, MATH-500, and GPQA-Diamond intact, using a masked preference objective.
AI Agents · Independent Researcher
TIDE: Proactive Multi-Problem Discovery with Templates turns proactive problem discovery into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Fine-Tuning & Adaptation · The Hong Kong Polytechnic University
Teachability-Aware OPD supervises only ~5% of tokens, those where the teacher's correction lands inside the student's top-K support, matching or beating full-token distillation (44.89 vs 42.37 on Qwen3-4B to 1.7B).
AI Agents · Independent Researcher
ToolMaze: When LLM Agents Must Replan After Tool Failures turns dynamic replanning after tool failures into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Multimodal Models · Independent Researcher
VideoKR: Knowledge-Intensive Video Understanding turns knowledge and reasoning in video understanding into a checkable test, with concrete failure signals, benchmark limits, and builder takeaways.
Video Generation · Kuaishou Technology
Instead of asking a video model to reason directly, a VLM grades its in-progress frames and fine-tunes a per-instance LoRA. The trick lifts RULER-Bench from 46.4 to 68.2.
World Models · University of Macau
PF-OPSD teaches a Qwen3.5-9B MLLM to decide when to simulate the future with a video world model, verify the rollout, and fold it into its answer, lifting accuracy +10.6 and +10.9 points on two new QA benchmarks.
Theorem Proving · Google DeepMind
This work evaluates AI-aided formal proof search on open math problems: the strongest agent resolves 9 of 353 Erdos problems and proves 44 of 492 OEIS conjectures.
DeepSeek-Prover-V1.5 combines Lean feedback, reinforcement learning, and RMaxTS search, reaching 63.5% on miniF2F and 25.3% on ProofNet.
LLM Reasoning · Samsung Research
TrOPD masks on-policy distillation to the tokens where the teacher is actually trustworthy, adding +3.06 to +3.52 average points over standard OPD on math, code, and STEM benchmarks with 1.5B-1.7B students.
LLM Reasoning · Shanghai AI Laboratory
SU-01, a 30B-A3B open model from Shanghai AI Lab, hits 35 points on IMO 2025 and clears gold lines at IPhO 2024/2025 using only ~338K short SFT trajectories plus a 200-step two-stage RL pipeline.
AntiSD inverts self-distillation — it rewards tokens where a privileged context disagrees with the base model — reaching GRPO's accuracy in 2 to 10x fewer steps and ending up to 11.5 points higher on 4B-30B models.
AI Agents · Shanghai Jiao Tong University
ARIS is an open-source autonomous-research harness pairing a Claude-family executor with a GPT-family reviewer to attack the failure it calls 'plausible unsupported success', with 65+ skills and a three-stage audit.
AutoResearchClaw is a 23-stage multi-agent system for autonomous ML research. It scores 0.648 vs AI Scientist v2's 0.419 on its 25-topic ARC-Bench, and rises to 7.27/10 quality with a human in the loop.
LLM Reasoning · Google Research
Showing a few worked examples with intermediate reasoning steps lets big models solve multi-step problems — a 540B model with 8 chain-of-thought exemplars hits 57% on GSM8K, beating fine-tuned GPT-3 with a verifier.
AI Agents · University of Illinois Urbana-Champaign
This survey reframes code not as a thing agents generate but as the executable substrate they run on, mapping 40-plus systems across three layers — interface, mechanisms, multi-agent scaling — plus seven open problems.
LLM Reasoning · Renmin University of China
DelTA reweights RLVR updates so credit lands on tokens that actually separate right answers from wrong ones, lifting Qwen3-8B-Base by 3.26 and Qwen3-14B-Base by 2.62 average points over the strongest baselines.
LLM Reasoning · Alibaba Qwen Team
DVAO weights each reward by its in-group variance instead of fixed coefficients, lifting Qwen3-4B-Base from 38.99% to 42.19% average accuracy and length compliance to 99.91% in math-plus-tool-use RL.
Fine-Tuning & Adaptation · HKUST
On-policy distillation does not sit between SFT and RLVR — it carves its own geometry. Its updates touch fewer weights, avoid principal directions, and lock into a narrow low-dimensional subspace early in training.
Retrieval-Augmented Generation · University of Massachusetts Amherst
GrepSeek trains an LLM to answer questions by issuing shell commands like grep against the raw corpus — no embedding index — and posts the best F1 and Exact Match across seven open-domain QA benchmarks.
Multimodal Models · University of Washington
Imaginative Perception Tokens (IPT) make a VLM render a new viewpoint instead of reasoning in text — lifting multiview counting 3.4%, rivaling closed models on path tracing, while text chain-of-thought sometimes hurts.
KVarN compresses the KV-cache to 2 bits with no calibration data, using a Hadamard rotation plus dual-axis variance normalization to stop quantization errors from snowballing across long reasoning chains.
Retrieval-Augmented Generation · AIRI
OCC-RAG is a pair of 0.6B and 1.7B reasoning models trained to answer strictly from the given context and refuse when the answer isn't there — matching or beating general models 2-6x their size on multi-hop QA.
PPO keeps policy-gradient RL stable with a clipped surrogate objective — almost as well-behaved as TRPO but far simpler — which made it the default RL engine behind RLHF for ChatGPT and InstructGPT.
LLM Reasoning · Princeton University
ReAct interleaves a model's reasoning traces with task actions like search and API calls, cutting chain-of-thought hallucination and beating RL agents on ALFWorld by 34% absolute with one or two examples.
AI Agents · Zhejiang University
SDAR adds a gated, token-level self-distillation signal from a skill-augmented teacher on top of GRPO, lifting multi-turn agents by up to +10.2 points on WebShop and +9.4 on ALFWorld for small Qwen models.
AI Agents · University of Science and Technology of China
Skill1 trains a single Qwen2.5-7B policy to retrieve, apply, and create reusable skills under one task-outcome reward — reaching 97.5% on ALFWorld, 6.5 points over the strongest RL-only baseline.
Toolformer trains a model to decide which API to call — calculator, QA, search, translation, calendar — purely by keeping the sampled calls that lower next-token loss, with only a handful of demos per tool.
Language Models · Alibaba Qwen Team
TransitLM is a 13M-record corpus from four Chinese cities (120,845 stations) that trains a language model to plan transit routes with no map engine — a 4B model hits 97.0% connectivity and 71.0% exact match.
Fine-Tuning & Adaptation · T-Tech
On-policy distillation wastes teacher supervision on a student's weak early rollouts. TRB blends teacher-like behavior inside a KL trust region during warmup, then anneals it to zero — best average on two math settings.
Theorem Proving · Google DeepMind
AlphaGeometry pairs a language model with a symbolic engine and trains on 100M synthetic theorems, solving 25 of 30 olympiad geometry problems versus 10 for the prior best.
Alignment · Stanford University
Direct Preference Optimization solves the RLHF problem with a single classification-style loss on preference pairs — no separate reward model, no RL loop, no sampling during training.
OpenAI's GPT-4 report is a measurement document, not a recipe. It hits human-level scores on professional and academic exams — bar exam ~top 10% — yet discloses no architecture, data, or compute.
Meta released Llama 3 as a herd of language models led by a dense 405B-parameter flagship with a 128K context window, trained on 15T+ tokens and openly published with weights.
DeepSeek-R1 learns to reason from reinforcement learning on whether its answer is correct — with no human reasoning examples — matches OpenAI o1 on AIME and MATH-500, and ships open MIT-licensed weights.