Agents' Last Exam: Why AI Agents Still Fail at Work

Quick answer

Agents’ Last Exam (ALE) is a UC Berkeley benchmark for computer-use agents that tries to measure real professional work, not short toy tasks. It contains 1,490 task instances from 960 expert-authored workflows across 55 digital industries, with tasks split into near-term, full-spectrum, and last-exam difficulty tiers. The number to remember is harsh: the paper reports an average full pass rate of only 2.6% on the hardest tier, even though top systems can look strong on narrower agent benchmarks.

What ALE is really testing

ALE is built around a simple claim: current agent benchmarks are too small or too narrow to tell whether an agent can do economically valuable work. SWE-bench is useful for software issues. Terminal-Bench tests terminal tasks. WebArena and OSWorld stress browser or desktop control. ALE tries to cover professional digital work across a much wider map: engineering design, scientific computing, finance, media, operations, legal-style document work, and other knowledge-work clusters.

That breadth is the point. The paper is not asking whether a model knows facts about these fields. It asks whether an agent can open tools, inspect files, run domain software, produce an artifact, and satisfy a verifiable task specification. A good ALE task is supposed to feel like a scoped professional job: not a one-line quiz, not a vague chat request, and not a task where the grader simply asks another model to decide.

How the benchmark is built

The authors describe three design principles: representativeness, complexity, and verifiability. Tasks are collected with domain experts, grouped into 13 industry clusters and 55 subdomains, then converted into executable workflows. Figure-level statistics in the paper describe 960 expert-authored task workflows and 1,490 task instances.

The most important design choice is scoring. ALE avoids LLM-as-judge wherever a deterministic alternative exists. Many tasks check files, numbers, output artifacts, simulations, screenshots, or constraint satisfaction. If a task can only be graded by asking a model whether an answer seems good, it is rejected or re-engineered. A minority of cases use narrow evidence-anchored yes/no probes, but the benchmark’s main pitch is verifiable outcomes rather than vibes-based grading.

ALE also keeps most tasks private. The paper releases about 150 of 1,490 task instances publicly, roughly 10%, and keeps the rest behind rolling evaluation. That matters for searchers asking “is ALE public?” The answer is partial: there is a public slice for inspection and development, but the score-bearing pool is intentionally guarded against contamination.

Key results

Scale: 960 expert-authored workflows, 1,490 task instances, 55 digital industries, and 13 broader industry clusters.
Difficulty: the hardest tier averages only 2.6% full pass, which is the paper’s core evidence that professional agent work is still mostly unsolved.
Benchmark contrast: Codex with GPT-5.5 reaches 82% on Terminal-Bench in the paper’s comparison, but scores far lower on ALE’s broader professional tasks.
ALE-CLI result: the paper reports Codex GPT-5.5 at 25.2% overall pass on ALE-CLI, with 41.5% on near-term tasks, 20.0% on full-spectrum tasks, and 4.5% on last-exam tasks.
Cost and time: one run costs roughly 3 to 10 dollars on average and often takes tens of minutes to hours, with runs capped at five hours.

The result is not “agents are useless.” It is more specific: agents that solve many terminal or coding benchmark tasks still collapse when the task requires domain context, tool selection, multi-step checking, and artifact-level correctness.

Why current agents fail

The paper’s failure analysis is more interesting than the leaderboard. For Claude Code with Opus 4.7, the authors say understanding and approach failures account for roughly three quarters of failed cases. That means the agent often misunderstands the task, chooses the wrong plan, or fails to recognize what domain-specific tool or workflow is required.

This diagnosis cuts against a common assumption that agents mainly need more tool calls or longer budgets. ALE suggests that more resources do not reliably produce better performance. The stronger signal is the foundation model: among well-engineered systems, model choice explains a much larger spread than harness choice. The authors also observe that agents often default to ad hoc scripts instead of using the intended professional software, which is exactly the behavior a narrow coding benchmark may reward but a real workflow may punish.

Limits and open questions

ALE is ambitious, but it is not a final exam for all work. Expert-authored tasks improve realism, yet they also make the benchmark expensive to maintain. The private-pool design is necessary for contamination control, but it means outside researchers cannot fully audit every task behind the leaderboard. The benchmark also evaluates digital artifacts, not field work, embodied robotics, wet-lab science, or long organizational projects where success depends on people and institutions.

The biggest interpretive limit is category balance. A 55-industry map sounds broad, but searchers should still ask which industries dominate the score and whether their own domain is represented with enough depth. ALE is best read as a strong stress test for computer-use professional agents, not a universal measure of economic automation.

FAQ

What is Agents’ Last Exam (ALE)?

Agents’ Last Exam is a benchmark for AI agents doing long-horizon, professional computer-use tasks. It contains 1,490 task instances from expert-built workflows across 55 digital industries and grades agents by verifiable outcomes.

How hard is Agents’ Last Exam for current AI agents?

The hardest ALE tier averages only 2.6% full pass. In ALE-CLI, the paper reports Codex GPT-5.5 at 25.2% overall pass, dropping to 4.5% on the last-exam tier.

How is Agents’ Last Exam different from Terminal-Bench or SWE-bench?

Terminal-Bench and SWE-bench focus on narrower terminal or software-engineering tasks. ALE covers a broader set of professional workflows across 55 digital industries, and many tasks require domain tools, artifact checks, and multi-step professional reasoning.

Is Agents’ Last Exam public or open source?

ALE is partially public. The paper says about 150 of 1,490 task instances are released publicly, while the remaining score-bearing tasks stay private for rolling evaluation and contamination control.

What are the main limitations of Agents’ Last Exam?

ALE’s private task pool is hard for outsiders to fully audit, and its scope is digital professional work rather than every kind of real-world labor. It is a strong stress test for computer-use agents, not a complete measure of all automation.

One line: ALE is useful because it names the gap precisely. Current agents can look competent on narrow benchmarks, then fail when professional work requires domain understanding, tool choice, and verifiable artifacts. Read the original paper on arXiv.