SkillsVote: Governing the Lifecycle of Reusable Agent Skills

Quick answer

SkillsVote is a framework for governing the full lifecycle of agent skills — reusable scripts plus written procedure — instead of letting an agent dump raw trajectories into memory. Its headline result: governed offline evolution improves GPT-5.2 on Terminal-Bench 2.0 by up to 7.9 percentage points, and online evolution improves SWE-Bench Pro by up to 2.6 percentage points, with no change to the underlying model. The whole gain comes from controlling what skills the agent sees, how credit is assigned, and which discoveries are allowed to persist.

Why raw agent memory rots

Long-horizon agents generate trajectories that look like reusable experience, but raw traces are noisy. Open skill ecosystems make it worse: they fill up with redundant, uneven, and environment-sensitive artifacts. The real failure mode SkillsVote targets is pollution — if you let an agent write back whatever it just did, future context fills with low-quality or misleading skills, and the agent gets worse over time. So the paper reframes the problem: a skill is not a log entry, it is an experience schema that couples an executable script with non-executable guidance, and like any shared library it needs admission control.

Collection. SkillsVote profiles a million-scale open-source corpus, scoring each candidate for environment requirements, quality, and verifiability, then synthesizes tasks specifically for the skills it can verify. The point is to start from a vetted library rather than from one agent’s own messy history.

Recommendation. Before execution, the system runs agentic library search over a structured skill library to surface the right instructional context for the task at hand. This is the recommendation half of the title — exposure control, deciding which skills the agent even gets to see.

Evolution. After execution, SkillsVote decomposes a trajectory into skill-linked subtasks and runs credit attribution: it separates outcome into what the skill contributed versus agent exploration, environment, and raw result signals. Only successful, genuinely reusable discoveries pass an evidence-gated update. This “vote” before admitting a skill is what stops the library from degrading.

Key results

Terminal-Bench 2.0 (offline evolution): improves GPT-5.2 by up to 7.9 percentage points — the strongest reported gain, from a frozen model with no fine-tuning.
SWE-Bench Pro (online evolution): improves performance by up to 2.6 percentage points, showing the governance loop also works when skills evolve during live use.
Corpus scale: the collection stage profiles a million-scale open-source corpus for environment requirements, quality, and verifiability before any task synthesis.
No model updates: every gain comes from the external skill library — the agent weights are frozen. The thesis is that you can improve a frozen agent by controlling exposure, credit, and preservation.

The 7.9 pp figure is the one to remember, but note it is the upper bound (“up to”), measured on a single coding-agent benchmark with one model. The 2.6 pp online gain is the more conservative and arguably more honest number, because online evolution is the harder regime.

Why this matters now

Skill libraries are becoming a standard layer in agent stacks, and most current designs are naive write-back: the agent records what it did and reuses it. SkillsVote is one of the first frameworks to take the governance problem seriously — treating the skill store like a package registry that needs profiling, search, and admission gates, not a scratchpad. As agents run longer and share skills across tasks, the bottleneck shifts from “can the model reason” to “can the system stop the shared memory from poisoning itself,” and that is the gap this paper addresses.

Limits and open questions

The evaluation is narrow. Both benchmarks — Terminal-Bench 2.0 and SWE-Bench Pro — are coding/terminal agents, so it is unproven whether the same governance helps research, browsing, or open-ended tool use. The gains are reported as upper bounds (“up to 7.9 pp”), so the typical-case improvement is unclear from the abstract alone. The system is also heavy: profiling a million-scale corpus, synthesizing verification tasks, decomposing trajectories, and running credit attribution add real compute and engineering overhead that the headline accuracy numbers do not price in. Finally, the credit-attribution step — separating a skill’s contribution from luck and environment — is the hardest claim to trust, and the abstract does not show how reliable that attribution is in adversarial cases.

FAQ

What does SkillsVote actually do?

SkillsVote governs the lifecycle of agent skills across three stages: it collects and profiles a million-scale skill corpus for quality and verifiability, recommends the right skills before a task via agentic library search, and gates skill updates after a task using evidence-based credit attribution — admitting only successful, reusable discoveries.

How much does SkillsVote improve agent performance?

SkillsVote improves GPT-5.2 on Terminal-Bench 2.0 by up to 7.9 percentage points with offline evolution, and improves SWE-Bench Pro by up to 2.6 percentage points with online evolution, all without updating the model weights.

Does SkillsVote require fine-tuning the model?

No. Every reported gain comes from the external, governed skill library — the agent stays frozen. The argument is that controlling exposure, credit, and preservation of skills can improve a frozen agent without any model update.

Why not just let the agent save its own trajectories?

Because raw trajectories are noisy and open skill ecosystems fill with redundant, uneven, environment-sensitive artifacts. Indiscriminate write-back pollutes future context and can make the agent worse, which is why SkillsVote uses evidence-gated admission instead.

One line: govern the skill library — profile, recommend, and gate — and a frozen agent gets meaningfully better. Read the original paper on arXiv.