From Chatbot to Digital Colleague: A Survey of Persistent Autonomous AI

Quick answer

This is a survey from Tencent YouTu Lab, Tsinghua, Sun Yat-sen, Central South University, and UIC. It organizes recent LLM progress on two axes that it treats as coupled. The first axis is the cognitive core, moving from Chatbot-era fast generation to Thinking LLMs that spend inference-time compute on reasoning. The second axis is task execution, moving from ad-hoc tool-calling Agents to persistent Workspace systems. Its central thesis is that the combination it calls Workspace plus Skill is the qualitative leap, not better models or more tools. It is a position paper with no new model, dataset, or benchmark.

The two-dimension taxonomy

The survey’s spine is a 2x2 reading of LLM history, split into four eras.

On the cognitive axis, the Chatbot Era covers scaling autoregressive Transformers plus instruction alignment, which the authors frame as System-1 fast generation over compressed parametric knowledge. The Thinking LLM Era (their examples are OpenAI o1 and DeepSeek R1) covers long chain-of-thought, inference-time scaling, process supervision, and RL-driven self-correction.

On the execution axis, the Agent Era covers tool invocation and environment-action-feedback loops, with the limitation that each tool call is an isolated transaction. The next stage is what the paper labels the OpenClaw Era: agents that live inside a persistent workstation with files, terminals, browsers, logs, permissions, and reusable skills. OpenClaw here is a stand-in name for a harness-plus-workspace architecture, and the authors are explicit that they use it as a representative pattern, not as the origin of the idea. They cite OpenHands and SWE-agent as the same pattern from the software-engineering side, where SWE-agent’s claim is that the agent-computer interface is itself a design object.

Key results: the strongest organizing claim

The sharpest argument is in Part III: a workspace plus a skill library beats either one alone. A workspace without skills forces the agent to improvise the same procedure every time. A skill without a workspace stays a static instruction template with nowhere to act. Put them together and you get task closure: load a reusable procedure, operate over persistent artifacts, verify results, repair failures, and leave an inspectable final state.

The framing that follows is the survey’s best move. It reframes the agent question from “can the model pick the right next action” to “after many actions, is the environment still coherent, inspectable, and recoverable.” That shifts the unit of analysis from a single decision to the durability of state across a long trajectory, which is where real agent deployments actually break.

A skill in this scheme is a directory-level asset, often built around a SKILL.md file plus scripts, examples, dependencies, and validation routines. The authors are careful to say a skill is more than a prompt (it encodes preconditions and failure modes) and more than a tool (it describes a repeatable way of working, not one callable function). They call it procedural memory that loads only when relevant.

The data and evaluation shift

The survey’s most concrete contribution is tracking two background shifts that match the four eras.

Data moves from instruction-response pairs (SFT, InstructGPT-style) to chain-of-thought and process-reward data (PRM) to State-Action-Observation trajectories. In the agent stage a sample is no longer “what answer should the model write” but “which action in this state, and how does feedback steer the next step.” These trajectories carry tool-call traces, DOM states, screenshots, terminal output, file diffs, and error messages.

Evaluation moves through four stages. Stage I scores final-output accuracy (MMLU, GSM8K, MATH, BLEU/ROUGE). Stage II adds process verification and LLM-as-judge for reasoning traces (ProcessBench, PRMBench). Stage III introduces task closure rate, asking whether the system turned an initial environment state into the intended final state (SWE-bench, WebArena, OSWorld, WorkArena). Stage IV adds workspace capability and safety under reproducible state, with benchmarks the paper names ClawBench, ClawsBench, and a trajectory-safety benchmark. The task closure rate definition is the useful import here, because it forces success to mean a changed environment, not a plausible plan.

What the taxonomy hides

A clean 2x2 hides the messy parts, and this one is no exception.

It treats the cognitive and execution axes as separate and then “tightly coupled,” but never shows where coupling fails. A strong Thinking LLM inside a weak workspace, or a strong workspace driving a weak base model, are different failure profiles the framework collapses into one story.

The Workspace plus Skill thesis is hard to falsify as stated. The survey itself lists the failure modes that undercut it: skill brittleness when a UI redesign, a dependency bump, or a schema change silently invalidates a saved procedure; skill overfitting and negative transfer when a procedure built for one environment misleads in a near-neighbor. If skills need versioning, regression tests, compatibility checks, and deprecation to stay valid, then “reusable procedure” is doing less work than the headline implies, and most of the reliability burden moves to lifecycle management the paper does not measure.

The OpenClaw framing is presentation, not evidence. Because the case study is a representative pattern rather than a benchmarked system, none of the central claims (persistent state raises the ceiling, Workspace plus Skill is the leap) come with numbers in this paper. The empirical content lives in the systems it cites, not in the survey.

What evidence would settle it

The thesis is testable, and the survey points at the experiments without running them. A clean test would hold the base model and tool set fixed, then add only persistent workspace state, and measure task closure rate on Stage III and IV benchmarks. A second test would measure skill reuse net of maintenance: does a skill library raise closure rate once you subtract the cost of repairing skills after environment drift. If persistent state and reusable skills do not move closure rate beyond a strong stateless agent baseline, the “key leap” claim falls.

Builder judgment

Read this if you are designing an agent harness and want a clean vocabulary for state, skills, task closure, and the four evaluation stages. The Stage III/IV evaluation framing and the skill-as-directory-asset model are directly useful for deciding what to log and what to verify. Skip it if you want implementation detail, ablations, or a system to clone; there is none, and the OpenClaw case study stays conceptual. The honest read is that this is a map of a landscape its authors believe in, useful for orientation, light on proof.

Limits and open questions

The survey ships no experiments, so every causal claim rests on cited work, not on its own measurement. Its naming (OpenClaw, ClawBench, ClawsBench) reads as anonymized or project-internal labels, which makes some citations hard to trace to public artifacts. The two-axis split is an editorial choice that imposes a tidy four-era progression on a messier history, and the future-directions section (harness engineering, self-evolving governed loops) is a research agenda, not a result. Treat it as a structured reading list with a strong opinion, and verify the opinion against the primary systems.

FAQ

What is the Digital Colleague framework in this survey?

It is a two-axis taxonomy of LLM evolution. One axis is the cognitive core (Chatbot then Thinking LLM); the other is task execution (Agent then persistent Workspace). The endpoint, a Digital Colleague, is an agent that keeps project memory, follows local conventions, and delivers verifiable work across sessions rather than answering one prompt at a time.

Why does the survey treat OpenClaw as representative rather than original?

Because the authors use OpenClaw as a stand-in for a general harness-plus-workspace pattern, not as the system that invented it. They explicitly pair it with OpenHands and SWE-agent to show the same pattern from the software-engineering side, where the agent-computer interface is the design object.

What are the limitations the Digital Colleague Workspace plus Skill taxonomy hides?

It hides that reusable skills are brittle. The paper’s own limitations section admits skills break on UI redesigns, dependency bumps, and schema changes, and can overfit and transfer negatively. So most of the reliability work moves into versioning, regression tests, and deprecation, which the survey lists but does not quantify.

How does the Digital Colleague survey’s four-stage evaluation method define task closure rate?

Task closure rate appears at Stage III. It asks whether the system transformed an initial environment state into the intended final state, scored on benchmarks like SWE-bench and WebArena, instead of scoring a final answer (Stage I) or a reasoning trace (Stage II). Stage IV then adds workspace capability and safety under reproducible state.

One line: a Tencent YouTu Lab survey that maps the chatbot-to-agent shift onto two axes and argues persistent Workspace plus Skill is the real leap, strong as a vocabulary, unproven as a claim. Read the original paper on arXiv.