Direct Corpus Interaction: Letting Agents grep Instead of a Retriever

Quick answer

Direct Corpus Interaction (DCI) drops the retriever entirely and lets the search agent read the raw corpus with grep, find, file reads, and shell pipes — no embedding model, no vector index, no top-k API. Swapping a Qwen3-Embedding-8B retrieval tool for DCI under the same Claude Sonnet 4.6 backbone raises BrowseComp-Plus accuracy from 69.0% to 80.0% (+11.0 points) while cutting evaluation cost from $1,440 to $1,016 (−29.4%). The paper’s thesis: for capable agents, the retrieval bottleneck is the interface, not the retriever.

The bottleneck DCI attacks

Every conventional retriever — BM25, a dense embedder, a reranker — compresses the corpus into a single similarity score and hands the agent a top-k slice before reasoning begins. That works for one-shot RAG, but it quietly throws away everything agentic search needs: exact lexical constraints, conjunctions of weak clues, and local-context checks. Worse, evidence filtered out at the top-k step cannot be recovered later, no matter how strong the downstream model is. The authors frame this as a resolution problem — a top-k list can only address whole documents or passages, never the precise span an agent actually wants to verify.

How DCI works

DCI moves semantic understanding down into the LLM and gives it high-resolution access to the bytes. Instead of querying a retriever, the agent issues terminal calls: grep and rg for exact or regex matches, find and glob for structural navigation, and targeted reads (head, tail, sed) to inspect context around a hit. Crucially these compose into pipelines — grep 'foo' file | grep 'bar' enforces a conjunction, find . | grep 'report' | grep '2024' combines weak clues, and grep -n 'keyword' file | head verifies a hypothesis against local context. No offline indexing is required, so DCI adapts to a corpus that changes under it.

The paper instantiates DCI two ways: a strong scaffold (Claude Code on Sonnet 4.6) and, to prove generality, a minimal harness built on a different stack (Pi with GPT-5.4 nano) that uses only bash and read plus simple truncation/compaction for context. Both beat the retrieval baselines.

Key results

BrowseComp-Plus: accuracy rises from 69.0% to 80.0% (+11.0 points) versus the Qwen3-Embedding-8B retrieval tool under the same Sonnet 4.6 backbone, while cost falls from $1,440 to $1,016 (−29.4%) over the eval set.
Multi-hop QA: DCI with Claude Code reaches 83.0 average accuracy, beating the strongest retrieval-agent baseline by 30.7 points.
IR ranking: the same setup hits 68.5 average NDCG@10 across four BRIGHT and two BEIR datasets, +21.5 points over the best retrieval baseline.
Source of the gain: ablations show DCI usually does not win by surfacing more gold documents — it often prevails even when retrieval agents already surfaced all gold evidence. The advantage comes from converting that evidence into finer-grained local search and verification.

That last point is the most honest and most interesting result: this is not “grep finds more,” it is “grep lets the agent look more closely at what it already found.”

Limits and open questions

DCI has a sharp operating envelope, and the paper is candid about it. As the BrowseComp-Plus corpus grows from 100K to 200K documents, tool calls per question jump from 38.5 to 86.9, latency and cost both more than double, and accuracy drops 13.6 points. At 400K documents accuracy falls to 37.5%, average tool usage hits 122.4 calls, and 20 examples run out of tool budget entirely. DCI scales well in search depth but poorly in search breadth — locating the first useful anchor in a huge candidate space is where cost explodes. The authors agree dense and sparse retrieval remain the right tool for large, static corpora; DCI is for local, evolving, agent-controlled workspaces. The wins are also reported on a curated set of BRIGHT/BEIR datasets and lean on frontier models (Sonnet 4.6, GPT-5.4 nano), so the result is partly a story about how strong the agent is, not just the interface.

FAQ

What is Direct Corpus Interaction (DCI)?

DCI is a retrieval paradigm where a search agent reads the raw corpus directly with general-purpose terminal tools — grep, find, file reads, shell pipes — instead of calling an embedding-based retriever or vector index. The whole corpus stays available and the agent itself decides what to inspect.

How much does DCI improve agentic search accuracy?

On BrowseComp-Plus, replacing a Qwen3-Embedding-8B retriever with DCI raised accuracy from 69.0% to 80.0% (+11.0 points) under the same Claude Sonnet 4.6 backbone, while cutting evaluation cost by 29.4%. On multi-hop QA it reached 83.0 average accuracy, 30.7 points over the strongest retrieval-agent baseline.

Does DCI beat dense and sparse retrievers on IR benchmarks?

Yes, on the tested sets: 68.5 average NDCG@10 across four BRIGHT and two BEIR datasets, +21.5 points over the best retrieval baseline, without using any conventional semantic retriever.

When does DCI stop working?

DCI degrades as the corpus grows. Expanding BrowseComp-Plus to 400K documents dropped accuracy to 37.5% and pushed average tool usage to 122.4 calls per question. It is suited to local, evolving corpora, not massive static indexes where dense or sparse retrieval is cheaper.

Why does grep beat a trained retriever for an agent?

Because the gain is about interface resolution, not recall. A top-k retriever can only return whole documents; DCI lets the agent run exact-match, regex, and local-context checks on the bytes it already surfaced, turning coarse evidence into fine-grained verification.

One line: when the agent is strong enough to search like a researcher, the compressed similarity index becomes the bottleneck. Read the original paper on arXiv.