Agent Search Is a Runtime Problem

From the guides: Claude Code & Codex CLI

A May 14 arXiv paper tested grep and vector retrieval across Chronos, Claude Code, Codex, and Gemini CLI on 116 LongMemEval questions. Inline grep beat inline vector on every harness-model pair in the paper’s first experiment, but the bigger finding was stranger: the runtime changed the result almost as much as the retriever did.1

Agent search quality does not live inside “grep versus vector” alone. It lives in the full runtime: prompt, tool surface, shell ergonomics, result formatting, context pressure, delivery path, retry behavior, and the model’s ability to close the tool loop.

TL;DR

Sen, Kasturi, Lumer, Gulati, and Subbiah compared lexical search and vector search across a custom harness named Chronos plus three provider-native CLI harnesses: Claude Code, Codex, and Gemini CLI.1 The study used a 116-question LongMemEval-S subset and tested both inline tool results and file-based tool results.1 Inline grep outperformed inline vector for every harness-model pair in Experiment 1, including Codex CLI with GPT-5.4 at 93.1% for inline grep versus 75.9% for inline vector.1 The paper does not prove that grep beats vector search in general; the authors explicitly limit the conclusion to their long-memory conversational QA setting, where answers often depend on literal spans.1 The useful takeaway for agent builders is sharper: retrieval method, agent runtime, and result delivery form one system. Benchmark them together.

Key Takeaways

  • For agent builders: keep grep as a serious baseline. The paper’s results make “vector by default” look underspecified for long-memory QA over chat history, especially when literal names, dates, and user facts matter.1
  • For Codex and Claude Code users: do not treat a provider CLI as a neutral wrapper around a search primitive. The paper reports large harness-level shifts with the same underlying conversation data.1
  • For RAG teams: report the delivery path, not only the retriever. Inline results and file-based results produced different behavior because file delivery adds another tool-use task.1
  • For migration work: preserve the runtime behaviors that make search reliable. A Claude Code to Codex migration should test retrieval, transcript shape, and verification loops before it declares parity.
  • For citation-heavy systems: final citations are not the whole evidence story. A separate Agentic GraphRAG paper argues that provenance can depend on visited-but-uncited graph context, not only cited nodes.4

What Did the Grep Paper Actually Test?

The paper asks a practical question: when an LLM agent must answer questions over long conversation history, how much does retrieval depend on the search method, and how much depends on the agent system wrapped around it?1

The authors compared two retrieval families:

Retrieval family What it favors Failure mode
Grep / lexical search exact names, dates, phrases, and distinctive strings misses paraphrases or terms the agent never guesses
Vector / semantic search paraphrases, related concepts, and indirect mentions admits near-topic distractors and noisy neighbors

They tested those retrievers across two classes of runtime:

Runtime class Systems in the paper Why it matters
Custom harness Chronos The developer controls prompts, tools, context construction, result formatting, and stop criteria
Provider-native CLI harnesses Claude Code, Codex CLI, Gemini CLI The model works through shell-style tools, provider-specific transcript formatting, sandboxing, and CLI ergonomics

They also varied how results reached the model. Inline delivery inserts search hits directly into the conversation. Programmatic delivery writes results to files, then requires the model to locate, open, and integrate them.1 That sounds like an implementation detail. The data says it is part of the task.

Why Did Grep Win Here?

The measured task favors literal recovery. LongMemEval asks questions over long multi-session conversations. Many answers depend on names, time expressions, personal facts, or exact prior statements. In that setting, a high-precision lexical tool can beat a semantic retriever because the answer often sits behind a distinctive string.1

The paper’s Table 1 shows the pattern clearly:

Harness-model pair Inline grep Inline vector
Chronos + Claude Opus 4.6 93.1% 83.6%
Claude Code + Claude Opus 4.6 76.7% 75.0%
Chronos + GPT-5.4 89.7% 81.9%
Codex CLI + GPT-5.4 93.1% 75.9%
Gemini CLI + Gemini 3.1 Pro 81.9% 75.0%

That table does not say “delete your vector database.” The paper itself warns against that reading. The authors say their conclusion is tied to long-memory conversational QA and that dense or hybrid retrieval may behave differently in scientific synthesis, visual documents, or code semantics.1

The better reading: exact search deserves a first-class seat in any serious agent runtime. If your agent can search the filesystem, read logs, inspect prior transcripts, or recover a literal user fact, lexical search may be the cheapest high-signal tool in the box.

The Runtime Changed the Result

The most useful line in the paper is not “grep won.” It is that changing the harness can shift the ceiling by roughly the same scale as changing the retriever.1

One example: Claude Opus 4.6 with inline grep hit 93.1% under Chronos and 76.7% under Claude Code.1 Same model family, same benchmark subset, different runtime. Another example: Codex CLI with GPT-5.4 hit 93.1% with inline grep, but dropped to 55.2% when grep results moved through the programmatic file-delivery path; programmatic vector landed at 67.2%.1

That is not a retrieval result alone. It is a runtime result.

The model had to do more than find evidence. It had to understand the tool contract, choose search terms, interpret stdout, decide when to retry, read files when results were not inline, and integrate evidence into an answer. Every one of those steps belongs to the agent runtime. If any step gets brittle, a strong retriever can still produce a weak answer.

Why File-Based Delivery Is a Tool-Use Test

File-based delivery has an obvious appeal. It can reduce context pressure by keeping large search results outside the immediate transcript until the model asks to read them. That should help when inline vector dumps crowd the window.

The paper shows the tradeoff. Programmatic vector beat programmatic grep on several rows, which supports the context-pressure argument.1 But the Codex/GPT-5.4 row shows the other side: file delivery can turn cheap retrieval into a multi-step workflow. The agent must find the artifact, open it, extract useful spans, and retry when the first read was not enough.1

That means programmatic delivery trades context bandwidth for tool-loop competence. The trade pays only when the runtime closes the loop reliably.

This matters for real work. A local agent does not fail search only because the index was wrong. It fails search because stdout got chunked poorly, because the result file path was easy to miss, because a command returned too much noise, because the prompt framed the task badly, or because the model stopped one read too early.

What This Means for Codex Migration

My own Claude Code to Codex migration has focused on moving operating contracts rather than copying a file tree. This paper reinforces that choice.

If search quality depends on the runtime, then migration quality depends on more than “does Codex have a search tool?” A migration has to preserve the behaviors that make search useful:

  • the agent knows when to use exact search before semantic search;
  • command output stays small enough to read;
  • evidence paths survive into the final answer;
  • file-based artifacts are easy to locate and inspect;
  • failed searches trigger better queries instead of premature answers;
  • public writing uses source verification, not plausible retrieval.

That list is intentionally public and generic. It does not disclose private hooks, private prompts, or local workflow internals. The point is the operating contract: make the agent prove what it found, not merely sound confident about the search it performed.

The paper also explains why a migration can feel worse even when every obvious feature exists. Claude Code and Codex may both expose shell tools. Both may read files. Both may search. But if transcript formatting, file-result handling, stop behavior, or retry patterns differ, the same search primitive can produce different work.

The Other Three Signals Point the Same Way

Three other May 14 papers from the same scan point at the same broader pattern: agent quality is moving out of isolated model calls and into runtime architecture.

APWA treats highly parallel agent work as a distributed execution problem. The authors decompose workflows into non-interfering subproblems that independent resources can process without cross-communication, then evaluate scaling on larger tasks where prior systems fail.2 That is a runtime claim, not a prompt trick.

MeMo treats memory as a separate model component. It keeps the executive LLM fixed, encodes new knowledge into a dedicated memory model, and reports resistance to retrieval noise plus plug-and-play compatibility with open and closed-source LLMs.3 That is a memory architecture claim, not a longer-context claim.

The Agentic GraphRAG provenance paper argues that final citations can be necessary but insufficient. Accurate answers can depend on uncited traversal context, graph structure, and visited-but-uncited entities.4 That is a provenance claim, not a citation-format claim.

Put those beside the grep paper and a shape emerges:

Problem Weak framing Stronger framing
Search pick grep or vector test retrieval plus runtime plus delivery path
Parallel work spawn more agents decompose into non-interfering execution units
Memory stuff more context design a memory layer with update and retrieval behavior
Citations cite final sources preserve provenance across the retrieval trajectory

The common theme: the wrapper is the product. The runtime decides whether the model’s capability becomes useful work.

What I Would Change in an Agent Stack

Start with a boring baseline. Give the agent exact search over the files, logs, transcripts, or notes that matter. Measure that before adding semantic retrieval.

Then test four combinations, not two:

Retriever Delivery path
grep inline
grep file-based
vector inline
vector file-based

Record the tool transcript for each run. The final answer is not enough. You need to know whether the agent searched the right terms, opened the right file, noticed the right span, retried after a miss, and cited the evidence that actually supported the answer.

Add vector search when the domain needs paraphrase recovery, conceptual synthesis, or non-literal evidence. Keep exact search when the domain contains names, IDs, filenames, dates, log lines, command output, user preferences, or previous instructions. Use hybrid routing when the task mixes both.

For public writing, make the retrieval path stricter. A cited article should carry source URLs, claim-source alignment, and a record of what remains unverified. If the system used a graph, a memory layer, or an intermediate retrieval path, final citations should not be the only trace. The provenance paper makes that point for Agentic GraphRAG, but the product lesson applies more broadly: evidence should explain the path, not only the destination.4

The Better Benchmark Question

The weak benchmark question is:

Which retriever is better?

The stronger question is:

Under this runtime, with this model, this corpus, this delivery path, and this retry policy, which search behavior produces verified answers?

That question is slower to answer. It also tells you something you can use.

Agent work keeps tempting people toward component claims: better model, better retriever, better prompt, better memory, better parallelism. The operational reality keeps pushing the other way. The component only matters after the runtime turns it into a dependable path from task to evidence to action.

That is the part worth migrating.


FAQ

No. The authors explicitly limit the result to the studied long-memory conversational QA setting. They state that dense retrieval and hybrid routing may behave differently in domains where evidence is rarely literal, including scientific synthesis, visual-heavy documents, and code semantics.1

Why did grep perform so well in the experiment?

LongMemEval questions often depend on literal spans from past conversations: names, dates, personal facts, and exact statements. Grep rewards high-precision patterns when the agent can guess a distinctive term.1

Why did the harness matter?

The runtime controls prompt shape, tool descriptions, transcript formatting, shell behavior, context construction, result delivery, and stop criteria. The paper reports large accuracy shifts across Chronos, Claude Code, Codex CLI, and Gemini CLI even when the underlying conversation data stayed the same.1

What should Codex users do with this?

Keep exact search as a baseline, inspect tool transcripts, and test inline versus file-based delivery before assuming a retrieval method is better. The paper’s Codex row is useful, but it is still one benchmark setting, one corpus type, and an incomplete vendor-wide picture for scaling rows.1

How does this relate to RAG citations?

The Agentic GraphRAG provenance paper argues that final citations can support an answer while still omitting retrieval context that influenced the answer. For agent systems, citation quality should include provenance over the path, not only the final cited source list.4

What should a migration from Claude Code to Codex preserve?

Preserve the operating behavior: when the agent searches, how it limits output, how it opens evidence, how it retries, how it records source paths, and how it refuses unsupported claims. Do not assume parity because both environments expose a shell and a search command.


References


  1. Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, “Is Grep All You Need? How Agent Harnesses Reshape Agentic Search,” arXiv:2605.15184v1, submitted 14 May 2026. Primary source for the LongMemEval-S setup, Chronos / Claude Code / Codex CLI / Gemini CLI comparison, inline versus programmatic delivery distinction, Table 1 accuracy values, Experiment 2 context-scaling discussion, and the paper’s stated limitation that the conclusion does not prove grep beats vector search in general. 

  2. Evan Rose, Tushin Mallick, Matthew D. Laws, Cristina Nita-Rotaru, Alina Oprea, “APWA: A Distributed Architecture for Parallelizable Agentic Workflows,” arXiv:2605.15132v1, submitted 14 May 2026. Source for APWA’s decomposition of parallelizable workflows into non-interfering subproblems, independent resources without cross-communication, and evaluation claim that APWA scales on larger tasks where prior systems fail. 

  3. Ryan Wei Heng Quek, Sanghyuk Lee, Alfred Wei Lun Leong, Arun Verma, Alok Prakash, Nancy F. Chen, Bryan Kian Hsiang Low, Daniela Rus, Armando Solar-Lezama, “MeMo: Memory as a Model,” arXiv:2605.15156v1, submitted 14 May 2026. Source for the dedicated memory-model architecture, fixed executive LLM, resistance to retrieval noise, avoidance of catastrophic forgetting in the executive model, closed-source LLM compatibility, and BrowseComp-Plus / NarrativeQA / MuSiQue evaluation. 

  4. Riccardo Terrenzi, Maximilian von Zastrow, Serkan Ayvaz, “Why Neighborhoods Matter: Traversal Context and Provenance in Agentic GraphRAG,” arXiv:2605.15109v1, submitted 14 May 2026. Source for the claim that citation faithfulness in Agentic GraphRAG should be treated as a trajectory-level provenance problem involving graph traversal, structure, cited evidence, and visited-but-uncited entities. 

Related Posts

The Repo Shouldn't Get to Vote on Its Own Trust

Two Claude Code trust dialog bypass CVEs in 37 days reveal a load-order failure. One invariant fixes it: interpret no wo…

12 min read

Reward the Tool Before the Answer

AI agents fail when answers claim tool work that never happened. Four failure modes and the rule that catches them, with…

13 min read

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

11 min read