Agent Code Search Has a Token Budget

From the guides: Claude Code & Codex CLI

Semble crossed 900 GitHub stars on May 17, 2026 by making a blunt claim: coding agents waste most of their context budget when they grep, open whole files, and read far more code than the task needs.1

The claim lands because it reframes code search as a budget problem. A human can skim a noisy rg result and ignore the junk. An agent pays for every irrelevant line in context, attention, and tool-loop time.

TL;DR

Semble is a code search library for agents. It offers an MCP server, shell integration through AGENTS.md or CLAUDE.md, a CLI, and a Python API.1 Under the hood, Semble chunks code, searches with BM25 plus static Model2Vec code embeddings, fuses the ranked lists with Reciprocal Rank Fusion, then reranks with code-aware signals such as symbol weighting, definition boosts, identifier stems, file coherence, and noise penalties.1 Its benchmark reports NDCG@10 of 0.854 across roughly 1,250 queries over 63 repositories in 19 languages, near a CodeRankEmbed hybrid score of 0.862 while indexing far faster in the benchmark table.2 The important product lesson is not “replace grep.” The lesson is sharper: an agent search tool should return the smallest evidence packet that lets the model act correctly.

Key Takeaways

  • For coding-agent users: keep rg for exact strings, but use snippet-ranked search when the task asks for behavior rather than a literal token.
  • For tool builders: optimize retrieved context, not only retrieval accuracy. The useful unit is evidence per token.
  • For Codex and Claude Code users: prefer a shell-accessible path for subagents, because top-level MCP tools may not reach delegated agents in the same way.1
  • For benchmark readers: separate vendor benchmark claims from local runtime behavior. My cold uvx run took much longer than Semble’s benchmark table because package/model/index startup dominated.
  • For public writing: retrieval tools do not remove citation work. They only make the evidence path cheaper to inspect.

Why Grep Is Still Good, And Still Not Enough

rg remains the right first tool for exact strings. If I need visible_label_residue, a credential variable name, or a function name, lexical search should win on speed and certainty. In my local test, a literal rg query for translation residue terms returned in about a tenth of a second.5

The problem starts when the agent does not know the exact string.

Agents often search by intent: “where does the blog i18n gate check visible label residue” or “how does translation release verification work?” Literal search can still find useful lines, but the agent has to choose words, inspect dozens of hits, read files, reformulate the query, and decide which line carries the answer. Every step consumes context and creates a chance to stop too early.

Semble attacks that specific failure mode. It lets the agent query in natural language, then returns ranked code snippets instead of whole files.1 That does not make rg obsolete. It changes the default interaction from “show me every line matching this term” to “give me the smallest useful slice of code.”

That distinction matters because agents do not read like humans. Humans can glance over 80 lines of search output and hold only the interesting three lines in mind. Models receive the full output as tokens. A noisy search result becomes part of the task environment.

What Semble Actually Does

Semble’s public README describes four integration paths: MCP server, Bash / AGENTS.md, CLI, and Python API.1 The Codex setup is a local MCP server entry in ~/.codex/config.toml, and the shell path adds a code-search section to AGENTS.md or CLAUDE.md.1

The shell path matters more than it first appears. The README states that Claude Code and Codex CLI subagents should use the Bash integration instead of, or alongside, MCP, because subagents cannot call MCP tools directly in that setup.1 That is a practical agent-interface point: the search tool needs to exist where the work happens, not only where the top-level session starts.

The retrieval stack also looks like the direction agent search is going:

Layer Role
Code-aware chunking Search returns snippets instead of whole files
BM25 Catches identifiers, API names, exact terms, and lexical clues
Static Model2Vec embeddings Catches semantic intent without a transformer forward pass at query time
Reciprocal Rank Fusion Combines lexical and semantic rankings without score calibration
Code-aware reranking Boosts definitions, symbol matches, file-level coherence, and likely canonical implementations

That design matches what I have seen in local retrieval systems: pure vector search misses identifiers, pure keyword search misses intent, and hybrid ranking gives the agent a better first read.4

The Benchmark Claim Is About Context, Not Magic

Semble’s benchmark README reports two different classes of result.

The first class measures retrieval quality and speed. The table reports Semble at 0.854 NDCG@10, CodeRankEmbed Hybrid at 0.862, BM25 at 0.673, and ripgrep at 0.126. The benchmark covers about 1,250 queries over 63 repositories in 19 languages, with CPU-only runs.2

The second class measures token efficiency. The benchmark models a common coding-agent workflow: split a query into keywords, run rg --fixed-strings --ignore-case, rank files by distinct keyword matches, then read matched files in full. Against that baseline, the benchmark reports an average of 45,692 tokens for ripgrep plus file reads versus 566 tokens for Semble, a 98% reduction.2

That is the interesting claim. Not “semantic search beats grep” in every setting. Not “agents should stop using exact search.” The claim is that grep-plus-read sends too much irrelevant code into the model when the task only needs a few chunks.

The benchmark’s methodology also explains where the claim should and should not apply. Semble compares against reading matched files in full.2 If your workflow already uses rg -n, sed, and surgical line ranges, your baseline may be tighter than the benchmark’s grep-plus-read model. If your agent routinely opens entire files after a broad search, the benchmark is closer to your real failure mode.

My Local Test

I ran Semble in the site repo through uvx --from semble semble, then compared it with literal rg searches.

I started with a release-process query:

semble search "blog translation quality gate release verifier D1" . --top-k 5 --include-text-files

Semble returned five snippets. The top result summarized the blog translation release loop in a migration article table. Another result pointed directly at scripts/i18n-automation/README.md, which contained the quality-gate, release-verifier, native-review, commit, push, Railway, Cloudflare, and live-smoke steps.5

The comparable rg command returned fast, but it returned a large stream of literal matches for credential variables, blog_release_verify, and related names across scripts, tests, and docs.5 A human can filter that. An agent has to spend context to do the same.

I then asked for the gate implementation:

semble search "where does the blog i18n gate check visible label residue" . --top-k 5 --include-text-files

Semble’s top result pointed at the exact local gate block where visible_label_residue gets assigned, converted into an error, and affects the finding status. The output included the relevant function body lines rather than a whole file.5

The comparable rg query again finished faster, but returned many hits across tests, translation prompts, repair scripts, the README, and the gate implementation.5

That test does not prove Semble’s benchmark. My invocation used uvx, downloaded packages/model assets, indexed a large mixed repo, included Markdown and JSON files, and ran from a cold path. The first Semble query took about 54 seconds; the second took about 31 seconds.5 Those numbers do not match the project’s benchmark table, and I would not cite them as Semble performance data.

The test does prove the product shape. Semble returned smaller, more answer-shaped evidence packets. After two searches, semble savings --verbose reported about 38,100 estimated tokens saved at 94%, using its own file-versus-snippet savings method.5 Treat that as a tool-reported estimate, not independent measurement, but the direction matched the visible output.

The Right Mental Model: Evidence Packets

Agent search should produce evidence packets.

An evidence packet has four properties:

Property Why it matters
Small The model spends attention on relevant code, not file bulk
Located The result carries file path and line range
Sufficient The snippet contains enough context for the next step
Escalatable The agent can open the full file when the snippet is not enough

Raw rg gives location and speed. Full-file reads give context but too much of it. Vector search gives intent but can miss exact names. A good agent search workflow combines them:

  1. Use exact search when the task names a symbol, error, config key, file, or literal string.
  2. Use snippet-ranked semantic or hybrid search when the task names behavior.
  3. Open the full file only after a snippet proves relevance.
  4. Cite the file and line range in the final answer.
  5. Retry with exact search when the snippet suggests a concrete identifier.

Semble encodes much of that workflow as a tool. The agent still needs judgment, and the evidence gate still needs a trace it can inspect.

How Semble Changes Codex And Claude Code Workflows

The practical question is not whether to install every new search tool. The question is where code search belongs in the agent’s operating contract.

For top-level sessions, MCP can work well because the agent sees the tool schema and calls the server directly. Semble’s README includes MCP setup examples for Claude Code, Codex, OpenCode, Cursor, and other MCP-compatible clients.1

For delegated work, shell access may matter more. Semble’s README explicitly calls out the Bash integration for Claude Code and Codex CLI subagents.1 A subagent that cannot reach the top-level MCP tool still can run a shell command if the workflow teaches it when and how.

That means the best integration may look boring:

## Code Search

Use `semble search` when looking for behavior or related implementation.
Use `rg` when looking for an exact string, symbol, file name, or config key.
Open full files only after the search result proves relevance.
Report file path and line range when citing evidence.

That kind of instruction beats a vague “use semantic search” rule because it names the routing decision. The agent learns which tool fits which question.

What I Would Not Do

I would not replace rg.

The local test made that clear. rg answered literal queries in about a tenth of a second. Semble returned better packets for behavior-shaped queries, but my cold shell invocation had real startup and indexing cost.5

I would not treat Semble’s 98% token claim as universal. The benchmark compares against grep plus full-file reads. The claim is fair when that baseline resembles the agent’s behavior. The claim overstates the gain when a disciplined workflow already reads narrow line ranges.

I would not hide the routing choice inside a black box. Agents need to know when they are doing exact lookup, semantic discovery, related-code exploration, or evidence confirmation. Tool use without routing rules becomes another source of plausible failure, the same interface problem behind chat-driven agent work.

Why Semble Belongs Beside The Grep Paper

The recent “Is Grep All You Need?” paper tested grep and vector retrieval across Chronos, Claude Code, Codex CLI, and Gemini CLI on long-memory conversational QA. Inline grep beat inline vector in that setting, but the paper’s deeper lesson matters more: runtime changed the result as much as retrieval method did.3

Semble points at the same operational layer from the code side. Search quality does not live in the retriever alone. It lives in:

  • how the query gets formed;
  • whether exact and semantic paths both exist;
  • how much context the tool returns;
  • whether snippets carry file and line evidence;
  • whether the agent opens full files only when needed;
  • whether delegated agents can reach the tool;
  • whether the final answer cites what the search actually found.

The wrapper remains the product. Search becomes useful only when the runtime turns retrieval into evidence, which is why the agentic design control surface matters as much as the retrieval algorithm.

The Standard I Want

An agent search tool should report more than matches.

It should report:

  • the query it ran;
  • the retrieval path it used;
  • the file and line range;
  • the snippet;
  • an estimate of returned tokens;
  • whether the result came from exact, semantic, or hybrid retrieval;
  • when the agent escalated from snippet to full-file read.

That output would make code search auditable. A reviewer could see whether the agent found the right code, read enough context, and avoided drowning itself in irrelevant files. The same principle drives agent execution traces: the proof lives in the path, not only the answer.

Semble already moves in that direction by treating snippet size and token savings as first-class product concerns. The next step for agent runtimes is to make that evidence path visible in review packets and final answers.

The goal is not prettier search. The goal is fewer unsupported claims per token.

FAQ

Does Semble replace grep?

No. Use rg for exact strings, symbols, config keys, filenames, and fast confirmation. Use Semble-style snippet retrieval when the task describes behavior or related implementation and the agent does not know the exact identifier.

Did your local test confirm Semble’s speed claims?

No. My local uvx invocation took about 54 seconds for the first query and 31 seconds for the second, mostly because package/model startup and indexing dominated the ad hoc run. Semble’s benchmark table reports much faster controlled measurements, but my local run should be treated as workflow evidence, not a performance benchmark.25

Did your local test confirm the token-savings direction?

Yes, at the workflow level. Semble returned much smaller snippets than the broad literal rg output, and its savings command reported about 38,100 estimated tokens saved after two searches. The savings number comes from Semble’s own accounting method, so treat it as tool telemetry rather than independent proof.5

Why does agent code search matter for Codex and Claude Code?

Agents lose quality when search dumps too much context or hides too much evidence. A good workflow teaches the agent when to use exact search, when to use snippet-ranked retrieval, when to open full files, and how to cite the result.

Should teams add Semble to AGENTS.md?

Only after testing it on their codebase. Start with one instruction: use snippet-ranked search for behavior-shaped questions and rg for exact strings. Measure whether agents find the right files faster and read fewer irrelevant lines.


References


  1. MinishLab, “Semble README,” GitHub repository documentation. Source for Semble’s purpose, integration paths, MCP and AGENTS.md setup, subagent Bash note, search/savings commands, retrieval architecture, code-aware ranking signals, and headline feature claims. Current-session verification on May 17, 2026 found PyPI version 0.1.7, latest GitHub release v0.1.7, MIT license, and repository description “Fast and Accurate Code Search for Agents. Uses ~98% fewer tokens than grep+read.” 

  2. MinishLab, “Semble benchmarks,” GitHub benchmark documentation. Source for the 63-repository, 19-language, roughly 1,250-query methodology; NDCG@10 and latency table; CPU-only benchmark note; token-efficiency methodology; and reported 45,692 average tokens for ripgrep plus full-file reads versus 566 for Semble. 

  3. Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, “Is Grep All You Need? How Agent Harnesses Reshape Agentic Search,” arXiv:2605.15184v1, submitted May 14, 2026. Source for the long-memory QA search comparison across Chronos, Claude Code, Codex CLI, and Gemini CLI, and for the conclusion that retrieval behavior depends on runtime and delivery path. 

  4. Author’s prior production retrieval write-up, “Hybrid Retriever for Obsidian,” blakecrosley.com. Source for the local BM25 plus vector retrieval pattern, RRF fusion framing, and exact-versus-semantic failure modes in a personal knowledge base. 

  5. Author’s current-session local verification on May 17, 2026. Commands included uvx --from semble semble --help, uvx --from semble semble search "blog translation quality gate release verifier D1" . --top-k 5 --include-text-files, uvx --from semble semble search "where does the blog i18n gate check visible label residue" . --top-k 5 --include-text-files, comparable rg searches, and uvx --from semble semble savings --verbose. Observed results: Semble exposed search, find-related, init, and savings; the first query returned targeted release-loop snippets; the second query returned the visible_label_residue gate block; the comparable rg searches completed faster but returned broader literal-match streams; Semble reported two search calls and about 38,100 estimated tokens saved at 94%. 

Related Posts

Context Engineering Is Architecture: 650 Files Later

Context engineering for AI agents across a 650-file, seven-layer hierarchy. Three production failures, real token budget…

21 min read