← All Posts

The CLI Thesis

From the guide: Claude Code Comprehensive Guide

Three Hacker News threads in one week. 1,638 combined points. One conclusion: the IDE is for humans. The CLI is for agents.123

Boris Tane’s planning/execution split (716 points) runs entirely in the terminal.1 The “Making MCP Cheaper via CLI” analysis (304 points) measured a 94% token reduction by replacing MCP tool calls with claude --print.3 Vercel’s just-bash project (87 points) demonstrated that removing 80% of an agent’s tools made everything faster, cheaper, and more reliable.4 A separate benchmark found CLI approaches using 35x fewer tokens than MCP for equivalent tasks.10

The pattern keeps showing up because the architecture forces it. Agents need composability, scriptability, and minimal overhead. CLIs provide all three. IDEs provide none.

TL;DR

CLI-first agent architecture costs 94% less in token overhead than MCP-based approaches, runs 3.5x faster, and composes with standard Unix tools. The planning/execution split works because CLI artifacts are portable markdown files, not IDE state. Remote Control makes CLI agents remotely supervisable without sacrificing the composability advantage. Every serious agent system converges on the terminal because the terminal is where automation already lives.


The Planning/Execution Split

Boris Tane documented a three-phase workflow that 454 HN commenters validated with their own variations: research, plan, implement.1 The critical observation: never let the agent write code until you have reviewed and approved a written plan.

The workflow produces markdown artifacts at each stage. research.md captures codebase analysis. plan.md captures the implementation strategy with code snippets. The developer reviews in their editor, adds inline annotations, and sends the annotated plan back with an explicit instruction: “don’t implement yet.” The annotation cycle repeats one to six times before a single line of code gets written.

The entire loop runs through claude in the terminal. No IDE plugin. No visual diff tool. Markdown files in, markdown files out. The artifacts survive context window compaction because they exist as files, not as conversation state.

My autonomous loop implements the same split at machine scale. A PRD file defines stories with acceptance criteria. A fresh agent spawns for each story, receiving a briefing of current git state and what previous agents accomplished. The agent implements, an independent verifier runs the tests (never trusting the agent’s self-report), and three code reviewers evaluate the diff in parallel. The entire orchestration runs through claude --print calls in bash scripts. No framework, no runtime, no server.

The planning/execution split works in the terminal because the terminal makes the split structural. Planning produces files. Execution consumes files. The boundary between the two phases is a file on disk, visible and auditable, not a state transition buried inside an IDE plugin.


Why CLI Beats IDE for Agents

The argument rests on three pillars: cost, composability, and context efficiency.

Cost: 94% Token Reduction

Kan Yilmaz measured the token overhead of MCP versus CLI across four scenarios.3 The numbers tell the full story:

Scenario MCP Tokens CLI Tokens Savings
Session start (0 tools used) ~15,540 ~300 98%
Single tool usage ~15,570 ~910 94%
10 tools used ~15,840 ~964 94%
100 tools used ~18,540 ~1,504 92%

MCP injects tool schemas into every conversation. With 84 tools, the schema overhead alone consumes 15,540 tokens before the agent does any work. CLI calls carry no schema overhead because models already understand standard command-line interfaces.10 One user documented MCP_DOCKER consuming 125,964 tokens across 135 tools.14

Jannik Reinhard ran a parallel benchmark on an Intune compliance task: 145,000 tokens via MCP versus 4,150 tokens via CLI for equivalent output.10 The CLI agent had 95% of its context window available for reasoning. The MCP agent spent the majority of its budget on tool definitions.

Composability: Unix Pipes Already Work

Vercel’s v0 team removed 80% of their agent’s tools and replaced them with a single bash environment.9 The results:

Metric Before (15+ tools) After (bash only) Change
Execution time 274.8s 77.4s 3.5x faster
Success rate 80% 100% +20%
Token usage ~102k ~61k -37%
Steps required ~12 ~7 -42%

Andrew Qu explained the reasoning: “We were solving problems the model could handle on its own.” File systems, grep, pipes, and redirects already compose. Models understand them. Building custom MCP tools for operations that cat | grep | jq handles natively adds overhead without adding capability.13

Anthropic’s own documentation confirms the pattern. Claude Code follows the Unix philosophy: pipe logs into it, run it in CI, chain it with other tools.5 The headless mode (claude -p) supports structured JSON output, schema enforcement, and session resumption via captured session IDs. Anthropic positions headless mode as the primary integration path for CI/CD and scripted workflows.5

Simon Willison frames the implication: writing code is cheap now.6 The corollary nobody wants to hear is that verification is now the expensive part. CLI agents compose with existing verification infrastructure. Test runners, linters, type checkers, security scanners, deployment pipelines: all command-line tools. An IDE agent needs a plugin for each one. A CLI agent pipes to them.

Context Efficiency: Reasoning Over Signal, Not Noise

The context window is finite. Every token spent on tool schemas, conversation history, and MCP overhead is a token unavailable for reasoning. CLI architecture keeps the context budget lean by design.

A fresh claude --print call receives a focused prompt (~2K tokens) instead of inheriting a full conversation context (~100K+ tokens). Each operation starts clean. No accumulated state. No stale tool definitions. No conversation drift.

My infrastructure runs 84 hooks across 17 lifecycle events, all orchestrated through CLI calls. Each agent spawn gets a briefing: current git state, a summary of what previous agents accomplished, and the acceptance criteria for its one task. Briefing instead of memory. Models execute a clear briefing better than they navigate 30 steps of accumulated context.

An academic analysis of Unix philosophy in agentic systems formalized the principle: collapse diverse interfaces into a uniform abstraction, accepting some loss of specialization in exchange for composability and tractability.11 File-like abstractions and code-based specifications reduce the cognitive and engineering burden. CLI agents inherit 50 years of that design work.


Remote Control Changes the Equation

The obvious objection to CLI-first agents: you lose the visual feedback of an IDE. Anthropic shipped an answer on February 25, 2026. Remote Control connects to a local Claude Code session from any browser or the Claude mobile app.2 The feature earned 531 points and 313 comments on Hacker News.

Remote Control does not move anything to the cloud. The agent keeps running locally. The terminal session registers with Anthropic’s API over TLS and polls for work. All traffic goes through outbound HTTPS. No inbound ports open.2

The feature resolves the supervision gap. Before Remote Control, CLI agents had two modes: supervised (sit at the terminal) or unsupervised (walk away and hope). Remote Control creates a third: asynchronous governance. Approval prompts route to your phone. You approve, reject, or redirect from anywhere.

My hook system classifies operations by blast radius. Local operations (file writes, test runs) auto-approve. Shared operations (git commits) warn. External operations (pushes, deployments) defer to human review. Remote Control turns that “defer” path from a blocking wait into an async notification. The agent keeps working on the next story while I review the previous one from my phone.

The IDE becomes a display layer, not the execution environment. You monitor progress through Remote Control. You intervene when quality gates flag an issue. The CLI agent does the work. The IDE shows you the result.


The Bash Agent Pattern

Vercel’s just-bash is a simulated bash environment with an in-memory virtual filesystem, built specifically for AI agents.4 The design encodes three convictions about agent architecture:

Isolation over sandboxing. Each call runs in isolation. Environment variables, functions, and working directory reset between calls. The filesystem persists. Lazy file loading means files load on first read and get cached, never loading files the agent writes before reading. The agent cannot contaminate its own environment across operations.

Existing tools over custom tools. Grep is 50 years old and handles exactly what custom search MCP tools replicate. jq parses JSON. curl fetches URLs. The model knows these interfaces from training data. Custom tools require schema injection and documentation. Standard tools require neither.

Minimal architecture over framework architecture. Andrew Qu captured the principle: “Models are getting smarter and context windows are getting larger, so maybe the best agent architecture is almost no architecture at all.”13

My harness validates the pattern at production scale. Approximately 15,000 lines of bash orchestrating Claude Code. 84 hooks across 17 event types. Dispatchers, quality gates, semantic search integration, autonomous loops. No Python runtime, no framework dependency. The rough edges of bash (no native JSON, no async, no proper data structures) are real but solvable. jq handles JSON. Sequential processing is actually a feature: gates should run in order, not race.

The pattern works because agent orchestration is fundamentally about reading stdin, making a decision, and writing stdout. That description matches bash’s design purpose exactly. Anything more complex indicates the task decomposition was wrong, not the tooling.


Cost as Architecture Decision

Cost decisions compound. Choosing CLI over MCP for stateless operations saves 94% per invocation.3 At 100 operations per day, the savings reach $228/month on tool definition overhead alone.3 Those savings free budget for more operations, which generates more savings. The architecture pays for itself.

Three cost layers compound independently:

Token layer. System prompt compression. I run ~3,500 tokens of system prompt across a CLAUDE.md file and 8 rules files. Constraints outperform explanations. “Reject tool calls matching sensitive paths” does the same work as a 15-line explanation of why credentials should remain protected. Anthropic’s best practices documentation emphasizes the same point: context window performance degrades as it fills.7 Every wasted token costs twice: once in direct API charges, again in degraded reasoning quality.

Agent layer. Fresh spawns over long conversations. Each story in an autonomous run gets a new agent with a clean context window. Geoffrey Huntley documented a similar pattern called “The Ralph Loop,” running autonomous development at $10.42/hour on Sonnet.12 The context never balloons because each agent starts fresh. Cache hits on the system prompt cost 90% less ($0.50 vs $5.00 per million tokens on Opus 4.6), so the repeated system prompt across fresh spawns carries minimal overhead.8

Architecture layer. CLI for stateless operations, MCP for stateful ones. A claude --print call for a one-shot evaluation adds no connection overhead. MCP makes sense when the tool needs persistent state or streaming. Most agent operations are one-shot evaluations, classifications, or code generation tasks. CLI handles all of them at lower cost with simpler debugging.

A concrete example from my autonomous loop last week: five PRD stories processed overnight.15 Each story spawned a fresh agent (~2K tokens briefing), ran implementation (~15K tokens average), then spawned three review agents (~2K tokens each). Total per story: ~23K tokens. The same workflow in a long-running MCP conversation would carry ~100K+ tokens of accumulated context per story by story three. Five stories via CLI: ~115K tokens total. Five stories via MCP conversation: ~500K+ tokens total. The cost ratio compounds with every additional story.


When MCP Still Wins

The CLI thesis does not argue against MCP universally. MCP wins in specific scenarios where CLI falls short.

Stateful tool servers. A database browser that maintains a connection pool across calls benefits from MCP’s persistent server model. Reconnecting on every CLI call adds latency and authentication overhead. If the tool needs state between invocations, MCP is the correct choice.

Structured validation. MCP tool schemas enforce input/output contracts. A CLI call accepts arbitrary text. When the agent must provide structured input matching a precise schema (API key format, date ranges, enumerated options), the MCP schema catches malformed input before the tool processes it. CLI validation requires the tool itself or a wrapper script to enforce constraints.

Multi-tenant access control. MCP servers can enforce per-user permissions centrally. CLI tools inherit the operating system user’s permissions. In team environments where different agents need different access levels, MCP provides finer-grained authorization.

Streaming responses. Long-running operations that produce incremental output (log tailing, build progress, database exports) work better through MCP’s streaming protocol than through a CLI call that blocks until completion.

The decision rule: if the operation is stateless and one-shot, use CLI. If the operation requires persistent state, structured contracts, or streaming, use MCP. In my harness, approximately 90% of operations are stateless. The 10% that need MCP genuinely benefit from it. Optimizing the 90% yields the largest return.


What You Can Build Today

Three patterns, each buildable in an afternoon, each compounding with the others.

Pattern 1: Planning/Execution Split

# Plan phase: research and plan, no implementation
claude -p "Research the codebase and write research.md" \
  --allowedTools "Read,Glob,Grep,Write"

# Review: read annotations in research.md, write plan.md
claude -p "Read my annotations in research.md and write plan.md" \
  --allowedTools "Read,Write"

# Implement: follow the approved plan
claude -p "Implement the plan in plan.md" \
  --allowedTools "Read,Write,Edit,Bash"

Each phase gets scoped tool permissions. The planning agent cannot edit code. The implementation agent cannot browse the web. File boundaries enforce the split. The --allowedTools flag does the enforcement at the CLI level. No configuration file. No plugin settings. One flag per invocation, scoped to exactly the permissions that phase requires.

The annotation cycle is the key differentiator from “just prompting better.” You review the plan in your editor. You cross out the parts you disagree with. You add notes in the margins. The agent reads your annotated file and revises. The plan gets better with each pass because two different intelligences (human domain knowledge, model code generation) converge on the same document.

Pattern 2: Fresh Spawn Per Task

for story in $(jq -r '.stories[].id' prd.json); do
  # Each story gets fresh context with a focused briefing
  criteria=$(jq -r ".stories[] | select(.id==\"$story\")" prd.json)
  state=$(git diff --stat HEAD~1)
  briefing="Git state: $state --- Story: $criteria"

  claude -p "Implement: $briefing" \
    --output-format json \
    --allowedTools "Read,Write,Edit,Bash,Glob,Grep" \
    | jq -r '.result'

  # Independent verification: never trust self-report
  python -m pytest -v
done

No accumulated context. No conversation drift. Each agent gets a clean window with a focused briefing. The --output-format json flag captures structured output including session IDs, enabling deterministic conversation resumption if a story needs follow-up work.

The independent verification step matters more than the implementation step. Agents exhibit what I call Phantom Verification: claiming tests pass without running them. Running pytest outside the agent’s context window eliminates the failure mode entirely. The agent cannot misrepresent results it never produced.

Pattern 3: Parallel Review Pipeline

diff=$(git diff HEAD~1)

# Three reviewers with independent context
claude -p "Review for bugs: $diff" --output-format json > /tmp/correctness.json &
claude -p "Review for vulnerabilities: $diff" --output-format json > /tmp/security.json &
claude -p "Review for style issues: $diff" --output-format json > /tmp/conventions.json &
wait

# Merge findings from all three
jq -s 'map(.result)' /tmp/correctness.json /tmp/security.json /tmp/conventions.json

Three agents. Three perspectives. Zero shared state. Disagreements between reviewers surface exactly the issues a single reviewer misses. The & operator and wait builtin handle parallelism. No async runtime. No thread pool. No orchestration framework. Bash job control does the work.

The power of the pattern: each reviewer gets the full context window for its one concern. A single reviewer splitting attention across correctness, security, and style performs worse than three specialists with dedicated context. CLI makes the split trivial because each call is an independent process with independent memory.


Key Takeaways

For developers building agent systems: - Start with claude -p for all agent spawning. Add MCP only when persistent state is required. - Scope tool permissions per phase. Planning agents read. Implementation agents write. Review agents read diffs. - Budget 94% less for token overhead by choosing CLI over MCP for stateless operations.3

For teams scaling autonomous workflows: - Fresh agent spawns prevent context drift and cap token costs per operation. - Remote Control turns “unsupervised” into “asynchronously supervised” without changing the CLI architecture.2 - The Vercel data proves the counterintuitive result: fewer tools means higher success rates, not lower.4

For architects choosing agent infrastructure: - CLI agents compose with existing CI/CD, testing, and deployment tooling. No integration work needed. - The Unix philosophy (compose small tools via pipes) predates and outperforms every agent-specific framework.11 - The 10% productivity wall breaks when you stop treating the agent as a chat tool and start treating it as infrastructure.


Part of the AI Engineering series. See also: Claude Code as Infrastructure, Anatomy of a Claw, The Autonomous Loop, and The 10% Wall.


  1. Boris Tane, “How I Use Claude Code: Separation of Planning and Execution.” Blog post. HN discussion (716 points, 454 comments). 

  2. Claude Code Remote Control. Anthropic documentation. HN discussion (531 points, 313 comments). 

  3. Kan Yilmaz, “Making MCP Cheaper via CLI.” Blog post. HN discussion (304 points, 115 comments). 

  4. Vercel, just-bash: Bash for Agents. GitHub repository. HN discussion (87 points, 48 comments). 

  5. Claude Code Headless Mode. Anthropic documentation

  6. Simon Willison, “Writing Code is Cheap Now.” Agentic Engineering Patterns

  7. Claude Code Best Practices. Anthropic documentation

  8. Anthropic Model Pricing. Pricing page. Opus 4.6: $5/MTok input, $0.50/MTok cache hit. 

  9. Andrew Qu, “We Removed 80% of Our Agent’s Tools.” Vercel blog

  10. Jannik Reinhard, “Why CLI Tools Are Beating MCP for AI Agents.” Blog post. 35x token reduction, 33% TES advantage. 

  11. Deepak Babu Piskala, “From ‘Everything is a File’ to ‘Files Are All You Need’: How Unix Philosophy Informs the Design of Agentic AI Systems.” arXiv:2601.11672, January 2026. 

  12. Geoffrey Huntley, “The Ralph Loop.” ghuntley.com/loop. Autonomous development at $10.42/hour on Sonnet. 

  13. “The Key to Agentic Success? BASH Is All You Need.” The New Stack, February 2026. 

  14. MCP token overhead analysis. Context pollution guide. One user hit 144,802 tokens from MCP tools alone. 

  15. Author’s analysis based on autonomous loop sessions processing multi-story PRDs via Claude Code CLI. 

Related Posts

Context Is the New Memory

Context engineering is the highest-impact skill in agent development. Three compression layers turn a 200K token window …

15 min read

The Protege Pattern

A 7B model with sparse expert access matches agents 50x its size. The protege pattern routes routine work to small model…

9 min read

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

8 min read