Claude Code as Infrastructure

Andrej Karpathy coined a term for what grows around an LLM agent: claws. The hooks, scripts, and orchestration that let the agent grip the world outside its context window.¹ Most people treat Claude Code like a chat box with file access. They type a prompt, watch it edit a file, and move on. That framing misses what the tool actually is.

Claude Code is not an IDE feature. It is infrastructure. And the gap between treating it as one versus the other determines whether AI-assisted development stays at 10% productivity gains or breaks through to something fundamentally different.

TL;DR

Claude Code exposes 17 lifecycle events, each hookable with shell scripts that fire before, during, or after every tool call.² Stack hooks into dispatchers, dispatchers into skills, skills into agents, agents into workflows, and you get a programmable layer between you and the model that enforces constraints the model cannot skip. I built 84 hooks, 48 skills, 19 agents, and ~15,000 lines of orchestration over two months. Zero frameworks. Zero external dependencies. All bash and JSON. The result is an autonomous development system that writes, reviews, and ships code while I sleep. This post explains the architecture, why the IDE framing holds people back, and what changes now that Remote Control makes this infrastructure accessible from anywhere.

The IDE Framing is Wrong

The default mental model: Claude Code is a smarter autocomplete. You sit at a terminal, give it tasks, and supervise the output. That model caps your productivity at whatever you can personally oversee.

The infrastructure mental model: Claude Code is a programmable runtime with an LLM kernel. Every action the model takes passes through hooks you control. You define policies, not prompts. The model operates within your infrastructure the same way a web server operates within nginx rules. You do not sit at nginx and type requests. You configure it, deploy it, and monitor it.

The distinction matters because infrastructure compounds. A hook that blocks credentials in bash commands protects every session, every agent, every autonomous run. A skill that encodes your blog evaluation rubric applies consistently whether you invoke it or an agent does. An agent that reviews code for security runs the same checks whether you are watching or not.

Simon Willison frames the current moment around a single observation: writing code is cheap now.³ Correct. But the corollary nobody wants to hear is that verification is now the expensive part. Cheap code without verification infrastructure produces bugs at scale. The investment that pays off is not a better prompt. It is the system around the model that catches what the model misses.

The Infrastructure Layer

Claude Code’s hook system fires shell commands at 17 lifecycle events.² PreToolUse fires before a tool executes and can block it. PostToolUse fires after and can provide feedback. UserPromptSubmit fires when you type and can inject context. Stop fires when the model tries to finish and can force it to continue. Each event receives JSON on stdin with full context: session ID, tool name, tool input, current working directory.

The hook system is not a plugin system. It is an event-driven architecture. The difference: plugins extend a tool’s features. Events let you intercept, modify, and control every action the tool takes. You become the middleware.

Hooks: The Deterministic Layer

Hooks are shell scripts. They cannot be hallucinated, sweet-talked, or prompt-injected around. The model wants to run rm -rf /? A 10-line bash script checks the command against a blocklist and rejects it before the shell ever sees it. The model tries to read .env? A regex on the file path intercepts the Read tool call. None of this requires the model to cooperate. The hook fires whether the model wants it to or not.

I run 84 hooks across 17 event types. The split tells a story: 35 enforce judgment (gates, guards, validators) and 49 handle automation (injectors, loggers, trackers). That ratio started at 1:6. Two months of things breaking in autonomous runs pushed it to 4:5. Every judgment hook exists because something failed without it. An agent committed code with TODO comments. An agent ran a destructive git command. An agent leaked a credential path into a log file. Each failure got a gate.

The biggest lesson: dispatchers over independent hooks. I had seven hooks all firing on UserPromptSubmit, each reading stdin independently, two writing to the same JSON state file. Concurrent writes truncated the JSON. Every downstream hook that parsed that file broke. One dispatcher per event running hooks sequentially from cached stdin fixed it. Invisible overhead, 200ms per prompt.

Skills: The Knowledge Layer

Skills are markdown instruction sets that activate on demand or via hooks.⁴ Each one encodes domain expertise the model draws on when invoked. My blog-evaluator skill defines a 6-category weighted rubric with specific scoring criteria, category minimums, and interdependencies. My jiro skill encodes a 7-step quality loop with an evidence gate that requires specific proof for each criterion.

Skills compose with hooks. A skill can define its own hooks in frontmatter that activate only while the skill runs. Philosophy skills auto-activate via SessionStart hooks, injecting quality constraints into every session without explicit invocation.

48 skills covering: code quality (jiro, testing-philosophy, debugging-philosophy), content (blog-writer-core, blog-evaluator, citation-verifier), architecture (fastapi, swiftui, database, htmx-alpine), operations (deploy, cache, analytics, security), and meta-orchestration (deliberation, scan-intel, ralph). Research into Claude Code’s own preferences found it gravitates toward certain frameworks and patterns.⁹ Skills let you override those defaults with your own.

Agents: The Delegation Layer

Agents are specialized subagents with isolated context windows.⁵ Each one gets a focused task and fresh context. My code review system spawns three agents in parallel: correctness, security, and conventions. Each reviews independently. Disagreements between reviewers surface exactly the issues a single reviewer would miss.

The critical constraint: a recursion guard. A shell script fires before every Task tool call, checks a spawn depth counter in a shared state file, and blocks the call if depth exceeds a threshold. Without it, agents delegate to agents that delegate to agents, each one losing context and burning tokens. Default limit is 3 levels. In practice, useful work happens at depth 1 (main agent plus one subagent). Anything deeper than 2 usually means the task decomposition was wrong.

19 agents spanning: development (ios-developer, backend-architect), review (code-reviewer, security-reviewer, conventions-reviewer, yagni-reviewer), exploration (project-scout, code-explorer, code-architect), and validation (test-runner, correctness-reviewer).

Remote Control Changes the Equation

On February 25, 2026, Anthropic shipped Remote Control: the ability to connect to a local Claude Code session from any browser or the Claude mobile app.⁶ The feature got 531 points and 313 comments on Hacker News, most of them complaints about bugs. The complaints are valid. The feature is still transformative.

Here is why. Before Remote Control, the infrastructure I described had two modes: supervised (I watch the terminal) or unsupervised (I walk away and hope). Neither is ideal. Supervised caps throughput at my attention span. Unsupervised risks the model making bad decisions nobody catches.

Remote Control creates a third mode: asynchronous governance. I run autonomous loops that process multi-story PRDs overnight. The approval prompts for external actions (git push, API calls, anything that leaves the machine) route to my phone. I approve, reject, or redirect from anywhere. The governance layer stays the same. The latency between “agent needs approval” and “human provides it” drops from “whenever I check my laptop” to “10 seconds from my phone.”

The approval flow compounds with the blast radius classification from my hooks. Local operations (file writes, test runs) auto-approve. Shared operations (git commits) warn. External operations (pushes, API calls, deployments) defer to human review. Remote Control turns that “defer” path from a blocking wait into an async notification. The agent keeps working on the next story while I review the previous one.

Tools like Agent Multiplexer already manage Claude Code sessions via tmux.¹⁰ Open-source alternatives like Emdash provide full agentic development environments.¹¹ The people suggesting SSH plus tmux as an alternative are right that it works for terminal access. None of these give you the approval routing. That routing is what makes unattended operation safe, not just possible.

Cost as Architecture

The “Making MCP Cheaper via CLI” post (304 HN points) documented a pattern: wrapping MCP tool calls in CLI invocations to avoid the overhead of maintaining an MCP server connection.⁷ The broader insight is that cost is an architectural decision, not an operational afterthought.

My infrastructure handles cost at three levels:

Token level. System prompt compression. I run ~3,500 tokens of system prompt across a CLAUDE.md file and 8 rules files. The high-return cuts: removing tutorial code examples (the model knows the APIs), collapsing duplicate rules across files, and replacing explanations with constraints. “Reject tool calls matching sensitive paths” does the same work as a 15-line explanation of why credentials shouldn’t be read. Semantic density over raw compression.⁸

Agent level. Fresh spawns over long conversations. Each story in an autonomous run gets a new agent with a clean context window. At spawn time, the agent receives a briefing: current git state, what previous agents accomplished, what it needs to do. Briefing instead of memory. Models execute a clear briefing better than they navigate 30 steps of accumulated context. The context never balloons because each agent starts fresh. Geoffrey Huntley documented a similar pattern in “The Ralph Loop,” running autonomous development at $10.42/hour on Sonnet.¹³ Multi-agent orchestrators like OpenSwarm formalize the worker-reviewer pipeline with model escalation.¹⁴

Architecture level. CLI-first over MCP when the operation is stateless. A claude --print call for a one-shot evaluation costs less and adds no connection overhead. An MCP server makes sense when the tool needs persistent state or streaming. Context Mode demonstrated the inverse: compressing 315 KB of MCP output to 5.4 KB using FTS5 indexing with BM25 ranking.¹² Both approaches reduce token spend, from different directions. Most of my skill invocations are one-shot. My prompt caching analysis found that the Claude Code CLI caches system prompts by default above 4,096 tokens. Zero configuration needed.

Case Study: What 84 Hooks Look Like in Practice

A concrete session trace from an autonomous run last week, processing a PRD with 5 stories:

SessionStart fires. Dispatcher injects: current date, project detection, philosophy constraints, system performance check, cost tracking initialization. Five hooks, 180ms total.
Agent reads the PRD, plans the first story. UserPromptSubmit fires on the internal prompt. Dispatcher injects: active project context, session drift baseline (Model2Vec embedding of first prompt for later similarity checks). 120ms.
Agent calls Bash to run tests. PreToolUse:Bash fires. Dispatcher runs: credentials check (no .env paths in the command), sandbox validation (command not on blocklist), project detection. 90ms. Test runs. PostToolUse:Bash fires: activity heartbeat logged, drift check against baseline (cosine similarity 0.63, well above 0.30 threshold).
Agent calls Write to create a file. PreToolUse:Write fires: file scope check (is this path within the project directory?). PostToolUse:Write fires: lint check on the written file, commit tracking, activity heartbeat.
Agent finishes the story. Stop fires. Quality gate hook checks: did the agent cite evidence for each criterion? Did it use hedging language (“should”, “probably”)? Are there TODO comments in the diff? If any check fails, the hook returns exit 2 and the agent continues working.
Independent verification: a fresh agent runs the test suite without trusting the previous agent’s self-report.
Three code review agents spawn in parallel. Each reviews the diff independently. Findings merge. If any reviewer flags a CRITICAL issue, the story goes back in the queue.
Story passes. Next story loads. The cycle repeats for all 5 stories.

Total hooks fired across 5 stories: ~340. Total time in hooks: ~12 seconds. Invisible overhead that prevented three credential leaks, one destructive command, and two incomplete implementations in a single overnight run.

Key Takeaways

Claude Code is a runtime, not a tool. The 17 lifecycle events make it programmable. Hooks, skills, and agents are the instruction set. The model is the execution engine. You are the systems architect.

Governance scales with automation. Every hook that adds a constraint reduces the risk of unattended operation. The ratio of judgment hooks to automation hooks is your safety margin. Mine is 4:5 and climbing.

Infrastructure compounds, prompts don’t. A good prompt improves one interaction. A good hook improves every interaction. A good skill improves every agent that invokes it. A good agent improves every workflow that delegates to it. Invest in the layer that multiplies.

Remote Control makes the infrastructure portable. The approval routing turns “unsupervised” into “asynchronously supervised.” That distinction is the difference between hoping the model makes good decisions and verifying it does.

Cost is architecture, not optimization. Fresh agent spawns, CLI-first invocations, system prompt compression, and prompt caching are structural decisions that compound. Optimizing after the fact costs more than designing for it.

Zero frameworks required. 84 hooks, 48 skills, 19 agents, ~15,000 lines of orchestration. Bash scripts in a directory. JSON state files. No runtime dependencies. You can adopt one hook or the entire stack. The infrastructure grows organically from solving real problems, not from implementing someone else’s framework.

This is part of the AI Engineering series. Previously: Why My AI Agent Has a Quality Philosophy. See also: Thinking With Ten Brains and The Blind Judge.

Andrej Karpathy on “claws” as a new layer on top of LLM agents. HN discussion (406 points, 917 comments). ↩
Claude Code Hooks Reference. Anthropic documentation. 17 lifecycle events with JSON input/output, matcher patterns, and three hook types (command, prompt, agent). ↩↩
Simon Willison, “Writing code is cheap now.” Agentic Engineering Patterns. HN discussion. ↩
Claude Code Skills Reference. Anthropic documentation. Markdown instruction sets with frontmatter metadata, allowed tools, and hook definitions. ↩
Claude Code Sub-agents Reference. Anthropic documentation. Specialized subagents with isolated context, worktree support, and model selection. ↩
Claude Code Remote Control. Anthropic documentation. Continue local sessions from any device. HN discussion (531 points, 313 comments). ↩
“Making MCP Cheaper via CLI.” Blog post by thellimist. HN discussion (304 points, 115 comments). ↩
“Compress Your Claude.md: Cut 60-70% of System Prompt Bloat.” Blog post by jchilcher. HN discussion (24 points, 9 comments). ↩
“What Claude Code Chooses.” Research by amplifying.ai. Analysis of Claude Code’s tool and framework preferences. HN discussion (39 points, 19 comments). ↩
Agent Multiplexer (amux). GitHub. Manage Claude Code sessions via tmux. HN discussion (13 points). ↩
Emdash: Open-source agentic development environment. GitHub. HN discussion (201 points, 71 comments). ↩
Context Mode: 315 KB of MCP output becomes 5.4 KB. GitHub. FTS5 indexing with BM25 ranking. HN discussion (77 points, 23 comments). ↩
Geoffrey Huntley, “The Ralph Loop.” ghuntley.com/loop. Autonomous development at $10.42/hour running Sonnet. ↩
OpenSwarm: Multi-Agent Claude CLI Orchestrator. GitHub. Worker-reviewer pipelines with model escalation. HN discussion (34 points, 18 comments). ↩