← Todos os Posts

Anatomy of a Claw: 84 Hooks as an Orchestration Layer

The first hook took four minutes to write. It blocked the model from suggesting OpenAI products in an Anthropic-only workflow. Two months later, that single hook became 84. The 84 hooks connected to 43 skills, 19 specialized agents, and 30 library modules. At some point the collection stopped being a set of scripts and became an orchestration layer.

I did not design it that way. Nobody sits down and says “I will build 15,000 lines of agent infrastructure.” You solve one problem. Then another. Then you solve the problem of the problems interacting. By the time you notice the architecture, it already exists.

Andrej Karpathy noticed too. In February 2026, he described “Claws” as a new computational layer: orchestration, scheduling, context management, and tool routing built on top of LLM agents, the same way agents are built on top of LLMs.1 The framing crystallized something practitioners had been building without naming. This post is the anatomy of one such system: what it contains, how it grew, where it works, and where it fails.

TL;DR

Karpathy’s “Claws” layer describes orchestration systems built on top of agent CLIs. I built one organically over two months on Claude Code: 84 hooks across 15 event types, 43 skills, 19 agents, and 30+ library modules. The system maps cleanly to five Claws functions (orchestration, scheduling, context management, tool routing, quality enforcement) with one notable gap (declarative workflow definitions). Key finding: planning-execution separation emerged as a natural property of hook-based orchestration, not as a design goal. Lattner’s observation that “judgment and abstraction remain core while AI automates implementation” maps directly to the hook architecture: governance hooks exercise judgment, automation hooks execute implementation.


The Claws Taxonomy

Karpathy’s description identifies five functions that a Claws layer performs. Each function has a direct analog in the hook system I built on Claude Code over the past two months.1

Claws Function Description Implementation
Orchestration Coordinate multiple agents toward a goal Ralph autonomous loop, deliberation system
Scheduling Determine when tasks execute Cron hooks, activity-heartbeat.sh, overnight security scanning
Context management Maintain relevant information across turns Prompt dispatcher, philosophy injectors, memory capsules
Tool routing Direct tool calls through appropriate handlers 84 hooks across PreToolUse, PostToolUse, UserPromptSubmit events
Quality enforcement Verify outputs meet standards Quality gates, evidence requirements, 7 review agents

The taxonomy is useful because it separates concerns that practitioners tend to build in tangled ways. My early hooks mixed context management with quality enforcement. The cost-tracking hook both injected budget context (context management) and blocked expensive operations (quality enforcement). Separating these into distinct hooks improved reliability because each hook could fail independently without breaking the other function.


The Full System

The numbers as of February 2026:

Component Count Purpose
Hooks 84 Event-driven functions across 15 hook event types
Skills 43 Reusable capability modules invoked by name
Agents 19 Specialized subagents for review, exploration, development
Library modules 30+ Shared Python and Bash utilities
Lines of code ~15,000 Across hooks, skills, agents, libraries, configs

The hook distribution across event types reveals where orchestration complexity concentrates:

Event Type Hook Count Example
UserPromptSubmit 9 (via dispatcher) Context injection, cost tracking, usage analytics
PreToolUse:Bash 12 Security scanning, credential checking, sensitive command blocking
PostToolUse:Bash 6 Output scanning, deployment verification
PreToolUse:Write 4 Credential detection, path validation
PreToolUse:Edit 3 Pattern enforcement
PreToolUse:Task 3 Recursion guarding, spawn budgeting
PreCompact 1 Memory capsule, death spiral detection
SessionStart 1 Environment initialization
WorktreeCreate 1 Environment setup for isolated branches
WorktreeRemove 1 Safety checks before cleanup
Other event types ~43 Distributed across PreToolUse:Read, PostToolUse:Write, PreToolUse:WebFetch, NotebookEdit, and 8 additional event types

UserPromptSubmit carries the most weight because it fires on every user message. The dispatcher (prompt-dispatcher.sh) runs nine hooks sequentially on every prompt: security filtering, analytics, usage tracking, system monitoring, objective injection, time-estimate blocking, context injection, memory topic injection, and context pressure monitoring.2

Each hook adds latency. Nine sequential hooks add a measured 200ms total per prompt. The dispatcher runs them sequentially (not parallel) because concurrent hook writes to shared JSON state files caused data corruption in early testing. Two hooks writing to jiro.state.json simultaneously produced truncated JSON that broke every downstream hook. Sequential execution is slower but safe. The 200ms overhead is invisible to users because human typing speed is the bottleneck, not hook latency.


How It Grew

The growth was not linear. It followed a pattern of problem-solution-integration cycles.

Phase 1: Single-purpose hooks (Week 1-2). Each hook solved one problem. enforce-opus-model.sh blocked non-Opus model requests. no-time-estimates.sh removed effort estimates from responses. filter-sensitive.sh caught credentials in tool calls. These hooks operated independently. No hook knew about any other hook.

Phase 2: Coordination problems (Week 3-4). Hooks started interfering with each other. The credential filter blocked legitimate API calls. The model enforcer conflicted with subagent spawning. The solution: dispatchers. A single entry point (prompt-dispatcher.sh) replaced seven individual UserPromptSubmit hooks, controlling execution order and sharing state through a cached stdin pipe.

Phase 3: Compound capabilities (Week 5-8). Individual hooks composed into systems. The quality loop connected pre-tool hooks (catching problems before they happen) with post-tool hooks (verifying results after they happen) through a shared state file (jiro.state.json). The deliberation system used recursion guards, spawn budgets, and consensus protocols to coordinate multiple agents without infinite loops. Ralph (the autonomous development loop) connected PRD files to Claude spawning to test verification to code review in a single orchestrated pipeline.

Phase 4: Self-awareness (Week 9+). The system became large enough to need tools for understanding itself. Semantic search across the hook system (/find skill) let agents discover hooks by purpose rather than filename. Performance monitoring (/perf skill) tracked whether the system’s own overhead was degrading the machine. A context-pressure monitor warned when the orchestration layer’s injected context was consuming too much of the model’s context window.

The progression from single-purpose hooks to self-monitoring infrastructure mirrors a pattern that Chris Lattner identified in his review of the Claude C Compiler project: “Good software depends on judgment, communication, and clear abstraction. AI has amplified this.”3 The hook system’s architecture reveals the same truth. The valuable hooks are not the ones that automate tasks. The valuable hooks are the ones that encode judgment about when and how tasks should be automated.


Judgment Hooks vs. Automation Hooks

Lattner’s review of the Claude C Compiler distinguished between what AI automates well (implementation) and what remains fundamentally human (judgment and abstraction).3 This distinction maps directly onto the hook system.

Judgment hooks decide whether something should happen. They encode policy, not procedure.

Hook Judgment
quality-gate.sh “Is this work complete enough to report?”
filter-sensitive.sh “Does this command risk exposing credentials?”
recursion-guard.sh “Has the agent spawned too many sub-agents?”
context-pressure.sh “Is the context window too full to continue effectively?”
cost-gate.sh “Has this session exceeded its budget threshold?”

Automation hooks execute predetermined actions. They encode procedure, not policy.

Hook Automation
inject-context.sh Inject date, time, working directory, branch into every prompt
track-usage.sh Record token counts and session metrics
sysmon-snapshot.sh Capture CPU, memory, disk state
memory-capsule-inject.sh Restore context after compaction
activity-heartbeat.sh Update session liveness indicator

The judgment hooks are harder to write, harder to test, and more valuable. quality-gate.sh required seven named failure modes, six evidence criteria, and a hedging-language detector. inject-context.sh required five lines of bash. But both are necessary. Automation hooks provide the data that judgment hooks evaluate. sysmon-snapshot.sh (automation) feeds data to the performance monitor that decides whether to recommend throttling agent count (judgment).

The ratio matters. In a healthy orchestration layer, judgment hooks should outnumber automation hooks. If most hooks just inject data or record metrics, the system automates well but governs poorly. A verified count of the current system: 35 judgment hooks, 44 automation hooks, roughly 4:5. Automation still leads. The ratio started at approximately 1:6 (almost all injection and logging hooks) and shifted toward judgment over two months as governance constraints were added after encountering failures that pure automation could not prevent. The ratio has not reached parity yet, which is itself a useful signal: this system still governs less than it automates.


Planning-Execution Separation

Boris Tane’s “How I use Claude Code” post attracted 936 points on Hacker News by describing a workflow pattern: separate planning from execution.4 Plan with one Claude session (researching, outlining, designing), then execute with a fresh session that receives the plan as structured input. The pattern resonated because it solves a real problem: planning and execution compete for context window space.

The hook system arrived at the same separation through a different path. The deliberation system spawns specialized agents to research and debate approaches. The output is a structured PRD (Product Requirements Document) with stories, acceptance criteria, and verification types. The Ralph loop reads the PRD and spawns fresh Claude instances to implement each story. Planning agents never implement. Implementation agents never plan.

This separation was not a design goal. It emerged from two independent constraints:

  1. Context window pressure. Planning requires reading many files and exploring options. Implementation requires focused context on the current task. Putting both in the same context window means neither gets enough space. Separate sessions give each phase full context.

  2. Quality verification independence. If the same agent plans and implements, it cannot objectively verify its own implementation against the plan. A fresh agent with only the plan and the code provides independent verification. The Ralph loop enforces this: implementation agents run tests, but three separate review agents (correctness, security, conventions) verify the results.

The convergence between Tane’s manual workflow and the automated hook system suggests that planning-execution separation is a natural property of agentic systems, not just a practitioner preference. Any system that manages context windows and verifies outputs will eventually separate planning from execution because the alternative (doing both in one context) produces worse results in both phases.


Where the Hook System Fails

The architecture has three significant weaknesses that a purpose-built orchestration framework would address.

No declarative workflow definitions. Every workflow is encoded imperatively in bash scripts. The Ralph loop is 1,320 lines of bash that encode a specific sequence: read PRD, select story, gather context, spawn Claude, run tests, run reviews, handle failures, update state. Changing the workflow means editing bash. A declarative system would define workflows as data (YAML, JSON) that an interpreter executes. Declarative workflows are easier to modify, compose, and visualize. Imperative scripts are easier to write initially but harder to maintain as they grow.

Hook ordering is fragile. The prompt dispatcher runs hooks in a hardcoded sequence. Moving memory-capsule-inject.sh before inject-context.sh would break the capsule injection because it depends on the session ID that inject-context.sh resolves. These dependencies are implicit (encoded in the dispatcher’s ordering) rather than explicit (declared as dependencies between hooks). A purpose-built system would express hook dependencies as a DAG and topologically sort execution order.

No workflow visualization. With 84 hooks, understanding the full execution path of any user action requires reading dispatcher code and tracing hook chains manually. There is no tool that shows “when the user types a message, these 9 hooks fire in this order, and hook 3 calls library function X which writes to state file Y.” The system is observable through logs but not through structure. A purpose-built orchestration framework would provide a visual graph of hook dependencies, data flows, and execution paths.

These weaknesses share a common cause: the system grew organically from solving individual problems rather than being designed as a coherent orchestration layer. Organic growth produces systems that work (all 84 hooks function correctly in production) but are hard to reason about as a whole. The trade-off is real: designing the orchestration layer up front would have produced better structure but worse capabilities, because many capabilities (memory capsules, output whitelists, spawn budgets) were invented in response to failures that could not have been predicted before they occurred.


What Practitioners Should Take Away

If you are building an orchestration layer on top of an agent CLI, three patterns from this system transfer directly.

Start with dispatchers, not individual hooks. The biggest architectural improvement was replacing seven individual UserPromptSubmit hooks with a single dispatcher that runs them sequentially. If you anticipate more than three hooks on any event type, build the dispatcher first. The 30 minutes spent writing a dispatcher saves hours of debugging hook interaction bugs later. The minimal pattern:

#!/bin/bash
# dispatcher.sh — sequential hook execution with shared stdin
HANDLERS=("inject-context.sh" "track-usage.sh" "quality-gate.sh")
HOOK_DIR="$(dirname "$0")/handlers"
INPUT=$(cat)  # Cache stdin once (each handler gets the same input)

for handler in "${HANDLERS[@]}"; do
    [ -x "$HOOK_DIR/$handler" ] && echo "$INPUT" | "$HOOK_DIR/$handler"
done

Register this single dispatcher as your hook entry point. Add handlers to the array as you build them. Each handler reads the same cached stdin (the hook event payload) and writes to stdout independently.

Separate judgment from automation early. When writing a new hook, ask: “Does this hook decide whether something should happen, or does it execute a predetermined action?” Judgment hooks need more testing, more edge case handling, and more iteration. Automation hooks need reliability and performance. Treating them the same leads to under-tested judgment hooks and over-engineered automation hooks.

Let planning-execution separation emerge. Do not force the separation on day one. Build the simplest thing that works. When you notice that your agent’s context window is too full for both planning and implementation, split them. When you notice that your agent cannot objectively verify its own work, add independent review agents. The separation will feel obvious when the constraints demand it.

The hook-based approach has one advantage over purpose-built orchestration frameworks: zero commitment. Every hook is independent. You can adopt one hook, ten hooks, or eighty-four hooks. You can delete any hook without breaking others (assuming you maintain the dispatcher). There is no framework to learn, no dependency to manage, no runtime to operate. The orchestration layer is just files.

Karpathy called it a new computational layer. The implementation is older than the name. Practitioners have been building Claws since the first time they wrote a shell script to wrap an agent CLI call. The difference between a shell script and an orchestration layer is not a difference in kind. It is a difference in how many problems you have solved, and how many of those solutions had to solve each other.


Sources


  1. Andrej Karpathy, “Claws” discussion, February 2026, x.com/karpathy/status/2024987174077432126. Relayed via Simon Willison, simonwillison.net/2026/Feb/21/claws/

  2. Context injection architecture detailed in “Context Is Architecture.” 

  3. Chris Lattner, “The Claude C Compiler: What It Reveals About the Future of Software,” Modular blog, February 2026. Relayed via Simon Willison, simonwillison.net/2026/Feb/22/ccc/

  4. Boris Tane, “How I use Claude Code,” boristane.com, February 2026. 936 points, 569 comments on Hacker News. 

Artigos relacionados

The Fabrication Firewall: When Your Agent Publishes Lies

An autonomous agent published fabricated claims to 8 platforms over 72 hours. Training-phase safety failed at the public…

14 min de leitura

Self-Governing Agents: Runtime Constitutions

Training-phase alignment fails at runtime. Six papers converge on embedded constitutions for agent governance. Three of …

16 min de leitura

Vibe Coding vs. Engineering: Where I Draw the Line

I use Claude Code daily with 86 hooks and a full quality gate system. Here's where I vibe code, where I engineer, and wh…

6 min de leitura