← Todos os Posts

Anatomy of a Claw: 84 Hooks as an Orchestration Layer

From the guide: Claude Code Comprehensive Guide

The first hook took four minutes to write. It blocked the model from suggesting OpenAI products in an Anthropic-only workflow. Two months later, that single hook became 84. The 84 hooks connected to 43 skills, 19 specialized agents, and 30 library modules. At some point the collection stopped being a set of scripts and became an orchestration layer.

I did not design it that way. Nobody sits down and says “I will build 15,000 lines of agent infrastructure.” You solve one problem. Then another. Then you solve the problem of the problems interacting. By the time you notice the architecture, it already exists.

Andrej Karpathy noticed too. In February 2026, he described “Claws” as a new computational layer: orchestration, scheduling, context management, and tool routing built on top of LLM agents, the same way agents are built on top of LLMs.1 The framing crystallized something practitioners had been building without naming. This post is the anatomy of one such system: what it contains, how it grew, where it works, and where it fails.

TL;DR

Karpathy’s “Claws” layer describes orchestration systems built on top of agent CLIs. I built one organically over two months on Claude Code: 84 hooks across 15 event types, 43 skills, 19 agents, and 30+ library modules. The system maps cleanly to five Claws functions (orchestration, scheduling, context management, tool routing, quality enforcement) with one notable gap (declarative workflow definitions). Key finding: planning-execution separation emerged as a natural property of hook-based orchestration, not as a design goal. Lattner’s observation that “judgment and abstraction remain core while AI automates implementation” maps directly to the hook architecture: governance hooks exercise judgment, automation hooks execute implementation.


The Claws Taxonomy

Karpathy’s description identifies five functions that a Claws layer performs. Each function has a direct analog in the hook system I built on Claude Code over the past two months.1

Claws Function Description Implementation
Orchestration Coordinate multiple agents toward a goal Ralph autonomous loop, deliberation system
Scheduling Determine when tasks execute Cron hooks, activity-heartbeat.sh, overnight security scanning
Context management Maintain relevant information across turns Prompt dispatcher, philosophy injectors, memory capsules
Tool routing Direct tool calls through appropriate handlers 84 hooks across PreToolUse, PostToolUse, UserPromptSubmit events (hook event reference)
Quality enforcement Verify outputs meet standards Quality gates, evidence requirements, 7 review agents

The taxonomy is useful because it separates concerns that practitioners tend to build in tangled ways. My early hooks mixed context management with quality enforcement. The cost-tracking hook both injected budget context (context management) and blocked expensive operations (quality enforcement). Separating these into distinct hooks improved reliability because each hook could fail independently without breaking the other function.


The Full System

The numbers as of February 2026:

Component Count Purpose
Hooks 84 Event-driven functions across 15 hook event types
Skills 43 Reusable capability modules invoked by name
Agents 19 Specialized subagents for review, exploration, development
Library modules 30+ Shared Python and Bash utilities
Lines of code ~15,000 Across hooks, skills, agents, libraries, configs

The hook distribution across event types reveals where orchestration complexity concentrates:

Event Type Hook Count Example
UserPromptSubmit 9 (via dispatcher) Context injection, cost tracking, usage analytics
PreToolUse:Bash 12 Security scanning, credential checking, sensitive command blocking
PostToolUse:Bash 6 Output scanning, deployment verification
PreToolUse:Write 4 Credential detection, path validation
PreToolUse:Edit 3 Pattern enforcement
PreToolUse:Task 3 Recursion guarding, spawn budgeting
PreCompact 1 Memory capsule, death spiral detection
SessionStart 1 Environment initialization
WorktreeCreate 1 Environment setup for isolated branches
WorktreeRemove 1 Safety checks before cleanup
Other event types ~43 Distributed across PreToolUse:Read, PostToolUse:Write, PreToolUse:WebFetch, NotebookEdit, and 8 additional event types

UserPromptSubmit carries the most weight because it fires on every user message. The dispatcher (prompt-dispatcher.sh) runs nine hooks sequentially on every prompt: security filtering, analytics, usage tracking, system monitoring, objective injection, time-estimate blocking, context injection, memory topic injection, and context pressure monitoring.2

Each hook adds latency. Nine sequential hooks add a measured 200ms total per prompt. The dispatcher runs them sequentially (not parallel) because concurrent hook writes to shared JSON state files caused data corruption in early testing. Two hooks writing to jiro.state.json simultaneously produced truncated JSON that broke every downstream hook. Sequential execution is slower but safe. The 200ms overhead is invisible to users because human typing speed is the bottleneck, not hook latency.


How It Grew

The growth was not linear. It followed a pattern of problem-solution-integration cycles.

Phase 1: Single-purpose hooks (Week 1-2). Each hook solved one problem. enforce-opus-model.sh blocked non-Opus model requests. no-time-estimates.sh removed effort estimates from responses. filter-sensitive.sh caught credentials in tool calls. These hooks operated independently. No hook knew about any other hook.

Phase 2: Coordination problems (Week 3-4). Hooks started interfering with each other. The credential filter blocked legitimate API calls. The model enforcer conflicted with subagent spawning. The solution: dispatchers. A single entry point (prompt-dispatcher.sh) replaced seven individual UserPromptSubmit hooks, controlling execution order and sharing state through a cached stdin pipe.

Phase 3: Compound capabilities (Week 5-8). Individual hooks composed into systems. The quality loop connected pre-tool hooks (catching problems before they happen) with post-tool hooks (verifying results after they happen) through a shared state file (jiro.state.json). The deliberation system used recursion guards, spawn budgets, and consensus protocols to coordinate multiple agents without infinite loops. Ralph (the autonomous development loop) connected PRD files to Claude spawning to test verification to code review in a single orchestrated pipeline.

Phase 4: Self-awareness (Week 9+). The system became large enough to need tools for understanding itself. Semantic search across the hook system (/find skill) let agents discover hooks by purpose rather than filename. Performance monitoring (/perf skill) tracked whether the system’s own overhead was degrading the machine. A context-pressure monitor warned when the orchestration layer’s injected context was consuming too much of the model’s context window.

The progression from single-purpose hooks to self-monitoring infrastructure mirrors a pattern that Chris Lattner identified in his review of the Claude C Compiler project: “Good software depends on judgment, communication, and clear abstraction. AI has amplified this.”3 The hook system’s architecture reveals the same truth. The valuable hooks are not the ones that automate tasks. The valuable hooks are the ones that encode judgment about when and how tasks should be automated.


Judgment Hooks vs. Automation Hooks

Lattner’s review of the Claude C Compiler distinguished between what AI automates well (implementation) and what remains fundamentally human (judgment and abstraction).3 This distinction maps directly onto the hook system.

Judgment hooks decide whether something should happen. They encode policy, not procedure.

Hook Judgment
quality-gate.sh “Is this work complete enough to report?”
filter-sensitive.sh “Does this command risk exposing credentials?”
recursion-guard.sh “Has the agent spawned too many sub-agents?”
context-pressure.sh “Is the context window too full to continue effectively?”
cost-gate.sh “Has this session exceeded its budget threshold?”

Automation hooks execute predetermined actions. They encode procedure, not policy.

Hook Automation
inject-context.sh Inject date, time, working directory, branch into every prompt
track-usage.sh Record token counts and session metrics
sysmon-snapshot.sh Capture CPU, memory, disk state
memory-capsule-inject.sh Restore context after compaction
activity-heartbeat.sh Update session liveness indicator

The judgment hooks are harder to write, harder to test, and more valuable. quality-gate.sh required seven named failure modes, six evidence criteria, and a hedging-language detector. inject-context.sh required five lines of bash. But both are necessary. Automation hooks provide the data that judgment hooks evaluate. sysmon-snapshot.sh (automation) feeds data to the performance monitor that decides whether to recommend throttling agent count (judgment).

The ratio matters. In a healthy orchestration layer, judgment hooks should outnumber automation hooks. If most hooks just inject data or record metrics, the system automates well but governs poorly. A verified count of the current system: 35 judgment hooks, 44 automation hooks, roughly 4:5. Automation still leads. The ratio started at approximately 1:6 (almost all injection and logging hooks) and shifted toward judgment over two months as governance constraints were added after encountering failures that pure automation could not prevent. The ratio has not reached parity yet, which is itself a useful signal: this system still governs less than it automates.


Planning-Execution Separation

Boris Tane’s “How I use Claude Code” post attracted 936 points on Hacker News by describing a workflow pattern: separate planning from execution.4 Plan with one Claude session (researching, outlining, designing), then execute with a fresh session that receives the plan as structured input. The pattern resonated because it solves a real problem: planning and execution compete for context window space.

The hook system arrived at the same separation through a different path. The deliberation system spawns specialized agents to research and debate approaches. The output is a structured PRD (Product Requirements Document) with stories, acceptance criteria, and verification types. The Ralph loop reads the PRD and spawns fresh Claude instances to implement each story. Planning agents never implement. Implementation agents never plan.

The separation was not a design goal. It emerged from two independent constraints:

  1. Context window pressure. Planning requires reading many files and exploring options. Implementation requires focused context on the current task. Putting both in the same context window means neither gets enough space. Separate sessions give each phase full context.

  2. Quality verification independence. If the same agent plans and implements, it cannot objectively verify its own implementation against the plan. A fresh agent with only the plan and the code provides independent verification. The Ralph loop enforces this: implementation agents run tests, but three separate review agents (correctness, security, conventions) verify the results.

The convergence between Tane’s manual workflow and the automated hook system suggests that planning-execution separation is a natural property of agentic systems, not just a practitioner preference. Any system that manages context windows and verifies outputs will eventually separate planning from execution because the alternative (doing both in one context) produces worse results in both phases.


Where the Hook System Fails

The architecture has three significant weaknesses that a purpose-built orchestration framework would address.

No declarative workflow definitions. Every workflow is encoded imperatively in bash scripts. The Ralph loop is 1,320 lines of bash that encode a specific sequence: read PRD, select story, gather context, spawn Claude, run tests, run reviews, handle failures, update state. Changing the workflow means editing bash. A declarative system would define workflows as data (YAML, JSON) that an interpreter executes. Declarative workflows are easier to modify, compose, and visualize. Imperative scripts are easier to write initially but harder to maintain as they grow.

Hook ordering is fragile. The prompt dispatcher runs hooks in a hardcoded sequence. Moving memory-capsule-inject.sh before inject-context.sh would break the capsule injection because it depends on the session ID that inject-context.sh resolves. These dependencies are implicit (encoded in the dispatcher’s ordering) rather than explicit (declared as dependencies between hooks). A purpose-built system would express hook dependencies as a DAG and topologically sort execution order.

No workflow visualization. With 84 hooks, understanding the full execution path of any user action requires reading dispatcher code and tracing hook chains manually. There is no tool that shows “when the user types a message, these 9 hooks fire in this order, and hook 3 calls library function X which writes to state file Y.” The system is observable through logs but not through structure. A purpose-built orchestration framework would provide a visual graph of hook dependencies, data flows, and execution paths.

These weaknesses share a common cause: the system grew organically from solving individual problems rather than being designed as a coherent orchestration layer. Organic growth produces systems that work (all 84 hooks function correctly in production) but are hard to reason about as a whole. The trade-off is real: designing the orchestration layer up front would have produced better structure but worse capabilities, because many capabilities (memory capsules, output whitelists, spawn budgets) were invented in response to failures that could not have been predicted before they occurred.


The Harness Goes Mainstream

Three weeks after Karpathy named the layer, the concept has a second name and a growing community.

Geoffrey Huntley proposed a formal definition: “Agent Harness — the orchestration layer around a language model that transforms it from a tool into a teammate.”5 The framing is precise. The harness is not the model. It is not the tools the model calls. It is the system that decides which tools to call, when to call them, and how to evaluate whether the call succeeded. Every production agent system builds this layer. Most build it implicitly, inside application code that mixes orchestration logic with business logic. Naming it makes the architecture visible.

The community signals confirm the pattern is spreading. Pieter Levels reported permanently switching to running Claude Code on a server, treating it as infrastructure rather than a local tool.6 Anthropic shipped Remote Control, letting users kick off tasks from the terminal and pick them up from Claude.ai.7 Ben Cherny announced /simplify and /batch as first-party skills.8 Each of these is a harness feature: persistent execution, remote orchestration, and built-in capability modules. The CLI is growing into the harness.

Meanwhile, practitioners are building their own harness components. One developer published 22 custom Obsidian + Claude Code commands for a personal operating system.9 Another created a “Visual Explainer” agent skill with complementary slash commands.10 The patterns match: dispatchers, skills, shared state, event-driven hooks. Nobody reads a framework guide before building these systems. They solve one problem, then another, then the problems of the problems interacting.

Two recent projects reveal how sophisticated community-built harness components have become. nah is a context-aware permission guard that registers as a PreToolUse hook.14 It classifies 20 distinct action types (file writes, network requests, process spawning, etc.) and applies per-type policies. The tool detects pipe decomposition attacks where an agent chains innocuous commands to achieve a blocked operation. The architecture mirrors filter-sensitive.sh and recursion-guard.sh from this post, arrived at independently by a different practitioner solving the same governance problems.

Rudel provides session analytics by ingesting Claude Code session data into ClickHouse.15 Analysis of 1,573 sessions revealed that only 4% of users invoke skills and 26% of sessions are abandoned within 60 seconds. The numbers confirm what the harness architecture implies: most users interact with agent CLIs at the surface level. The orchestration layer described in this post represents the deep end of a usage distribution where the vast majority never leave the shallow end. The gap between what the tool can do and what most users ask it to do is the space that harness infrastructure fills.


Autoresearch: The Harness as Research Loop

Karpathy’s own autoresearch project demonstrates the harness pattern in a different domain.11 The system points a language model at a training script (train.py), runs a five-minute experiment, evaluates the result against a fixed metric (validation bits per byte), and keeps improvements or discards regressions. Over two days, the system ran approximately 700 experiments and found approximately 20 genuine improvements, reducing GPT-2 training time by 11%.

The architecture is identical to the hook system described above. A fixed evaluation harness (prepare.py) is the equivalent of judgment hooks: it decides whether an experiment succeeded. The training script (train.py) is the equivalent of automation hooks: it executes the agent’s modifications. The git branch management (keep on improvement, reset on regression) is the equivalent of the Ralph loop’s state management. The results.tsv log is the equivalent of session telemetry.

The pattern transfers because the harness solves a problem that is independent of domain. Whether the agent writes code, optimizes a training loop, or manages a content pipeline, it needs: a way to evaluate results against criteria, a way to keep or discard changes, a way to maintain state across iterations, and a way to run autonomously without human intervention. These four requirements produce the same architecture regardless of what the agent is actually doing.

Shopify CEO Tobi Lütke adapted autoresearch internally. His agent-optimized smaller model outperformed a manually-configured larger model, validating the claim that autonomous harness-driven iteration discovers configurations that humans do not think to try.12


The Security Gap in the Harness

The harness solves orchestration. It does not automatically solve safety.

A study of iterative LLM-driven code refinement found that 43.7% of iteration chains contained more vulnerabilities after ten rounds of agent modifications than the baseline code they started from.13 The root cause was specification drift: as the agent optimized for functional correctness, it progressively removed defensive logic and weakened exception handling. Worse, adding static analysis security tools (SAST gates) to the iteration loop actually increased latent degradation from 12.5% to 20.8%. The scanners created a false sense of security that caused the agent to be less cautious, not more.

The degradation finding is directly relevant to harness design. The judgment hooks described in this post (quality-gate.sh, filter-sensitive.sh, recursion-guard.sh) address the quality and safety dimensions that automation alone degrades. The SCAFFOLD-CEGIS framework that addressed the degradation problem used four layers of gated verification, achieving a 2.1% latent degradation rate with 100% safety monotonicity.13 The architecture parallels the hook system: separate evaluation layers, each checking a different property, with explicit gates between phases.

A separate effort corroborates the threat model from the production side. Perplexity’s NIST response on AI agent security mapped the attack surface of agentic systems operating at scale.16 The primary vectors: indirect prompt injection through data channels (web pages, emails, tool outputs), CIA triad violations specific to agents (data exfiltration via tool calls, action manipulation through context poisoning, resource exhaustion through recursive spawning), and real CVEs from agent-adjacent systems. Their recommended defense architecture (input-level filtering, model-level alignment, and deterministic enforcement through sandboxing and allowlists) mirrors the three-layer pattern that emerged organically in the hook system described here. Automation hooks filter inputs. The model exercises judgment. Governance hooks enforce deterministic constraints that the model cannot override.

The lesson for practitioners: if your harness only automates and orchestrates without governing, iterative agent execution will introduce security regressions that standard tooling cannot detect. The judgment hooks are not overhead. They are the reason the system does not degrade.


What Practitioners Should Take Away

If you are building an orchestration layer on top of an agent CLI — starting from the Claude Code guide or from scratch — three patterns from this system transfer directly.

Start with dispatchers, not individual hooks. The biggest architectural improvement was replacing seven individual UserPromptSubmit hooks with a single dispatcher that runs them sequentially. If you anticipate more than three hooks on any event type, build the dispatcher first. The 30 minutes spent writing a dispatcher saves hours of debugging hook interaction bugs later. The minimal pattern:

#!/bin/bash
# dispatcher.sh — sequential hook execution with shared stdin
HANDLERS=("inject-context.sh" "track-usage.sh" "quality-gate.sh")
HOOK_DIR="$(dirname "$0")/handlers"
INPUT=$(cat)  # Cache stdin once (each handler gets the same input)

for handler in "${HANDLERS[@]}"; do
    [ -x "$HOOK_DIR/$handler" ] && echo "$INPUT" | "$HOOK_DIR/$handler"
done

Register the single dispatcher as your hook entry point. Add handlers to the array as you build them. Each handler reads the same cached stdin (the hook event payload) and writes to stdout independently.

Separate judgment from automation early. When writing a new hook, ask: “Does the hook decide whether something should happen, or does it execute a predetermined action?” Judgment hooks need more testing, more edge case handling, and more iteration. Automation hooks need reliability and performance. Treating them the same leads to under-tested judgment hooks and over-engineered automation hooks.

Let planning-execution separation emerge. Do not force the separation on day one. Build the simplest thing that works. When you notice that your agent’s context window is too full for both planning and implementation, split them. When you notice that your agent cannot objectively verify its own work, add independent review agents. The separation will feel obvious when the constraints demand it.

The harness is going mainstream because the pattern is inevitable. Whether you call the layer Claws, Agent Harness, or just “my hooks folder,” any system that coordinates agents toward goals will converge on the same architecture: dispatchers for ordering, judgment hooks for governance, automation hooks for execution, and state files for continuity. The Claude Code source leak confirmed that Anthropic’s own internal architecture follows these same patterns — coordinator mode is implemented entirely as system prompt instructions, not code-level orchestration. The hook-based approach has one advantage over purpose-built orchestration frameworks: zero commitment. Every hook is independent. You can adopt one hook, ten hooks, or eighty-four hooks. You can delete any hook without breaking others (assuming you maintain the dispatcher). There is no framework to learn, no dependency to manage, no runtime to operate. The orchestration layer is just files.


Sources


  1. Andrej Karpathy, “Claws” discussion, February 2026, x.com/karpathy/status/2024987174077432126. 351 points, 795 comments on Hacker News. Relayed via Simon Willison, simonwillison.net/2026/Feb/21/claws/

  2. Context injection architecture detailed in “Context Is Architecture.” 

  3. Chris Lattner, “The Claude C Compiler: What It Reveals About the Future of Software,” Modular blog, February 2026. Relayed via Simon Willison, simonwillison.net/2026/Feb/22/ccc/

  4. Boris Tane, “How I use Claude Code,” boristane.com, February 2026. 936 points, 569 comments on Hacker News. 

  5. Geoffrey Huntley, “Agent Harness” definition, March 2026, x.com/GeoffreyHuntley/status/2028008682676723943

  6. Pieter Levels, switching to running Claude Code on servers permanently, March 2026, x.com/levelsio/status/2027566773814403448

  7. Anthropic, “New in Claude Code: Remote Control,” March 2026, x.com/claudeai/status/2026418433911603668

  8. Ben Cherny, Claude Code /simplify and /batch skills announcement, March 2026, x.com/bcherny/status/2027534984534544489

  9. Internet Vin, “22 commands I use with Obsidian and Claude Code,” March 2026, x.com/internetvin/status/2026461256677245131

  10. Nicopreme, “Visual Explainer” agent skill with slash commands, x.com/nicopreme/status/2023495040258261460

  11. Andrej Karpathy, autoresearch: AI agents running autonomous ML research, March 2026, github.com/karpathy/autoresearch. 196 points, 55 comments on Hacker News. 630-line Python script running ~700 experiments over two days, finding ~20 genuine improvements. 

  12. Tobi Lütke, Shopify CEO, adapted autoresearch internally; agent-optimized smaller model outperformed manually-configured larger model, March 2026. Reported via VentureBeat

  13. Yi Chen et al., “SCAFFOLD-CEGIS: Preventing Latent Security Degradation in LLM-Driven Iterative Code Refinement,” arXiv:2603.08520, March 2026, arxiv.org/abs/2603.08520v1. 43.7% of iteration chains introduced more vulnerabilities than baseline after 10 rounds; SAST gates increased latent degradation from 12.5% to 20.8%; SCAFFOLD-CEGIS framework achieved 2.1% latent degradation with 100% safety monotonicity. 

  14. Manuel Schipper, “nah: A context-aware permission guard for Claude Code,” github.com/manuelschipper/nah. PreToolUse hook with 20 action types, per-type policies, and pipe decomposition detection. 124 points, 89 comments on Hacker News. 

  15. keks0r, “Rudel: Claude Code Session Analytics,” github.com/obsessiondb/rudel. ClickHouse-backed analytics across 1,573 sessions. Found 4% skill usage rate, 26% abandonment within 60 seconds. 137 points, 75 comments on Hacker News. 

  16. Ninghui Li, Kaiyuan Zhang, Kyle Polley, Jerry Ma, “Security Considerations for Artificial Intelligence Agents,” arXiv:2603.12230, March 2026, arxiv.org/abs/2603.12230v1. Perplexity’s NIST/CAISI response mapping agent attack surfaces, CIA triad violations, and defense-in-depth architecture from production agentic systems serving millions of users. 

Artigos relacionados

The CLI Thesis

Three top HN Claude Code threads converge on one conclusion: CLI-first architecture is cheaper, faster, and more composa…

15 min de leitura

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

8 min de leitura