The Ralph Loop: How I Run Autonomous AI Agents Overnight
I built an autonomous agent system that uses stop hooks to intercept exit attempts, filesystem memory to persist state across context windows, and spawn budgets to prevent runaway recursion. The system shipped my 9-PRD deliberation infrastructure (3,455 lines of Python, 141 tests) across multiple overnight sessions.1
TL;DR
The Ralph architecture enables long-running autonomous AI development by solving three problems simultaneously: context window exhaustion (solved by fresh context per iteration), state persistence (solved by filesystem as memory), and task continuity (solved by stop-hook loops that prevent the agent from terminating before completion). I implemented this pattern in my Claude Code hook system and used it to build my multi-agent deliberation infrastructure. The system works, but it taught me hard lessons about spawn budgets, criteria quality, and filesystem pollution.
The Context Window Problem
Every AI conversation operates within a context window: a fixed-size buffer holding conversation history, system prompts, tool outputs, and working memory. Claude’s context window holds approximately 200,000 tokens. A complex development session can exhaust this within 30-60 minutes of intensive work.2
I measured this directly during the deliberation system build. Sessions that started with precise multi-file edits across 8 Python modules degraded into single-file tunnel vision by the 90-minute mark. The agent stopped referencing architecture it had read earlier because that context had been compressed away.3
The degradation follows a predictable curve:
Iteration 1: [200K tokens] → writes code, creates files
Iteration 2: [200K tokens] → reads files from disk, continues
Iteration 3: [200K tokens] → reads updated files, continues
...
Iteration N: [200K tokens] → reads final state, verifies criteria
Compare with a single long session:
Minute 0: [200K tokens available] → productive
Minute 30: [150K tokens available] → somewhat productive
Minute 60: [100K tokens available] → degraded
Minute 90: [50K tokens available] → significantly degraded
Minute 120: [compressed, lossy] → errors accumulate
The fresh-context-per-iteration agent outperforms the continuous agent because each iteration allocates full cognitive resources to the current state rather than carrying the weight of previous reasoning.
My Implementation
The Stop Hook
My recursion-guard system intercepts agent stop attempts and checks completion criteria:
#!/bin/bash
# From recursion-guard.sh - simplified
CONFIG_FILE="${HOME}/.claude/configs/recursion-limits.json"
STATE_FILE="${HOME}/.claude/state/recursion-depth.json"
# Safe defaults with config override
MAX_DEPTH=2
MAX_CHILDREN=5
DELIB_SPAWN_BUDGET=2
DELIB_MAX_AGENTS=12
# Load config with validation
load_config() {
if [[ -f "$CONFIG_FILE" ]] && command -v jq &>/dev/null; then
config_depth=$(jq -r '.max_depth // 2' "$CONFIG_FILE")
if [[ "$config_depth" =~ ^[0-9]+$ ]] && [[ "$config_depth" -gt 0 ]]; then
MAX_DEPTH="$config_depth"
fi
fi
}
The hook reads a criteria file that defines success conditions. The conditions must be machine-verifiable: test pass/fail, linter output, HTTP status codes, file existence checks.4
Filesystem as Persistent Memory
The critical insight: files persist across context windows. My .claude/ directory serves as the agent’s persistent memory — a distributed context architecture that grew from this initial filesystem pattern into a 7-layer, 650-file system:
| Directory | Contents | Role in Ralph Loop |
|---|---|---|
state/ |
recursion-depth.json, agent-lineage.json |
Tracks iteration count, parent-child relationships |
configs/ |
14 JSON files | Encode thresholds, budgets, rules (not hardcoded) |
handoffs/ |
49 context documents | Preserve multi-session architectural decisions |
hooks/ |
95 lifecycle handlers | Enforce quality gates between iterations |
Each new iteration reads the current state from disk and continues from where the previous iteration left off. The session-start hook initializes clean state:
# From session-start.sh - recursion state initialization
RECURSION_STATE_FILE="$RECURSION_STATE_DIR/recursion-depth.json"
# Initialize with safe defaults
{
"depth": 0,
"agent_id": "root",
"parent_id": null,
"initialized_by": "session-start"
}
If state gets corrupted (which happened twice during development), the recovery pattern recreates from safe defaults rather than crashing:
if ! jq -e '.depth' "$RECURSION_STATE_FILE" &>/dev/null; then
# Corrupted state file, recreate with safe defaults
echo "- Recursion state recovered (was corrupted)"
fi
Task Specification Format
Effective Ralph tasks include three elements: objective, completion criteria, and context pointers:
OBJECTIVE: Implement multi-agent deliberation with consensus validation.
COMPLETION CRITERIA:
- All tests in tests/test_deliberation_lib.py pass (81 tests)
- post-deliberation.sh validates consensus above 70% threshold
- recursion-guard.sh enforces spawn budget (max 12 agents)
- deliberation-pride-check.sh passes 5 quality checks
- No Python type errors (mypy clean)
CONTEXT:
- Follow patterns in lib/deliberation/state_machine.py
- Consensus thresholds in configs/deliberation-config.json
- Spawn budget model: agents inherit budget, not increment depth
What I Built With This Pattern
The Deliberation Infrastructure (9 PRDs)
The largest Ralph loop project: 9 Product Requirement Documents implemented across multiple sessions.
| PRD | Deliverable | Tests |
|---|---|---|
| PRDs 1-4 | Hooks, config, recursion-guard extensions | Committed as 3cad08c |
| PRD-5 | 48 bash integration tests (7 suites) | Committed as 10df724 |
| PRDs 7-8 | Hook wiring + 81 Python unit tests | Committed as fbf1a0d |
| PRD-9 | 12 E2E pipeline simulation tests | Committed as 32bd711 |
Total output: 3,455 lines of Python across 8 modules, 141 tests, 4 commits. Each session picked up from the previous one’s filesystem state. The fresh context meant each PRD got the agent’s full attention without carrying conversation history from earlier PRDs.5
Blog Quality System (12 Modules)
The blog linter started as a 3-module script and grew to 12 modules through iterative Ralph loops. Each iteration added a module, ran the full test suite, and verified zero regressions. The completion criteria evolved:
- Iteration 1: “All 77 tests pass”
- Iteration 5: “All 77 tests pass AND linter reports 0 errors on all 33 posts”
- Iteration 8: “All tests pass AND 0 errors AND 0 warnings AND depth score ≥2 on all posts”
The Failures and Lessons
Failure 1: The Spawn Budget Disaster
Early in the deliberation build, I ran a session without spawn budget limits. The agent spawned 3 exploration subagents. Each subagent spawned its own subagents. Within minutes, the recursion-guard hook was intercepting dozens of spawn attempts. The session consumed API tokens at 10x the normal rate before I killed it manually.6
The fix: I added a spawn budget model to recursion-limits.json. Agents inherit a budget from their parent rather than incrementing depth. A root agent with budget=12 can spawn up to 12 agents total across all levels of recursion. This was a key architectural insight: budget inheritance prevents exponential growth while still allowing deep agent chains. The spawn budget maps directly to the separation rule in Reynolds’ boids algorithm — the constraint that prevents agents from crowding onto the same sub-problem.
Failure 2: Criteria That Pass Trivially
An early task asked the agent to “write tests that pass.” The agent wrote minimal tests: assert True, assert 1 == 1. Technically, the criteria were met. The output was worthless.
The fix: Criteria must specify both quantity and quality:
| Criteria Quality | Example | Outcome |
|---|---|---|
| Vague | “Tests pass” | Agent writes trivial tests |
| Measurable but incomplete | “Tests pass AND coverage >80%” | Agent writes tests that cover lines but test nothing meaningful |
| Comprehensive | “All tests pass AND coverage >80% AND no type errors AND linter clean AND each test class tests a distinct module” | Production-quality output |
Failure 3: Filesystem Pollution
Iterations that explored dead-end approaches left artifacts: partially implemented features, deprecated files, conflicting configurations. Iteration 5 might build on a half-finished approach from iteration 3 that was abandoned in iteration 4.
The fix: I added a cleanup step to the stop hook criteria: “No files exist that aren’t referenced by imports or tests.” This forces the agent to clean up dead ends before the iteration can complete.
Failure 4: The ((VAR++)) Incident
During bash integration testing, the recursion-guard hook crashed silently on the first iteration. The bug: ((VAR++)) returns exit code 1 when VAR is 0 (because 0++ evaluates to 0, which bash treats as false). With set -e enabled, this killed the script.
The fix was VAR=$((VAR + 1)) instead of ((VAR++)). This bash gotcha, documented in my MEMORY.md, has prevented the same bug in 6 subsequent hook scripts.7
When Ralph Works and When It Doesn’t
Strong Fit
- Greenfield implementation with clear specifications (new API, new module, new test suite)
- Automated verification exists (tests, type checkers, linters, compilation)
- Bounded scope that can be described in a single task file
Weak Fit
- Subjective quality (“make the UI look good”) has no machine-verifiable criteria
- Exploratory work where the direction depends on intermediate discoveries
- Large-scale refactoring that requires understanding global code relationships across dozens of files
Key Takeaways
For developers building autonomous agent systems: - Invest in machine-verifiable completion criteria before starting autonomous loops; my deliberation system succeeded because each PRD had testable success criteria (141 tests total) - Implement spawn budgets from day one; the budget inheritance model (not depth incrementing) prevents exponential agent spawning while allowing deep chains - Add filesystem cleanup to completion criteria; dead-end artifacts from abandoned iterations will pollute subsequent iterations
For engineering teams evaluating autonomous AI development: - The Ralph architecture trades human implementation time for human specification time; the ROI depends on whether your bottleneck is implementation capacity or specification clarity - Audit autonomous output with the same rigor applied to code from external contractors; my 141 tests exist because I learned that meeting completion criteria does not guarantee production readiness
References
-
Author’s implementation of the Ralph Loop pattern using Claude Code hooks. Deliberation infrastructure: 3,455 lines of Python, 8 modules, 141 tests across 4 commits (2025-2026). ↩
-
Anthropic, “Claude Models,” 2025. ↩
-
Liu, Nelson F. et al., “Lost in the Middle: How Language Models Use Long Contexts,” TACL, 2024. ↩
-
Anthropic, “Claude Code Documentation,” 2025. Hook lifecycle events. ↩
-
Author’s git log. Commits
3cad08c,10df724,fbf1a0d,32bd711spanning the full deliberation infrastructure build. ↩ -
Author’s experience with unbudgeted agent spawning. Documented in
~/.claude/projects/*/memory/MEMORY.mderror entries. ↩ -
Author’s bash debugging.
((VAR++))exit code behavior withset -edocumented as a cross-session learning in MEMORY.md. ↩