The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system that uses stop hooks to intercept exit attempts, filesystem memory to persist state across context windows, and spawn budgets to prevent runaway recursion. The system shipped my 9-PRD deliberation infrastructure (3,455 lines of Python, 141 tests) across multiple overnight sessions.¹

TL;DR

The Ralph architecture enables long-running autonomous AI development by solving three problems simultaneously: context window exhaustion (solved by fresh context per iteration), state persistence (solved by filesystem as memory), and task continuity (solved by stop-hook loops that prevent the agent from terminating before completion). I implemented this pattern in my Claude Code hook system and used it to build my multi-agent deliberation infrastructure. The system works, but it taught me hard lessons about spawn budgets, criteria quality, and filesystem pollution.

The Context Window Problem

Every AI conversation operates within a context window: a fixed-size buffer holding conversation history, system prompts, tool outputs, and working memory. Claude’s context window holds approximately 200,000 tokens. A complex development session can exhaust this within 30-60 minutes of intensive work.²

I measured this directly during the deliberation system build. Sessions that started with precise multi-file edits across 8 Python modules degraded into single-file tunnel vision by the 90-minute mark. The agent stopped referencing architecture it had read earlier because that context had been compressed away.³

The degradation follows a predictable curve:

Iteration 1: [200K tokens] → writes code, creates files
Iteration 2: [200K tokens] → reads files from disk, continues
Iteration 3: [200K tokens] → reads updated files, continues
...
Iteration N: [200K tokens] → reads final state, verifies criteria

Compare with a single long session:

Minute 0:  [200K tokens available] → productive
Minute 30: [150K tokens available] → somewhat productive
Minute 60: [100K tokens available] → degraded
Minute 90: [50K tokens available]  → significantly degraded
Minute 120: [compressed, lossy]    → errors accumulate

The fresh-context-per-iteration agent outperforms the continuous agent because each iteration allocates full cognitive resources to the current state rather than carrying the weight of previous reasoning.

My Implementation

The Stop Hook

My recursion-guard system intercepts agent stop attempts and checks completion criteria:

#!/bin/bash
# From recursion-guard.sh - simplified
CONFIG_FILE="${HOME}/.claude/configs/recursion-limits.json"
STATE_FILE="${HOME}/.claude/state/recursion-depth.json"

# Safe defaults with config override
MAX_DEPTH=2
MAX_CHILDREN=5
DELIB_SPAWN_BUDGET=2
DELIB_MAX_AGENTS=12

# Load config with validation
load_config() {
    if [[ -f "$CONFIG_FILE" ]] && command -v jq &>/dev/null; then
        config_depth=$(jq -r '.max_depth // 2' "$CONFIG_FILE")
        if [[ "$config_depth" =~ ^[0-9]+$ ]] && [[ "$config_depth" -gt 0 ]]; then
            MAX_DEPTH="$config_depth"
        fi
    fi
}

The hook reads a criteria file that defines success conditions. The conditions must be machine-verifiable: test pass/fail, linter output, HTTP status codes, file existence checks.⁴

Filesystem as Persistent Memory

The critical insight: files persist across context windows. My .claude/ directory serves as the agent’s persistent memory — a distributed context architecture that grew from this initial filesystem pattern into a 7-layer, 650-file system:

Directory	Contents	Role in Ralph Loop
`state/`	`recursion-depth.json`, `agent-lineage.json`	Tracks iteration count, parent-child relationships
`configs/`	14 JSON files	Encode thresholds, budgets, rules (not hardcoded)
`handoffs/`	49 context documents	Preserve multi-session architectural decisions
`hooks/`	95 lifecycle handlers	Enforce quality gates between iterations

Each new iteration reads the current state from disk and continues from where the previous iteration left off. The session-start hook initializes clean state:

# From session-start.sh - recursion state initialization
RECURSION_STATE_FILE="$RECURSION_STATE_DIR/recursion-depth.json"
# Initialize with safe defaults
{
  "depth": 0,
  "agent_id": "root",
  "parent_id": null,
  "initialized_by": "session-start"
}

If state gets corrupted (which happened twice during development), the recovery pattern recreates from safe defaults rather than crashing:

if ! jq -e '.depth' "$RECURSION_STATE_FILE" &>/dev/null; then
    # Corrupted state file, recreate with safe defaults
    echo "- Recursion state recovered (was corrupted)"
fi

Task Specification Format

Effective Ralph tasks include three elements: objective, completion criteria, and context pointers:

OBJECTIVE: Implement multi-agent deliberation with consensus validation.

COMPLETION CRITERIA:
- All tests in tests/test_deliberation_lib.py pass (81 tests)
- post-deliberation.sh validates consensus above 70% threshold
- recursion-guard.sh enforces spawn budget (max 12 agents)
- deliberation-pride-check.sh passes 5 quality checks
- No Python type errors (mypy clean)

CONTEXT:
- Follow patterns in lib/deliberation/state_machine.py
- Consensus thresholds in configs/deliberation-config.json
- Spawn budget model: agents inherit budget, not increment depth

What I Built With This Pattern

The Deliberation Infrastructure (9 PRDs)

The largest Ralph loop project: 9 Product Requirement Documents implemented across multiple sessions.

PRD	Deliverable	Tests
PRDs 1-4	Hooks, config, recursion-guard extensions	Committed as `3cad08c`
PRD-5	48 bash integration tests (7 suites)	Committed as `10df724`
PRDs 7-8	Hook wiring + 81 Python unit tests	Committed as `fbf1a0d`
PRD-9	12 E2E pipeline simulation tests	Committed as `32bd711`

Total output: 3,455 lines of Python across 8 modules, 141 tests, 4 commits. Each session picked up from the previous one’s filesystem state. The fresh context meant each PRD got the agent’s full attention without carrying conversation history from earlier PRDs.⁵

Blog Quality System (12 Modules)

The blog linter started as a 3-module script and grew to 12 modules through iterative Ralph loops. Each iteration added a module, ran the full test suite, and verified zero regressions. The completion criteria evolved:

Iteration 1: “All 77 tests pass”
Iteration 5: “All 77 tests pass AND linter reports 0 errors on all 33 posts”
Iteration 8: “All tests pass AND 0 errors AND 0 warnings AND depth score ≥2 on all posts”

The Failures and Lessons

Failure 1: The Spawn Budget Disaster

Early in the deliberation build, I ran a session without spawn budget limits. The agent spawned 3 exploration subagents. Each subagent spawned its own subagents. Within minutes, the recursion-guard hook was intercepting dozens of spawn attempts. The session consumed API tokens at 10x the normal rate before I killed it manually.⁶

The fix: I added a spawn budget model to recursion-limits.json. Agents inherit a budget from their parent rather than incrementing depth. A root agent with budget=12 can spawn up to 12 agents total across all levels of recursion. This was a key architectural insight: budget inheritance prevents exponential growth while still allowing deep agent chains. The spawn budget maps directly to the separation rule in Reynolds’ boids algorithm — the constraint that prevents agents from crowding onto the same sub-problem.

Failure 2: Criteria That Pass Trivially

An early task asked the agent to “write tests that pass.” The agent wrote minimal tests: assert True, assert 1 == 1. Technically, the criteria were met. The output was worthless.

The fix: Criteria must specify both quantity and quality:

Criteria Quality	Example	Outcome
Vague	“Tests pass”	Agent writes trivial tests
Measurable but incomplete	“Tests pass AND coverage >80%”	Agent writes tests that cover lines but test nothing meaningful
Comprehensive	“All tests pass AND coverage >80% AND no type errors AND linter clean AND each test class tests a distinct module”	Production-quality output

Failure 3: Filesystem Pollution

Iterations that explored dead-end approaches left artifacts: partially implemented features, deprecated files, conflicting configurations. Iteration 5 might build on a half-finished approach from iteration 3 that was abandoned in iteration 4.

The fix: I added a cleanup step to the stop hook criteria: “No files exist that aren’t referenced by imports or tests.” This forces the agent to clean up dead ends before the iteration can complete.

Failure 4: The `((VAR++))` Incident

During bash integration testing, the recursion-guard hook crashed silently on the first iteration. The bug: ((VAR++)) returns exit code 1 when VAR is 0 (because 0++ evaluates to 0, which bash treats as false). With set -e enabled, this killed the script.

The fix was VAR=$((VAR + 1)) instead of ((VAR++)). This bash gotcha, documented in my MEMORY.md, has prevented the same bug in 6 subsequent hook scripts.⁷

When Ralph Works and When It Doesn’t

Strong Fit

Greenfield implementation with clear specifications (new API, new module, new test suite)
Automated verification exists (tests, type checkers, linters, compilation)
Bounded scope that can be described in a single task file

Weak Fit

Subjective quality (“make the UI look good”) has no machine-verifiable criteria
Exploratory work where the direction depends on intermediate discoveries
Large-scale refactoring that requires understanding global code relationships across dozens of files

Key Takeaways

For developers building autonomous agent systems: - Invest in machine-verifiable completion criteria before starting autonomous loops; my deliberation system succeeded because each PRD had testable success criteria (141 tests total) - Implement spawn budgets from day one; the budget inheritance model (not depth incrementing) prevents exponential agent spawning while allowing deep chains - Add filesystem cleanup to completion criteria; dead-end artifacts from abandoned iterations will pollute subsequent iterations

For engineering teams evaluating autonomous AI development: - The Ralph architecture trades human implementation time for human specification time; the ROI depends on whether your bottleneck is implementation capacity or specification clarity - Audit autonomous output with the same rigor applied to code from external contractors; my 141 tests exist because I learned that meeting completion criteria does not guarantee production readiness

References

Author’s implementation of the Ralph Loop pattern using Claude Code hooks. Deliberation infrastructure: 3,455 lines of Python, 8 modules, 141 tests across 4 commits (2025-2026). ↩
Anthropic, “Claude Models,” 2025. ↩
Liu, Nelson F. et al., “Lost in the Middle: How Language Models Use Long Contexts,” TACL, 2024. ↩
Anthropic, “Claude Code Documentation,” 2025. Hook lifecycle events. ↩
Author’s git log. Commits 3cad08c, 10df724, fbf1a0d, 32bd711 spanning the full deliberation infrastructure build. ↩
Author’s experience with unbudgeted agent spawning. Documented in ~/.claude/projects/*/memory/MEMORY.md error entries. ↩
Author’s bash debugging. ((VAR++)) exit code behavior with set -e documented as a cross-session learning in MEMORY.md. ↩

The Ralph Loop: How I Run Autonomous AI Agents Overnight

TL;DR

The Context Window Problem

My Implementation

The Stop Hook

Filesystem as Persistent Memory

Task Specification Format

What I Built With This Pattern

The Deliberation Infrastructure (9 PRDs)

Blog Quality System (12 Modules)

The Failures and Lessons

Failure 1: The Spawn Budget Disaster

Failure 2: Criteria That Pass Trivially

Failure 3: Filesystem Pollution

Failure 4: The `((VAR++))` Incident

When Ralph Works and When It Doesn’t

Strong Fit

Weak Fit

Key Takeaways

References

Related Posts

Context Engineering Is Architecture: 650 Files Later

Why My AI Agent Has a Quality Philosophy

Metacognitive AI: Teaching Your Agent Self-Evaluation

TL;DR

The Context Window Problem

My Implementation

The Stop Hook

Filesystem as Persistent Memory

Task Specification Format

What I Built With This Pattern

The Deliberation Infrastructure (9 PRDs)

Blog Quality System (12 Modules)

The Failures and Lessons

Failure 1: The Spawn Budget Disaster

Failure 2: Criteria That Pass Trivially

Failure 3: Filesystem Pollution

Failure 4: The ((VAR++)) Incident

When Ralph Works and When It Doesn’t

Strong Fit

Weak Fit

Key Takeaways

References

Related Posts

Context Engineering Is Architecture: 650 Files Later

Why My AI Agent Has a Quality Philosophy

Metacognitive AI: Teaching Your Agent Self-Evaluation

Failure 4: The `((VAR++))` Incident