The Agent Operator's Handbook: Supervising What You Can't See
Operating autonomous AI agents is a new discipline: not engineering, not management, not operations, but a hybrid requiring all three. The operator role emerges when agents run long enough that supervision becomes the bottleneck, not code generation. Five responsibilities define the role. A supervision stack implements them. An intervention framework decides when to act.
Nobody trained for the job. No university department teaches it. No job posting describes it accurately. One month you write Python. The next month you manage an autonomous system that writes Python, calls APIs, modifies your filesystem, and makes architectural decisions while you sleep. The Ralph loop created the role in my infrastructure: a shell script that restarts Claude Code with fresh context, reads filesystem state, and continues work across overnight sessions. Every team running agents autonomously has discovered the same role independently, because the same problems surface whenever an agent operates longer than a single interactive session.
The role has no established name. Some teams call it “AI ops.” Others fold it into platform engineering. A few assign it to engineering managers who have never written a hook. The ambiguity matters because misidentifying the role leads to misallocating the work. An engineering manager without systems knowledge cannot debug a corrupted agent state. A platform engineer without product judgment cannot evaluate whether an agent’s output meets the spec’s intent. The operator role requires both: specification decisions (what the agent should build, what constraints to impose) and operational execution (monitoring sessions, recovering from failures, maintaining infrastructure).
Five Operator Responsibilities
1. Delegation
Delegation means writing specifications that constrain agent behavior before execution begins. The quality of delegation determines the quality of autonomous output more than any other factor.
A CLAUDE.md file is a delegation artifact. It encodes project conventions, forbidden patterns, required behaviors, and quality standards into a document the agent reads at session start.1 A PRD is a delegation artifact. It specifies acceptance criteria that the agent verifies against before reporting completion. A task description is a delegation artifact. The specificity of the task description bounds the agent’s decision space.
Poor delegation produces the Shortcut Spiral: the agent skips steps because the specification did not enumerate them as mandatory. Good delegation makes the required steps explicit. My PRDs include numbered acceptance criteria, and each criterion maps to an observable artifact (a file path, a test result, a specific behavior). The agent cannot mark a criterion as complete without producing the artifact. Delegation that specifies observable outcomes eliminates an entire class of phantom completions.
The skill is knowing what to specify and what to leave open. Over-specification produces brittle agents that cannot adapt when they encounter unexpected code. Under-specification produces agents that make architectural decisions you did not authorize. The boundary moves with trust: a well-tested agent with strong hooks earns wider latitude than a new configuration running its first overnight session.
2. Supervision
Supervision means monitoring active sessions, reviewing diffs, and catching drift before it compounds.
Drift is the central risk. An agent starts aligned with the spec, makes a reasonable micro-decision that deviates slightly, then makes subsequent decisions that build on the deviation. By iteration eight, the agent is solving a different problem than the one you delegated. Each individual decision looked reasonable in isolation. The cumulative trajectory missed the target.
I catch drift through two mechanisms. First, hooks enforce hard boundaries: blocked commands, required patterns, forbidden file modifications. Hooks catch violations in real time, before the agent proceeds. Second, periodic log review catches soft drift that no hook can detect: the agent choosing an unnecessarily complex approach, or building a feature the spec did not request, or optimizing a code path that was not the bottleneck. Soft drift requires human judgment because no automated check can determine whether the agent’s trajectory matches the operator’s intent.
Supervision scales poorly with agent count. One agent producing one session per night is reviewable over morning coffee. Five agents producing eight iterations each generate forty context windows of work. Prioritization becomes mandatory: review failures first, then sessions that touched critical paths, then clean completions on low-risk tasks.
3. Intervention
Intervention means knowing when to stop, redirect, or restart an agent mid-task.
Four patterns demand intervention:
The agent is stuck in a loop. The same error appears across consecutive iterations. The agent tries the same fix with minor variations. Each iteration consumes a full context window and produces no forward progress. Intervention: stop the session, diagnose the root cause manually, update the handoff document with the diagnosis, restart.
The agent produced incorrect output that passes tests. The code compiles, the tests are green, but the behavior does not match the spec’s intent. The evidence gate catches some instances, but an agent can produce a plausible justification for wrong behavior. Intervention: write a failing test that captures the correct behavior, then restart.
The agent is about to touch production or external systems. Any operation with irreversible consequences (deploying to production, sending emails, modifying a database, calling a paid API) requires a gate. My hooks block destructive bash commands and external network calls. The operator decides which gates to open and when.2
The agent is making progress but in the wrong direction. The work is competent but misaligned. Intervention: stop, clarify the spec in the handoff document, restart. Do not attempt to redirect mid-session through conversation. The agent has already built mental models around the wrong interpretation, and mid-course correction in the same context window produces inconsistent output.
The pattern you do not intervene on: the agent making progress slowly toward the correct goal. Let it run.
4. Recovery
Recovery means handling failures after they occur: corrupted state, wrong branches, broken builds, and data loss.
Agent failures leave artifacts. A crashed session may have written partial files, committed to the wrong branch, left temporary files in the working directory, or modified configuration that subsequent sessions inherit. Recovery requires reversing these artifacts before restarting.
My recovery protocol: inventory the damage (git status, git log, git diff), preserve the session log as diagnostic data, revert to the last verified-good commit, update the handoff document with what failed and why, then restart with corrected constraints. Do not attempt to salvage partial work from a failed session unless the partial work is clearly correct and isolatable. The handoff carries failure context across session boundaries so the next agent does not repeat the same mistakes.
The most dangerous recovery scenario is a failure that looks like success. The agent reports completion, the tests pass, the build is green, but the implementation is subtly wrong. The Confidence Mirage failure mode produces exactly this situation. Recovery requires reading the code, not just the completion report.
5. Governance
Governance means setting policies, budgets, permissions, and audit requirements that apply across all agent sessions.
Policies define what agents may and may not do. My governance layer includes: a spawn budget (maximum iterations per overnight run), a cost ceiling (maximum API spend per session), an allowlist of permitted bash commands, a blocklist of forbidden file patterns, and a set of required completion criteria.3 Each policy traces to a specific failure. The spawn budget exists because an early session ran 47 iterations without converging. The cost ceiling exists because a debugging session burned $200 in API calls chasing a red herring. Each policy is a scar from a lesson learned the expensive way.
Permissions follow the principle of least privilege. An agent that generates blog content does not need filesystem write access outside the content directory. An agent that runs tests does not need network access. My hooks enforce these boundaries at the tool-call level, blocking operations that exceed the session’s permission scope.2
Audit requirements complete the governance layer. Every session produces a structured log: commands executed, files modified, tests run, completion criteria evaluated. The seven failure modes taxonomy emerged from reviewing six months of these logs and categorizing every failure that required human intervention.
The Supervision Stack
Five infrastructure components implement the five responsibilities.
Hooks implement automated supervision. Claude Code’s lifecycle events (PreToolUse, PostToolUse, Notification) trigger shell scripts that enforce policies in real time.2 A hook that blocks rm -rf is a governance policy encoded as a PreToolUse check. A hook that requires test execution before completion is a delegation constraint encoded as a PostToolUse check. The 95 hooks in my system encode 95 decisions about what agents may and may not do, each tracing to a specific failure that the hook now prevents.
The evidence gate implements structured verification. Six criteria (follows patterns, simplest solution, edge cases handled, tests pass, no regressions, solves the problem) must produce specific artifacts before the agent marks work complete.4 The gate translates supervision from “did the agent do a good job?” (subjective, unverifiable) to “did the agent produce evidence for all six criteria?” (objective, auditable). Every hedging word in a completion report triggers re-verification.
The quality loop implements iterative refinement. Seven steps (implement, review, evaluate, refine, zoom out, repeat, report) force the agent through multiple passes over its own work.5 The loop compensates for a structural limitation of single-pass generation: models produce plausible first drafts that contain errors visible only on re-reading. The quality loop mandates the re-reading.
Session logs implement post-hoc audit. The system captures every tool call, file modification, and completion report in structured form. Six months of session logs produced the failure taxonomy. Without the logs, each failure would have remained an isolated anecdote.
Cost gates implement budget enforcement. Spawn budgets cap iteration count. API cost ceilings cap token spend. An agent that has not converged within the spawn budget is probably stuck, and more iterations will not help. The budget forces the operator to diagnose and intervene rather than hoping the next iteration will fix the problem.
When to Intervene vs. When to Let It Run
The intervention decision is the operator’s most consequential judgment call. Intervening too early wastes agent work. Intervening too late allows drift to compound. A framework helps.
| Signal | Action | Reasoning |
|---|---|---|
| Same error across 3+ iterations | Intervene | The agent lacks information to resolve the error. More iterations will not help. |
| Slow but measurable progress toward the correct goal | Let it run | Speed is not the variable. Correctness is. |
| Output passes tests but does not match spec intent | Intervene | The hardest case. Write a test that captures the correct behavior, then restart. |
| Agent is about to call an external API or modify production | Gate | Irreversible operations require explicit approval regardless of confidence. |
| Agent requests a permission it should not need | Intervene | Permission requests outside the expected scope indicate the agent has drifted from the task. |
| Completion report uses hedging language | Re-verify | “Should work” and “I believe” are not evidence. Demand artifacts. |
| Agent is building infrastructure not in the spec | Evaluate | Sometimes necessary preparation. Often Tunnel Vision. Check whether the infrastructure serves the goal or delays it. |
The meta-principle: intervene on information asymmetry, not on speed. When you know something the agent does not (the correct code path, the real requirement, the failure mode of a previous session), intervention transfers that knowledge. When the agent knows everything you know and is simply working through the problem, let it work.
The Operator’s Checklist
Before Starting
- [ ] Spec reviewed: acceptance criteria are specific, observable, and complete
- [ ] Hooks active: policy hooks are enabled and tested for the task type
- [ ] Budget set: spawn limit and cost ceiling are configured
- [ ] Sandbox confirmed: the agent cannot reach production, send external requests, or modify files outside scope
- [ ] Handoff current: if continuing previous work, the handoff document reflects the latest corrections
- [ ] Branch clean: working directory is on the correct branch with no uncommitted changes
During
- [ ] Check logs at defined intervals (every 2-3 iterations for overnight runs)
- [ ] Verify trajectory matches spec intent, not just spec letter
- [ ] Monitor resource usage: token spend, iteration count, file system changes
- [ ] Watch for permission escalation: requests for access the task should not require
- [ ] Note any soft drift for the post-session review
After
- [ ] Review all file changes, not just the completion report
- [ ] Run the full test suite independently (do not trust the agent’s reported test results)
- [ ] Check for regressions in adjacent code the agent did not explicitly modify
- [ ] Verify the evidence gate: every criterion has a specific artifact, not a general assurance
- [ ] Update the handoff document with session outcomes and any corrections
- [ ] Log the session: failure modes encountered, hooks that fired, intervention decisions made
- [ ] Update governance: if a new failure pattern emerged, write a hook or policy to prevent recurrence
The Operator as Craftsperson
The agent operator role exists at the intersection of engineering skill and product judgment. Writing hooks requires systems knowledge. Writing specs requires product understanding. Reviewing agent output requires both: the ability to evaluate whether code is correct and whether correct code solves the right problem.
Chat is the wrong interface for the operational half of the role. Scrolling through conversation transcripts to supervise autonomous work does not scale past a single agent running a single session. The supervision stack described above (hooks, evidence gates, quality loops, session logs, cost gates) compensates for the interface gap by encoding supervision into infrastructure. The infrastructure does not replace the operator. The infrastructure multiplies the operator’s reach.
Taste is a technical system describes the judgment half. Knowing what to delegate, what to verify, and what to reject requires pattern recognition built from experience. Each session teaches the operator something about agent behavior. The operator’s skill compounds through deliberate practice, reflection, and infrastructure that encodes lessons permanently.
The dark factory represents the theoretical endpoint, Level 5, where no human reads the code. Current practice sits at Level 3 or 4 for most teams: the agent does the work, the operator supervises and intervenes. The gap between Level 4 and Level 5 is the verification layer. The gap between Level 2 and Level 4 is the operator.
Every team running autonomous agents will develop operators. The question is whether they develop the role deliberately (with defined responsibilities, infrastructure support, and explicit training) or accidentally, by assigning the work to whoever happens to be awake when the overnight session fails. The craft develops from there.
-
Anthropic, “Claude Code Configuration,” published February 2026. https://docs.anthropic.com/en/docs/claude-code/settings ↩
-
Anthropic, “Claude Code Hooks,” published February 2026. https://docs.anthropic.com/en/docs/claude-code/hooks ↩↩↩
-
Blake Crosley, “The Ralph Loop: How I Run Autonomous AI Agents Overnight,” published February 2026. https://blakecrosley.com/blog/ralph-agent-architecture ↩
-
Blake Crosley, “The Evidence Gate: Proof Over Plausibility in AI Output,” published March 2026. https://blakecrosley.com/blog/the-evidence-gate ↩
-
Blake Crosley, “What Actually Breaks When You Run AI Agents Unsupervised,” published February 2026. https://blakecrosley.com/blog/what-actually-breaks-unsupervised ↩