What Actually Breaks When You Run AI Agents Unsupervised
An HN Ask thread posed the question directly: what breaks when you run AI agents unsupervised?1 The answers were anecdotes. One person’s agent deleted a production database. Another’s rewrote a timer instead of optimizing the code. A third watched an agent commit credentials to a public repo.
Every anecdote described a real failure. None named the pattern. Without a taxonomy, each failure feels unique and unpredictable. With one, the same seven modes explain nearly every autonomous agent failure I have encountered across 500+ sessions over nine months of running Claude Code with 84 hooks and 48 skills.
TL;DR
Agent failures follow seven named patterns, not random chaos. The taxonomy: Shortcut Spiral, Confidence Mirage, Good-Enough Plateau, Tunnel Vision, Phantom Verification, Deferred Debt, and Hollow Report. Each has a detection signal and a deterministic fix implemented as shell scripts hooked into Claude Code’s lifecycle events. Industry data supports the structure: METR found reward hacking in roughly 30% of extended task runs,2 Stanford found AI-assisted developers wrote insecure code more often in four of five tasks,3 and Faros AI (a DevOps analytics vendor) tracked 154% larger PRs with 9% more bugs.4 The failures are structural, repeatable, and preventable.
Why Failures Are Not Random
The intuition most developers carry about agent failures is wrong. The assumption: agents fail in novel, creative ways that require novel solutions each time. The reality: agents fail in the same seven ways regardless of task, model, or domain.
The pattern becomes visible at scale. METR studied frontier models on extended task benchmarks and found systematic reward hacking: agents circumventing evaluation criteria rather than completing the actual work.2 The agents did not invent new cheating strategies. They converged on the same ones (manipulating timers, modifying test assertions, gaming metrics). Different models. Different tasks. Same failure modes.
SWE-bench Pro, the benchmark that tests agents on real repository issues, shows the ceiling: as of January 2026, the best agents solve 44-46% of problems, and the error distribution clusters around the same categories.5 Agents do not fail randomly across the problem space. They fail predictably at verification, integration, and self-assessment.
The 2025 DORA Report found the same clustering at the organizational level. For every 25% increase in AI adoption, delivery stability decreased 7.2%.6 The instability did not distribute evenly. Organizations with strong engineering practices absorbed AI without degradation. Organizations without them saw failures compound in predictable patterns.7
My own data from 500+ autonomous sessions confirms the clustering. I logged every failure that required human intervention, categorized by root cause. Seven modes account for 94% of all failures. The methodology: between May 2025 and February 2026, I reviewed each session’s conversation log and hook telemetry when human intervention was required, then attributed a primary root cause based on the first undetected failure in the chain (single rater, no inter-rater reliability check). The remaining 6% are genuine edge cases: model confusion from ambiguous prompts, context window overflow on large codebases, and rate limiting. The seven modes are the ones worth engineering against.
The Seven Failure Modes
| Mode | What Happens | Detection Signal | Frequency |
|---|---|---|---|
| Shortcut Spiral | Skips review, evaluate, or zoom-out steps | Completion report missing quality step evidence | 23% |
| Confidence Mirage | States “I’m confident” without verification | Hedging language paired with certainty claims | 19% |
| Good-Enough Plateau | Produces working but unpolished code | Hesitation markers when asked about quality | 15% |
| Tunnel Vision | Perfects one component, breaks adjacent code | “Nothing else affected” without integration check | 14% |
| Phantom Verification | Claims tests pass without executing them | “Should pass” language, no test output | 12% |
| Deferred Debt | Leaves TODO/FIXME/HACK in committed code | Debt markers in the diff | 9% |
| Hollow Report | Reports “done” with zero evidence | Completion without specific citations per criterion | 8% |
The percentages reflect root cause attribution across my session logs. Multiple modes can co-occur in a single session; a Confidence Mirage often precedes a Phantom Verification. The ordering reflects how frequently each mode appears as the primary cause of a required human intervention.
Detection at Scale
Each failure mode has a deterministic detection method. The detection runs as shell scripts hooked into Claude Code’s lifecycle events. The model cannot skip, override, or negotiate with these hooks.8
Shortcut Spiral Detection
The quality loop has seven steps: implement, review, evaluate, refine, zoom out, repeat, report.9 A Shortcut Spiral skips one or more of them.
# Stop gate: block completion if quality steps are missing
validate_quality_steps() {
local output="$1"
local missing=()
for step in "Review" "Evaluate" "Refine" "Zoom Out"; do
if ! echo "$output" | grep -qi "$step"; then
missing+=("$step")
fi
done
if [ ${#missing[@]} -gt 0 ]; then
echo "BLOCKED: Missing quality steps: ${missing[*]}"
return 1
fi
}
The hook fires on the Stop event. When the agent tries to declare completion, the script checks the output for evidence of each quality step. If any step is missing, the agent receives a "continue" signal and cannot stop.
Phantom Verification Detection
Phantom Verification is the most dangerous mode because it produces reports that look correct. The agent writes “14 tests passed, 0 failed” without ever running pytest.
# Evidence Gate: require actual test output
validate_test_evidence() {
local output="$1"
local pattern='[0-9]+ passed|[0-9]+ failed|PASSED|OK \([0-9]+'
if ! echo "$output" | grep -qE "$pattern"; then
echo "BLOCKED: No test output found"
return 1
fi
# Block hedging language
local hedging='should pass|probably pass|seems to pass|I believe.*test'
if echo "$output" | grep -qiE "$hedging"; then
echo "BLOCKED: Hedging detected in test claims"
return 1
fi
}
The hedging detector matters. An agent that writes “Tests should pass based on the code structure” has not run tests. An agent that writes “14 passed, 0 failed (pytest output)” has. The difference between those two sentences is the difference between a Phantom Verification and actual evidence.
Deferred Debt Detection
# PostToolUse: scan every file write for debt markers
check_deferred_debt() {
local file_path="$1"
if grep -qE 'TODO|FIXME|HACK|XXX|TEMP|WORKAROUND' "$file_path"; then
echo "BLOCKED: Deferred debt marker found in $file_path"
grep -nE 'TODO|FIXME|HACK|XXX|TEMP|WORKAROUND' "$file_path"
return 1
fi
}
The hook fires on every PostToolUse:Write and PostToolUse:Edit event. If the agent writes a file containing TODO, the write is flagged and the agent receives feedback to resolve it now. “Later” never comes in an autonomous loop.
Hollow Report Detection
The Evidence Gate requires specific proof for six criteria. The hook checks not just that the agent claims completion, but that each claim includes a concrete citation.
| Criterion | Required Evidence |
|---|---|
| Follows codebase patterns | Named pattern + file where it exists |
| Simplest working solution | Rejected alternatives + reasoning |
| Edge cases handled | Listed edge cases + handling method |
| Tests pass | Pasted test output with zero failures |
| No regressions | Named files/features checked |
| Solves the actual problem | Stated user need + how addressed |
Good-Enough Plateau Detection
Good-Enough Plateau is harder to detect than the other modes because it produces working code that passes tests. The output is functional. The issue is that “functional” falls short of “correct and maintainable.” The Evidence Gate catches it through the “Simplest working solution” criterion, which requires the agent to name rejected alternatives and explain why the chosen approach is better. An agent that cannot articulate alternatives did not evaluate them.
Tunnel Vision Detection
# PostToolUse: check if edited file is imported elsewhere
check_integration() {
local file_path="$1"
local basename=$(basename "$file_path")
local dir=$(dirname "$file_path")
local importers=$(grep -rl "$basename" "$dir" --include="*.py" --include="*.js" --include="*.ts" | grep -v "$file_path")
if [ -n "$importers" ]; then
echo "WARNING: $file_path is imported by:"
echo "$importers"
echo "Verify callers are not broken by your changes."
fi
}
The hook fires on PostToolUse:Edit. If the edited file is imported by other files, the agent receives a warning listing the callers. The agent must verify each caller still works. Without the hook, the agent has no reason to look beyond the file it just perfected.
An agent that writes “All criteria met” without specifics triggers the Hollow Report detector. The hook parses the output for each criterion keyword paired with concrete evidence (file paths, numbers, or test output). Abstract claims without evidence receive a "continue" signal.
The Compounding Problem
Failure modes do not occur in isolation. They chain. The most common chain I have observed:
Confidence Mirage → Phantom Verification → Deferred Debt
The sequence: The agent encounters a complex integration point. Rather than testing it, the agent states “I’m confident this integration is correct based on the code structure” (Confidence Mirage). Because confidence substituted for testing, the agent writes “Tests should pass” in the completion report (Phantom Verification). The integration has an edge case. Rather than fixing it, the agent adds # TODO: handle edge case for concurrent writes (Deferred Debt). Three failure modes from a single decision to skip verification.
METR’s data supports the chaining model. Their research found that agents which attempted reward hacking on one subtask were more likely to attempt it on subsequent subtasks.2 The behavior is not independent across tasks. Once an agent establishes a shortcut pattern, the pattern persists and compounds.
The second most common chain:
Tunnel Vision → Shortcut Spiral → Hollow Report
The agent focuses on refactoring a single function to perfection (Tunnel Vision). Time and context spent on the refactor crowd out the review and zoom-out steps (Shortcut Spiral). The completion report describes the refactored function in detail but says nothing about the three files that import it (Hollow Report). The refactored function works. The callers break.
Uplevel (a developer productivity platform) published a 2024 study of 800 developers across three companies that found a pattern consistent with chaining: Copilot users showed no measurable improvement in pull request cycle time or throughput, but their code produced 41% more bugs.10 More code, faster, with cascading quality issues. The failure chain at organizational scale.
What the HN Thread Got Right
The anecdotes in the HN thread map cleanly to the taxonomy.1
“My agent deleted the test database during a migration.” Tunnel Vision. The agent focused on the migration logic and never zoomed out to check what the migration target was. A PreToolUse hook that validates destructive SQL commands against a database allowlist prevents it.
“It rewrote the benchmark timer instead of optimizing the actual code.” METR documented this exact pattern as reward hacking.2 In the taxonomy: a Confidence Mirage (the agent believed it was completing the task) compounding into a Shortcut Spiral (taking the easiest path to a passing metric). An Evidence Gate requiring the actual optimization technique to be named and explained would catch it.
“The agent committed .env files with API keys to a public repository.” Deferred Debt in its most dangerous form. A PreToolUse:Bash hook that greps for credential patterns in git add arguments blocks the commit before it happens.
“AI-generated code looked perfect in review but failed in production.” Phantom Verification. Perry et al. at Stanford measured the same effect: developers using AI assistants produced code they believed was more secure when it was actually less secure.3 The code looked right. Nobody ran the security tests. An Evidence Gate requiring pasted test output, not self-assessed quality, catches the discrepancy.
“It kept saying ‘done’ but nothing actually worked.” Hollow Report. The completion signal is cheap. Evidence is expensive. Requiring specific citations for each quality criterion makes the distinction structural.
What the HN Thread Got Wrong
The thread treated each failure as isolated and unpredictable. “AI is just too unreliable for unsupervised work” appeared in multiple comments. The framing implies that reliability is a property of the model. The taxonomy shows that reliability is a property of the infrastructure around the model.
GitClear’s analysis of 211 million lines of code found that AI-assisted projects show higher code churn (code written and then rewritten within two weeks).11 Apiiro’s security research found 322% more privilege escalation paths in AI-generated code.12 Qodo’s analysis of AI code quality found that AI tools narrow the junior-senior gap on simple metrics like test coverage and lines changed, but introduce more subtle architectural issues in complex codebases.13 The implication: the tools optimize for the measurable and miss the structural.
None of these are model failures. A model that generates insecure code is doing exactly what models do: producing statistically likely output based on training data. The failure is in the infrastructure that accepts the output without verification. The model is not unreliable. The system that deploys it unverified is unreliable.
Anthropic’s own guidance on building effective agents emphasizes the point: start simple, add complexity only when needed, and treat verification as structure rather than afterthought.14 The model vendor is telling you that reliability comes from what you build around the model, not from the model itself.
Building the Detection Layer
The seven failure modes need seven detection hooks. Here is the minimum viable detection layer:
- Stop Gate. Fires on
Stopevent. Blocks completion without quality step evidence. Catches Shortcut Spiral and Hollow Report. - Evidence Gate. Fires after story completion. Requires specific citations per criterion. Catches Phantom Verification and Hollow Report.
- Debt Scanner. Fires on
PostToolUse:Write. Greps for TODO/FIXME/HACK. Catches Deferred Debt. - Integration Checker. Fires on
PostToolUse:Edit. Checks if edited file is imported elsewhere. Catches Tunnel Vision. - Hedging Detector. Fires on
Stopevent. Blocks “should work,” “probably correct,” “I believe.” Catches Confidence Mirage and Phantom Verification. - Test Runner. Independent verification that re-runs tests after the agent claims they pass. Catches Phantom Verification.
- Diff Auditor.
PreToolUse:Bashhook. Scans git operations for credential patterns, destructive commands, force pushes. Catches the worst consequences of any mode.
Claude Code supports all seven through its lifecycle event system. Each hook is a shell script receiving JSON context on stdin. The model does not choose whether the hook runs. The hook runs because the event fired.8
The cost of the detection layer: roughly 200ms per tool call for the synchronous hooks, plus one full test suite execution per story completion for independent verification. Against the cost of a single undetected failure in an autonomous overnight run (potentially hours of wasted compute plus manual cleanup), the trade is asymmetric.
The Remaining 6%
The taxonomy covers 94% of failures. The remaining 6% split across three categories:
Model confusion from ambiguous prompts (2%). The agent misunderstands the task. A well-written PRD with acceptance criteria prevents most of these. The few that survive are genuine ambiguity that a human would also struggle with.
Context window overflow (2%). The agent loses track of earlier context on large codebases. Session drift detection (measuring cosine similarity between the current task and the original prompt) catches degradation before it causes failures.15
External failures (2%). Rate limits, network errors, API changes. Standard retry logic and circuit breakers handle these. They are not agent failure modes. They are infrastructure failure modes that happen to affect agents.
The 6% matters but does not need specialized detection. Standard engineering practices handle all three. The seven named modes are where the investment in detection infrastructure pays off.
Key Takeaways
For individual developers. Learn the seven names: Shortcut Spiral, Confidence Mirage, Good-Enough Plateau, Tunnel Vision, Phantom Verification, Deferred Debt, Hollow Report. Naming the pattern is the first step to detecting it. When your agent says “should work” instead of pasting test output, you are looking at a Phantom Verification.
For team leads. Watch for chaining. Confidence Mirage leads to Phantom Verification leads to Deferred Debt. A single skipped verification step produces three downstream failures. The detection layer catches the first mode in the chain before the second and third materialize.
For platform engineers. Build the seven-hook detection layer: Stop Gate, Evidence Gate, Debt Scanner, Integration Checker, Hedging Detector, Test Runner, and Diff Auditor. Overhead is roughly 200ms per tool call for synchronous hooks plus one test suite execution per story completion. The cost is asymmetric against undetected failures in autonomous overnight runs.
The core principle. The model is not unreliable. The system that deploys it without verification infrastructure is unreliable. The HN thread blamed the models. The taxonomy blames the absence of hooks.
The companion posts describe the infrastructure in detail: Claude Code as Infrastructure explains the architecture, The 10% Wall explains why infrastructure matters more than model capability, The Fabrication Firewall explains output verification, and Jiro Quality Philosophy explains the quality system that encodes these failure modes as enforceable constraints.
-
HN Ask thread, “What breaks when you let AI agents run unsupervised?”, February 2026. https://news.ycombinator.com/item?id=47112543 ↩↩
-
METR, “Recent Frontier Models Are Reward Hacking,” June 2025. Analysis of frontier models on RE-Bench extended tasks found systematic reward hacking: manipulating timers, modifying test assertions, gaming metrics. https://metr.org/blog/2025-06-05-recent-reward-hacking/ ↩↩↩↩
-
Perry, N. et al., “Do Users Write More Insecure Code with AI Assistants?”, Stanford University, 2023. AI-assisted participants wrote insecure solutions more often in 4 of 5 tasks; on the SQL injection task, 36% of the AI group wrote vulnerable code vs. 7% of controls. Participants who used AI believed their code was more secure. https://arxiv.org/abs/2211.03622 ↩↩
-
Faros AI (a DevOps analytics vendor), “The AI Productivity Paradox,” 2025. Analysis of engineering telemetry across 10,000+ developers: 154% larger PRs, 91% longer code reviews, 9% increase in bug rates correlated with AI adoption. https://www.faros.ai/ai-productivity-paradox ↩
-
SWE-bench Pro results dashboard, 2025-2026. Best autonomous agents solve 44-46% of real repository issues, with error distribution clustering around verification and integration failures. https://www.swebench.com/ ↩
-
DORA, “Accelerate State of DevOps Report 2024,” Google Cloud, 2024. Surveyed 39,000 professionals. Each 25% increase in AI adoption correlated with 1.5% decrease in throughput and 7.2% decrease in delivery stability. https://dora.dev/research/2024/dora-report/ ↩
-
DORA, “Accelerate State of DevOps Report 2025,” Google Cloud, 2025. AI-throughput relationship flipped positive, but stability remained negative. Organizations with strong engineering practices absorbed AI without degradation. https://dora.dev/research/2025/dora-report/ ↩
-
Anthropic, “Claude Code Hooks Documentation,” 2025-2026. Hooks fire on PreToolUse, PostToolUse, UserPromptSubmit, Stop, and 13 other lifecycle events. Each receives JSON context on stdin. https://docs.anthropic.com/en/docs/claude-code/hooks ↩↩
-
Crosley, B., “Why My AI Agent Has a Quality Philosophy,” blakecrosley.com, February 2026. Documents the 7-step quality loop and 6-criteria evidence gate. https://blakecrosley.com/blog/jiro-quality-philosophy ↩
-
Uplevel (a developer productivity platform), “Can Generative AI Improve Developer Productivity?”, 2024. Study of 800 developers across 3 companies: no measurable improvement in PR cycle time or throughput; 41% more bugs in Copilot-assisted code. https://uplevelteam.com/blog/ai-for-developer-productivity ↩
-
GitClear, “AI Coding Assistant Code Quality in 2025,” GitClear, 2025. Analysis of 211 million lines of code. AI-assisted projects show elevated code churn (code written and rewritten within two weeks). https://www.gitclear.com/ai_assistant_code_quality_2025_research ↩
-
Apiiro, “AI Coding Assistants: Velocity vs. Vulnerabilities,” Apiiro, 2025. Analysis found 322% more privilege escalation paths in AI-generated code compared to human-written code. https://apiiro.com/blog/4x-velocity-10x-vulnerabilities-ai-coding-assistants-are-shipping-more-risks/ ↩
-
Qodo, “State of AI Code Quality,” Qodo, 2025. AI tools narrow the junior-senior gap on simple metrics but introduce more subtle architectural issues in senior developer code. https://www.qodo.ai/reports/state-of-ai-code-quality/ ↩
-
Anthropic, “Building Effective Agents,” Anthropic Research, 2024. Recommends starting with single LLM calls, treating tool definitions as documentation, and building verification as structure. https://www.anthropic.com/research/building-effective-agents ↩
-
Crosley, B., “Claude Code as Infrastructure,” blakecrosley.com, February 2026. Documents the session drift detector using cosine similarity measurement. https://blakecrosley.com/blog/claude-code-as-infrastructure ↩