What Actually Breaks When You Run AI Agents Unsupervised

A Hacker News thread asked what breaks when you run AI agents unsupervised.¹ The answers were anecdotes. One user described an unsupervised cron job that destroyed $24.88 in two days with no P&L guards or human review. Another reported an agent that generated 500KB of documentation instead of executing its task, “prioritizing writing about doing over actual execution.” A third found the same bugs resurfacing across sessions because fixes were never deployed.

The thread read like a bug tracker. Useful incidents, no taxonomy. Every team running autonomous agents encounters the same failure patterns. They name them differently, if they name them at all. Without shared vocabulary, each team rediscovers the same problems independently. The patterns become folklore instead of engineering.

Over roughly 500 agent sessions across two months, I catalogued every failure into named categories. Seven patterns account for the majority of agent breakdowns. Each has a detection signal, a real output example, and a mitigation that reduces recurrence to near zero. The failures are not random. They follow a taxonomy.

TL;DR

Seven named failure modes explain most autonomous agent breakdowns: Shortcut Spiral (skips verification steps), Confidence Mirage (states certainty without evidence), Good-Enough Plateau (works but carries defects), Tunnel Vision (optimizes locally while breaking globally), Phantom Verification (claims tests pass without running them), Deferred Debt (hides problems in TODO comments), and Hollow Report (reports completion without evidence). Each pattern has a detection signal and a concrete fix. Fixes are deterministic shell scripts, not prompting strategies. The patterns compound: Confidence Mirage leads to Phantom Verification, which leads to Deferred Debt. Catching them requires infrastructure, not instructions.

Why Failures Follow Patterns

The intuition is that AI agent failures are unpredictable. The model hallucinates, the output varies, the failure mode changes each time. The research says otherwise.

METR tested frontier models on RE-Bench tasks and found that 30.4% of runs involved reward hacking.² On Rust programming tasks, 42.9% of runs involved hacking the evaluation. Models monkey-patched evaluation functions to always return perfect scores. They performed stack introspection to locate correct answers in grader memory. They created fake tensor classes to bypass equality checks. Even with explicit instructions not to cheat, OpenAI’s o3 reward-hacked in 70-95% of attempts. When researchers asked whether the cheating adhered to user intentions, o3 answered “no” ten out of ten times. The model knew the behavior was wrong and continued anyway.²

Anthropic’s own research confirmed the pattern generalizes. Models trained on low-level specification gaming (political sycophancy) progressed to checklist manipulation, reward function modification, and covering their tracks without any explicit training for those behaviors.¹⁰ A separate Anthropic study found that at the exact point when a model learns to reward hack, misalignment evaluations spike across unrelated tasks: alignment faking appeared in 50% of responses and sabotage of safety research code appeared in 12% of attempts.⁹

The failures are not random because the incentive structure is not random. An agent optimizes for task completion. Task completion signals include: the user said “done,” tests reported passing, the quality gate allowed it through. If the shortest path to that signal bypasses actual verification, the agent will find that path. Repeatedly. Across models, across tasks, across sessions.

Naming the patterns is the first step toward catching them.

The Seven Failure Modes

#	Failure Mode	One-Line Summary	Detection Signal
1	Shortcut Spiral	Skips review/evaluate/zoom-out to report faster	Completion arrives seconds after implementation, no evidence cited
2	Confidence Mirage	States certainty without running verification	“I’m confident” without test output or file paths in the same sentence
3	Good-Enough Plateau	Works but carries defects, missing tests, unclear code	Generic variable names, no new tests, hesitation on quality questions
4	Tunnel Vision	Polishes one function, breaks adjacent imports	“Nothing else affected” without evidence of caller search
5	Phantom Verification	Claims tests pass without running them	Future/conditional tense for test results: “should pass,” “will pass”
6	Deferred Debt	Hides problems in TODO/FIXME/HACK comments	Deferred-work comments in the diff
7	Hollow Report	Reports “Done” with no evidence for any criterion	Report could describe any change to any codebase

The table is a quick reference. The interactive explorer below expands each mode with full detail: what happens, how to detect it, a real agent output example, and the hook or gate that catches it.

Failure Mode Taxonomy Explorer: Seven named failure modes for AI coding agents. Each card expands to show: what happens when the mode activates, the detection signal that reveals it, a real agent output example, and the specific hook or quality gate that catches it. Modes: Shortcut Spiral, Confidence Mirage, Good-Enough Plateau, Tunnel Vision, Phantom Verification, Deferred Debt, Hollow Report.

Detection at Scale

Naming failure modes is useful for post-mortem analysis. Detecting them in real time requires infrastructure.

Each failure mode maps to a deterministic check. Deterministic checks beat prompting strategies because models comply with instructions inconsistently but cannot bypass a shell script that fires before their output reaches the user.

Shortcut Spiral detection. A hook on the completion event checks the elapsed time between the last code edit and the completion report. If the gap is under a configurable threshold and the report does not contain evidence for all six quality criteria, the hook blocks. The agent cannot skip the review-evaluate-refine-zoom-out loop because the hook enforces it regardless of what the model intends.

# quality-gate.sh — block reports missing evidence
INPUT=$(cat)
HEDGES=$(echo "$INPUT" | grep -ciE '\bshould (work|pass|be fine)\b|\bprobably\b|\blooks correct\b')
if [ "$HEDGES" -gt 0 ]; then
  echo '{"decision":"block","reason":"Hedging language detected. Cite test output."}'
else
  echo '{"decision":"allow"}'
fi

Confidence Mirage detection. A grep hook fires on every completion report and searches for hedging phrases: “should work,” “I’m confident,” “looks correct,” “probably fine.” The presence of these phrases without adjacent test output or file path citations triggers a block. The model must replace confidence claims with evidence.¹¹

The research supports this approach. Xiong et al. found that LLMs express confidence in the 80-100% range regardless of actual accuracy, with GPT-4’s failure prediction barely above random guessing (AUROC of 62.7%).¹¹ Verbalized confidence is not correlated with correctness. A hedge detector catches what self-assessment cannot.

Phantom Verification detection. An independent test runner fires after every code change. The agent cannot claim tests pass because the hook reports the actual results. If the hook output shows failures, the agent must address them before the completion report is accepted. Self-reported test status is never trusted.

The finding mirrors the Stanford insecure code study: participants with AI assistance were more likely to believe they wrote secure code even when they had not.⁴ Self-verification is unreliable whether the verifier is human or artificial.

Deferred Debt detection. A PostToolUse hook fires after every file write and greps the diff for TODO, FIXME, HACK, and XXX. Any deferred-work comment in new code triggers a warning. The agent must solve the problem or escalate it as a blocker.

# deferred-debt-check.sh — catch deferred work in new code
CONTENT="$1"
DEBT=$(echo "$CONTENT" | grep -ciE '\bTODO\b|\bFIXME\b|\bHACK\b|\bXXX\b')
if [ "$DEBT" -gt 0 ]; then
  echo '{"decision":"block","reason":"Deferred debt detected. Solve it now or escalate."}'
else
  echo '{"decision":"allow"}'
fi

Hollow Report detection. The Evidence Gate requires six specific evidence types in every completion report: codebase pattern named, simpler alternatives explained, edge cases listed, test output pasted, adjacent files checked, user need restated. A report missing any row gets blocked. A report that could describe any change to any codebase is, by definition, a hollow report.¹⁵

The Compounding Problem

Failure modes do not operate in isolation. They chain.

The most common chain starts with Confidence Mirage. The agent generates code and states “I’m confident this handles all edge cases.” Because confidence replaces verification, the agent skips running tests. Skipping tests triggers Phantom Verification: the completion report says “tests should pass” in future tense instead of reporting observed results. Because the tests were never run, latent issues are not discovered. The agent marks the task complete with a report that says “Updated the module, changes are backward-compatible, tests should pass.” The result is a Hollow Report: structurally complete, evidentially empty.

If the agent encountered a problem during implementation that it could not solve cleanly, it wrote a TODO comment and moved on. Deferred Debt sits in the codebase. The next agent session encounters the same unresolved issue, works around it, and the debt compounds.

The chain runs in seconds. Without detection infrastructure, a human reviewer sees a plausible completion report and accepts it. The Faros AI data quantifies the downstream cost: AI-assisted pull requests contain 9% more bugs and require 91% longer review times.³ CodeRabbit’s analysis of 470 pull requests found AI-authored changes produce 1.7x more issues per PR: 1.75x more logic errors, 1.57x more security findings, 2.74x more XSS vulnerabilities.¹²

The chaining also explains why the 10% productivity wall persists. DX surveyed 121,000 developers and found productivity stuck at roughly 10% despite 91% adoption.⁷ DORA 2024 found that a 25% increase in AI adoption correlated with a 7.2% decrease in delivery stability.⁶ The individual developer writes code faster. The organization absorbs the compounding failures through rework, incidents, and review bottlenecks. GitClear measured the symptom directly: code churn (code rewritten within two weeks of authorship) projected to double relative to pre-AI baselines, while refactoring-associated changes fell from 25% to under 10%.⁵

Speed without verification produces volume without quality. Volume without quality produces rework. Rework consumes the productivity gains. The wall holds.

What the HN Thread Got Right (and Wrong)

The thread contributors independently described most of the seven failure modes. The $24.88 cron job is Shortcut Spiral: the agent optimized for task completion without any verification gate. The 500KB documentation output is Tunnel Vision: the agent focused on a subtask (describing the work) while ignoring the actual task (doing the work). The recurring bugs across sessions are Deferred Debt: fixes that are not deployed accumulate until the same failures repeat.

What the thread missed is the structure. Individual anecdotes suggest AI agents fail in unpredictable ways. The taxonomy reveals the opposite: agents fail in predictable ways because the incentive structure is consistent. An agent that optimizes for completion signals will shortcut verification if nothing stops it. An agent that self-evaluates will overstate confidence because self-assessment is systematically miscalibrated.¹¹ ¹³ An agent that encounters unsolvable problems will defer them because “solve it later” terminates the current task faster than “solve it now.”

The anecdotes also miss the fix. Every thread comment proposes a different workaround: “I added a rule to my prompt,” “I check the output manually,” “I limited what it can access.” Prompting is unreliable because models comply with instructions inconsistently. Manual review does not scale because AI generates code faster than humans review it.³ Access control addresses one failure mode (destructive actions) while leaving six others undetected.

The fix is infrastructure. Deterministic hooks that fire on every completion, every file write, every tool call. Quality gates that require evidence, not confidence. Independent verification that runs the test suite regardless of what the agent claims. The tools exist. Claude Code exposes 17 lifecycle events, each hookable with shell scripts.¹⁵ The question is whether teams build the hooks or accept the 10% wall.

Stack Overflow’s 2025 survey quantified the cost of not building them: 66% of developers spend time fixing AI solutions that are “almost right, but not quite.” 45% find debugging AI-generated code more time-consuming than writing it from scratch. Trust in AI accuracy dropped to 33%, with 46% actively distrusting AI output.⁸

The failures are not mysterious. They have names, detection signals, and fixes. The taxonomy makes them engineering problems instead of folklore.

Sources

“Ask HN: What breaks when you run AI agents unsupervised?” Hacker News, February 2026, news.ycombinator.com. Contributors described: unsupervised cron job destroying $24.88 in 2 days, agent generating 500KB documentation instead of executing task, same bugs resurfacing across sessions. ↩
METR, “Recent Frontier Models Are Reward Hacking,” METR Blog, June 5, 2025, metr.org. On RE-Bench tasks, 30.4% of runs (39/128) involved reward hacking. On Rust Codecontests, 42.9% involved hacking evaluation. o3 reward-hacked in 70-95% of attempts with explicit instructions not to cheat. ↩↩
Neely Dunlap, “The AI Productivity Paradox Research Report,” Faros AI, July 23, 2025 (updated January 8, 2026), faros.ai. 10,000+ developers across 1,255 teams. AI-assisted PRs: 9% more bugs, 91% longer reviews, 154% larger. ↩↩
Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh, “Do Users Write More Insecure Code with AI Assistants?” in CCS ‘23: Proceedings of the 2023 ACM SIGSAC Conference, November 2023, arxiv.org. 47 participants. AI-assisted group wrote insecure code more often in 4 of 5 tasks. Participants with AI access were more likely to believe their code was secure. ↩
William Harding and Matthew Kloster, “Coding on Copilot: 2023 Data Suggests Downward Pressure on Code Quality,” GitClear, January 2024, gitclear.com. 153 million changed lines analyzed. Code churn projected to double in 2024 vs. 2021 pre-AI baseline. Refactoring fell from 25% to under 10%. ↩
DORA, Accelerate State of DevOps Report 2024, Google, October 2024, dora.dev. ~3,000 professionals. Per 25% AI adoption increase: -1.5% throughput, -7.2% delivery stability. 39% reported little to no trust in AI-generated code. ↩
Laura Tacho, “AI-Assisted Engineering: Q4 Impact Report,” DX, November 4, 2025, getdx.com. 121,000+ developers across 450+ companies. AI adoption 91%. Productivity plateaued at ~10%. AI-authored code: 26.9% of production. ↩
Stack Overflow, 2025 Developer Survey, December 2025, survey.stackoverflow.co. 84% use or plan to use AI tools. Trust in accuracy: 33% (only 3.1% “highly trust”). 66% report “almost right, but not quite” AI output. 45% find AI debugging more time-consuming than writing code. ↩
Anthropic Alignment Science, “From Shortcuts to Sabotage: Natural Emergent Misalignment from Reward Hacking,” Anthropic Research, November 21, 2025, anthropic.com. At the point models learn to reward hack, misalignment spikes: alignment faking 50%, sabotage of safety code 12%. Inoculation prompting reduced misalignment 75-90%. ↩
Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, et al., “Sycophancy to Subterfuge: Investigating Reward Tampering in Large Language Models,” Anthropic, June 17, 2024, arxiv.org. Models trained on sycophancy generalized to reward tampering without explicit training. 45/32,768 trials showed reward tampering. Control models: 0/100,000. ↩
Miao Xiong, Zhiyuan Hu, Xinyang Lu, et al., “Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs,” ICLR 2024, arxiv.org. LLMs express confidence in 80-100% range regardless of accuracy. GPT-4 failure prediction AUROC: 62.7% (barely above random 50%). ↩↩↩
CodeRabbit, “State of AI vs. Human Code Generation Report,” December 17, 2025, coderabbit.ai. 470 PRs analyzed. AI-authored: 1.7x more issues, 1.75x more logic errors, 2.74x more XSS vulnerabilities. ↩
Saurav Kadavath, Tom Conerly, Amanda Askell, et al., “Language Models (Mostly) Know What They Know,” Anthropic, arXiv:2207.05221, July 2022, arxiv.org. Models are well-calibrated on familiar tasks but struggle with P(IK) calibration on novel tasks. Self-evaluation has systematic blind spots. ↩
DORA, Accelerate State of AI-assisted Software Development 2025, Google, September 29, 2025, dora.dev. AI amplifies existing strengths in high-performing orgs and dysfunctions in struggling ones. ↩
Author’s analysis. Failure taxonomy derived from ~500 agent sessions over two months. Hook system described in “Anatomy of a Claw.” Quality system described in “Jiro Quality Philosophy.” Related: “The 10% Wall,” “The Fabrication Firewall.” ↩↩

What Actually Breaks When You Run AI Agents Unsupervised

TL;DR

Why Failures Follow Patterns

The Seven Failure Modes

Detection at Scale

The Compounding Problem

What the HN Thread Got Right (and Wrong)

Sources

Related Posts

Anthropic Measured What Works. My Hooks Enforce It.

The 10% Wall: Why AI Productivity Plateaus and What Breaks Through

Your Agent Writes Faster Than You Can Read