Your Agent Writes Faster Than You Can Read

Last Tuesday, my autonomous coding agent completed a 47-file refactor in 9 minutes. Tests passed. Linting passed. The quality gates found zero violations. I merged the PR, deployed, and moved on. Three days later a teammate asked why the retry logic in the payment service had changed from exponential backoff to fixed-interval polling. I didn’t know it had. The agent’s commit message said “refactor: standardize retry patterns across services.” The change was technically correct. I never read line 847 of file 31.

That gap between what shipped and what I understood is cognitive debt.

TL;DR

Five independent research teams published about the same structural problem in one week: coding agents produce output faster than developers can verify, understand, and maintain it. Margaret-Anne Storey named the pattern “cognitive debt.” Researchers at Microsoft, ETH Zurich, and multiple universities are building systems to detect agent misbehaviors, make tool calls transactional, and benchmark how agents learn through interaction. The convergence matters because it signals the research community catching up to a problem practitioners have been solving with ad-hoc quality gates. The agent reliability problem now has a name, a taxonomy, and five competing approaches. Below: the research, how to detect cognitive debt in your own workflow, and a minimum viable intervention you can implement today.

Five Papers, One Week, One Problem

Between February 15 and February 21, 2026, five independent groups published work addressing the same structural failure in AI coding agents. None cited each other. Each approached the problem from a different angle. All converged on the same conclusion: the bottleneck in agent-assisted development is no longer code quality. The bottleneck is human comprehension.

Margaret-Anne Storey articulated the concept of “cognitive debt” to describe what accumulates when agents produce code that could be clean, tested, and well-structured while developers lose track of what the code actually does.¹ Technical debt lives in the codebase. Cognitive debt lives in the developer’s head. Storey’s framing shifts the agent reliability question from “does the code work?” to “does the developer understand the code?”

Nanda et al. at Microsoft published Wink, a system for automatically detecting and recovering from coding agent misbehaviors.² Their taxonomy identifies three failure modes: instruction deviation (the agent does something different from what you asked), repetitive loops (the agent tries the same failing approach repeatedly), and tool misuse (the agent calls the wrong tool or passes wrong arguments). Wink monitors agent behavior in real time and intervenes before the misbehavior compounds.

Mohammadi et al. at ETH Zurich introduced Atomix, a runtime that wraps agent tool calls in transactions.³ When an agent’s multi-step plan fails partway through, Atomix rolls back the side effects. The insight: agents act on external systems (databases, APIs, file systems), and those actions have consequences the agent cannot undo without explicit rollback infrastructure.

Hallinan et al. created OpaqueToolsBench, a benchmark measuring how agents learn tool behavior through interaction rather than documentation.⁴ Real-world tools are poorly documented. The benchmark tests whether agents can discover failure modes, best practices, and edge cases by trial and error. The finding: agents that explore tool behavior independently produce better results than agents given perfect documentation they never verify.

Deng et al. evaluated 28 LLM-based penetration testing systems and identified two distinct failure categories.⁵ Type A failures stem from missing capabilities (wrong tools, bad prompts) that engineering readily fixes. Type B failures persist regardless of tooling because the agent lacks the judgment to evaluate its own findings. Type B is the cognitive debt problem expressed as a security risk: the agent finds six of seven vulnerabilities but confidently reports the system is secure.

The Convergence Matters More Than Any Single Paper

One paper about agent reliability is interesting. Five papers in a week from unrelated teams is a signal. The research community is independently arriving at the same conclusion that practitioners have been discovering through production failures.

I built the Jiro quality system starting in May 2025. The system enforces a 7-step quality loop, a 6-criteria evidence gate, and 7 named failure modes that map directly to the patterns these papers describe:

Research Finding	Jiro Equivalent	Detection Method
Wink: instruction deviation	Tunnel Vision	Zoom Out step verifies integration points
Wink: repetitive loops	Circuit breaker	Kills retry after 3 identical failures
Wink: tool misuse	Confidence Mirage	Evidence gate rejects “I’m confident” without proof
Atomix: unrecoverable side effects	Deliberation gates	Multi-agent consensus before irreversible actions
Deng: Type B judgment failures	Hollow Report	Requires specific evidence for every claim

The timeline matters. Nine months of trial-and-error debugging in production, building quality gates one failure at a time, resulted in an architecture that five research papers are now formalizing independently. The structural problems are real. The ad-hoc solutions work. The research is catching up with frameworks, taxonomies, and benchmarks that make the solutions reproducible.

The Three Laws of Cognitive Debt

Storey’s framing crystallizes what I’ve observed across 11 months of autonomous agent development. Three patterns hold regardless of model, tooling, or domain:

1. Cognitive debt compounds with velocity. My agent averages 140-200 lines of meaningful code changes per minute across a refactoring session (measured from git diffs, excluding whitespace). A focused human developer produces roughly 20-40 lines per minute during active coding.⁸ The Ralph loop that runs Claude at $10/hour doesn’t produce 5x the cognitive debt of a human developer. It produces far more, because the human developer’s typing speed is coupled to their thinking speed. The agent’s output speed has no coupling to your comprehension speed. Output doubles; comprehension stays constant; debt compounds.

2. Passing tests does not retire cognitive debt. Every paper in this week’s cluster treats test passage as a necessary but insufficient signal. Deng et al.’s Type B failures pass all automated checks. Wink’s misbehavior taxonomy includes agents that produce working code that doesn’t match the intent. My evidence gate requires six criteria beyond “tests pass,” and the hardest criterion to verify is still “does the developer understand what changed?”⁶

Here’s a concrete example. My agent refactored a database query to use a CTE (Common Table Expression) instead of a subquery. Both approaches returned identical results. Tests passed. The CTE version ran 3x slower on our dataset because the query planner couldn’t push predicates into the CTE. I caught it during a routine EXPLAIN ANALYZE check two weeks later. The agent’s tests verified correctness. Nothing in the test suite verified performance characteristics. The cognitive debt wasn’t “bad code.” The cognitive debt was “I didn’t know the execution plan had changed.”

3. Cognitive debt is invisible until it isn’t. Technical debt announces itself through slow builds, flaky tests, and merge conflicts. Cognitive debt is silent until someone asks “why does the payment service use fixed-interval polling?” and no one knows. Storey’s contribution is giving the invisible problem a name.

Five Warning Signs You’re Accumulating Cognitive Debt

Before you can fix the problem, you need to see it. These five signals show up before production incidents do:

1. You can’t explain the last agent PR without re-reading it. Open the most recent PR your agent created. Without looking at the diff, describe what changed and why. If you can’t, you merged code you don’t understand. I track this by adding a one-line “summary check” to my review process: before approving, I write a one-sentence explanation in the PR comment. If I can’t write the sentence, I haven’t reviewed enough.

2. Your git log --stat shows sessions with 20+ files touched. Run this right now:

git log --stat --since="1 week ago" --author="$(git config user.name)" | \
  awk '/files? changed/ {files+=$1} END {print files, "files changed this week"}'

Compare the number to how many of those files you could describe from memory. The gap is your cognitive debt backlog.

3. You review diffs by scrolling, not reading. Scrolling is pattern-matching: “that looks right.” Reading is comprehending: “this changes the retry interval from exponential to fixed, which means the downstream service will see a different traffic pattern.” If your review takes less than one minute per 100 lines of diff, you’re scrolling.

4. Your commit messages describe WHAT, not WHY. “Refactor: standardize retry patterns” describes what the agent did. “Fix: exponential backoff caused thundering herd after service restart” describes why. If your agent’s commit messages read like the first example and you don’t rewrite them, nobody (including future you) will know the reasoning behind the change.

5. You feel productive but can’t list what changed. At the end of a day using an agent, write down the three most significant code changes from memory. If you struggle, the agent was productive. You weren’t. The debt accumulated while you felt efficient.

Cognitive Debt Ratio: Divide the number of files your agent modified in the last session by the number you fully read afterward. If the ratio exceeds 3:1, your cognitive debt is accumulating faster than you can service it.

Start Here: The Three-File Protocol

You don’t need 95 hooks, 7 named failure modes, or a multi-agent deliberation system to start managing cognitive debt. Start with one rule and build from there.

The rule: After every agent session, fully read three files. Not skim. Not scroll. Read every line of the three files with the largest diffs.

Why three? Because three files is achievable (you’ll actually do it) and diagnostic (you’ll discover whether the agent’s changes match your mental model). If they match, your debt is manageable. If they don’t, you have a leading indicator that the rest of the session’s changes also diverge from your understanding.

Implementation

After your agent finishes, run:

# Show the 3 files with the largest diffs from the last commit
git diff HEAD~1 --stat | sort -t'|' -k2 -rn | head -3

Then read those three files. Not the diff. The full file. Context matters: the diff shows what changed, but the file shows what the change means in context.

Upgrade Path

Once the three-file protocol is habit (roughly one week), add one layer at a time:

Week	Addition	What It Catches
1	Three-file read	Comprehension gaps
2	One-sentence PR summary (written before approval)	Intent misalignment
3	`EXPLAIN ANALYZE` on any modified query	Performance regressions
4	Commit message rewrite (change WHAT to WHY)	Lost reasoning
5+	Named failure modes for your team’s recurring patterns	Structural blindness

Each layer retires a specific category of cognitive debt. The three-file read catches comprehension gaps. The PR summary catches intent misalignment. The query check catches the CTE incident I described above. The commit rewrite preserves reasoning that would otherwise evaporate. Named failure modes prevent repeat mistakes.

What the Research Proposes (And What Actually Works)

The five papers point toward four structural interventions. All four exist in some form in my Claude Code toolchain, built before the papers were published, validated by the same patterns the papers describe.

Independent verification. Wink monitors agent behavior against the stated intent. My quality loop requires re-reading every line written, explicitly prohibiting the Phantom Verification failure mode (claiming tests pass without running them in the current session).⁷ The fix is structural: verification must be performed by a different process than the one that produced the output.

In practice, I enforce this with a post-session hook that runs the test suite independently rather than trusting the agent’s report:

# Post-session verification hook (simplified)
# Agent says "tests pass" — verify independently
cd "$PROJECT_DIR"
test_output=$(python -m pytest --tb=short -q 2>&1)
exit_code=$?

if [ $exit_code -ne 0 ]; then
  echo "AGENT CLAIMED TESTS PASS. INDEPENDENT RUN FAILED:"
  echo "$test_output"
  exit 1
fi

The agent reported “all tests pass” and meant it. The independent run catches environment differences, missing fixtures, and tests that pass through side effects rather than correctness. In 11 months of running this hook, it has caught 23 false positives from agent self-reports.⁹

Transactional boundaries. Atomix wraps tool calls in transactions with rollback. My deliberation system gates irreversible actions behind consensus from multiple independent agents. Both approaches add friction to agent execution at the points where mistakes are most costly. The practical version for most teams: require a manual approval step before any agent-initiated database migration, deployment, or external API call.

Behavioral taxonomies. Wink’s three failure modes (deviation, loops, tool misuse) and my seven named failure modes (Shortcut Spiral, Confidence Mirage, Good-Enough Plateau, Tunnel Vision, Phantom Verification, Deferred Debt, Hollow Report) serve the same purpose: they make invisible failures visible by giving them names.⁷ A developer who can say “the agent is exhibiting Tunnel Vision” can intervene before the debt compounds. Start with three names for your team’s three most common agent mistakes. The names matter more than the taxonomy.

Selective engagement. Deng et al.’s Type A/Type B distinction and the confidence module in my deliberation system both encode the same insight: not every agent output deserves the same level of scrutiny. A useful heuristic:

Agent Output	Review Level	Why
Test file additions	Skim	Low blast radius, easy to verify by running
Config/dependency changes	Full read	Silent production impact
Database schema or queries	Full read + EXPLAIN	Performance is invisible in tests
Authentication/authorization	Full read + security review	Deng’s Type B failures cluster here
Refactoring across 10+ files	Three-file protocol + spot checks	Comprehension impossible at full scale

The Question Nobody Answered Yet

All five papers describe the problem. Wink, Atomix, and OpaqueToolsBench propose partial solutions. None of them answer the question that matters most: how do you measure cognitive debt?

Technical debt has proxies: cyclomatic complexity, test coverage, dependency age. Cognitive debt has no equivalent metric. My evidence gate asks “does the developer understand what changed?” but enforces the answer through self-report, which is exactly the verification method the Confidence Mirage failure mode exploits.

A useful metric would track the delta between what the agent changed and what the developer can explain. File count is a crude proxy. Diff complexity (not line count, but semantic change density) is better. The ideal metric would correlate with the probability of a production incident caused by a misunderstanding of agent-generated code. Nobody has built that metric yet. The interactive calculator above approximates the ratio, but a ratio isn’t a threshold. We don’t yet know where the line between “manageable debt” and “incident waiting to happen” falls.

Until someone builds that metric, the practical answer is the same one that predates AI agents: read the code. The agent’s velocity makes reading every line impractical. The three-file protocol, behavioral taxonomies, and transactional boundaries reduce the volume of code that requires human attention. The cognitive debt that remains after those filters is the debt that matters.

Key Takeaways

Five independent research groups converged on the same problem in one week. When unrelated teams arrive at the same conclusion simultaneously, the underlying problem is structural, not theoretical.
Cognitive debt is the bottleneck, not code quality. Agents produce correct code faster than developers can understand it. Tests, linters, and quality gates reduce the problem but cannot eliminate it.
Start with the three-file protocol. After every agent session, fully read the three files with the largest diffs. Build additional layers (PR summaries, query checks, commit rewrites, named failure modes) one per week.
Name the failure modes. Wink’s taxonomy and Jiro’s named failure modes serve the same purpose: making invisible problems visible. If your agent system doesn’t have names for its failure patterns, you can’t detect them.
Add friction at irreversible boundaries. Transactional tool calls (Atomix) and multi-agent consensus (deliberation) both add cost at the points where mistakes are most expensive. The cost is worth it.

FAQ

What is cognitive debt in software development?

Cognitive debt is the gap between what code does and what developers understand about the code. Margaret-Anne Storey articulated the concept to distinguish it from technical debt, which lives in the codebase. Cognitive debt lives in the developer's head. AI coding agents accelerate cognitive debt because they produce working code faster than developers can read, review, and internalize it.

How do you detect cognitive debt accumulating?

Five practical signals: you can't explain the last agent PR without re-reading it, git log shows 20+ files touched per session, you review diffs by scrolling rather than reading, commit messages describe what changed but not why, and you feel productive but can't list what changed. The ratio of files modified to files reviewed is the simplest quantitative proxy.

Should developers review every line an AI agent writes?

Reviewing every line is impractical at agent output speeds. The three-file protocol provides a practical alternative: after every agent session, fully read the three files with the largest diffs. Selective review guided by risk fills the remaining gap. Routine changes with high test coverage need less scrutiny. Architecture changes, security modifications, database queries, and irreversible actions need full review. Deng et al.'s Type A/Type B failure classification provides the framework: Type A failures (missing tools, bad prompts) are caught by automated checks. Type B failures (judgment gaps) require human review.

What is the minimum viable intervention for cognitive debt?

Start with the three-file protocol: after every agent session, run git diff HEAD~1 --stat | sort -t'|' -k2 -rn | head -3 to find the three largest changed files, then read each file completely (not the diff, the full file in context). Add one layer per week: PR summary sentences, EXPLAIN ANALYZE on modified queries, commit message rewrites from "what" to "why," and named failure modes for recurring patterns.

References

Storey, Margaret-Anne, “How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt.” Referenced via Simon Willison, February 15, 2026. simonwillison.net. ↩
Nanda, Rahul, et al., “Wink: Recovering from Misbehaviors in Coding Agents,” arXiv:2602.17037, February 2026. arxiv.org. ↩
Mohammadi, Bardia, et al., “Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows,” arXiv:2602.14849, February 2026. arxiv.org. ↩
Hallinan, Skyler, et al., “OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction,” arXiv:2602.15197, February 2026. arxiv.org. ↩
Deng, Gelei, et al., “What Makes a Good LLM Agent for Real-world Penetration Testing?” arXiv:2602.17622, February 2026. arxiv.org. ↩
Author’s Jiro quality system evidence gate. Six criteria: follows codebase patterns, simplest working solution, edge cases handled, tests pass, no regressions, solves the actual problem. Implementation in Why My AI Agent Has a Quality Philosophy. ↩
Author’s named failure mode taxonomy. Seven modes documented in the Jiro quality system, enforced by 95 hooks across the Claude Code toolchain. See Quality Philosophy for the full taxonomy and detection methods. ↩↩
Agent output measured from git diff --stat across 30 Ralph loop sessions in January-February 2026, averaging 140-200 meaningful lines per minute (excluding whitespace, imports, and boilerplate). Human baseline estimated from author’s own pre-agent commit history: 20-40 lines per minute during focused coding sessions. These numbers are illustrative and vary by task type. ↩
Author’s post-session verification logs, tracked in ~/.claude/state/verification/. 23 false positives caught across approximately 400 agent sessions from May 2025 through February 2026 (5.75% false-positive rate on agent self-reported test status). ↩