Anthropic Measured What Works. My Hooks Enforce It.

Anthropic analyzed 9,830 conversations on Claude.ai during a single week in January 2026.¹ The analysis used CLIO, a privacy-preserving tool that classifies conversation patterns without reading individual messages. The researchers tracked 11 observable behaviors from a 24-behavior AI Fluency Framework developed by Professors Rick Dakan and Joseph Feller.²

The core finding: 85.7% of conversations include iteration and refinement. Iterative conversations exhibit 2.67 fluency behaviors on average, roughly double the 1.33 in non-iterative conversations. Users in iterative conversations are 5.6x more likely to question the model’s reasoning and 4x more likely to identify missing context.¹

Iteration is the variable that separates effective AI use from mediocre AI use. Anthropic measured it. The question is whether iteration happens consistently or only when humans remember to do it.

TL;DR

Anthropic’s AI Fluency Index found that iterative refinement doubles fluency markers across 9,830 conversations. The “artifact paradox” explains why iteration does not happen by default: when the model produces polished output, users become more directive but less evaluative. Fact-checking drops 3.7 percentage points. Identifying missing context drops 5.2 percentage points. Questioning reasoning drops 3.1 percentage points. The cognitive mechanism is processing fluency: polished output triggers automatic trust (Kahneman’s System 1) and suppresses critical evaluation (System 2). A quality loop forces the iteration that single-pass skips: mandatory review, evidence gate, integration check, and repeat until all criteria cite evidence. The hooks enforce what Anthropic measured. The model cannot skip iteration because the infrastructure requires it.

What Anthropic Measured

The AI Fluency Index tracks observable behaviors, not subjective quality assessments. The framework defines AI fluency as “the ability to work effectively, efficiently, ethically, and safely within emerging modalities of Human-AI interaction.”² The 24 behaviors span four dimensions: Delegation, Description, Discernment, and Diligence. Eleven are directly observable in conversation. The remaining thirteen occur outside the chat interface (evaluating output in production, sharing results with colleagues, verifying against external sources).

The 11 observable behaviors include iteration and refinement, questioning reasoning, identifying missing context, clarifying goals, specifying formats, providing examples, and fact-checking. The researchers classified each conversation against this taxonomy using Claude Sonnet as the analysis model.

Three findings matter for engineering infrastructure.

Finding 1: Iteration is the strongest signal. 85.7% of conversations include at least some iteration. Conversations with iteration show roughly double the fluency behaviors (2.67 vs 1.33). Users who iterate are 5.6x more likely to question reasoning and 4x more likely to identify gaps.¹ Iteration is not a nice-to-have. It is the behavior most strongly associated with effective AI use.

Finding 2: Polished output suppresses evaluation. 12.3% of conversations involved artifact generation (code, documents, interactive tools). When the model produces artifacts, users become more directive: clarifying goals increased 14.7 percentage points, specifying format increased 14.5 points, providing examples increased 13.4 points. But evaluation dropped: identifying missing context fell 5.2 points, fact-checking fell 3.7 points, questioning reasoning fell 3.1 points.¹ Users directed better but evaluated less.

Finding 3: Few users set up collaboration. Only 30% of conversations included explicit collaboration instructions such as “push back if my assumptions are wrong” or “tell me what I’m missing.”¹ The default mode is delegation, not dialogue. Most users treat the model as an executor rather than a collaborator.

The Artifact Paradox

Anthropic named the pattern but did not name the mechanism. Cognitive science has a precise term for it: processing fluency.

Processing fluency is the subjective experience of ease or difficulty associated with a mental task. Alter and Oppenheimer documented that stimuli which are semantically primed, visually clear, or easy to process are judged as more true, more reliable, and more trustworthy, regardless of actual accuracy.³ Oppenheimer showed that the fluency heuristic operates automatically: people use processing ease as a proxy for quality without conscious awareness.⁴

Kahneman’s System 1/System 2 framework explains why. System 1 processes information automatically, associating cognitive ease with truth. System 2 engages deliberate analysis but requires effort and motivation. Polished AI output is high in processing fluency. The code compiles. The formatting is clean. The explanation is coherent. System 1 flags it as “good” before System 2 has a chance to evaluate whether it is correct.⁵

Kahneman identified the specific failure mode: “It’s a mistake for people to have confidence in a judgment because it made for a good story when in fact confidence should be based on the quality and quantity of evidence.”⁵ Replace “good story” with “clean code” and the artifact paradox is WYSIATI (What You See Is All There Is) applied to AI-generated output.

West et al. formalized the complementary finding from the model’s side. In two papers presented at ICLR 2024, they demonstrated that generative models acquire production capabilities that exceed their evaluation capabilities.⁶ The model generates expert-level code in seconds while making errors no human expert would make. The model cannot reliably evaluate its own output because generation and evaluation are separate capabilities that scale differently.

The paradox compounds: the model produces polished output it cannot properly evaluate, and the human, encountering that polish, reduces their own evaluation. Neither party verifies. Both assume correctness. Jeff Gothelf captured the organizational version: “Part of productivity gains comes from the appearance of quality in AI-produced output. It looks good, it looks polished, it looks done.”⁷

The artifact paradox is not a user education problem. Education helps, but Anthropic’s data shows that even users who iterate (85.7% of conversations) evaluate less when artifacts are present. The drop in fact-checking and missing-context identification occurs across the population, not just among novice users. The mechanism is cognitive, not informational. Knowing about the bias does not eliminate it.

Infrastructure eliminates it.

Mapping Findings to Infrastructure

Each Anthropic finding maps to a specific infrastructure component. The table below shows the chain from measured behavior to enforcement mechanism.

Fluency Behavior	Anthropic’s Finding	Infrastructure Fix	Implementation
Iteration and refinement	2x fluency markers when present	Mandatory quality loop	7-step loop: implement, review, evaluate, refine, zoom out, repeat, report. A hook blocks completion if any step is skipped.
Questioning reasoning	5.6x more likely in iterative conversations	Evidence Gate	6 criteria requiring specific proof. “I’m confident” is not evidence. Hedging phrases trigger a block.
Identifying missing context	4x more likely; -5.2pp with artifacts	Zoom Out step	Mandatory caller search, import verification, and integration test before completion.
Fact-checking	-3.7pp with artifacts	Independent test runner	Test suite runs after every code change. Agent cannot self-report test results.
Collaboration instructions	Only 30% of conversations	Auto-injected context	9 hooks fire on every prompt, injecting date, branch, conventions, and explicit instructions to push back on assumptions.

The hooks enforce what Anthropic measured. The model does not need to remember to iterate because the infrastructure requires it. The user does not need to remember to include collaboration instructions because hooks inject them on every prompt. Fact-checking does not depend on the user’s processing fluency because an independent test runner reports results regardless of how polished the code appears.

Quality Loop Simulator: Step through a concrete scenario where an agent writes an email validator. Seven phases show what each quality loop step catches: initial output (nothing caught, artifact paradox active), review (regex bugs, unclear naming), evaluate (0/6 evidence criteria met, Phantom Verification detected), refine (all issues fixed, no TODOs), zoom out (2 calling files broken by signature change), repeat (all 6 criteria now cite evidence), and final report (single-pass "Done" vs full evidence report).

Why Single-Pass Fails

The quality loop described in Jiro Quality Philosophy runs seven steps: implement, review, evaluate, refine, zoom out, repeat, report. A single-pass agent executes step 1 and step 7, skipping five steps in between. Anthropic’s data quantifies what each skipped step costs.

Skipping review means the agent does not re-read its own output. The review step catches the class of errors that the agent would notice if it looked again: typos, unclear naming, off-by-one errors. Without review, these errors reach the completion report as if they do not exist.

Skipping evaluate means the Evidence Gate never runs. The six criteria (follows codebase patterns, simplest solution, edge cases handled, tests pass, no regressions, solves actual problem) never receive evidence. The completion report contains claims, not proof. Anthropic’s finding that iteration doubles fluency markers maps directly here: the Evidence Gate forces iteration by blocking reports that lack evidence.

Skipping refine means discovered issues are deferred, not fixed. A TODO comment replaces a solution. Deferred Debt accumulates. GitClear measured the downstream effect: refactoring-associated changes fell from 25% to under 10% of all changes in AI-assisted codebases, while code duplication rose from 8.3% to 12.3%.⁸

Skipping zoom out means the agent never checks integration. The function works. The callers break. Tunnel Vision goes undetected. Anthropic’s -5.2 percentage point drop in “identifying missing context” when artifacts are present describes the same failure from the human side: polished output makes context gaps invisible.

Skipping repeat means one pass through the Evidence Gate is treated as sufficient. The first pass catches issues. Fixing those issues may introduce new ones. Without a second pass, the quality of the fix is unverified. The quality loop iterates until all six criteria cite evidence on the same pass. Single-pass never reaches that standard.

From the harness data: the average quality loop catches 3.2 issues per non-trivial change. Of those, 1.1 are caught during review (step 2), 0.8 during evaluate (step 3), and 1.3 during zoom out (step 5). A single-pass agent would ship all 3.2 issues. The review-evaluate-zoom-out sequence eliminates them before the completion report.

The Hedging Language Problem

Anthropic found that iterative users are 5.6x more likely to question the model’s reasoning.¹ Hedging language is the inverse signal: the model expressing certainty without being questioned.

Hedging phrases include “should work,” “I’m confident,” “looks correct,” “probably fine,” and “I believe.” Each phrase substitutes a prediction for an observation. “Tests should pass” means the agent predicts the outcome. “14 tests passed, 0 failed” means the agent observed the outcome. The distinction is the difference between Phantom Verification and actual verification.

Xiong et al. found that LLMs express confidence in the 80-100% range regardless of actual accuracy, with GPT-4’s failure prediction barely above random guessing (AUROC 62.7%).⁹ Kadavath et al. at Anthropic found that models are well-calibrated on familiar tasks but struggle on novel ones, with the qualifier “mostly” concealing systematic blind spots.¹⁰ Verbalized confidence is not correlated with correctness. A model saying “I’m confident” provides zero information about whether the code works.

The hedge detector catches this pattern. A grep hook fires on every completion report and searches for configurable hedging patterns. The presence of hedging language without adjacent test output or file path citations triggers a block. The model must replace the hedge with evidence. The hook enforces the questioning behavior that Anthropic found in iterative conversations, except it operates deterministically on every conversation, not probabilistically on 85.7%.

#!/bin/bash
# hedge-detector.sh — block hedging without evidence
INPUT=$(cat)
HEDGES=$(echo "$INPUT" | grep -ciE '\bshould (work|pass|be fine)\b|\bprobably\b|\blooks correct\b|\bI.m confident\b')
if [ "$HEDGES" -gt 0 ]; then
  echo '{"decision":"block","reason":"Hedging language detected. Replace with test output or file paths."}'
else
  echo '{"decision":"allow"}'
fi

What You Can Implement Today

Anthropic’s data points to three minimum interventions that capture the highest-value fluency behaviors.

One collaboration hook. Inject instructions on every prompt that tell the model to push back on assumptions, identify missing context, and question its own reasoning. Anthropic found only 30% of users do this manually. A hook does it on 100% of prompts. Five lines of bash.

#!/bin/bash
# inject-collaboration.sh — force evaluative behavior
echo "INSTRUCTIONS: Push back if assumptions are wrong."
echo "Identify missing context before proceeding."
echo "Question your own reasoning at each step."

One evidence gate. Block completion reports that use hedging language instead of citing evidence. The gate operationalizes Anthropic’s finding that iteration doubles fluency markers by making iteration mandatory. The model cannot report “done” without evidence for each quality criterion.

One independent verifier. Run the test suite after every code change and inject results into the conversation. The verifier addresses the artifact paradox directly: regardless of how polished the output looks, the test results report what actually works. Fact-checking does not depend on human processing fluency because the hook automates it.

The three hooks together enforce the three behaviors Anthropic found most strongly associated with effective AI use: iteration, questioning reasoning, and identifying missing context. Each hook is deterministic. Each fires on every interaction. None depends on the user remembering to activate it.

Anthropic measured what works. The hooks make it non-optional.

Sources

Kyle Swanson, Daniel Bent, Zack Ludwig, Rick Dakan, and Joseph Feller, “Anthropic Education Report: The AI Fluency Index,” Anthropic Research, February 23, 2026, anthropic.com. 9,830 anonymized conversations on Claude.ai, January 20-26, 2026. Iterative conversations: 2.67 fluency behaviors (vs 1.33 non-iterative). 5.6x more likely to question reasoning. Artifact paradox: -5.2pp missing context identification, -3.7pp fact-checking. ↩↩↩↩↩↩
Rick Dakan and Joseph Feller, “The AI Fluency Framework,” developed in collaboration with Anthropic, supported by Ireland’s Higher Education Authority, 2025, aifluencyframework.org. 24 behaviors across 4 dimensions: Delegation, Description, Discernment, Diligence. Definition: “The ability to work effectively, efficiently, ethically, and safely within emerging modalities of Human-AI interaction.” ↩↩
Adam L. Alter and Daniel M. Oppenheimer, “Uniting the Tribes of Fluency to Form a Metacognitive Nation,” Personality and Social Psychology Review, 13(3), 219-235, 2009, doi.org. Processing fluency causes stimuli to be judged as more true, reliable, and trustworthy regardless of actual accuracy. ↩
Daniel M. Oppenheimer, “The Secret Life of Fluency,” Trends in Cognitive Sciences, 12(6), 237-241, 2008, sciencedirect.com. The fluency heuristic operates automatically: processing ease serves as a proxy for quality without conscious awareness. ↩
Daniel Kahneman, Thinking, Fast and Slow (New York: Farrar, Straus and Giroux, 2011). System 1 (automatic) associates cognitive ease with truth. System 2 (deliberate) requires effort. WYSIATI: conclusions drawn from available information without accounting for what is missing. Quote: “It’s a mistake for people to have confidence in a judgment because it made for a good story when in fact confidence should be based on the quality and quantity of evidence.” ↩↩
Peter West et al., “The Generative AI Paradox: What It Can Create, It May Not Understand,” ICLR 2024, arxiv.org; Peter West et al., “The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate,” arXiv:2402.06204, 2024, arxiv.org. Generative capabilities exceed evaluation capabilities. Models produce expert-level output they cannot reliably self-evaluate. ↩
Jeff Gothelf, “Confusing Artifact Quality with Certainty,” 2024, jeffgothelf.com. “Part of productivity gains comes from the appearance of quality in AI-produced output. It looks good, it looks polished, it looks done.” ↩
William Harding and Matthew Kloster, “Coding on Copilot: 2023 Data Suggests Downward Pressure on Code Quality,” GitClear, January 2024, gitclear.com. 153 million changed lines. Refactoring fell from 25% to under 10%. Code duplication rose from 8.3% to 12.3%. ↩
Miao Xiong, Zhiyuan Hu, Xinyang Lu, et al., “Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs,” ICLR 2024, arxiv.org. LLMs express confidence in 80-100% range regardless of accuracy. GPT-4 failure prediction AUROC: 62.7%. ↩
Saurav Kadavath, Tom Conerly, Amanda Askell, et al., “Language Models (Mostly) Know What They Know,” Anthropic, arXiv:2207.05221, July 2022, arxiv.org. Well-calibrated on familiar tasks, systematic blind spots on novel tasks. ↩
CodeRabbit, “State of AI vs. Human Code Generation Report,” December 17, 2025, coderabbit.ai. 470 PRs analyzed. AI-authored: 1.7x more issues, 1.75x more logic errors. Polished output masks higher defect rates. ↩
Laura Tacho, “AI-Assisted Engineering: Q4 Impact Report,” DX, November 4, 2025, getdx.com. 121,000+ developers. Productivity plateaued at ~10% despite 91% adoption. ↩
Author’s analysis. Quality loop described in “Jiro Quality Philosophy.” Hook system described in “Anatomy of a Claw.” Failure modes described in “What Actually Breaks When You Run AI Agents Unsupervised.” 10% wall explained in “The 10% Wall.” ↩

Anthropic Measured What Works. My Hooks Enforce It.

TL;DR

What Anthropic Measured

The Artifact Paradox

Mapping Findings to Infrastructure

Why Single-Pass Fails

The Hedging Language Problem

What You Can Implement Today

Sources

Related Posts

What Actually Breaks When You Run AI Agents Unsupervised

The 10% Wall: Why AI Productivity Plateaus and What Breaks Through

Your Agent Writes Faster Than You Can Read