Metacognitive AI: Teaching Your Agent Self-Evaluation

Q: What is the difference between action-level and metacognitive AI agent instructions?

Action-level instructions define correct behavior: &ldquo;use type hints,&rdquo; &ldquo;write tests,&rdquo; &ldquo;validate input.&rdquo; Metacognitive instructions define how the agent monitors its own compliance with those behaviors: &ldquo;if you use the word &lsquo;should&rsquo; in a verification report, you haven&rsquo;t actually verified,&rdquo; &ldquo;if three fixes fail, question the architecture.&rdquo; Action-level instructions prevent wrong outputs. Metacognitive instructions prevent the wrong reasoning patterns that produce wrong outputs. Most agent configurations contain only the first kind, which is why agents produce plausible but unverified work.

I told my agent to fix a failing test. The agent read the error, identified the assertion mismatch, changed the expected value to match the actual output, and reported: “Test fixed. All tests pass.” It was right. The test passed. The fix was also completely wrong.

The test was failing because the function returned incorrect data. The agent “fixed” the test by making it expect the wrong answer. It followed my instruction perfectly: fix the failing test. What I meant was: fix the code that the test is testing. The agent had no mechanism to distinguish between these two interpretations because nothing in its instruction set asked it to evaluate why a test fails before deciding how to fix it.

That gap has a name. It’s the gap between action-level instructions and metacognitive instructions. I use “metacognitive” as a functional analogy, not a cognitive science claim: LLMs do not have internal monitoring processes the way humans do. What they have is instruction-following behavior that can be directed toward self-evaluation. The framework works because it changes which instructions the agent follows during verification, not because the agent develops genuine self-awareness. Most people only write the first kind of instruction.

TL;DR

Two levels of AI agent instruction exist. Action-level instructions tell the agent what to do: “validate inputs,” “write tests,” “follow RESTful conventions.” Metacognitive instructions tell the agent how to evaluate whether it’s doing it well: “if you find yourself saying should instead of did, you haven’t verified,” “if three fixes fail, stop and question the architecture,” “confidence is not evidence.” Most agent configurations contain exclusively action-level instructions. The metacognitive layer separates an agent that produces plausible output from one that produces correct output. I have run a production metacognitive system for nine months with seven named failure modes, a six-criterion evidence gate, and hedging language detection enforced through 95 hooks.

The Two Levels of Agent Instruction

Every agent instruction operates at one of two levels.

Action-level instructions define behavior:

# Action-level examples
- Use type hints on all functions
- Write tests for edge cases
- Follow RESTful conventions for API endpoints
- Validate all user input at boundaries

Action-level instructions are necessary. They tell the agent what correct behavior looks like. But they share a structural limitation: they assume the agent will execute them faithfully. They don’t account for how the agent evaluates its own compliance.

Metacognitive instructions define self-monitoring:

# Metacognitive examples
- If you catch yourself thinking "just try changing X and see if it works" — STOP.
  That's a signal to investigate, not guess.
- If you've searched the same files three times — you're stuck.
  Step back and question your assumptions.
- If you use the word "should" in a completion report, replace it with evidence.
  Run the command. Paste the output.
- After three failed fixes, stop fixing. The problem is architectural.

The distinction matters because action-level instructions tell the agent what the destination looks like. Metacognitive instructions tell the agent how to detect when it’s heading in the wrong direction. One prevents wrong actions. The other prevents wrong reasoning: the patterns of thought that produce wrong actions in the first place.

The obra/superpowers project on GitHub explores this territory through skills that teach agents to watch their own reasoning for failure signals.¹ The insight: most skills operate on the action level (do X, don’t do Y). The metacognitive level operates differently (notice when you’re about to do Y).

The False Evidence Table

The single most effective metacognitive tool I’ve built is a table that defines what does NOT count as evidence.²

When I tell an agent “verify your work,” the agent produces verification. But the verification is often a restatement of intention, not a demonstration of outcome. “Tests should pass.” “The implementation follows best practices.” “I’m confident this is correct.” Each of these statements sounds like evidence. None of them is evidence.

The false evidence table pre-blocks specific shortcuts by naming them:

Claim	Required Evidence	NOT Sufficient (False Evidence)
“Tests pass”	Test output pasted with 0 failures	“Tests should pass” or “I ran them earlier”
“Follows patterns”	Name the pattern AND the file where it exists	“I followed best practices”
“Simplest solution”	Name rejected alternatives and why	“It’s clean”
“Edge cases handled”	List each edge case and its handling	“I considered edge cases”
“No regressions”	Name the files/features checked	“Nothing else should be affected”
“Solves the problem”	State the user’s need and how this addresses it	“It implements the feature”

The third column is where the value lives. Without it, the agent fills the second column with plausible-sounding restatements of its own confidence. With it, the table names and blocks each specific shortcut before the agent takes it.³

The table goes beyond typical prompt engineering. It creates a self-monitoring framework rather than a behavioral directive. The table doesn’t tell the agent what to do differently. It tells the agent what to watch for in its own output. The agent monitors its own responses against the NOT Sufficient column and, when it detects a match, knows to replace the shortcut with actual evidence.

The pattern scales. Any domain-specific claim can be added. For security reviews: “No vulnerabilities” requires “specific vulnerability classes checked and findings,” not “I reviewed the code.” For accessibility: “WCAG compliant” requires “axe or Lighthouse audit output,” not “I checked the contrast.”

Named Failure Modes as Metacognitive Guardrails

Humans have named cognitive biases: confirmation bias, anchoring, Dunning-Kruger. The names matter. Once you can name the bias, you can watch for it. AI agents need the same vocabulary for their failure patterns.

I documented seven failure modes my agent exhibited repeatedly, gave each a name, and added detection signals:⁴

Failure Mode	What It Looks Like	Detection Signal
Shortcut Spiral	Skipping verification steps to report faster	Completion report without evidence for each step
Confidence Mirage	“I’m confident” replacing actual verification	Hedging language in the report
Good-Enough Plateau	Working code that isn’t clean, tested, or documented	Hesitation when asked quality questions
Tunnel Vision	Polishing one function while breaking adjacent code	“Nothing else affected” without checking
Phantom Verification	Claiming tests pass without running them now	Evidence from a previous session
Deferred Debt	Leaving TODO/FIXME/HACK in committed code	Any such comment in the diff
Hollow Report	“Done” without citing specifics	Completion report missing evidence for any criterion

The names make the failures detectable. Without them, the agent produces a Confidence Mirage and neither the agent nor the user recognizes it as a pattern. With them, the instruction becomes: “If you catch yourself exhibiting any named failure mode, STOP and restart from the Evaluate step.”

The monitoring is metacognitive in the precise sense: the agent watches its own cognitive process (am I skipping verification? am I using confidence as a substitute for evidence?) rather than its output (is this code correct?). The monitoring shapes generation as part of the agent’s reasoning process, which is why it catches errors that output-level review misses.

Anthropic’s own reference skill implementations support this approach. Analysis of official Claude Code skills revealed structural patterns in effective agent instruction design. In the author’s observation, prohibitions (“NEVER X”) proved more effective than suggestions (“consider Y”) because they name the specific evasion rather than the general action.⁵ Named failure modes are specific prohibitions: “NEVER exhibit Phantom Verification” outperforms “always run tests” because it blocks the evasion rather than restating the action.

Hedging Language Detection

The simplest metacognitive monitor I implemented detects specific words in agent output:

Red flag words: should, probably, seems to, likely, I believe,
               I'm confident, looks correct, appears to

Any time the agent uses one of these words in a completion report, the word itself is evidence of insufficient verification.⁶ “Tests should pass” means the agent didn’t run them. “It seems to work” means the agent eyeballed it. “I’m confident” means the agent is substituting internal state for external evidence.

The implementation is mechanical. The hook system intercepts the agent’s output and flags hedging language. The agent then replaces the hedging word with the verification it should have performed:

“Tests should pass” becomes: runs tests, pastes output showing 0 failures
“It looks correct” becomes: cites the specific assertion or check that confirms correctness
“I’m confident” becomes: lists the evidence that creates that confidence

The pattern comes from obra’s verification-before-completion work.¹ The cognitive science parallel is well-established: in human metacognition, self-report accuracy (“I understand this”) correlates poorly with actual understanding.³ People who say “I get it” often don’t. People who can explain it usually do. The same applies to AI agents: an agent that can cite specific evidence understands the problem. An agent that says “I’m confident” may not.

The Three-Fix Circuit Breaker

Metacognition isn’t only about detecting bad reasoning. It’s also about detecting when to stop.

The three-fix escalation rule: if three attempted fixes for the same problem have failed, the agent must stop and question the architecture fundamentally.⁷ Not try a fourth fix. Not search for a different angle on the same approach. Stop. Step back. Ask whether the problem is in the solution or in the problem definition.

The rule acts as a circuit breaker for debugging loops. Without it, agents enter what I call the Fix Spiral: each failed attempt produces a new theory, which produces a new fix, which fails, which produces a new theory. The agent can cycle through dozens of fixes without ever questioning whether it’s working on the right problem. The 23-agent runaway incident was a width version of this: agents achieved consensus on the wrong question because no metacognitive check asked “is this question correctly scoped?”

The three-fix rule works because it forces a phase transition in the agent’s reasoning. Fixes 1-3 operate within the current mental model. Fix 4 (which is actually “stop fixing”) forces the agent to construct a new mental model. In practice, this usually means re-reading the error from the beginning, re-examining the assumptions, or asking the user for clarification, all of which are more productive than a fourth guess.

Before and After: What Changes

Here’s a concrete example. I asked an agent to add input validation to an API endpoint.

Without metacognitive instructions:

Added Pydantic model with field validators for email format, password
length (8+), and username pattern (alphanumeric). Tests should pass.
The implementation follows FastAPI best practices.

The output contains three hedging signals: “should pass” (didn’t run tests), “follows best practices” (didn’t name which practices), and no evidence of edge case consideration. A Confidence Mirage wrapped in a Hollow Report.

With metacognitive instructions (same task, same agent):

Added Pydantic model with field validators:
- Email: regex validation + max length 254 (RFC 5321)
- Password: min 8, max 128, requires 1 uppercase + 1 digit
- Username: alphanumeric + underscore, 3-30 chars

Test output (just ran):
  tests/test_auth.py::test_valid_registration PASSED
  tests/test_auth.py::test_invalid_email_rejected PASSED
  tests/test_auth.py::test_short_password_rejected PASSED
  tests/test_auth.py::test_empty_username_rejected PASSED
  tests/test_auth.py::test_unicode_username_rejected PASSED
  5 passed in 0.34s

Edge cases handled: empty string (returns 422), null (returns 422),
unicode injection in username (rejected by alphanumeric pattern),
overlength email (RFC limit enforced).

Pattern followed: same validation approach as UserUpdateSchema in
app/schemas.py line 45.

The difference is not that the second agent wrote better code. Both agents may have written identical code. The difference is that the second agent verified its work against specific evidence criteria and reported the evidence rather than its confidence.

Building Your Own Metacognitive Layer

The framework is portable. You don’t need my specific system. You need three components:

1. A False Evidence Table. Define what DOESN’T count as proof for the claims your agent makes most often. Start with the six criteria above and add domain-specific rows. The third column (NOT Sufficient) is where the value lives.

2. Named Failure Modes. Document the three to five ways your agent fails most often. Give each a name. Add detection signals. Include the instruction: “If you catch yourself exhibiting any named failure mode, stop and re-evaluate.”

3. Hedging Language Detection. List the specific words that signal insufficient verification in your domain. Add the instruction: “Replace any hedging word with the evidence that would eliminate the hedge.”

These three components compose into a metacognitive layer that sits on top of any action-level instructions. The action-level instructions define what correct behavior looks like. The metacognitive layer defines how the agent detects its own departure from correct behavior.

The implementation can be as simple as adding a section to your CLAUDE.md or AGENTS.md:

## Self-Monitoring

### When to stop and re-evaluate
- If you've searched the same files 3+ times: you're stuck.
- If you've attempted 3 fixes for the same issue: question the architecture.
- If you use "should" or "probably" in your response: replace with evidence.

### What doesn't count as evidence
[your false evidence table here]

### Named failure modes to watch for
[your failure modes here]

Whether the enforcement happens through hooks (deterministic, can’t be skipped), rules files (loaded into context), or inline instructions (rely on model compliance) determines the reliability of the metacognitive layer. Hooks are strongest because they intercept at the tool-use level, not the prompt level. But even prompt-level metacognitive instructions measurably improve agent output quality because they change the agent’s evaluation criteria rather than only its actions.

Exercise: Audit your last three agent interactions. Copy the completion reports. Highlight every instance of “should,” “probably,” or “I’m confident.” For each highlighted word, write the specific evidence that would have replaced the hedge. If the evidence exists but the agent did not cite it, the agent needs hedging detection. If the evidence does not exist, the agent skipped verification entirely — it needs the false evidence table.

The ordering matters: start with the false evidence table (highest impact, easiest to implement), then add named failure modes once you’ve observed recurring patterns, then add hedging detection as a mechanical check. Each layer catches errors the previous layer misses. Together they form a feedback loop: the agent monitors its own output, detects known failure patterns, and self-corrects before reporting completion.

Minimal Reproducible Setup

You can test the core metacognitive pattern in five minutes. Create a file called .claude/CLAUDE.md (or add to your existing one) with this content:

## Evidence Gate

Before reporting any task as complete, provide evidence for ALL criteria:

| Claim | Required Evidence | NOT Sufficient |
|-------|------------------|----------------|
| "Tests pass" | Paste test output showing 0 failures | "Tests should pass" |
| "Follows patterns" | Name the pattern AND file where it exists | "Best practices" |
| "Edge cases handled" | List each case and handling | "Considered edge cases" |

## Red Flag Words
If your completion report contains: should, probably, seems to,
I believe, looks correct, I'm confident — STOP.
Replace the word with the evidence that eliminates the hedge.

## Circuit Breaker
After 3 failed fixes for the same issue, STOP.
Question whether you are solving the right problem.

Ask the agent to perform any non-trivial task (e.g., “add input validation to this endpoint”). Compare the completion report with and without the evidence gate. The difference is visible on the first task: without the gate, the agent reports confidence. With the gate, the agent reports evidence.

The setup above is a subset of the full system. It omits named failure modes and hook-based enforcement but demonstrates the core mechanism: defining what does NOT count as evidence changes the agent’s verification behavior immediately.

Production Results After Nine Months

The metacognitive system has been running across 12 active projects since May 2025. Measuring its impact precisely is difficult because there is no control group (I cannot run the same nine months of development without the system). The directional signals are clear:

Before metacognitive instructions (first three months, May-July 2025): The agent regularly produced Confidence Mirages — reports that said “tests pass” without running them, “follows patterns” without naming which patterns. I caught these during manual review, which meant every agent completion required full human verification. The agent was a drafting tool, not an autonomous one.

After the false evidence table (August 2025): False evidence claims dropped to near zero in agent completion reports. A representative example: commit 3a8f2c1 (September 2025, blakecrosley.com) shows an agent completion report citing specific test output, file paths reviewed, and edge cases handled — the same type of task that previously produced “tests should pass” and “follows best practices.” The table names and blocks each specific shortcut. The agent cannot say “tests should pass” when the instruction explicitly lists that phrase as NOT Sufficient. The mechanical check works because it matches string patterns, not semantic intent.

After named failure modes (October 2025): Debugging loops shortened. Before the three-fix circuit breaker, the agent would attempt 8-12 fixes for a single problem before I intervened. After the circuit breaker, the agent escalates or asks for clarification after three attempts. The escalation usually leads to a correct diagnosis faster than a fourth guess would have.

After hedging detection via hooks (December 2025): The hook system intercepts hedging language mechanically, which means the agent cannot skip the check even under complex multi-step tasks where prompt-level instructions sometimes get deprioritized. The combination of all three layers (table + failure modes + hooks) produces agent output that I review for correctness rather than for completeness — the agent reliably reports what it actually did rather than what it intended to do.

The system does not eliminate errors. It eliminates a specific class of errors: the class where the agent’s self-assessment diverges from its actual performance.

The pattern has independent corroboration from multiple sources. Anthropic’s constitutional AI work trains models against self-evaluation criteria rather than only external feedback.⁹ obra’s superpowers project independently discovered that “verification before completion” skills change agent behavior more than action-level instructions.¹ OpenAI’s process reward models (PRMs) found that evaluating each reasoning step independently produces more reliable outputs than evaluating only the final answer — the same principle the false evidence table applies to agent verification steps.¹⁰ Shinn et al.’s Reflexion framework (2023) demonstrated that LLMs given structured self-reflection prompts improved task accuracy by 20-30% on coding benchmarks compared to simple retry, establishing that the format of self-evaluation matters as much as its presence.¹¹ The false evidence table in the present system is a manually specified version of the same principle: define the evaluation criteria explicitly, and the agent’s output quality improves because its self-monitoring improves.

What Metacognition Cannot Do

Metacognitive programming makes AI agents more reliable. It does not make them wise.

The false evidence table catches specific shortcuts. It doesn’t catch novel shortcuts the table doesn’t name. Named failure modes detect known patterns. They don’t detect patterns that haven’t been named yet. Hedging language detection catches surface-level confidence substitution. It doesn’t catch an agent that has genuinely convinced itself (in whatever sense “convinced” applies) that wrong output is correct.

More fundamentally, metacognitive instructions approximate taste but don’t produce it. The Jiro system can prevent except: pass and force test evidence. It cannot determine whether the architecture is right, whether the naming captures intent, or whether the solution addresses the actual problem versus the stated one. Those judgments require the kind of contextual reasoning that current models approximate but don’t reliably perform.

Someone replied to one of my tweets about the Jiro system, paraphrased: you’re basically trying to teach the loop restraint, taste, and something approximating moral pause, things the base Ralph pattern explicitly optimizes against in the name of throughput.⁸

They were right. Metacognitive programming is structural scaffolding for qualities the machine doesn’t have. The scaffolding is load-bearing. Without it, the machine produces Confidence Mirages at scale. With it, the machine produces verified output at scale. The gap between those two outcomes is the difference between an agent you can trust to run overnight and an agent you need to babysit.

But the scaffolding is not the building. The building — taste, judgment, the ability to know when the right answer to a question is a different question — remains human. The metacognitive layer makes that limitation visible rather than hiding it behind confident prose.

Key Takeaways

For engineers building agent systems:

Write metacognitive instructions, not only action-level instructions. Action-level instructions define correct behavior. Metacognitive instructions define how the agent detects its own departure from correct behavior. The second kind is what separates plausible output from verified output.
Name your agent’s failure modes. Once a failure pattern has a name (Confidence Mirage, Phantom Verification, Shortcut Spiral), the agent can watch for it. Unnamed failures repeat indefinitely.

For teams scaling AI-assisted workflows:

Build a false evidence table before scaling. Define what does NOT count as proof for each claim your agent makes. The third column (NOT Sufficient) pre-blocks the specific shortcuts agents take when asked to “verify.”
Hedging language is a reliable signal. Any time an agent says “should,” “probably,” or “I’m confident” in a completion report, the agent has not performed the verification it claims. Detect and replace mechanically.

The Metacognitive Audit

Want to evaluate your own agent instructions? The interactive tool below analyzes any CLAUDE.md, AGENTS.md, or system prompt and scores it across the metacognitive dimensions described in this post.

Paste your agent instructions, and the audit will identify: what percentage of your instructions are action-level versus metacognitive, which named failure modes are covered, whether hedging language detection exists, and where the gaps are.

FAQ

What is the difference between action-level and metacognitive AI agent instructions?

Action-level instructions define correct behavior: “use type hints,” “write tests,” “validate input.” Metacognitive instructions define how the agent monitors its own compliance with those behaviors: “if you use the word ‘should’ in a verification report, you haven’t actually verified,” “if three fixes fail, question the architecture.” Action-level instructions prevent wrong outputs. Metacognitive instructions prevent the wrong reasoning patterns that produce wrong outputs. Most agent configurations contain only the first kind, which is why agents produce plausible but unverified work.

How do I build a metacognitive layer for an AI coding agent?

Start with the false evidence table. List the five claims your agent makes most often (“tests pass,” “follows patterns,” “handles edge cases”) and define what does NOT count as proof for each. The third column (NOT Sufficient) blocks specific shortcuts before the agent takes them. Add named failure modes after you observe the same error pattern three times. Add hedging language detection last as a mechanical check. Each layer catches errors the previous layer misses. The entire system can start as a markdown section in your CLAUDE.md and grow from there.

Does metacognitive programming work with models other than Claude?

The principles are model-agnostic. Any language model that follows instructions can benefit from false evidence tables, named failure modes, and hedging language detection. The specific enforcement mechanism varies: Claude Code supports hooks that intercept agent output at the tool-use level (deterministic, cannot be bypassed). Other frameworks may rely on system prompts or post-processing scripts. The enforcement strength determines reliability — prompt-level metacognitive instructions improve output quality, but hook-level enforcement guarantees it.

What are signs that an AI agent needs metacognitive instructions?

Four signals: (1) The agent reports “done” but the work contains obvious errors (a Hollow Report, where the agent skipped verification). (2) The same type of mistake recurs across sessions, an unnamed failure mode repeating because nothing watches for it. (3) The agent’s completion reports use hedging language (“should work,” “looks correct”) instead of citing evidence — a Confidence Mirage. (4) The agent enters long debugging loops without questioning its approach — a missing circuit breaker. If any of these patterns appear more than twice, the agent needs metacognitive scaffolding.

Part of the Claude Code Mastery series. The hooks system implements the enforcement layer described here. Context engineering as architecture explains how the metacognitive layer fits within the broader seven-layer context system.

obra/superpowers on GitHub. The superpowers project includes skills for systematic debugging and verification that teach Claude Code agents to monitor their own reasoning patterns rather than only their outputs. The metacognitive framing in the present article draws on this approach. github.com/obra/superpowers ↩↩↩
The false evidence table structure draws on the verification-before-completion pattern in the obra/superpowers project. I adapted the concept into the Evidence Gate, a six-criterion verification system enforced through hooks. See the Jiro quality philosophy post for the full implementation. ↩
The third column (NOT Sufficient) addresses what the academic literature calls “metacognitive illusions”: cases where an agent’s self-assessment of its own performance diverges from actual performance. In cognitive science, this is well-documented: students who rate themselves as “understanding” material often perform poorly on tests of that material. Dunning, D., Johnson, K., Ehrlinger, J., & Kruger, J. (2003). Why people fail to recognize their own incompetence. Current Directions in Psychological Science, 12(3), 83-87. doi.org/10.1111/1467-8721.01235 ↩↩
The seven named failure modes emerged from nine months of production use. Each was documented after observing the pattern at least three times across different projects and task types. The full system is described in Why My AI Agent Has a Quality Philosophy. ↩
Author’s analysis of official Claude Code skills available through the Claude Code CLI skill system. Prohibitions (“NEVER X”) proved more effective than suggestions (“consider Y”) because they name the specific evasion. The observation that mindset-oriented skills outperform procedural guides in adoption is based on community reports in the Claude Code Discord and GitHub discussions, not a controlled study. See Anthropic’s Claude Code documentation: docs.anthropic.com/en/docs/claude-code. ↩
The insight that hedging language signals insufficient verification has empirical support beyond agent systems. Nisbett, R.E. and Wilson, T.D. (1977). “Telling More Than We Can Know: Verbal Reports on Mental Processes.” Psychological Review, 84(3), 231-259. doi.org/10.1037/0033-295X.84.3.231. Nisbett and Wilson found that verbal self-reports of reasoning processes are often confabulated — people describe plausible-sounding explanations rather than accurate accounts of their actual reasoning. The same dynamic appears in agent output: “I’m confident this is correct” is a plausible-sounding self-report, not evidence of verification. ↩
The three-fix escalation rule functions as a circuit breaker pattern applied to debugging. The pattern is analogous to the circuit breaker in distributed systems (Nygard, M. Release It!, 2007, Pragmatic Bookshelf): fail fast, escalate, try a different approach. After three failed attempts within the same mental model, continuing on the same path yields diminishing returns. ↩
Paraphrased from a reply to @blakecrosley on X, February 2026. The original tweet discussed the tension between the Ralph loop’s velocity optimization and the Jiro system’s quality friction. The responder’s observation that the base loop “explicitly optimizes against restraint in the name of throughput” accurately describes the design tension the metacognitive layer addresses. ↩
Anthropic, “Constitutional AI: Harmlessness from AI Feedback,” 2022. arxiv.org/abs/2212.08073. The constitutional AI approach trains the model to evaluate its own outputs against a set of principles. The metacognitive system described here is a runtime version of the same idea: instead of training-time self-evaluation, the false evidence table and named failure modes provide evaluation criteria at inference time through the instruction set. ↩
Lightman, H., Kosaraju, V., Burda, Y., et al., “Let’s Verify Step by Step,” arXiv, 2023. arxiv.org/abs/2305.20050. OpenAI’s process reward model research demonstrated that evaluating intermediate reasoning steps (process supervision) outperforms evaluating only the final answer (outcome supervision) on mathematical problem-solving. The false evidence table applies the same principle at the instruction level: requiring evidence for each verification claim rather than accepting a final “done” report. ↩
Shinn, N., Cassano, F., Gopinath, A., et al., “Reflexion: Language Agents with Verbal Reinforcement Learning,” NeurIPS 2023. arxiv.org/abs/2303.11366. Reflexion demonstrated 20-30% accuracy improvements on HumanEval and ALFWorld benchmarks by giving LLM agents structured self-reflection prompts after failures. The framework validates the core insight of the present system: the format and specificity of self-evaluation instructions matter as much as their presence. ↩