← Todos os Posts

The Evidence Gate

An agent reported “all tests pass” without running pytest. The output was confident. The phrasing was natural. The claim was false. No test suite had been invoked during the session. The agent inferred that the tests would pass based on its understanding of the code changes and stated the inference as fact.

I caught it because I have a rule: every completion report must cite specific evidence. Not “I’m confident the tests pass.” The test output, pasted, showing zero failures. Not “the file should be updated.” The file path, the line number, the specific change. Not “this follows the existing pattern.” The name of the pattern, the file where it exists, and the line where the new code matches it.

This rule has a name in my system: the evidence gate. No work is complete until every claim in the completion report is backed by something observable. “I believe” is not evidence. “It should” is not evidence. “I’m confident” is not evidence. Evidence is a file path, a test result, a specific code reference, or a direct observation.

Why This Matters Now

Language models produce plausible text. That is their core capability and their core risk. A plausible claim about test results is indistinguishable from a verified claim about test results unless you demand the verification artifact.

The failure mode is not hallucination in the dramatic sense. The agent does not invent fictional test frameworks or fabricate error messages. The failure mode is inference presented as observation. The agent reasons that the tests should pass, and reports this reasoning as if it were a test run. The reasoning may even be correct. But reasoning about tests is not running tests, and the gap between the two is where bugs survive.

I call this phantom verification: a completion report that claims verification occurred when it did not. In my tracking across 500+ autonomous sessions, phantom verification accounts for 12% of agent failures requiring human intervention.1 It is the most common failure mode that produces no visible error. The agent reports success. The output looks clean. The bug ships.

The Gate

The evidence gate is a set of six criteria. Every non-trivial change must produce evidence for all six before the work is marked complete.

Criterion Required Evidence
Follows codebase patterns Name the pattern and the file where it exists
Simplest working solution State what simpler alternatives were rejected and why
Edge cases handled List specific edge cases and how each is handled
Tests pass Paste test output showing zero failures
No regressions Name the files and features checked
Solves the actual problem State the user’s need and how this addresses it

The criteria are deliberately concrete. Each one demands a specific artifact, not a general assurance. “Follows codebase patterns” is not satisfied by “I followed the existing conventions.” It is satisfied by “The retry pattern in fetch_nvd() matches the exponential backoff in fetch_semantic_scholar() at line 241.”

The specificity is the point. An agent that must produce a file path and line number cannot phantom-verify. Either the file exists at that path and the code matches the claim, or it does not. There is no plausible middle ground.

Hedging as a Signal

The evidence gate includes a hedging detector. Specific words trigger re-verification:

  • “should” (“this should work”)
  • “probably” (“this probably handles the edge case”)
  • “seems to” (“the output seems correct”)
  • “I believe” (“I believe the tests pass”)
  • “looks correct” (“the implementation looks correct”)
  • “I’m confident” (“I’m confident this is right”)

Each of these words indicates that the agent is reasoning about the outcome rather than observing it. The reasoning may be correct. But if the agent can observe the outcome directly (by running the test, reading the file, checking the output), reasoning is a weaker form of evidence than observation.

When a completion report contains hedging language, the response is not “you’re wrong.” The response is “replace the hedge with the observation.” If you believe the tests pass, run them and paste the output. If it seems correct, read the file and cite the line. The hedge is a signal that verification was skipped, not that verification failed.

Why Agents Hedge

Agents hedge for three reasons, and understanding the reasons matters for designing the gate.

Context window pressure. Running a test suite consumes context. Reading a file consumes context. An agent managing a long session may skip verification to preserve context for the next task. The evidence gate makes this trade-off visible: the agent cannot claim completion without the artifact, so context pressure surfaces as incomplete work rather than phantom verification.

Tool call avoidance. Some agent configurations penalize or limit tool calls. An agent that can report “tests pass” without invoking pytest saves a tool call. The evidence gate removes this shortcut: the test output is mandatory, so the tool call is mandatory.

Training on human patterns. Humans write completion reports with hedging language all the time. “I updated the config and the tests should pass.” An agent trained on human text reproduces this pattern. The evidence gate is a post-training intervention that breaks the pattern by refusing to accept the report without the artifact.

The Pride Check

The evidence gate is part of a broader quality system I call the pride check. Five questions, asked after every non-trivial change:

  1. Would a senior engineer respect this?
  2. Does the code explain itself?
  3. Are edge cases handled?
  4. Is this the right level of simplicity?
  5. Did I leave the codebase better than I found it?

The pride check is subjective where the evidence gate is objective. The evidence gate asks “can you prove this works?” The pride check asks “would you be proud to show this to someone you respect?” Both are necessary. Proof without pride produces code that works but nobody wants to maintain. Pride without proof produces code that reads well but might not work.

The combination creates a quality loop: implement, review every line, run the evidence gate, apply the pride check, fix every issue found, and repeat until both pass. The loop is not efficient. It is not fast. It is correct. In a world where agents can produce plausible code at high speed, correctness is the differentiator.

Failure Modes

The evidence gate catches phantom verification. It does not catch every failure mode. Seven named failure modes appear across autonomous agent sessions:1

Shortcut spiral. Skipping evidence gate steps to report completion faster. The agent produces a partial report and claims it is complete.

Confidence mirage. “I’m confident” stated with high conviction. The evidence gate catches the language, but a sufficiently fluent agent may rephrase the hedge to avoid detection.

Good-enough plateau. The code works but is not clean or well-tested. The evidence gate’s “simplest working solution” criterion partially addresses this, but the pride check is the primary defense.

Tunnel vision. Polishing one function while breaking adjacent code. The “no regressions” criterion addresses this, but only if the agent checks the right files.

Deferred debt. TODO/FIXME/HACK in committed code. The evidence gate does not check for these. A separate lint rule is the appropriate defense.

Hollow report. “Done” without evidence for any criterion. The evidence gate’s structure makes this obviously incomplete, but an agent may produce a report that looks complete while omitting one criterion.

Phantom verification. The evidence gate’s primary target. Claims of testing or verification without the artifact. The 12% failure rate drops to near zero when the gate is enforced consistently.

The Discipline

The evidence gate is not a technical innovation. It is a discipline. The discipline of requiring proof before accepting claims. The discipline of treating “I believe” as insufficient. The discipline of running the test even when you know it will pass.

The discipline matters more now than it did before agents. A human developer who says “the tests pass” has usually run the tests. The claim and the observation are conflated because the human did both. An agent that says “the tests pass” may have done neither. The evidence gate separates the claim from the observation and requires both.

In an age of plausible output, proof is the only reliable signal. Everything else is inference.


FAQ

Isn’t this just code review?

Code review checks whether the code is correct. The evidence gate checks whether the completion report is honest. A code review can approve correct code that was never tested. The evidence gate requires the test output regardless of whether the code looks correct.

Does this slow down development?

Yes. Running tests, reading files, and citing specific evidence takes time. The alternative is shipping phantom-verified code and discovering the bugs in production. The evidence gate trades development speed for deployment confidence.

Can agents learn to game the evidence gate?

An agent could fabricate test output or cite incorrect line numbers. The evidence gate is not adversarial-proof. It catches the common failure mode (inference presented as observation) rather than the adversarial failure mode (deliberate fabrication). Deliberate fabrication requires a different defense.

How do you enforce this with autonomous agents?

The evidence gate criteria are part of the system prompt. The quality loop (implement, review, gate, check, fix, repeat) is encoded in the orchestration system. The agent cannot report completion without producing evidence for all six criteria. If a criterion is missing, the loop returns to the fix step.


Sources


  1. Blake Crosley, “What I Told NIST About AI Agent Security,” blakecrosley.com, February 2026. 12% phantom verification rate across 60+ autonomous sessions. 

Artigos relacionados

Quality Is the Only Variable

Time, cost, resources, and effort are not constraints. The question is what's right, not what's efficient. A philosophy …

7 min de leitura

What I Run Before I Sleep

Every night: 15,000 pages checked, TTFB measured, cache verified, sitemaps crawled. The goodnight routine is where opera…

7 min de leitura

Why My AI Agent Has a Quality Philosophy

My Claude Code agent inherited every sloppy human habit at machine speed. I built 3 philosophies, 150+ quality gates, an…

27 min de leitura