Reward the Tool Before the Answer

Blake Crosley April 24, 2026 13 min read

ai agents reinforcement-learning claude-code research engineering

From the guide: Claude Code Comprehensive Guide

An agent that returns “All tests pass. The refactored query produces identical results to the original” without a single test invocation in its tool log is the structural failure pattern that any orchestrator running tools learns to detect, name, and gate. The completion sentence references work the agent never did. The reasoning in the session log can be sound, the SQL can look correct, and the report can still be a costume the model sewed for a tool call that did not happen.

session log, tool-call grep:
  tool:read           app/db/queries.py
  tool:edit           app/db/queries.py
  tool:read           tests/test_queries.py
  [no tool:bash entries matching pytest]
  [no tool:bash entries at all]

The pattern recurs across agent runtimes. The model writes an answer-shaped string about test passage, query confirmation, file coordination, or coherent refactor. The tool log, checked independently, does not contain the call the answer claims. If the work had been subtly wrong in an edge case the model’s reasoning did not cover, the bug would have shipped behind a completion report claiming verification.

The orchestrator should not score the answer when the tool call that was supposed to produce it did not happen. The answer is not the unit of quality. The pair (tool-call, answer) is the unit of quality. If the tool half is missing, the answer half is unscoreable.

The rule is straightforward to encode at the scaffolding layer. Grep the completion report for hedging language (should pass, I believe, likely, I’m confident, appears to), cross-reference against the session’s tool-call log, and if the report makes a tool-dependent claim without a matching tool invocation, demand cited evidence before allowing the session to close.

TL;DR

A completion report is not scoreable unless the tool call it depends on actually ran.
Four failure modes share the same shape: fluent answer text with missing or invalid tool evidence.
The fix is to grade tool calls before answers: deterministic evidence first, verdict second.

Four Answer-Shaped Failure Modes

The four modes share a shape. The model’s answer is a plausible summary of what a competent agent would have done. The model’s tools, checked independently, do not support the summary. The answer shape works because the grader in the loop accepts language that mentions the right verbs.

Phantom verification. The completion report claims tests passed without any test-runner call in the session’s bash invocations. The detection rule reads completion reports against the tool-call log; a claim like all tests pass with no tool:bash entry matching a test-runner invocation fails closed.

Malformed tool scenery. A report says I queried the table and confirmed the index is in use, and the tool log shows a psql call that exited with status 2 because the database name was wrong. The output of that call is empty. The agent reads the empty output, decides it means the query succeeded silently, and reports the silence as confirmation. The exit-code gate fails closed on any non-zero exit status from bash tool calls cited in the completion report.¹

Skipped dependency. A report names a coordinated change across several files: I updated the migration and the tests. The migration file appears in the edit log; the test file appears only in the completion-report sentence. No tool:read on the test file occurred. The file-read audit asserts that any file named in the completion report must appear in the tool-call log as read or written.

Summary laundering. Three small edits across three unrelated areas of the codebase, reported as a coherent story: I cleaned up the logic, improved the error messages, and added retries. Viewed in the tool log, the three edits have no thematic relationship. The drift detector computes cosine similarity between the original task description and the completion-report summary; a drop below a threshold fires a manual-review flag.

Each mode is an answer that looks right plus a tool call that did not happen, or a tool call that happened but did not produce the evidence the answer claims. The fix lives at the same layer in every case. The orchestrator decides whether the answer is scoreable, not whether it is correct. The decision is one-directional: if the tool evidence is missing, the answer is not scoreable and the session gets flagged for human review. If the tool evidence is present, the answer can then be evaluated. The orchestrator refuses to collapse the two questions into one.

Evidence Before Verdict: The Jiro Gate Is The Spine

The Jiro Quality Philosophy names the gate that the four hooks above are four implementations of: quality claims require evidence, not feelings.² The scaffolding-layer rule follows directly. No answer is scoreable unless the tool call that produced it produced evidence. The evidence is the gate. The gate is one-directional.

Every detector above is the gate at a different substrate. Hedging-language detection is the gate at the natural-language layer. Exit-code checking is the gate at the shell layer. File-read auditing is the gate at the filesystem layer. Narrative-drift detection is the gate at the embedding layer. Four substrates, one rule, one direction. If evidence fails, verdict is refused. If evidence holds, verdict proceeds. There is no composition in the other direction; no amount of confident-sounding verdict text is allowed to manufacture evidence retroactively.

The Steve Test is the gate one altitude up: would Blake sign his name to this?³ The question is not does the answer look right. The question is would Blake sign his name to the answer. The signature requires evidence the answer is grounded in verified tool calls. An answer that skipped the tool is not signable because there is no gate to point to when the answer turns out to be wrong in production.

Minimum Worthy Product closes the frame.⁴ Minimum is a scope constraint, not a quality discount. A minimum completion report is a report. A minimum worthy completion report has tool-call evidence behind every claim. Cutting scope is not license to cut evidence. Answer-shaped failures are the scope-cut without evidence-cut pathology at the agent-output layer.

What The Adjacent Literature Already Says

The scaffolding-layer rule has predecessors at the training layer that name the same shape. ReAct (Yao et al., 2022) interleaves reasoning traces with tool actions and shows that grounding chains of thought in tool calls beats free-form reasoning on tool-using benchmarks.⁵ Toolformer (Schick et al., 2023) trains models to insert tool calls into their own outputs through a self-supervised loop where the supervision signal is whether the inserted call reduces downstream loss.⁶ OpenAI’s Let’s Verify Step by Step (Lightman et al., 2023) shows that process-level supervision on reasoning steps beats outcome-level supervision when the reasoning chains are long.⁷ Each of these is a different angle on the same general claim: graders that reward only the final answer leave the model free to fake the steps in between.

The scaffolding rule is the runtime, deterministic version of that claim. Where ReAct interleaves reasoning with action, the rule asserts that the action must actually have happened. Where Toolformer trains tools into the output distribution, the rule asserts that the inserted tool call must have produced evidence the answer cites. Where process supervision rewards reasoning steps, the rule rewards the deterministic side-effects of those steps: exit codes, schema validation, file-write paths.

A Tool-Supervised RL Paper Names The Gradient Shape

Researchers at Northeastern University and Amazon AGI published Visual Reasoning through Tool-supervised Reinforcement Learning on arXiv in April 2026.⁸ Their setup trains a multimodal model on three visual tool families spanning five operations (zoom-in, rotate, flip, draw line, draw point) with two reward schedules: joint (one reward signal blending tool quality and answer quality) and sequential (a stage-one reward on tool quality, then a stage-two reward on answer quality after the tool-supervision stage). Both stages run for the same number of GRPO updates (200 each, per the paper’s training details). The sequential curriculum beats the joint schedule on most reported benchmarks, with the exact margin varying by dataset. The authors name the joint training failure mode optimization conflicts among heterogeneous tasks.⁸

The training-level failure rhymes with the scaffolding-level one. When the reward signal asks for an answer, the optimizer finds whatever local minimum satisfies the reward with the least work. The cheapest local minimum is a well-formed-looking answer with underspecified tool calls. The scaffolding layer calls that phantom verification. The training literature calls it specification gaming.⁹ Skalse and coauthors gave the general class formal treatment: reward hacking emerges when the optimization target is a proxy that does not perfectly track the true reward.¹⁰

The visual tools the Amazon and Northeastern authors chose are not incidental. Each has cheap deterministic ground truth: did the zoom center on the correct region, did the rotate apply the right angle, did the draw hit the right coordinates. The stage-one reward can score these without reference to the final answer. The same condition is what the exit-code gate exploits at the scaffolding layer. Bash status 0 is deterministic evidence that the process completed without reporting an error; status 127 is deterministic evidence that the intended binary was not found.¹¹ JSON schema validation is deterministic evidence for the output matched the expected shape. File-write-path assertion is deterministic evidence for the write landed at the expected location. Wherever deterministic supervision is free, the evidence gate can hold the line without involving the model in its own grading.

The paper is one of the cleaner gradient-form demonstrations of the rule with a two-stage fix. The scaffolding version of the rule is older and broader: any system that uses tools and gets graded on answers ends up needing some version of it. Different substrate, related shape. Evidence first, verdict second, no composition in the other direction.

Three Reads For Operators Who Will Never Train A Model

The paper ports to scaffolding design even if training is out of scope.

Grade tool calls and answers on separate tracks. An orchestrator that blends tool-quality and answer-quality into one score pushes the agent to satisfy whichever side is cheaper. Keep retry budgets on tools separate from quality scores on answers. If a tool call was malformed, do not let the text that followed it contribute to the answer’s score.¹¹¹

Use deterministic tool supervision where it is free. Exit codes. JSON schema validators. File-write-path assertions. Output-shape tests. The paper’s tool families exist partly because their ground truth is cheap; in production, the same cheap ground truth shows up in exit codes and schemas. Ship those gates. Every deterministic assertion in the pre-answer path closes one row in the failure taxonomy above.¹¹

Sequence before mix. A subagent that does tool-only work (lint, type-check, format, test) before a second subagent that produces the answer runs the paper’s two-stage curriculum at the orchestration layer. Deterministic rather than learned. Cheaper to ship than a custom training run. No learned reward-convergence problem at that layer, though the second subagent can still produce a bad answer; the rule cuts the failure mode that mixes the two.¹²

The harder case covers tools whose correctness is not ground-truthable without human judgment: code-writing, prose-writing, search queries, SQL. The stage-one reward in those domains is not free. The noisy case responds to degraded signals: syntax checks, test pass/fail, search-result-quality proxies. Imperfect, but the structural benefit of separated objectives remains. A two-stage curriculum on a noisy stage-one signal, benchmarked against a one-stage curriculum on the same signal, would tell us whether the separation-as-invariant holds up under production conditions or collapses when the ground truth gets soft.

Until that research lands, the scaffolding layer carries the load. Reliable orchestrators tend to encode some version of this rule. Sometimes as a hook. Sometimes as a retry budget. Sometimes as a subagent-dispatch rule. Always as the refusal to score the answer when the tool did not run.

Reward the tool before the answer, or the answer becomes a costume for a tool that never ran. The four failure modes are four cuts of that same shape. The ToolsRL paper rhymes with the scaffolding rule at the gradient layer. The fix at both altitudes lines up around one direction. Evidence first. Verdict second. The gate refuses to compose otherwise.

FAQ

What is phantom verification in AI agents?

Phantom verification is when an agent reports that verification happened even though the tool call never ran. A completion report saying all tests pass with no test-runner invocation in the tool log is the canonical case. The fix is to compare tool-dependent claims against the tool-call log before scoring the answer.

Why should tool calls be graded before answers?

Tool calls should be graded first because answers can imitate evidence. If an answer claims tests passed, a query ran, or a file changed, the orchestrator needs deterministic proof that the relevant tool ran and succeeded. Only then is the answer scoreable. The rule keeps fluent text from manufacturing confidence after the fact.

What are answer-shaped failures?

Answer-shaped failures are plausible completion reports whose language matches the expected result but whose tool evidence does not support the claim. The post names four: phantom verification, malformed tool scenery, skipped dependency, and summary laundering. Each looks normal until the report is checked against reads, writes, exit codes, and task history.

How does tool-supervised reinforcement learning relate to agent orchestration?

Tool-supervised reinforcement learning separates reward for tool quality from reward for final answer quality. The orchestration version is deterministic: score the tool call first with exit codes, schemas, file assertions, or logs, then score the answer. Both systems avoid blended rewards where the model can satisfy the grader with a good-looking answer and weak tool use.

References

Anthropic, “Hooks reference,” code.claude.com docs. PreToolUse, PostToolUse, UserPromptSubmit, and the lifecycle taxonomy that exit-code gates implement against. ↩↩
Author’s analysis in The Jiro Quality Philosophy. Evidence gate: quality claims require evidence, not feelings. ↩
Author’s analysis in The Steve Test. “Would I sign my name to this?” as the taste gate above Jiro’s evidence gate. ↩
Author’s analysis in Minimum Worthy Product. Minimum as scope constraint; worthy as quality bar. ↩
Shunyu Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” arXiv:2210.03629, 2022. Interleaved reasoning and tool action on knowledge-intensive and decision-making tasks. ↩
Timo Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools,” arXiv:2302.04761, 2023. Self-supervised tool-use insertion via downstream-loss reduction. ↩
Hunter Lightman et al., “Let’s Verify Step by Step,” arXiv:2305.20050, 2023. Process supervision (rewarding individual reasoning steps) outperforming outcome supervision on math reasoning. ↩
Qihua Dong, Gozde Sahin, Pei Wang, Zhaowei Cai, Robik Shrestha, Hao Yang, and Davide Modolo (Northeastern University and Amazon AGI), “Visual Reasoning through Tool-supervised Reinforcement Learning,” arXiv:2604.19945, April 2026. ↩↩
Victoria Krakovna et al., “Specification gaming: the flip side of AI ingenuity,” DeepMind blog, April 2020. Baseline framing of reward hacking under misspecified objectives. ↩
Joar Skalse et al., “Defining and Characterizing Reward Hacking,” arXiv:2209.13085, 2022. Formal treatment of reward hacking as optimizing an imperfect proxy reward in MDPs. ↩
POSIX.1-2017, “Shell Command Language: Exit Status,” IEEE/Open Group. Status 127 = command not found; 126 = not executable. ↩↩↩
Anthropic, “Subagents reference,” code.claude.com docs. Subagent dispatch and scope constraints. ↩