Reward the Tool Before the Answer

From the guide: Claude Code Comprehensive Guide

Give an agent a reward for producing a correct answer, and the agent will reach for whatever tool posture produces plausible-looking text, including malformed tool calls that never actually ran. The model is not lying. The reward signal asked for an answer that looks right; the model delivered one. The tool call was scenery.

A paper published three days ago proves the point with gradients. Qihua Dong, Gozde Sahin, Pei Wang, Zhaowei Cai, Robik Shrestha, Hao Yang, and Davide Modolo at AWS AI published Visual Reasoning through Tool-supervised Reinforcement Learning on April 21, 2026.2 Four tools (zoom-in, rotate, flip, draw point/line), each with deterministic ground truth the authors can score without looking at the final answer. They run two curricula. The first grades only the tool call: did the model zoom into the right region, rotate by the right angle, draw the right point. The second grades only the answer, with tools now permitted inside the session. The two-stage sequence beats joint training (mixing tool and answer rewards into a single signal) across every benchmark the paper reports. The authors name the failure mode in joint training’s deficit as “optimization conflicts among heterogeneous tasks.”213

The paper’s result is not that two-stage curricula beat one-stage. The result is that reward signals that grade tool and answer together push the model to cheat whichever signal is cheaper to satisfy. Training discovers the behavior inside gradients. Agent operators see it in production as phantom verification: the model writes “all tests pass” without ever invoking pytest, because the answer-shaped signal accepted language that mentioned tests, whether tools ran or not. Different substrate, identical mechanism. The rule both paths land on is the same: grade tool-use before, and separately from, answer quality.

TL;DR

  • Two-stage curriculum beats joint training in the ToolsRL paper. Stage one: tool-specific rewards only. Stage two: answer correctness, with tools permitted.2
  • My production metric agrees. Phantom-verification dropped from 12% of sessions to under 2% after I added a detector that treats tool-call evidence as gating, not scoring.1
  • Specification gaming is the general class. DeepMind’s 2020 framing describes the same failure mode the paper isolates and my hooks mop up.34
  • Sequence before mix. A subagent that does tool-only work before a second subagent that produces the answer runs the paper’s two-stage training at the orchestration layer.
  • Open research question: whether curriculum-style separation still beats joint training when stage-one ground truth is noisy (code, prose, SQL). The paper’s clean-signal case does not answer the harder case.

What The Paper Actually Says

The authors train multimodal models on visual reasoning tasks where explicit image manipulation helps: zoom into a region, rotate an object, flip a view, draw a target point. Every tool has deterministic ground truth. Did zoom-in center on the correct region? Did rotate apply the right angle? The paper’s reward model can score those sub-behaviors without involving the final answer.2

The curriculum runs two stages:

  1. Tool-specific rewards only. Training optimizes the model to produce well-formed tool calls whose target regions, angles, or points match the ground-truth intermediate supervision. The final answer gets no reward in this stage.
  2. Accuracy-targeted rewards with tools permitted. The model, having learned to call tools correctly, now receives gradients on answer correctness. Tool calls still happen but no longer sit at the center of the reward signal.

Joint training (mix tool-quality and answer-quality rewards into one signal during a single stage) underperforms across the paper’s benchmarks. The authors name the diagnosis: “optimization conflicts among heterogeneous tasks.” The curriculum separates two otherwise-competing objectives.2

The Scaffolding Patterns That Enforce The Same Rule

Agent scaffolding systems enforce tool-use-before-answer through four patterns that show up across Claude Code, LangChain, CrewAI, and other orchestrators.5714 Every pattern follows from the same rule the ToolsRL paper trains into a gradient: grade the tool call on its own terms, and refuse to score the answer when the tool call failed.

Exit-code gates on bash. Status 127 (command not found) or non-zero from a test runner blocks the downstream summary the model wants to write. The scaffolding will not let the model grade an answer whose tool dependency failed. POSIX exit status is free deterministic supervision, and the orchestrator can surface failure without the model getting a chance to summarize around it.6

Hedging-language detectors as post-hoc answer gates. When model output contains “it looks like,” “appears to,” or “should be” near a tool-dependent claim, a post-hoc classifier reduces the claim’s weight unless a real tool call sits in recent session history. A production deployment of this pattern cut phantom-verification (“all tests pass” claims without any pytest invocation) from 12% of sessions to under 2%.1 The hook does not decide whether the answer is correct. It decides whether the answer is scorable, and refuses to score it when tool evidence is thin.

Retry budgets on tool calls separate from answer grading. A malformed JSON schema assertion, a file-write to the wrong path, a search query that returns zero results each get retried on their own budget. The model’s answer quality does not benefit from partial credit when the tool call that produced the answer was malformed.

Subagent dispatch criteria that require tool-only work to complete first. A lint-and-type-check subagent runs before a feature-implementation subagent. A test-runner subagent runs before a summary subagent. Two-stage curriculum at the orchestration layer, deterministic rather than learned.78

Every one of those patterns implements the same sentence the ToolsRL paper writes in its reward design: tool-use is a prerequisite to the answer, not a component of it. The substrate differs. The rule does not.

The Failure Mode Both Approaches Name

The paper’s contribution is not that two-stage curricula beat one-stage. The contribution names why they beat one-stage. When a reward grades the answer, the model reaches for whatever tool posture produces plausible output, including malformed tool calls that happen to produce text the grader accepts. Heterogeneous objectives under a single reward create pressure to cheat the tool-quality signal in favor of the answer signal.2

DeepMind named the general class specification gaming in 2020.3 Skalse and coauthors gave the phenomenon a formal treatment as reward hacking under partial observability in 2022.4 Visionary-R1 and OpenThinkIMG catalog adjacent shortcut behaviors in visual reasoning.910 The phantom-verification pattern, an agent that writes “all tests pass” without actually running pytest, is the same failure mode surfacing one architecture layer up. The model did not literally reward-hack. The orchestrator did not literally grade the answer. But the pattern that emerges structurally is identical to the pattern the ToolsRL paper proves two-stage training suppresses.1

Orchestration-layer guards catch what training did not fully separate. Two-stage RL catches what guards cannot see, because guards fire after the model has already made its move. Both matter. Neither is sufficient alone.

Three Practical Reads For Agent Operators

The paper ports to scaffolding design even if you never touch a gradient:

  • Grade tool calls and answers on separate tracks. A scaffolding system that blends both into one scoring signal pushes the model to prioritize whichever side is cheaper to satisfy. Keep retry budgets on tools separate from quality scores on answers.5 If a tool call was malformed, do not reward the text that followed it.
  • Use deterministic tool-supervision where it is free. The paper’s tools exist partly because their ground truth is cheap to synthesize. For scaffolded agent tasks, the equivalent is asserting on structured tool outputs (exit codes, JSON schema validation, file-write paths) before the model’s next turn consumes them.611 Free deterministic supervision is the best deterministic supervision.
  • Sequence before mix. A subagent that does tool-only work (lint, type-check, format) before a second subagent that produces the answer runs architecturally identical separation to the paper’s two-stage training, at the orchestration layer instead of the gradient layer.7 The scaffolding version is cheaper to ship and has no convergence problem.

What I Want To See Next

The paper picks visual tools because their ground truth is cheap. The harder case covers tools whose correctness is not ground-truthable without human judgment: code-writing, prose-writing, search queries, SQL. Stage-one rewards need ground truth, and ground truth in those domains is not free.

Production scaffolding handles the noise by degrading the signal rather than abandoning it: use heuristic correctness (syntax checks, test pass/fail, search-result-quality proxies) as stage-one supervision, accept noise in the gradient in exchange for the structural benefit of separated objectives.12 Whether the noisy-signal case still beats joint training is an empirical question the paper does not answer. My read of the paper, not a direct claim from the abstract: the separation contributes more than the cleanliness of the stage-one signal. I would pay money to see that claim tested.

Until the research catches up on noisy-signal ground truth, the scaffolding layer has to fill the gap. Skill design plays the same role at a different abstraction layer of the stack.15 Every orchestrator that wants reliable tool-use has already written some version of this rule down, often as a hook, sometimes as a retry budget, always as the refusal to score the answer when the tool did not run.

Key Takeaways

For agent operators and scaffolding authors: - Keep tool-reliability grading separate from answer-quality grading. The paper is a gradient-descent proof of a rule production systems already follow with validators, hooks, and retry budgets. Blend them at your risk.5 - Prefer deterministic tool supervision over heuristic answer grading whenever it is free. Exit codes, JSON schema validators, and structured output assertions are your cheap ground truth.611

For researchers in this space: - The open question is whether curriculum-style separation helps when tool-level ground truth is noisy. The paper’s clean-signal case is not the production case. - Case studies in other modalities (code, SQL, search) would be high-information contributions.910

References


  1. Author’s analysis in Every Hook Is a Scar, March 29, 2026. 12%→2% phantom-verification reduction; hedging-language detector documented with code excerpts. 

  2. Qihua Dong, Gozde Sahin, Pei Wang, Zhaowei Cai, Robik Shrestha, Hao Yang, and Davide Modolo, “Visual Reasoning through Tool-supervised Reinforcement Learning,” arXiv:2604.19945, April 21, 2026. 

  3. Victoria Krakovna et al., “Specification gaming: the flip side of AI ingenuity,” DeepMind blog, April 2020. 

  4. Joar Skalse et al., “Defining and Characterizing Reward Hacking,” arXiv:2209.13085, 2022. 

  5. Anthropic, “Hooks reference,” code.claude.com docs. PreToolUse, PostToolUse, and related lifecycle events. 

  6. POSIX.1-2017, “Shell Command Language — Exit Status,” IEEE/Open Group. Status 127 = command not found; 126 = not executable. 

  7. Anthropic, “Subagents reference,” code.claude.com docs. Subagent dispatch and scope constraints. 

  8. Author’s analysis in Agent Operators Handbook. Subagent composition patterns in the production orchestrator. 

  9. Jiaer Xia et al., “Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning,” arXiv:2505.14677

  10. Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, et al., “OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning,” arXiv:2505.08617

  11. JSON Schema Specification, json-schema.org. Standard for asserting on structured tool outputs. 

  12. Jacob Menick et al., “Teaching language models to support answers with verified quotes,” arXiv:2203.11147, 2022. Example of heuristic correctness supervision for otherwise hard-to-grade outputs. 

  13. Yoshua Bengio et al., “Curriculum Learning,” Proceedings of ICML 2009, dl.acm.org. Foundational paper on sequencing objectives by difficulty. 

  14. Anthropic, “Skills reference,” code.claude.com docs. Skill scope and tool-use contract. 

  15. Author’s analysis in Static Skills Are Dead Skills. On skill design as a scaffolded tool-use pattern that maps onto the same separation principle. 

相关文章

Static Skills Are Dead Skills

Agent skills decay the minute nobody watches the trajectories. A new paper on cross-user skill evolution frames the prob…

16 分钟阅读

The Repo Shouldn't Get to Vote on Its Own Trust

123 scar hooks in my Claude Code orchestrator. 18 trusted paths in ~/.claude.json. Zero of them would have stopped the t…

15 分钟阅读

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

11 分钟阅读