Agent Execution Traces Are the Runtime Contract
Three new agent papers make the same claim from different angles: the final answer is the weakest unit to trust. SHEPHERD turns agent execution into a typed, forkable trace. The AI Workflow Store argues that repeated agent work should run as engineered reusable workflows instead of improvised plans. WildClawBench grades agents inside native command-line runtimes with real tools, side-effect audits, and trajectory checks, not final answers alone.123
Agent reliability now lives in the execution trace, the workflow artifact, and the runtime evaluator. A chat transcript can explain what the agent says it did. A trace can show what it touched. A workflow can constrain what it may do next time. A native-runtime benchmark can measure whether the model, tools, state, and control loop worked together.
I already argued that managed agents are absorbing runtime infrastructure. I also argued that the cleanup layer is the real agent market. This post is narrower: the contract underneath both arguments is the agent’s execution record. If you cannot inspect, fork, replay, reuse, and grade the trace, you do not yet have an agent system you can trust at scale.
The adjacent pieces cover the control surface, the proof gate, and the skill loop: Chat Is the Wrong Interface for AI Agents, The Evidence Gate, and Static Skills Are Dead Skills. The trace contract sits underneath all three.
TL;DR
Agent systems keep moving away from final-answer evaluation. SHEPHERD records every agent-environment interaction as a typed event in a Git-like trace where earlier states can be forked and replayed.1 The AI Workflow Store proposes reusable, hardened workflows that amortize proper design, testing, adversarial evaluation, and staged rollout across many users instead of paying that cost on every prompt.2 WildClawBench shows why the runtime matters: its 60 long-horizon tasks run inside real CLI agent runtimes with real tools, average roughly 8 minutes and 20+ tool calls, and use hybrid grading that audits artifacts and environment side effects.3
The practical shift: stop asking only whether the answer is right. Ask whether the trace is inspectable, whether the workflow is reusable, and whether the evaluation ran in the runtime where the agent actually works.
Key Takeaways
For agent builders: - Treat the execution trace as the product contract. Log tool calls, arguments, exit states, file changes, side effects, and decision points in a structure that another process can inspect. - Promote repeated high-stakes tasks into vetted workflows. Improvisation belongs in discovery; repeated work deserves a reusable artifact with tests and constraints.
For evaluation teams: - Grade the model plus runtime, not the model in isolation. WildClawBench reports that changing the CLI runtime alone can shift a single model by up to 18 points.3 - Keep deterministic checks separate from semantic judging. File existence, format validity, workspace cleanliness, and service side effects should not require an LLM judge.3
For operators: - Do not buy “agent reliability” if the vendor cannot show the trace. A transcript, diff, or success sentence is not enough. - Keep local judgment rules close to the product. Managed traces can show what happened; they cannot decide what deserves to ship.
Why Is the Final Answer Too Weak?
Final answers compress the wrong information.
An agent can report that tests passed without running tests. It can describe a migration without reading the downstream callers. It can produce the right final artifact through a tool path that touched data the user never intended to expose. The answer may look clean while the runtime path stays unsafe, wasteful, or impossible to reproduce.
That is the core argument in Reward the Tool Before the Answer: the answer is not scoreable when the tool evidence behind it is missing. The recent research moves the same idea below the completion report. The trace itself becomes the object other agents, graders, and operators need to inspect.
WildClawBench names the benchmark-side version of the problem. The authors argue that many agent benchmarks still lean on synthetic sandboxes, short tasks, mock APIs, and final-answer checks. Their benchmark instead runs actual CLI agents in Docker containers and grades produced artifacts, environment state, and semantic criteria after the agent exits.3 The difference matters because long-horizon work fails through side effects and runtime choices, not only through wrong text.
What Does SHEPHERD Add?
SHEPHERD treats an agent execution like a first-class object another agent can operate on.1
The paper defines meta-agents as higher-order agents that supervise, optimize, or train other agents. Those meta-agents need more than a transcript. They need to read execution while it happens, fork before risky turns, replay from earlier states, and compare branches without contaminating the parent run.
SHEPHERD gives them that substrate. The runtime records every agent-environment interaction as a typed event in a Git-like execution trace. Every action becomes part of a commit graph. A meta-agent can subscribe to the typed event stream, check out an earlier commit, fork a scope, replay the suffix, and merge the branch it wants.1
The trace carries a semantic promise that normal chat logs do not carry:
| Property | Why it matters |
|---|---|
| Typed events | Supervisors can reason over operations instead of parsing prose. |
| Exact rewind | A failed path can return to a known prior state. |
| Isolated fork | Alternative branches cannot leak changes into the parent run. |
| Replay | A grader can re-run only the affected suffix instead of starting over. |
| Cache reuse | Branching becomes cheap enough to use during real agent work. |
The reported numbers make the substrate concrete. SHEPHERD forks the agent process and filesystem faster than Docker in the authors’ benchmark and reports prompt-cache reuse above 95% on replay. In their examples, a live supervisor raises CooperBench joint pass rate from 28.8% to 54.7%, and a Tree-RL setup lifts TerminalBench-2 performance from 34.2% to 39.4% in the reported configuration.1
Do not overread those numbers as a universal production guarantee. The important point is the shape: supervision, optimization, and training all improve when the runtime gives another process structured access to execution, not just a final result.
What Does the AI Workflow Store Add?
The AI Workflow Store paper attacks the same reliability problem from the reuse side.2
The authors argue that the common agent loop asks a model to synthesize and execute a plan in seconds or minutes. That speed short-circuits the processes that made conventional software tolerable: requirements work, design, testing, adversarial evaluation, staged deployment, monitoring, and feedback. The paper calls many on-the-fly agent executions closer to improvised prototypes than production-grade systems.2
Their proposed answer is not “make the model think longer.” The answer is a shared store of hardened, reusable workflows. An agent should match a user request to a vetted workflow when one exists, parameterize it for the user’s details, and execute that constrained workflow instead of inventing a new tool chain every time.2
That idea sharpens the skill conversation. A skill file that only says “here is how to do X” still leaves too much improvisation inside the runtime. A workflow store asks for a stronger artifact:
| Weak artifact | Stronger artifact |
|---|---|
| Prompt pattern | Parameterized workflow |
| One user’s workaround | Reusable capability |
| Best-effort tool plan | Tested sequence with constraints |
| Safety instruction | Deterministic boundary |
| Per-prompt cost | Amortized engineering cost |
The paper’s key economic claim is practical: rigorous engineering may cost more time and compute than an on-the-fly run, so the cost has to amortize across users and repeated requests.2 That argument fits the way serious agent work already feels. The first time you do a high-stakes workflow, you explore. The second and third time, you should stop exploring the whole thing from scratch.
What Does WildClawBench Add?
WildClawBench gives the evaluation version of the contract.3
The benchmark contains 60 human-authored tasks across six categories. It includes bilingual and multimodal work. Each task runs inside a reproducible Docker container hosting an actual CLI runtime such as OpenClaw, Claude Code, Codex, or Hermes Agent. The tasks use real tools instead of mock-service APIs, and the authors report an average of roughly 8 minutes and more than 20 tool calls per run.3
The grading design matters more than the leaderboard. WildClawBench combines deterministic artifact checks, environment-state audits of side effects, and an LLM/VLM judge only where semantic verification needs one. The benchmark withholds grading-only assets until after the agent exits, which prevents the agent from seeing the answer key during execution.3
The headline result: the best reported configuration reaches 62.2% overall, every other model stays below 60% in the OpenClaw run, and switching the runtime can move one model’s score by up to 18 points.3 The paper’s conclusion follows: the runtime is part of the evaluated system. The model alone is not the product.
That result should make teams more careful with agent benchmarks. A high score in a short, synthetic, final-answer benchmark does not answer the question most operators care about: can the agent perform a long task in the actual runtime, with the actual tools, while leaving the environment in the intended state?
What Is the Contract?
Put the three papers together and the contract becomes clear.
| Layer | Artifact | The question it answers |
|---|---|---|
| Execution | Typed trace | What did the agent do, in order, with what side effects? |
| Reuse | Workflow artifact | Does repeated work run through a vetted path or a fresh improvisation? |
| Evaluation | Native-runtime benchmark | Does the model plus runtime complete realistic work under real tool constraints? |
| Judgment | Product standard | Does the verified output deserve to ship? |
Each layer prevents a different lie.
The trace prevents the agent from laundering a missing tool call into a plausible answer. The workflow prevents a repeated task from pretending it needs fresh improvisation forever. The native-runtime benchmark prevents a model score from pretending the runtime does not matter. The product standard prevents a verified artifact from pretending it is worthy just because it passed checks.
That last layer still matters. A trace can prove what happened. A workflow can constrain what happens. A benchmark can measure task completion. None of those layers can decide whether the result respects the user, the product, or the standard behind the work. That decision still belongs to the team.
What Should Operators Change Now?
Start with trace completeness.
If the runtime cannot produce a structured record of tool calls, arguments, exit codes, file changes, spawned agents, and emitted artifacts, fix that before adding more autonomy. A weak trace makes every downstream claim expensive to verify.
Then separate trace grading from answer grading. A completion report that claims tests passed should first prove the test command ran and exited successfully. A report that names a changed file should prove the file was read or written. A report that summarizes an external action should prove the action’s side effects match the expected state. Only after the trace supports the claim should the answer be judged for quality.
Next, identify repeated workflows. Every recurring agent job should carry a promotion question: does the next run deserve a reusable workflow artifact? Source scanning, guide refreshes, translation releases, dependency updates, incident triage, and content publication all become better when the runtime stops re-inventing the sequence.
Finally, evaluate in the runtime you ship. Mock tools and synthetic tasks can still help during development, but they should not carry the release decision. The release decision needs the same tool boundaries, filesystem state, time budgets, and side-effect checks the real agent will face.
Quick Summary
The agent trace is becoming the reliability contract. SHEPHERD shows how meta-agents can supervise and branch execution when the runtime exposes typed, replayable traces. The AI Workflow Store argues that repeated work should move from on-the-fly improvisation into reusable engineered workflows. WildClawBench shows that native runtime, tools, side effects, and trajectory audits materially change measured performance. Final answers still matter, but they sit at the end of the contract, not at the center.
FAQ
Is an execution trace the same thing as observability?
No. Observability tells operators what happened. A contract-quality execution trace must also be structured enough for another process to inspect, fork, replay, and grade. Logs help humans debug. Typed traces let supervisors, evaluators, and workflow builders operate on execution directly.
Does SHEPHERD make agents safe automatically?
No. SHEPHERD supplies a substrate for observation, fork, replay, and meta-agent intervention. A bad supervisor can still make bad decisions. The gain is that the supervisor can act on a structured execution object instead of parsing a chat transcript.
Does the AI Workflow Store mean agents should never improvise?
No. Agents still need exploration when no vetted workflow exists or when the task is genuinely novel. The point is promotion. Once a task recurs and carries real stakes, the system should turn the successful path into a reusable workflow with constraints, tests, and maintenance.
Does WildClawBench prove one agent runtime is best?
No. WildClawBench shows that runtime choice materially changes measured performance under its task set and experimental setup. Treat that as evidence that the runtime belongs in the evaluation, not as a permanent ranking of products.
What should a team build first?
Build the trace first. Then add gates that refuse unsupported claims. Then promote recurring work into workflows. Fancy orchestration without a trustworthy trace only makes failures harder to reconstruct.
References
-
Simon Yu, Derek Chong, Ananjan Nandi, Dilara Soylu, Jiuding Sun, Christopher D. Manning, and Weiyan Shi, “SHEPHERD: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace,” arXiv:2605.10913v1, May 11, 2026. Primary source for SHEPHERD’s typed Git-like execution trace, fork/replay semantics, Lean-mechanized core operations, fork and prompt-cache reuse measurements, CooperBench result, and TerminalBench-2 result. ↩↩↩↩↩
-
Roxana Geambasu, Mariana Raykova, Pierre Tholoniat, Trishita Tiwari, Lillian Tsai, and Wen Zhang, “Engineering Robustness into Personal Agents with the AI Workflow Store,” arXiv:2605.10907v1, May 11, 2026. Primary source for the on-the-fly agent-loop critique, proposed AI Workflow Store, hardened reusable workflow framing, SE lifecycle requirements, and amortized-reuse argument. ↩↩↩↩↩↩
-
Shuangrui Ding et al., “WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation,” arXiv:2605.10912v1, May 11, 2026. Primary source for the 60-task native-runtime benchmark, bilingual and multimodal task mix, real CLI runtimes, roughly 8-minute and 20+ tool-call averages, hybrid grading design, 62.2% top reported score, and harness-choice score shifts. ↩↩↩↩↩↩↩↩↩