The Agent Interface Is the Harness

OpenAI describes Codex as a cloud software-engineering agent that can read files, edit files, and run tests in an isolated environment; Anthropic documents hooks that can inspect and deny tool calls before they execute.⁴³ Those are not side details. They are the product.

The prompt box gets the attention because the prompt box feels like the interface. The real agent interface sits around the prompt: tool access, permission rules, memory loading, trace capture, evidence requirements, recovery controls, and release gates. That layer determines how the agent behaves after the user stops typing.

An agent product does not become trustworthy because it has better placeholder text. It becomes trustworthy when the surface around the model turns intent into governed work.

TL;DR

The agent interface is the operating layer. Chat can collect intent, but the surrounding surface decides what the agent may see, what it may do, what it must prove, and when a human must intervene. Microsoft framed human-AI interaction as behavior across time, and NIST frames trustworthiness as something teams incorporate into design, development, use, and evaluation.¹²

That means agent UX cannot stop at conversation design. The interface has to encode authority, memory, tool boundaries, evidence, and taste. If the interface does not carry those constraints, the agent will improvise them.

Agentic Design Is Control Surface Design names the visible surface. The frame below names the operating layer behind it.

Key Takeaways

For product teams: - Treat the prompt box as the intake surface, not the operating surface. - Design the agent’s permission, trace, memory, evidence, and recovery paths before polishing chat.

For design engineers: - Put quality rules where the agent acts: before tool calls, after edits, before release, and at completion. - Make invisible state inspectable enough for a human to stay responsible for the outcome.

For teams adopting agents: - Ask whether the interface reveals what the agent saw, changed, skipped, and verified. - Do not accept fluent final prose as proof of governed work.

The Interface Decides What The Agent Can Become

Every agent session starts with a user intent, but intent does not determine behavior alone.

The agent’s behavior also depends on:

Interface Layer	Behavioral Effect
Tools	Defines the actions the agent can take
Permissions	Defines when the agent must stop or ask
Memory	Defines what prior context shapes the run
Trace	Defines what later review can inspect
Evidence	Defines what counts as done
Recovery	Defines how failure stays reversible
Taste	Defines what the system should refuse

Those layers change the work as much as the model does. The same model behaves differently when it can run tests, when it can only edit files, when it sees a release gate, when it must cite sources, or when a stop gate blocks premature completion.

The product team that treats those layers as “settings” misunderstands the medium. Settings sit outside the work. Agent interface layers become the work’s shape.

Microsoft’s human-AI interaction guidelines make a useful older point: AI systems need to communicate status, support correction, and respond to failures across interaction time.¹ Agents sharpen that requirement because the system can act between user turns. The interface can no longer say, “The model answered.” The interface has to say, “The system acted under these constraints.”

Tool Access Is Interface Design

Tool access looks technical. It is also UX.

An agent that can only answer from memory has one kind of interface. An agent that can search files has another. An agent that can run shell commands, edit code, open browsers, call APIs, and deploy software needs a different contract with the user.

The Model Context Protocol describes a common pattern: AI applications connect to external systems such as local files, databases, tools, and workflows.⁵ That connection expands capability, but capability alone does not equal quality. Every new tool adds a question the interface must answer:

Tool Question	Interface Requirement
What can the agent touch?	Scope and permission boundary
What did the agent send?	Inspectable tool payload
What came back?	Output, error, and side-effect record
What changed?	Diff, artifact, or state summary
Who approved it?	Permission record
Can it be reversed?	Recovery path

A tool list buried in configuration cannot carry that burden. The user needs a surface that makes tool authority legible while work happens.

Claude Code’s PreToolUse hook shows the primitive. A hook can receive the tool name and input before execution, then allow, deny, ask, defer, or modify the call.³ That mechanism belongs in the mental model of agent interface design. The interface should expose the same decision point to the user at the right altitude.

Low-risk reads can pass quietly. Destructive shell commands need stronger friction. Public releases need a final gate. Customer-impacting changes need audit. The right interface does not ask the user to approve everything. The right interface gives every action the level of ceremony it deserves.

Memory Is Part Of The Product

Memory often enters agent products as infrastructure: context windows, files, summaries, vector stores, caches, project instructions, and retrieval systems. The user experiences those systems as product behavior.

When an agent remembers the design standard, the product feels coherent. When an agent forgets a constraint from 40 minutes earlier, the product feels careless. When an agent retrieves stale guidance, the product feels haunted by an old decision.

Memory needs an interface because memory changes responsibility. The user cannot supervise what the user cannot inspect.

The interface should separate at least four memory states:

Memory State	User-Facing Meaning
Active	The agent can use it now
Available	The agent can retrieve it if needed
Compacted	The system summarized it and may have lost detail
Stale	The system has a record, but trust should drop

Without that distinction, the user has to infer memory quality from agent behavior. That is backwards. The interface should reveal enough memory state for the user to intervene before the agent builds on the wrong premise.

The same point applies to personal or team philosophy. A quality doctrine hidden in a prompt may or may not survive a long session. A doctrine encoded into skills, hooks, templates, checks, and completion gates has more surface area. The model can still miss. The operating layer can catch more misses because the rule lives where the work happens.

Evidence Turns Output Into Work

The final answer is the weakest proof unit in an agent session.

A final answer can say tests passed when no test ran. A final answer can say citations were verified when the source does not support the claim. A final answer can say the deployment succeeded while the public route returns 404 from cache. Fluent prose can hide failure.

Evidence has to become a surface. The user should see the claim, the support, and the gap:

Claim Type	Required Evidence
Code changed	File paths and diffs
Tests passed	Command, exit status, and relevant output
Content is accurate	Source links and claim-source alignment
SEO path works	Rendered metadata, schema, and discovery files
Release succeeded	Live route status and cache state
Translation is ready	Local gate, D1 rows, live pages, and review status

That evidence surface changes agent behavior. When the system knows completion requires evidence, the agent searches for proof during the task instead of writing a confident summary at the end.

The evidence gate exists for that reason. It forces the agent to connect claims to observed behavior. Agent Execution Traces Are the Runtime Contract pushes the same argument deeper: the trace carries more truth than the final answer because the trace preserves the path.

NIST’s AI Risk Management Framework matters here because trustworthiness enters design, development, use, and evaluation, not only model selection.² Evidence is where those phases meet the user’s screen.

Recovery Belongs In The Main Flow

Agent interfaces often treat failure as an exception. Agent work makes failure routine.

A search query misses. A test fails. A permission gate blocks. A translation check finds a formatting mismatch. A deployment succeeds, but a CDN serves stale HTML. A good interface does not panic at those states. A good interface makes recovery obvious.

Recovery requires five controls:

Control	Purpose
Pause	Stop motion without losing state
Resume	Continue after review or external fix
Retry	Repeat a failed step with changed inputs
Fork	Explore an alternate path without overwriting the first
Roll back	Undo reversible work or mark irreversible work for repair

The recovery path should sit near the trace and evidence surfaces. The user should not have to copy a failed command out of a transcript, infer the working directory, and reconstruct the agent’s state manually. The interface already knows the failed step. The interface should provide the next responsible action.

That principle applies to content work too. When a translation quality gate fails, the interface should show the failing locale, failing segment, reason, and repair path. When a public page fails live verification, the interface should show whether the app failed, the database failed, or the edge cache served stale output. The agent should not call a release done until the user-visible path works.

Taste Is Not A Prompt

AI coding makes implementation cheaper. Cheaper implementation raises the value of judgment.

The important question shifts from “can the agent make something?” to “should this version exist?” That question belongs in the interface as much as in the human reviewer.

Taste shows up as constraints:

remove the unnecessary step;
refuse the clever path that weakens the product;
preserve consistency across artifacts;
verify the public path instead of celebrating local success;
protect private machinery from public copy;
choose the smaller, sharper solution over the busier one.

An agent can receive those values as prose. Prose helps. Prose alone does not guarantee behavior. The values need operational forms: a blog skill that blocks lazy phrases, a citation verifier that rejects unsupported claims, a release verifier that checks live pages, a stop gate that rejects completion without evidence, and design rules that prevent visual drift.

The interface is where taste becomes inspectable. The user sees what the system refused, what it simplified, what it verified, and what it left unproven. That record matters because agent output will only get cheaper. The scarce thing will be the standard that decides what survives.

A Practical Agent Interface Map

Teams can start with a plain map. No futuristic dashboard required.

Surface	Minimum Viable Version
Intent intake	Prompt, task type, repo or workspace scope
Plan	Assumptions, intended tools, acceptance criteria
Permission	Risk-tiered queue with full payloads
Memory	Active instructions, loaded files, stale warnings
Trace	Timeline of tool calls, outputs, and side effects
Evidence	Claims mapped to commands, files, sources, or gaps
Recovery	Pause, retry, fork, rollback, cancel
Release	User-visible route, schema, discovery, translation, cache
Taste	Refusals, simplifications, standards, and final worthiness

The map works because each surface answers a user responsibility. The user does not need every raw event. The user needs enough visibility and control to remain accountable for the result.

That distinction prevents two common mistakes. One mistake hides everything behind chat and calls the result magic. The other exposes every internal event and calls the result transparent. Strong agent interface design does neither. It gives the operator the right control at the right moment.

Quick Summary

The agent interface is the operating layer. The prompt collects intent, but tools, permissions, memory, traces, evidence, recovery, and taste determine what actually happens. OpenAI’s Codex and Claude Code’s hooks show the direction: agent products already include execution environments, tool calls, and policy decision points.⁴³ MCP widens the connection between agents and external systems.⁵ NIST and Microsoft provide the older trust and human-AI design frame.²¹

The product question is no longer whether the agent can answer. The product question is whether the surrounding surface governs autonomous work well enough for a human to trust, inspect, interrupt, repair, and sign the result.

FAQ

What does “the interface is the harness” mean?

The phrase means the interface does more than display agent output. It defines the operating layer around the model: tools, permissions, memory, traces, evidence, recovery, and standards. Those parts shape behavior before the final answer appears.

Can a chat interface still work for agents?

Chat can work as an intake surface and a lightweight review lane. Chat fails when it becomes the only operating surface. Agent work needs random access, permission review, trace inspection, memory visibility, and recovery controls.

How does this differ from prompt engineering?

Prompt engineering shapes instruction. Interface design shapes authority, state, and accountability. A prompt can tell an agent to verify work. A release surface can require live route evidence before the task can close.

What should a team build first?

Build the trace and evidence surfaces first. The trace shows what happened. The evidence surface shows what proves the result. Permissions, recovery, and memory become easier to design once the team can inspect the work path.

References

Saleema Amershi et al., “Guidelines for Human-AI Interaction,” Microsoft Research, CHI 2019. Primary source for 18 human-AI interaction guidelines validated with 49 design practitioners. ↩↩↩
National Institute of Standards and Technology, “AI Risk Management Framework,” NIST. Source for the framework’s voluntary risk-management purpose and its design, development, use, and evaluation framing. ↩↩↩
Anthropic, “Hooks reference,” Claude Code Docs. Source for hook events, PreToolUse input fields, and decision control that can allow, deny, ask, defer, or modify tool calls before execution. ↩↩↩
OpenAI, “Introducing Codex,” OpenAI, May 2025. Source for Codex as a cloud software-engineering agent, its independent sandboxed tasks, and its ability to read files, edit files, and run commands. ↩↩
Model Context Protocol, “What is the Model Context Protocol?” Source for MCP as an open standard that connects AI applications to external systems such as data sources, tools, and workflows. ↩↩