Agents Need Supervision Surfaces

OpenAI now describes the Codex app as a command center for managing multiple agents, running work in parallel, and supervising coordinated teams across the software lifecycle.¹ The product direction confirms the interface shift: the hard problem has moved from “can the agent act?” to “can the human supervise action at scale?”

Agents need supervision surfaces: places where a person can see state, review risk, approve sensitive tools, inspect traces, recover from failure, and sign the result with evidence. Better chat helps expression. Supervision surfaces govern work.

TL;DR

Chat remains useful for intent. It fails as the sole surface for autonomous work because agent runs contain tool calls, permissions, traces, memory, failed branches, and completion claims. OpenAI’s Codex cloud docs describe background tasks in sandboxed environments, real-time progress monitoring, terminal-log citations, and test-output evidence.² OpenAI’s Agents SDK exposes human-in-the-loop approvals and built-in tracing for tool calls, handoffs, guardrails, and custom events.³⁴ Anthropic’s Claude Code hooks expose lifecycle points such as PreToolUse, PostToolUse, PermissionRequest, and Stop.⁵

The product lesson: supervision is not one modal at the end. It is a set of surfaces that sit beside the agent while work happens.

Key Takeaways

For agent product teams: - Build a supervision queue before adding another chat polish feature. The queue should show blocked runs, risky actions, stale evidence, failed checks, and review-ready artifacts. - Treat approvals, traces, and recovery as primary UX. The user should not reconstruct tool state from a transcript.

For design engineers: - Give every agent action an altitude: silent, summarized, interrupting, or blocked. Read-only work should not look like production mutation. - Design the review object, not only the message. A review object contains the tool payload, risk reason, diff, evidence, and next action.

For teams adopting coding agents: - Measure whether an operator can answer: what is running, what is waiting, what changed, what failed, what needs approval, and what remains unverified. - Use chat for delegation. Use supervision surfaces for responsibility.

What Is A Supervision Surface?

A supervision surface is a user interface for accountable agent work.

It does not try to show every token. It shows the parts that decide whether the agent should continue:

Surface	User Question
Run queue	Which agents need attention?
State panel	What phase is each run in?
Approval queue	Which tool calls need a human decision?
Trace timeline	What happened, in what order?
Evidence panel	What proves the result?
Recovery controls	How do I pause, resume, retry, fork, or roll back?
Review packet	What can I sign, reject, or send back?

The difference from chat is random access. Chat says “read the scroll.” A supervision surface says “inspect the risky part, then decide.”

That matters when one person runs multiple agents. A single agent can remain conversational for a while. Five long-running agents become operations. The interface has to prioritize, summarize, and route attention.

Why Does Chat Fail As The Operating Surface?

Chat fails because it has the wrong shape for work that moves.

Agent work produces events: plans, searches, file reads, file writes, shell commands, browser actions, API calls, test runs, rejected paths, failed retries, and final evidence. A transcript can contain those events, but a transcript cannot organize them by risk, phase, or responsibility.

OpenAI’s Codex app announcement names the shift directly. Developers now delegate work, run tasks in parallel, and supervise agents across projects; older IDE and terminal surfaces do not fit that mode.¹ That wording matters because supervision requires a different layout than prompting. The operator needs a board, not a scroll.

Microsoft’s 2019 human-AI interaction guidelines still supply the base design frame: AI systems should communicate status, support correction, and handle failures across interaction time.⁶ Agents make those old guidelines operational. Status now means “which tool call is pending?” Correction now means “reject and resume this run.” Failure now means “show the failed command, changed assumption, and repair path.”

The mistake is treating supervision as friction. Poor supervision adds friction. Good supervision removes cognitive load because it puts the decision in the right place.

What Should The Run Queue Show?

The run queue should show attention, not activity.

An activity feed tells the user everything that happened. A supervision queue tells the user what needs judgment. The queue can compress most events into a few statuses:

Run Status	What The Operator Needs
Planning	Goal, scope, likely tools, acceptance criteria
Acting	Current tool, target, expected side effect
Waiting	Approval, credential, missing input, external blocker
Verifying	Test command, source check, rendered path, review gate
Repairing	Failed check, changed hypothesis, next retry
Review-ready	Artifact, diff, evidence, unresolved gaps
Blocked	Reason, owner, restart option

OpenAI’s Codex cloud docs describe tasks that can run in the background, including in parallel, inside their own cloud environments.² Parallel background work changes the attention model. The user should not poll each thread. The system should route blocked, risky, and review-ready work to one place.

The queue should avoid false urgency. A failed lint check on a draft branch and a production deployment mismatch do not deserve the same visual weight. The interface should reserve interruption for irreversible actions, public releases, security-sensitive operations, and decisions where the agent lacks enough context to continue responsibly.

How Should Approvals Work?

Approvals should work like a queue of review objects, not like a string of modal interruptions.

OpenAI’s Agents SDK human-in-the-loop flow pauses execution until a person approves or rejects sensitive tool calls. The docs describe pending approvals as interruptions, with RunState used to serialize and resume after decisions.³ The same page notes that approval applies across nested agent tools and MCP tools, not only the current top-level agent.³

Anthropic’s Claude Code hook docs expose the same design shape from another angle. PreToolUse runs before a tool call and can block it. PermissionRequest fires when a permission dialog appears. PostToolUse and PostToolUseFailure fire after successful or failed tools, and Stop fires when Claude finishes responding.⁵

Those primitives point to the right surface:

Approval Field	Why It Belongs In The UI
Tool name	Identifies capability class
Arguments	Shows what the agent wants to do
Target	Names file, database, host, route, account, or branch
Risk tier	Sets visual and procedural weight
Agent reason	Explains why the call belongs in the plan
Expected side effect	Separates read, write, network, deploy, spend, or delete
Decision	Approve once, always approve, reject, defer, rewrite

The right approval surface lets low-risk reads pass quietly, batches medium-risk decisions, and interrupts for high-risk changes. The user should not approve a shell command while reading a paragraph. The user should approve a typed operation with enough context to stay accountable.

What Should A Trace Surface Prove?

A trace surface should prove sequence, cause, and consequence.

OpenAI’s Agents SDK tracing docs say tracing records a run across LLM generations, tool calls, handoffs, guardrails, and custom events, then supports debugging, visualization, and monitoring in development and production.⁴ That description makes trace a product primitive, not only developer instrumentation.

The supervision trace should answer five questions:

Question	Required Trace Detail
What did the agent see?	Files, sources, prompts, retrieved context
What did it do?	Tool calls, arguments, outputs, exit states
What changed?	Diffs, generated artifacts, external state
Why did it change course?	Failed checks, denied permissions, new evidence
What proves completion?	Commands, source links, live routes, review status

The trace does not need private reasoning. It needs operational evidence. A user does not need hidden chain-of-thought to evaluate a release. The user needs the command output, route status, cache state, D1 rows, translation gate, source checks, and remaining native-review gap.

That distinction protects both trust and taste. Showing too much internals turns the interface into noise. Showing too little turns the product into theater.

How Should Recovery Fit The Flow?

Recovery belongs beside the failed event.

Agent systems fail constantly in normal work: an install command times out, a formatter changes unrelated files, a browser smoke test finds a stale cache, a translation gate rejects a locale, or a source link returns 403 to a script. A good supervision surface treats those moments as expected states.

The recovery controls should stay concrete:

Control	Responsible Use
Pause	Stop new side effects while preserving state
Resume	Continue after approval or external fix
Retry	Repeat a failed step with changed input
Fork	Explore an alternate plan without overwriting the first
Revert	Undo local reversible changes
Escalate	Ask a human or another agent for review
Close with gap	Finish only with explicit unresolved work

OpenAI’s Codex app announcement describes agents working in isolated copies of code so users can explore different paths and check out changes locally while an agent continues.¹ That isolation helps recovery, but the interface still needs to show which path won, which path failed, and which work remains unsafe to merge.

The product should never make the user reconstruct recovery from raw logs. The failed step already knows its command, working directory, output, and target. The surface should put the responsible next action on that event.

What Makes A Supervision Surface Worthy?

A supervision surface becomes worthy when it reduces work without reducing responsibility.

The easy version adds more panels. The worthy version removes doubt. The user should gain faster answers to the questions that matter:

Which run needs me?
Which action can cause damage?
Which result has proof?
Which result only has prose?
Which branch should survive?
Which gap remains unsolved?

NIST’s AI Risk Management Framework frames trustworthiness as something teams incorporate into the design, development, use, and evaluation of AI products and systems.⁷ Supervision surfaces live exactly at that intersection. They make design carry operational risk. They make use produce evidence. They make evaluation visible before the user signs off.

MCP widens the same responsibility. The Model Context Protocol connects AI applications to external data sources, tools, and workflows so agents can access information and perform tasks.⁸ More connected tools mean a larger action surface. Larger action surfaces require better supervision, not more faith.

The design standard should stay simple: an agent product should not maximize autonomous motion. It should maximize accountable progress.

How Do You Start Building One?

Start with the smallest useful supervision surface:

Run list: one row per active agent, with phase, age, blocker, and next decision.
Approval queue: one object per sensitive tool call, with arguments, target, risk, and approve/reject/defer controls.
Trace table: one row per meaningful event, filterable by read, write, shell, browser, source, test, deploy, and review.
Evidence panel: one claim-to-proof table for the final result.
Recovery menu: pause, resume, retry, fork, and close-with-gap from the event that failed.

The first version can look boring. Tables, filters, badges, and expandable rows beat an elegant transcript that hides risk. The taste problem comes after the information architecture is honest: reduce noise, reserve warning color, group low-risk events, expose high-risk payloads, and keep final signoff tied to proof.

Agentic design is control surface design. The agent interface is the operating layer. HTML can preserve spatial information that Markdown drops. Supervision surfaces combine those frames: they turn autonomous work into inspectable, spatial, accountable operations.

Quick Summary

Agents do not need a better transcript as much as they need supervision surfaces. A serious agent interface needs a run queue, approval queue, trace timeline, evidence panel, and recovery controls. OpenAI, Anthropic, Microsoft, NIST, and MCP documentation all point toward the same product shape: autonomous systems need visible status, tool governance, reviewable traces, and human decisions at the right altitude.¹³⁴⁵⁶⁷⁸

Chat can stay as the delegation lane. Supervision has to become the work surface.

FAQ

What is an agent supervision surface?

An agent supervision surface is a UI for monitoring and controlling autonomous agent work. It shows run state, pending approvals, tool traces, evidence, failures, and recovery controls. Chat collects intent. A supervision surface helps the operator decide what the agent may do next and whether the result deserves signoff.

Why is chat not enough for AI agents?

Chat is sequential and append-only. Agent work needs random access to state, risk, approvals, traces, diffs, test output, and unresolved gaps. A transcript can record those events, but it cannot prioritize them by risk or route human attention across parallel agents.

What should teams build first?

Teams should build a run queue and approval queue first. Those two surfaces immediately reveal blocked work and sensitive actions. Add a trace table next because evidence, recovery, and final review all depend on the event record.

How is a supervision surface different from observability?

Observability helps builders debug the system. Supervision helps operators govern work while it happens. The two share data, but they serve different users. A production trace can feed both a developer debugging view and a human approval surface.

Does every agent need human approval?

No. Every agent needs calibrated supervision. Low-risk reads can run silently. Medium-risk changes can batch for review. High-risk actions should pause for approval. Public releases, destructive commands, customer-impacting actions, and money movement deserve stronger gates.

References

OpenAI, “Introducing the Codex app,” OpenAI, 2 February 2026, updated 4 March 2026. Source for the Codex app as a multi-agent command center, parallel agent workflows, isolated code copies, skills, Automations, review queues, sandboxing, permission requests, and supervision framing. ↩↩↩↩
OpenAI, “Codex web,” OpenAI Developers. Source for Codex as a coding agent that can read, edit, and run code in background cloud tasks, including parallel work in its own cloud environment. ↩↩
OpenAI, “Human-in-the-loop,” OpenAI Agents SDK. Source for approval flows that pause execution, return pending approvals as interruptions, serialize and resume RunState, and support approvals across function tools, shell tools, apply-patch tools, MCP servers, hosted MCP tools, and nested agent tools. ↩↩↩↩
OpenAI, “Tracing,” OpenAI Agents SDK. Source for built-in tracing of LLM generations, tool calls, handoffs, guardrails, custom events, traces, spans, and development or production monitoring. ↩↩↩
Anthropic, “Hooks reference,” Claude Code Docs. Source for Claude Code lifecycle hooks, including PreToolUse, PermissionRequest, PostToolUse, PostToolUseFailure, PostToolBatch, subagent events, and Stop. ↩↩↩
Saleema Amershi et al., “Guidelines for Human-AI Interaction,” Microsoft Research, CHI 2019. Source for the 18 generally applicable human-AI interaction guidelines and the 49-practitioner validation study. ↩↩
National Institute of Standards and Technology, “AI Risk Management Framework,” NIST. Source for incorporating trustworthiness considerations into the design, development, use, and evaluation of AI products, services, and systems. ↩↩
Model Context Protocol, “What is the Model Context Protocol?” Source for MCP as an open-source standard connecting AI applications to external systems, including local files, databases, tools, and workflows. ↩↩