Your Agent Has Two Untrusted Inputs

From the guide: Claude Code Comprehensive Guide

A traditional program has one trust boundary: the input a user hands it. You validate at that edge and you are mostly done. An agent quietly added two more, and most of the tooling built around agents treats only one of them as real.

The first new input is code the model writes and your process runs. The second is tool output the model reads and then acts on. Both cross into your system from somewhere you do not control – a language model’s sampling distribution, or a third party’s server – and both get treated as trusted the moment they arrive. The reason agent security feels unsettled right now is that the industry has a credible answer for the first input and is still pretending the second one is the same problem. It is not.

Key Takeaways

  • Agents have two untrusted inputs, not one. Model-generated code that your runtime executes, and tool/server output that your model ingests and acts on. They fail in different ways and need different defenses – conflating them is why people feel sandboxed while staying exploitable.12
  • Untrusted code execution finally has a credible isolation answer. Simon Willison’s micropython-wasm runs model-written Python inside a WebAssembly sandbox with memory and CPU limits and controlled file and network access – the characteristics a real sandbox needs, on a substrate “tested in browsers for nearly a decade.”1
  • Tool poisoning is not a code-execution problem, so a code sandbox does nothing for it. Malicious instructions hidden in an MCP tool’s description are “invisible to users but visible to AI models,” and they can hijack an agent even if the poisoned tool is never run.23
  • The deepest issue is a trust gap between connect-time and runtime. Tool descriptions are reviewed once, at approval; tool responses flow straight into the model’s context with no equivalent check – and a malicious server can change the description after you approve it.2

Two inputs, two failure modes

Hold the two inputs side by side, because the asymmetry is the whole point.

When a model writes code and you exec() it, the failure mode is the code does something to your machine – reads a file it should not, opens a socket, spins the CPU forever, corrupts state. The blast radius is your process and everything it can reach. The defense is ancient and well understood: isolation. Run the code somewhere it cannot touch what matters.

When a model reads tool output and acts on it, the failure mode is the words change what the model decides to do. Nothing executes on your host in the classic sense; instead, text that arrived as “data” gets interpreted as “instruction,” and the agent – holding your credentials – carries it out. The blast radius is everything your agent is authorized to do. No amount of process isolation helps, because the agent was supposed to be able to read that tool and act on it. The exploit rides the legitimate path.

A code sandbox is a wall around the first input. It is necessary, and it is finally getting good. It is also completely silent on the second.

The input that finally has a sandbox

On June 6, 2026, Simon Willison released micropython-wasm, an alpha package for running Python in a WebAssembly sandbox, after years of trying approaches that never quite fit.1 The motivation is exactly the first untrusted input: he wants to run LLM-generated code inside his Datasette Agent without risking the host, and as he puts it, “a buggy or malicious plugin could break everything or leak private data.”1

What makes it worth attention is the checklist it actually satisfies, because that checklist is the definition of a usable sandbox:

  • Memory and CPU limits, so a model-written while True: s += "x" cannot exhaust the host. CPU is bounded with wasmtime’s “fuel” mechanism – which Willison is honest about, noting the units are “hard to reason about.”1
  • Controlled file and network access, default-denied rather than default-open.
  • A clean isolation model: the code runs on a separate thread behind a request/reply queue, with the WASM interpreter blocking on a host function until it is handed work – so the untrusted code never shares the host’s address space.1
  • A substrate with a track record. He chose WebAssembly because it was “designed from the start to support all of the characteristics I care about and has been tested in browsers for nearly a decade,” and MicroPython because the browser-oriented Pyodide is not supported server-side.1

The honesty is the tell of good security work. Willison ships it as alpha and writes that he is “not ready to recommend it to anyone who isn’t willing to take a significant risk.”1 That is the evidence-gate posture – claim exactly what you have verified and no more. The point is not that this one package is the answer; it is that the shape of the answer for untrusted code is now clear and reproducible: a real isolation boundary with resource limits and deny-by-default I/O. The first input is becoming a solved engineering problem.

The trap is believing that solving it makes you safe.

The input that doesn’t

The second input does not run code, so the wall does nothing. On April 1, 2025, Invariant Labs disclosed the tool poisoning attack: “malicious instructions are embedded within MCP tool descriptions that are invisible to users but visible to AI models.”2 The asymmetry is the exploit – “AI models see the complete tool descriptions, including hidden instructions, while users typically only see simplified versions in their UI.”2 A tool that reads, in the UI, “adds two numbers” can carry, in the description the model actually consumes, an instruction to read ~/.ssh/id_rsa and exfiltrate it through a hidden parameter.2 OWASP now catalogs this as a named attack class.3

Two properties make it nasty in a way isolation cannot reach:

It fires without the tool being called. The poisoned instructions live in the description, which the model ingests when it considers the toolset – so the attack can land even if the function never executes.3 There is no code event to sandbox.

The trust is granted once and abused later. Tool descriptions get reviewed at connect-time, when you approve a server. Responses then flow straight into the model’s context at runtime with no equivalent check – and, in Invariant’s “rug pull” finding, “a malicious server can change the tool description after the client has already approved it.”2 Your one-time approval secures nothing ongoing.

The canonical demonstration is not hypothetical. Researchers poisoned a GitHub issue with injected text; when a developer asked their assistant to review open issues, the agent ingested the issue, followed the hidden instructions, used the developer’s token to read private repositories, and leaked the contents into a public pull request.3 Every individual action was authorized. The agent did exactly what an agent is for. That is why a code sandbox is irrelevant here – nothing untrusted executed; untrusted text simply got promoted to instruction inside a process that holds real credentials. It is the same lesson Ken Thompson left in Reflections on Trusting Trust: the danger is not the code you inspect, it is the trust you extend without inspecting.

Defending the second input

If isolation is the defense for code, the defense for tool output is treating the boundary between data and instruction as real, at runtime, every time – the thing the connect-time approval model skips. Microsoft’s own MCP guidance lands in the same place, recommending “prompt shields” that detect and filter injected instructions plus spotlighting, where “including delimiters in the system message explicitly outlines the location of the input text” so the model can tell content from command.4 Concretely:

  • Re-check descriptions at runtime, not just at approval. Pin the tool definition you approved and detect when a server’s description changes underneath you. The rug pull only works because nobody looks twice.2
  • Mark the data/instruction boundary explicitly. Tool responses are data, not directives – and the practical version is spotlighting: delimit tool output in the context so the model is told, structurally, that it is content to reason about rather than commands to follow.4 It pairs with action-level authorization on MCP tools so that even a hijacked agent cannot reach the high-consequence calls.
  • Apply least privilege to the authorized path. The GitHub exploit worked because one agent held both “read private repos” and “open public PR” – and least privilege is exactly what Microsoft names as foundational hygiene against these attacks.4 Joanna Rutkowska’s security-by-compartmentalization is the same instinct at the architecture level: assume a domain is compromised and make sure it cannot reach the others. An agent that reads issues should not be the agent that holds the exfiltration path.
  • Watch egress, because exfiltration is the payoff. Most of these attacks end at the same step – data leaving. A silent-egress view catches the consequence even when the injection slips the front door.

None of these is a wall. They are gates on a path the agent is meant to walk – which is the harder, less satisfying kind of security, and the kind the second input actually requires.

What I take from this

The mental model I am keeping is that “is my agent sandboxed?” is the wrong question, because it silently assumes there is one boundary to defend. There are two. The code my model writes needs a wall – and thanks to work like Willison’s, building that wall is becoming a normal engineering task with a clear spec. The tool output my model reads needs something else entirely: a runtime that never lets data quietly become instruction, and an authorization model so tight that a hijacked agent still cannot do much. Conflating them is how you end up feeling protected – “look, the code runs in WASM” – while staying wide open through the input nobody put a gate on.

The deeper continuity is that agents did not invent a new category of vulnerability so much as add new places for an old one to live. Untrusted input has always been the root of security; the agent era just gave it two new doors – one you can lock with isolation, one you can only watch with discipline. Build the wall and the gate. A wall alone, on a system with two inputs, is half a defense that feels like a whole one.

FAQ

What are the two untrusted inputs in an AI agent?

Model-generated code that your runtime executes, and tool or server output that your model reads and then acts on. The first can damage the host directly; the second can hijack the agent’s decisions using the credentials and permissions the agent already holds. They require different defenses – isolation for code, runtime data/instruction separation and tight authorization for tool output.12

Does a code sandbox protect against MCP tool poisoning?

No. Tool poisoning does not execute code on your host – it embeds instructions in a tool’s description that the model treats as trusted, and it can fire even if the tool is never called.23 A sandbox isolates execution, so it does nothing for an attack that rides the agent’s legitimate, authorized actions.

What is a tool poisoning attack?

Disclosed by Invariant Labs in April 2025, it is an attack where “malicious instructions are embedded within MCP tool descriptions that are invisible to users but visible to AI models.”2 Because the model sees the full description while the user sees a simplified UI label, hidden instructions – for example, to read SSH keys and exfiltrate them – can hijack the agent. OWASP now tracks it as a named attack class.23

What is the connect-time vs runtime trust gap?

Tool descriptions are reviewed once, when you approve a server (connect-time). Tool responses then flow into the model’s context at runtime with no equivalent check, and a malicious server can change a description after approval (a “rug pull”).2 The fix is to verify at runtime, not just at approval.


Sources


  1. Simon Willison, “Running Python code in a sandbox with MicroPython and WASM,” simonwillison.net, June 6, 2026. Introduces the alpha micropython-wasm package and the datasette-agent-micropython plugin; motivation that “a buggy or malicious plugin could break everything or leak private data”; the sandbox checklist (memory/CPU limits via wasmtime “fuel,” controlled file/network access, separate-thread request/reply isolation); the rationale that WebAssembly was “designed from the start to support all of the characteristics I care about and has been tested in browsers for nearly a decade”; and the caveat that he is “not ready to recommend it to anyone who isn’t willing to take a significant risk.” 

  2. Luca Beurer-Kellner and Marc Fischer, “MCP Security Notification: Tool Poisoning Attacks,” Invariant Labs, April 1, 2025. Original disclosure: “A Tool Poisoning Attack occurs when malicious instructions are embedded within MCP tool descriptions that are invisible to users but visible to AI models”; the visibility gap (“AI models see the complete tool descriptions, including hidden instructions, while users typically only see simplified versions in their UI”); the “rug pull” finding that “a malicious server can change the tool description after the client has already approved it”; and the poisoned-add example reading ~/.cursor/mcp.json and ~/.ssh/id_rsa

  3. “MCP Tool Poisoning,” OWASP Foundation. Catalogs tool poisoning as a named attack class: instructions hidden in tool metadata that influence an agent “even if the poisoned tool is never invoked,” and the GitHub MCP example in which a poisoned issue leads an agent to read private repositories via the developer’s token and leak them into a public pull request. 

  4. Sarah Young and Den Delimarsky, “Protecting against indirect prompt injection attacks in MCP,” Microsoft Developer Blog, April 28, 2025. Recommends “prompt shields” that “use advanced machine learning algorithms and natural language processing to detect and filter out malicious instructions,” spotlighting where “including delimiters in the system message explicitly outlines the location of the input text,” supply-chain verification of MCP components, and foundational hygiene such as least privilege. 

Artículos relacionados

Agent Sandbox Security Is a Suggestion: Three Failure Levels

An attacker opened a GitHub issue and shipped malware in Cline's next release. Agent sandboxes fail at three levels. Her…

18 min de lectura

MCP Servers Are the New Attack Surface

50 MCP vulnerabilities, 30 CVEs in 60 days, 13 critical. Tool-use protocols are the attack surface nobody is auditing — …

8 min de lectura

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

11 min de lectura