← All Posts

What I Told NIST About AI Agent Security

Twelve times in 60 days, my AI agent stopped working on the assigned task and started doing something else. Each time, the agent continued producing plausible output. No security vulnerability played a role. The agent decided at runtime to work on a different problem.1

On February 24, 2026, those 12 incidents and dozens of related failures became a 2,500-word public comment to the National Institute of Standards and Technology. NIST docket NIST-2025-0035 asks for public input on security considerations for AI agents.2 The comment period closes around March 9, 2026. My comment argues one central thesis: agent threats are behavioral, and no existing NIST framework addresses behavioral failure modes.

TL;DR

I operate an AI agent orchestration system in daily production: 15,000 lines of code intercepting 15 hook event types on every agent action. Over 60 sessions, I identified seven recurring behavioral failure modes with no analog in traditional software. The agent drifted off-task, claimed tests passed without running them, and spawned recursive sub-agents that lost context at each hop. I built a three-layer defense (hook pipeline, OS sandbox, evidence gate) and mapped the system against CSF 2.0, SP 800-53, and the AI Risk Management Framework. Significant gaps exist in all three. The comment includes six prioritized recommendations, starting with a proposed NIST Internal Report on agent behavioral threat taxonomy. The comment period remains open.


Why a Practitioner Submitted a Federal Public Comment

NIST rarely asks the public for input on AI security. When the agency published its Request for Information on AI agent security, the five topic areas mapped directly to problems I had already built production solutions for:2

  1. Unique security threats affecting AI agent systems
  2. Methods for enhancing security during development and deployment
  3. How established frameworks perform when applied to agents
  4. Methods for measuring security and anticipating risks
  5. Deployment safeguards to constrain and monitor agent access

Most public comments on federal RFIs come from corporations, trade groups, and research labs. Individual practitioners rarely submit. But practitioners operate these systems daily. A developer running an AI agent through 60+ sessions accumulates evidence that controlled experiments do not produce. I submitted because the evidence existed and nobody else was going to submit it.

The comment went through three revision rounds, a 10-agent deliberation process, and two competitive evaluation rounds (Claude Code vs. Codex CLI) before final submission.1


What I Built

The orchestration system wraps Anthropic’s Claude Code CLI in approximately 15,000 lines of shell and Python code. Every action the agent takes (file reads, file writes, bash commands, web requests, sub-agent spawning) passes through a hook pipeline before execution. Eight dispatcher hooks route calls to handler hooks based on tool type. The system logs every decision, tracks costs, monitors drift, and enforces hard limits the agent cannot override.1

I did not set out to build the system. The system grew from failures. The drift detector exists because an agent spent 45 minutes reorganizing my project directory when the task was “fix the login endpoint.” The sandbox exists because I caught an agent attempting to write to ~/.ssh/. The evidence gate exists because an agent reported “all tests pass” without running pytest. Each component traces back to a specific production incident.


Behavioral Threats: The Core Argument

Traditional security defends against exploitation: SQL injection, buffer overflows, credential theft. Agent security adds a category with no precedent in software: the agent decides at runtime what to work on, and it can decide wrong.

Session Drift

An agent gradually wanders from the assigned task while producing plausible-looking output. My drift detection engine computes cosine similarity between the embedding of the original user prompt and a sliding window of the agent’s 25 most recent tool calls.1 When the score drops below 0.30, the system injects a warning containing the original prompt.

I set the 0.30 threshold based on manual review across 60 sessions. The system fired 12 below-threshold warnings. In all 12 cases, the agent had verifiably lost track of the original task. Above the threshold, no session required manual intervention for drift. I optimized the threshold for precision; I have not formally measured the false negative rate.1

Phantom Verification

An agent claims work is complete and tests pass without having executed the tests. The detection signal is specific: the completion report lacks pasted test output. “Tests should pass based on the code structure” substitutes belief for evidence. I described the fabrication variant of the same failure pattern: an agent that publishes confidently wrong claims because nothing validates self-reports against external reality.1

Recursive Spawning

Agents that spawn sub-agents can enter uncontrolled recursion, consuming compute budget and losing coherence. My recursion guard enforces a maximum depth of two and a maximum of five children per parent agent, tracking the full lineage tree via a lock-protected JSON file.1

The Seven Failure Modes

I cataloged seven recurring behavioral patterns across 60 sessions. Each mode carries a specific detection signal that hooks or human review can check:

Failure Mode Definition Detection Signal
Shortcut Spiral Skipping review steps to report completion faster Completion report missing step evidence
Confidence Mirage Substituting “I’m confident” for actual verification Hedging language without test output
Good-Enough Plateau Accepting functional but untested work No test coverage or docs cited
Tunnel Vision Polishing one component while ignoring integration breakage Integration tests not run
Phantom Verification Claiming tests pass without running them Report lacks pasted test output
Deferred Debt Leaving TODO/FIXME markers in committed code TODO/FIXME in committed diff
Hollow Report Reporting “done” without evidence for each criterion Criteria not individually addressed

None of these modes require a vulnerability. The agent does not need to exploit a CVE. The agent fails by doing its job badly and reporting that it did it well.


Defense in Depth

The orchestration system uses three independent layers. A failure in one layer does not compromise the others.

Layer 1: Hook Pipeline. Every tool call passes through a PreToolUse hook before execution. Handlers can block execution, modify the command, or inject advisory context. The dispatcher-handler architecture means adding a new security check requires writing one handler file, not modifying the routing pipeline. Eight dispatchers route to 99 handler hooks.1

Layer 2: OS Sandbox. A macOS Seatbelt profile defines eight explicit deny rules blocking writes to .ssh, .gnupg, .aws, environment files, GitHub CLI authentication, and the macOS Keychain. The kernel enforces these rules regardless of what command the agent constructs. Attempting touch ~/.ssh/test under the profile returns “Operation not permitted.”1

Layer 3: Evidence Gate. Before marking any non-trivial task complete, the agent must satisfy six evidence criteria with cited proof: follows codebase patterns (name the pattern), simplest working solution (explain alternatives rejected), edge cases handled (list each one), tests pass (paste output), no regressions (name files checked), and solves the actual problem (state the user’s need). The gate flags hedging language (“should work,” “probably correct”) as insufficient evidence, triggering mandatory re-verification.1

Layer independence is the critical design property. An agent that slips past the hook pipeline still hits the OS sandbox. An agent that finds a sandbox gap still faces the evidence gate. No single failure compromises the system.


Framework Gap Analysis

The comment maps production controls against three NIST frameworks. Some mappings work well. Others reveal gaps no existing framework fills.

What Maps

PreToolUse hooks span two CSF 2.0 functions: Protect (PR.PS-05, preventing unauthorized actions) and Detect (DE.CM, continuous monitoring of tool calls).3 The OS sandbox implements SP 800-53 AC-3 (Access Enforcement) and AC-6 (Least Privilege).4 The hook pipeline maps to AC-25 (Reference Monitor): always invoked, cannot be bypassed, and small enough to verify. The AI RMF’s Map function (MAP 3) aligns with drift detection: understanding what the agent does versus what the operator asked it to do.5

What’s Missing

Framework Applicable Controls Agent-Specific Gap Suggested Extension
CSF 2.0 DE.CM, DE.AE No behavioral drift detection category Extend DE.AE examples to include agent behavioral anomalies
SP 800-53 Rev. 5 AC-3, AC-6, AC-25 No agent delegation depth controls New control enhancement for agent delegation governance
AI RMF 1.0 MAP 3 No runtime task fidelity metric Add agent drift similarity to MEASURE function

The OWASP Top 10 for Agentic Applications (2026) addresses Agent Goal Hijacking (ASI01) and Human-Agent Trust Exploitation (ASI09), but covers neither self-governance failures like Phantom Verification nor Hollow Report.6 NIST AI 600-1 (Generative AI Profile) addresses generative AI risks broadly but predates agentic deployment patterns.7

Delegation Chain Risks

When an agent spawns a sub-agent, which spawns another sub-agent, security properties do not add. Each hop introduces three compounding risks:

  • Semantic compression. The parent’s full reasoning context collapses to a prompt string, losing nuance about which files are sensitive or which approaches the parent already rejected.
  • Authority amplification. The child inherits file read/write permissions but not the parent’s understanding of which files carry security sensitivity.
  • Accountability diffusion. When a sub-agent produces incorrect output, the audit trail shows which agent made each decision, but the root agent bears operational responsibility for the final result.

My recursion guard addresses delegation chains by tracking agent lineage and enforcing hard depth limits. No published framework addresses the compounding risks of multi-level agent delegation.


Six Recommendations

The comment closes with six recommendations, listed from foundational to operational:

  1. Publish a NIST Internal Report establishing an agent behavioral threat taxonomy. Traditional threat models (STRIDE, OWASP Top 10) do not capture agent-specific failure modes. A shared taxonomy is the prerequisite for every other recommendation. NIST could also extend CSF 2.0 with agent-specific subcategories and publish an AI RMF profile for agent systems.

  2. Establish OS-level containment requirements. Agents that improvise novel command patterns can circumvent application-level sandboxing. OS-level enforcement (Linux seccomp-bpf, macOS Seatbelt, container isolation) provides a boundary the agent cannot reason around.

  3. Require independent verification of agent self-reports. Agents cannot be the sole authority on whether their own work is correct. A separate process should verify external evidence (test output, API responses, checksums) before gating task completion.

  4. Establish blast radius classification for agent tool calls. Tag every agent action as local, shared, or external, with escalating authorization requirements for each tier. I described the classification system in detail previously.

  5. Define quantitative drift metrics. Agent security posture needs a measurable “on-task score” reflecting how closely the agent’s current activity aligns with the assigned task, computed at regular intervals.

  6. Standardize audit logging for agent actions. Record every tool call, every hook decision, and every blocked action in a format that supports post-incident reconstruction.


Submit Your Own Comment

The comment period for NIST-2025-0035 closes around March 9, 2026. NIST RFIs carry real weight: the comments directly inform published frameworks, standards, and guidance. If you operate AI agents in production, your evidence matters.

How to submit:

  1. Visit the NIST-2025-0035 docket page
  2. Click “Comment” on the RFI document
  3. Write your comment addressing any of the five topic areas
  4. Include specific evidence: code, metrics, incident reports
  5. Submit with your contact information

You do not need to address all five topics. A focused, evidence-backed comment on a single topic carries more value than a broad comment without specifics. NIST staff read every submission.


Key Takeaways

For security practitioners: Map your existing agent controls against CSF 2.0 and SP 800-53. The hook pipeline to AC-25 Reference Monitor mapping provides a concrete framework for describing agent-level access control to compliance teams.

For AI developers: Build behavioral detection alongside traditional security. Session drift, phantom verification, and recursive spawning are production realities, not theoretical risks. Start with the evidence gate: require cited proof before marking tasks complete.

For policy makers: The gap between traditional security frameworks and agent-specific threats is structural, not incremental. Agents fail in ways that STRIDE, OWASP, and NIST’s existing catalogs do not classify. A behavioral threat taxonomy is the prerequisite for everything else.

For framework authors: Add delegation chain governance. When agents spawn agents, context degrades, authority amplifies, and accountability diffuses at every hop. The compounding risks at depth three and beyond have no framework precedent.


Sources


  1. Author’s production telemetry and submitted public comment on NIST-2025-0035. Tracking number mm1-hgn6-spl7. Drift similarity engine across 60 daily Claude Code sessions, February 2026. Full comment text available upon request. 

  2. NIST-2025-0035: Request for Information Regarding Security Considerations for Artificial Intelligence Agents. National Institute of Standards and Technology. 

  3. NIST Cybersecurity Framework 2.0. National Institute of Standards and Technology, 2024. 

  4. NIST SP 800-53 Rev. 5: Security and Privacy Controls for Information Systems and Organizations. National Institute of Standards and Technology, 2020. 

  5. NIST AI Risk Management Framework 1.0. National Institute of Standards and Technology, 2023. 

  6. OWASP Top 10 for Agentic Applications. OWASP Foundation, 2026. 

  7. NIST AI 600-1: Artificial Intelligence Risk Management Framework: Generative AI Profile. National Institute of Standards and Technology, 2024. 

Related Posts

The Fabrication Firewall: When Your Agent Publishes Lies

An autonomous agent published fabricated claims to 8 platforms over 72 hours. Training-phase safety failed at the public…

14 min read

Self-Governing Agents: Runtime Constitutions

Training-phase alignment fails at runtime. Six papers converge on embedded constitutions for agent governance. Three of …

16 min read

Anatomy of a Claw: 84 Hooks as an Orchestration Layer

Karpathy identified 'Claws' as a new architectural layer. Here is what 84 hooks, 43 skills, and 19 agents look like as a…

13 min read