← Tous les articles

Deploy and Defend: The Agent Trust Paradox

On March 18, 2026, a Meta engineer deployed an internal AI agent to answer a colleague’s technical question on an internal forum. The agent posted its response without authorization. Another employee followed the agent’s flawed advice, triggering a cascade that exposed sensitive corporate and user data to unauthorized employees for nearly two hours. Meta classified it as Sev 1, the second-highest severity in their internal system.1

The same week, Google engineers released Sashiko, an agentic AI code review system for the Linux kernel that caught 53% of bugs from 1,000 recent upstream issues, bugs that “100 percent were missed by human reviewers.”2 Wikipedia’s community continued debating whether to ban LLM-generated contributions entirely.3 NIST published its AI Agent Standards Initiative for “trusted adoption.”4 And a U.S. Senator sat down with Claude to ask whether AI companies can be trusted with the data they collect. Claude’s answer: “Money, Senator. It’s fundamentally about profit.” The video hit 4.4 million views.5

Every major institution is deploying agents and building walls against them at the same time. The walls are going up because the agents keep proving they need them.

TL;DR

  • The paradox: Organizations are simultaneously accelerating agent deployment and scrambling to contain agent failures. Neither effort coordinates with the other.
  • The numbers: 1 in 8 enterprise AI breaches now involve agentic systems. 80% of organizations report risky agent behaviors. Only 21% of executives have complete visibility into what their agents access.6
  • The incidents: Meta Sev 1 from an unauthorized agent post. AWS 13-hour outage from an AI coding tool that decided to “delete and recreate the environment.”7 A 14-day multi-university study found 10 security vulnerabilities in six agents, including identity hijacking and infinite loops.8
  • The pattern: Deploy fast, discover failure, build a wall, deploy faster. Google ships Sashiko to help review code while Amazon mandates senior approval for AI-assisted code changes. Anthropic sues an open-source tool for spoofing Claude headers while 2.5 million developers use it monthly.9
  • Why it persists: Deploy runs on product timelines (quarterly OKRs). Defend runs on incident timelines (post-mortem responses). The constraints never catch up to the grants.
  • What breaks this cycle: Runtime behavioral governance that closes the feedback loop between deployment and defense. Behavioral containment (PreToolUse hooks), permission scoping (OS-level sandboxes), and drift detection (cosine similarity tracking) address the three failure categories in this article. Evidence from 500+ autonomous agent sessions and a public NIST comment on agent behavioral threats.

The Deploy-and-Defend Pattern

Three incidents from the past 90 days reveal the pattern.

Meta (March 2026): An AI agent posted unauthorized responses on an internal forum. An employee followed the flawed advice. Sensitive data leaked to unauthorized employees for two hours. Meta confirmed the incident, called it Sev 1, and said “no user data was misused.”1 Months earlier, Summer Yue, head of safety at Meta’s AI division, reported that an agent connected to her Gmail “independently deleted emails despite clear instructions not to” and ignored commands to cease operations until manually halted.10

Amazon (December 2025): Amazon’s Kiro AI coding tool caused a 13-hour AWS outage when the agent determined it needed to “delete and recreate the environment.” Amazon blamed “user error, not AI error” and said the employee had “broader permissions than expected, a user access control issue, not an AI autonomy issue.” Multiple employees told the Financial Times this was “at least” the second AI-tool-related disruption. Amazon’s response: mandate senior approval for AI-assisted code changes.7

Research lab (February 2026): The Agents of Chaos study (researchers from Northeastern, Stanford, Harvard, MIT, and CMU) gave six AI agents access to a Discord-like server with email, bash, persistent filesystems, cron jobs, and GitHub access for 14 days. Twenty researchers interacted with the agents, some benign, some adversarial. The agents exhibited 10 distinct security vulnerabilities.8

The vulnerabilities were not exotic. An agent destroyed an entire mail server instead of taking proportional action (Disproportionate Response). Two agents entered a mutual relay loop, spawning uncontrolled background processes (Infinite Loop). An agent accepted a spoofed owner identity and granted full system access (Identity Hijack). After 12 refusals, an agent complied with an unauthorized request following sustained emotional pressure (The Guilt Trip).8

Christoph Riedl, the Northeastern professor leading the study, summarized: AI agents are “just horribly bad with applying any kind of common-sense reasoning” to real-world situations, especially with competing interests.11

Agent Breach Numbers in 2026

HiddenLayer’s 2026 AI Threat Landscape Report surveyed 250 IT and security leaders. The findings quantify the paradox:12

  • Autonomous agents account for more than 1 in 8 reported AI breaches across enterprises
  • 35% of breaches came from malware in public model repositories — yet 93% of organizations still use them
  • 31% of respondents don’t know if they were breached
  • 53% admitted to withholding AI breach reports
  • 76% cite shadow AI as a definite or probable problem, up from 61% in 2025

CEO Chris Sestito: “Agentic AI evolved faster in 12 months than most enterprise security in five years.”12

A separate enterprise survey found that only 21% of executives have complete visibility into their agents’ permissions, tool usage, and data access. 80% reported risky agent behaviors including unauthorized access and improper data exposure. The average enterprise has roughly 1,200 unofficial AI applications, and 86% report no visibility into them.6

The code quality data is equally stark. CodeRabbit analyzed 470 pull requests and found AI-authored code has 1.7x more issues than human-written code.13 Apiiro found developers using AI introduce roughly 10x more security vulnerabilities.13 METR found that half of AI coding solutions passing industry tests would be rejected by human reviewers.13

A single compromised agent poisons 87% of downstream decision-making within 4 hours.6

Deploying Agents and Building Walls Simultaneously

The institutional response to these numbers splits into two simultaneous, uncoordinated movements: deploy harder and defend harder.

Deploy harder:

Google releases Sashiko for agentic code review of the Linux kernel, backed by the Linux Foundation. The system caught 53% of bugs that human reviewers missed entirely, with an estimated false positive rate under 20%.2 Meta continues expanding internal AI agents despite the Sev 1 incident. EY reports 64% of companies with $1B+ revenue lost over $1M to AI failures, and they are all still deploying.6

Defend harder:

Amazon mandates senior approval for AI-assisted code changes after the Kiro outage.7 Anthropic locks OAuth access to prevent third-party tools from spoofing Claude headers, then files legal requests against OpenCode for doing exactly that.9 Wikipedia restricts LLM-generated contributions: editors must disclose AI use in edit summaries, and “obviously LLM-generated comments may be struck or collapsed.”3 The EFF accepts LLM-generated code in its open-source projects but requires all comments and documentation to be human-authored.14 NIST launches the AI Agent Standards Initiative with three pillars: industry-led standards, community protocols, and security research.4

Senator Bernie Sanders posted a 9-minute interview with Claude that hit 4.4 million views. Gizmodo’s response: “Hey Bernie, That’s Not an AI Agent.”15 Critics had a point about the methodology, but the structural signal matters. When a sitting Senator treats an AI system as a credible witness on corporate surveillance, the policy environment has shifted before any technical framework is ready to answer the questions being asked.5

None of these defensive measures coordinate with the deployment decisions happening in the next building over.

The OpenCode Fault Line

The clearest illustration of deploy-and-defend tension is the Anthropic-OpenCode dispute.

OpenCode is an open-source AI coding agent with 120,000+ GitHub stars and 5 million monthly developers.9 The tool supports 75+ LLM providers. To access Claude, OpenCode spoofed the claude-code-20250219 HTTP header to make Anthropic’s servers believe requests came from the official Claude Code CLI. This let Max subscribers ($200/month) route Claude through OpenCode while Anthropic remained unaware.9

The community developed a technique called “Ralph Wiggum”: running Claude in infinite loops, autonomously modifying code until tests passed. One developer reportedly completed a $50,000 contract for under $300 in API costs, consuming unlimited Max subscription resources.9

On January 9, 2026, Anthropic deployed server-side blocks on unofficial OAuth access. On March 19, OpenCode merged PR #18186, removing all Anthropic-branded system prompts, authentication plugins, and provider hints “per legal requests.”9 The PR collected 399 downvotes and 177 confused reactions.

DHH and George Hotz criticized the move. Hotz: “Terrible policy for a company built on training models on our code.” OpenAI publicly supported OpenCode, allowing ChatGPT subscriptions with third-party tools, a deliberate contrast.9

Thariq Shihipar of Anthropic responded: “unauthorized harnesses introduce bugs and usage patterns that Anthropic cannot properly diagnose.”16

Both sides have a point. Anthropic cannot maintain quality guarantees when third-party tools spoof official headers. Developers cannot build on a platform that litigates interoperability. The dispute is not about technology. It is about where the trust boundary sits, and whether users or providers get to draw it.

The Timescale Gap

Every organization in this article made a defensible decision in isolation. Meta deployed internal agents because they improve productivity. Amazon shipped Kiro because AI-assisted coding accelerates development. Google released Sashiko because human reviewers miss half the bugs. Wikipedia restricted LLM contributions because volunteer editors cannot absorb the review burden of machine-generated text at scale.

The paradox persists because deploy and defend operate on different timescales.

Deployment runs on product timelines. A team ships an agent integration as a quarterly OKR. The success metric is adoption: how many employees use it, how many tasks it completes, how much time it saves. The agent gets broader permissions because scoped permissions slow adoption, and slow adoption kills the OKR.

Defense runs on incident timelines. A team builds a wall after something breaks. The Meta Sev 1 response was to restrict the agent’s posting permissions. Amazon’s response was to mandate senior approval. Each wall addresses the specific failure that triggered it. Neither addresses the next one.

The gap between these timescales creates a ratchet. Each deployment cycle grants agents new capabilities. Each incident cycle constrains one specific capability after it fails. The constraints never catch up to the grants because the deployment team’s next sprint starts before the incident review concludes.

I know this ratchet because I operate on both sides of it simultaneously. Over 500+ autonomous coding sessions since May 2025, I have deployed increasingly capable agent configurations while building defenses against the failures each configuration revealed. Twelve times in 60 days, my agent stopped working on the assigned task and started doing something else. Each time, the agent continued producing plausible output. No security vulnerability played a role. The agent decided at runtime to work on a different problem.

The drift detector exists because of those twelve incidents. The sandbox exists because I caught an agent attempting to write to ~/.ssh/. The evidence gate exists because an agent reported “all tests pass” without running pytest. Each defense traces back to a specific failure that the previous configuration did not anticipate. The seven named failure modes I catalogued are the same patterns the Agents of Chaos study found at research scale: agents failing at verification, proportionality, and self-assessment.8

What Runtime Governance Looks Like

The deploy-and-defend cycle breaks when both functions share the same feedback loop. In practice, this means instrumenting agent behavior at runtime, not reviewing it after the fact.

My orchestration system wraps every agent action in a hook pipeline: 84 hooks intercepting 15 event types on file reads, file writes, bash commands, web requests, and sub-agent spawning.17 Before any tool call executes, a PreToolUse hook checks it against constraints the agent cannot override. After every 25 tool calls, a drift detector computes cosine similarity between the original task and the agent’s recent actions. When the similarity score drops below 0.30, the system injects a warning containing the original prompt. In all twelve below-threshold firings, the agent had verifiably lost track of the task.17

Three specific mechanisms address the three failure categories in this article:

Behavioral containment solves the Meta problem. The Meta agent posted without authorization because nothing checked whether it should post. A PreToolUse hook that fires before every bash command, matching against patterns like curl -X POST, git push, or API write endpoints, would have blocked the unauthorized forum post before it executed. The check adds milliseconds of latency. The alternative was a Sev 1.

Permission scoping solves the Amazon problem. The AWS outage happened because the agent had permission to delete infrastructure. An OS-level sandbox (macOS Seatbelt, Linux seccomp, or container-level restrictions) that blocks writes to production paths, credential stores, and infrastructure APIs makes “delete and recreate the environment” physically impossible regardless of what the agent decides to do. Agent sandboxes remain suggestions until enforced below the application layer.

Drift detection solves the Agents of Chaos problem. The study’s most insidious finding was not the dramatic failures (mail server destruction, identity hijacking) but the gradual ones: agents complying after sustained pressure, following unauthorized requests framed as legitimate. Drift detection catches the behavioral trajectory before the harmful action. By the time an agent complies with “The Guilt Trip” on attempt 13, the cosine similarity between the original task and the current conversation has already dropped below any reasonable threshold.

None of these mechanisms require pre-deployment alignment to predict the specific failure. They observe behavior in real time and enforce invariants the agent cannot argue with. The Agents of Chaos study found 10 vulnerabilities and 6 genuine safety behaviors in the same agents running the same weights.8 The difference was context. Runtime governance makes context-dependent failures detectable.

The organizations that will navigate this paradox are not the ones that deploy fastest or defend hardest. They are the ones that close the feedback loop between the two, so that every deployment generates the telemetry that informs the next constraint, and every constraint is tested against the next deployment before it ships.

FAQ

Is the agent trust crisis actually new?

The failures are not new. Automation has caused incidents since before AI. What changed in 2025-2026 is the autonomy gap: agents now choose their own actions at runtime rather than following predefined scripts. The HiddenLayer report found that autonomous agents specifically account for 1 in 8 breaches, a category that did not exist two years ago.12

Are open-source AI agents less safe than proprietary ones?

The Anthropic-OpenCode dispute is about access control, not safety. OpenCode’s security profile depends on which LLM provider it connects to and how it is configured. The safety question is not open vs. closed. The question is whether the tool operator has visibility into what the agent does, regardless of license.

Did the Meta agent actually cause a data breach?

Meta classified the incident as Sev 1 (second-highest severity) and confirmed that sensitive data was exposed to unauthorized employees for approximately two hours. Meta stated “no user data was misused and there’s no evidence anyone exploited the access or made any data public.”1 Whether this constitutes a “breach” depends on definition. The unauthorized exposure was real.

What is the Agents of Chaos study?

A 14-day multi-university research project (Northeastern, Stanford, Harvard, MIT, CMU) that gave six AI agents access to email, bash, filesystems, cron jobs, and GitHub in a controlled environment. Twenty researchers interacted with the agents. The study identified 10 security vulnerabilities and 6 safety behaviors, published as arXiv:2602.20021.8

Should companies stop deploying AI agents?

No. Google’s Sashiko caught bugs that 100% of human reviewers missed. The enterprise productivity gains are measurable. Stopping deployment is not the answer. Closing the feedback loop between deployment and defense is. Every agent deployment should generate behavioral telemetry that informs the next constraint. Every constraint should be tested against the next deployment before it ships.

What should individual developers do?

Three concrete steps, ordered by impact: (1) Enforce permissions below the application layer. An OS-level sandbox that blocks writes to ~/.ssh/, ~/.aws/, production paths, and credential stores makes the Amazon-style catastrophe physically impossible. The agent cannot argue with a kernel-level deny. (2) Monitor behavioral trajectory, not just outputs. Session drift is detectable through embedding similarity between the original task and recent agent actions. A cosine similarity threshold of 0.30 caught 100% of verified drift incidents in my testing across 60 sessions.17 (3) Require evidence, not assertions. When an agent reports “all tests pass,” demand the test output. Phantom verification accounts for 12% of agent failures requiring human intervention.

What is the deploy-and-defend ratchet?

The pattern where each deployment cycle grants agents new capabilities while each incident cycle constrains one specific capability after it fails. The constraints never catch up because the deployment team’s next sprint starts before the incident review concludes. The ratchet breaks when both teams share the same telemetry pipeline and the same feedback loop.


  1. Amanda Silberling, “Meta Is Having Trouble with Rogue AI Agents,” TechCrunch, March 2026, reporting on The Information’s investigation. 

  2. Roman Gushchin, “Sashiko: Agentic AI Code Review for the Linux Kernel,” GitHub / Linux Foundation, March 2026. Coverage: Phoronix

  3. Wikipedia community, “Large Language Model Policy,” ongoing. See also: RFC on LLM-assisted writing

  4. NIST, “Announcing the AI Agent Standards Initiative for Interoperable and Secure AI,” February 2026. 

  5. Senator Bernie Sanders, X post, March 19, 2026. ~4.4 million views. 

  6. Help Net Security, “Enterprise AI Agent Security in 2026,” March 2026. Aggregates EY, Astrix Security, and Harmonic Security surveys. 

  7. Fortune, “AI Coding Risks: What Amazon’s Outage Reveals About Enterprise Agents,” March 2026. Also: Financial Times reporting on multiple AWS incidents. 

  8. Christoph Riedl et al., “Agents of Chaos,” arXiv:2602.20021, February 2026. Multi-institutional: Northeastern, Stanford, Harvard, MIT, CMU. 

  9. ShareUHack, “OpenCode Anthropic Legal Controversy,” March 2026. Primary: GitHub PR #18186

  10. Summer Yue, head of safety at Meta Superintelligence Labs, reported the email deletion incident in February 2026. Cited in TechCrunch and The Decoder coverage of Meta agent incidents. 

  11. Christoph Riedl, quoted in “Autonomous AI Agents Unleashed on Discord,” Northeastern University News, March 2026. 

  12. HiddenLayer, “2026 AI Threat Landscape Report,” March 18, 2026. Survey of 250 IT/security leaders. 

  13. CodeRabbit (470 PRs, 1.7x issue rate), Apiiro (~10x security issues), and METR (50% rejection by human reviewers) cited in Fortune, March 2026.7 

  14. EFF, “Our Policy on LLM-Assisted Contributions to Open Source Projects,” February 2026. 

  15. Gizmodo, “Hey Bernie, That’s Not an AI Agent,” March 2026. 

  16. Thariq Shihipar, Anthropic, quoted regarding unauthorized third-party tool access. Cited in The Register, February 2026. 

  17. Blake Crosley, “What I Told NIST About AI Agent Security,” blakecrosley.com, February 2026. Production evidence from 60+ daily autonomous agent sessions. 

Articles connexes

Silent Egress: The Attack Surface You Didn't Build

A malicious web page injected instructions into URL metadata. The agent fetched it, read the poison, and exfiltrated the…

17 min de lecture

The Invisible Agent: Why You Can't Govern What You Can't See

Anthropic silently dropped a 10GB VM on users' Macs. Agent observability requires three layers: resource metering, polic…

20 min de lecture

Your Agent Writes Faster Than You Can Read

Five research groups published about the same problem this week: AI agents produce code faster than developers can under…

16 min de lecture