← Todos los articulos

When Your Agent Finds a Vulnerability

From the guide: Claude Code Comprehensive Guide

Nicholas Carlini, a research scientist at Anthropic, pointed Claude Code at the Linux kernel source and told it to find vulnerabilities. The setup: a 10-line bash script plus a Docker container with ASAN-instrumented builds. Loop over source files, ask the model to look for bugs, move to the next file.13

The result: a remotely exploitable heap buffer overflow in the NFSv4 LOCK replay cache, present since March 2003 — twenty-three years, predating Git itself. The vulnerability allows two cooperating NFS clients to read sensitive kernel memory by overflowing a 112-byte buffer with a 1,024-byte lock owner ID.2 Carlini found at least four more kernel vulnerabilities in the same sweep. Separately, the same methodology produced 122 crashing inputs sent to Mozilla, of which 22 received CVEs.3 He described “several hundred crashes” he hasn’t had time to validate and report.2

These are confirmed vulnerabilities reported to maintainers, found by an agent using Opus 4.6 — the same model class that practitioners run daily for code review, refactoring, and feature development. Carlini presented the findings at the [un]prompted AI security conference in April 2026.1

TL;DR

Carlini’s methodology was minimal: iterate over source files, prompt Claude to find vulnerabilities in each, verify hits with ASAN assertions. Opus 4.6 found substantially more vulnerabilities than Opus 4.1 (8 months older) or Sonnet 4.5 (6 months older), suggesting a capability threshold was recently crossed.2 The bottleneck is now human validation, not AI discovery. This has direct implications for how practitioners build security hooks, run code review, and think about agent-assisted auditing.

Key Takeaways

  • Security engineers: The capability is real and improving fast. If you run agent-assisted code review, your PreToolUse security hooks are more important than ever — not to block Claude, but to gate what it can do with what it finds.
  • Harness builders: The verification bottleneck (“several hundred crashes I haven’t validated”) is a harness problem. Automated triage, deduplication, and severity classification are the next infrastructure layer.
  • Everyone else: The same model that introduces 446x performance regressions also finds bugs that 23 years of human review missed. Both are true simultaneously.

The Methodology

Carlini’s approach did not require a custom security framework, a fine-tuned model, or specialized prompts. He described it as a “10-line bash script plus Docker container”:3

  1. Compile the target with ASAN (AddressSanitizer) instrumentation
  2. Iterate through source files, using the model to rate security relevance
  3. Prompt Claude Code with a capture-the-flag framing for high-relevance files
  4. Run multiple passes per target (5-20 depending on codebase)
  5. Use automated critique agents to verify findings before disclosure

The capture-the-flag framing matters. Telling the model “this code has a bug” activates a different mode than “review this code for issues.” Developers have noticed the same pattern in daily use — Claude finds more problems when you tell it a problem exists than when you ask if one might exist.2

The cost of the sweep is measured in API tokens, not person-months. Carlini found five confirmed Linux kernel vulnerabilities and 22 Firefox CVEs using a commodity agent CLI.3 The same tool that writes your unit tests and formats your imports.

The Capability Threshold

The most striking finding is the model generation gap. Carlini attempted to reproduce his results with older models:2

  • Opus 4.6 (released ~2 months before the talk): found the heap overflow and multiple additional vulnerabilities
  • Opus 4.1 (8 months prior): found only a small fraction
  • Sonnet 4.5 (6 months prior): found only a small fraction

Something crossed a threshold between model generations. The ability to hold a complex codebase in context, reason about data flow across function boundaries, and identify subtle specification mismatches appears to have emerged rather than gradually improving.

Carlini stated plainly: “I have never found one of these in my life before. This is very, very, very hard to do. With these language models, I have a bunch.”2

The Paradox

The same agent architecture that introduces performance regressions — 118 functions with slowdowns from 3x to 446x — also finds security vulnerabilities that decades of expert human review missed. These are complementary aspects of the same capability profile. Vulnerability research is fundamentally pattern matching against known classes (buffer overflows, use-after-free, integer signedness), which is an LLM strength.4 Performance optimization requires the opposite: reasoning about specific execution contexts, cache behavior, and algorithmic complexity. The model recognizes a buffer overflow across millions of lines of code but cannot tell you that a hash map is slower than a sorted array for your access pattern. Build your harness accordingly — security hooks that flag findings, performance hooks that measure before committing.

The Verification Bottleneck

Carlini’s most revealing admission: “I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet.”2

The bottleneck has shifted from discovery to triage. Finding potential vulnerabilities is now cheaper than confirming they are real. This creates a new infrastructure problem for security teams:

Discovery is automated. An agent can sweep a codebase in hours.

Verification is manual. Each potential vulnerability needs a proof of concept, an impact assessment, and a responsible disclosure process.

Triage is the gap. Sorting hundreds of agent-generated findings into real vulnerabilities, false positives, and low-severity noise is the work that does not yet have good tooling.

This is the same pattern we see in agent-assisted code review: the agent produces raw output faster than humans can evaluate it. The value is not in the generation — it is in the infrastructure that processes, filters, and routes the output.

For harness builders, this means the next high-value hook is not a security scanner. It is a security triage system: deduplication, severity classification, false-positive filtering, and automatic proof-of-concept generation. The governance hooks that gate agent output are more important than the scanning capabilities themselves.

What This Means for Practitioners

If you run Claude Code on production codebases, you are already running a system capable of finding real vulnerabilities. The question is not whether the capability exists — it is whether your harness is designed to handle what the agent finds.

Three practical moves:

Add a security sweep to your review pipeline. A PostToolUse hook on Write/Edit can trigger a targeted security scan on changed files. The capture-the-flag framing outperforms generic review prompting:

{
  "hooks": {
    "PostToolUse": [{
      "matcher": "Write|Edit",
      "hooks": [{
        "type": "command",
        "command": "claude -p 'This file has a security vulnerability. Find it: $FILE_PATH' --output-format json >> .claude/security-findings.jsonl"
      }]
    }]
  }
}

This is a starting point, not production-ready — you would add deduplication, severity filtering, and rate limiting. But the core pattern is real: Carlini’s methodology was a loop over files with a targeted prompt.3

Build triage infrastructure. Raw vulnerability findings without severity classification are noise. If your agent produces 50 findings per sweep, you need automated deduplication and priority scoring before a human sees the list. This is a harness problem, not a model problem.

Accept the paradox. The same model that needs performance guardrails is genuinely excellent at security pattern matching. Design your harness to leverage the strength and compensate for the weakness. Security hooks that scan. Performance hooks that measure. Quality hooks that verify. Each covers what the others miss.

The 23-year-old Linux vulnerability was not hiding. It was in plain sight, in a file that thousands of engineers had read. The model found it because pattern matching at scale is what these systems do. The lesson is not that agents are better than humans at security. The lesson is that agents cover a different surface — and the harness that orchestrates both is what makes the combination reliable.


Sources

Frequently Asked Questions

Can I reproduce Carlini’s approach with Claude Code?

The methodology is documented in the podcast interview.3 The core loop: compile with ASAN, iterate over source files, prompt Claude with a capture-the-flag framing, verify hits. Carlini reported that Opus 4.6 found significantly more vulnerabilities than older models — results with other model generations may vary.

Does this mean AI agents are better than humans at finding security bugs?

No. It means agents cover a different surface. Agents excel at pattern matching against known vulnerability classes across large codebases. Humans excel at understanding novel attack vectors, business logic flaws, and context-dependent security properties. The combination is stronger than either alone.

Should I be worried about attackers using this capability?

Carlini explicitly warned of “a big wave coming.” The same capability that helps defenders find vulnerabilities is available to attackers. The asymmetry is that defenders can automate triage and patching, while attackers still need to develop exploits — but the discovery gap is closing.


  1. Nicholas Carlini, “Black-hat LLMs,” [un]prompted AI security conference, April 2026. Conference agenda. Carlini demonstrated automated vulnerability discovery in the Linux kernel, Firefox, Ghost CMS, and FFmpeg using Claude Opus 4.6. 

  2. Michael Lynch, “Claude Code Found a Linux Vulnerability Hidden for 23 Years.” April 2026. Detailed writeup of Carlini’s [un]prompted talk, including technical details of the NFSv4 heap buffer overflow, the model generation comparison, and the verification bottleneck. 

  3. AI Finds Vulns You Can’t,” Security Cryptography Whatever podcast with Nicholas Carlini, March 2026. Primary source for methodology details: 10-line bash script, Docker/ASAN setup, multiple passes per target, 122 Firefox crashing inputs (22 CVEs), automated critique agents for verification. 

  4. Hacker News discussion. 409 points. Key observation: vulnerability research is fundamentally pattern matching against known classes, which aligns with LLM strengths. 

Artículos relacionados

Silent Egress: The Attack Surface You Didn't Build

A malicious web page injected instructions into URL metadata. The agent fetched it, read the poison, and exfiltrated the…

18 min de lectura

The Invisible Agent: Why You Can't Govern What You Can't See

Anthropic silently dropped a 10GB VM on users' Macs. Agent observability requires three layers: resource metering, polic…

20 min de lectura

Your Agent Writes Faster Than You Can Read

Five research groups published about the same problem this week: AI agents produce code faster than developers can under…

16 min de lectura