Cybersecurity Is Proof of Work
The UK AI Security Institute published an independent evaluation of Claude Mythos Preview on cybersecurity tasks.1 The headline number: Mythos completed a 32-step corporate network attack simulation in 3 out of 10 attempts. No other model has solved the full chain. The next day, Drew Breunig published the economic corollary: each of those attempts cost roughly $12,500 in tokens.2 Together, these two analyses reframe cybersecurity from a skill problem to a compute problem.
The implication is uncomfortable. In Breunig’s framing, defending a system now requires spending more tokens discovering exploits than attackers will spend exploiting them.2 Security has always been asymmetric — attackers only need one path, defenders need to cover all of them. AI agents preserve that asymmetry but shift the axis. The scarce resource is no longer expertise. The scarce resource is compute budget.
What the AISI Actually Measured
The evaluation used two testing approaches: capture-the-flag (CTF) challenges and cyber range simulations.1
The cyber range that matters — “The Last Ones” (TLO) — simulates a 32-step corporate network attack. A human would need an estimated 20 hours to complete it.1 Mythos completed the full chain in 3 of 10 attempts. Across all 10 runs, Mythos averaged 22 of 32 steps completed. For comparison, Claude Opus 4.6 averaged 16 steps on the same range.1
On expert-level CTF tasks, Mythos succeeded 73% of the time.1
The token budgets tell their own story. The AISI tested ranges up to 2.5 million tokens for non-expert CTF, 50 million for expert CTF, and 100 million for the cyber range simulations.1 The evaluation explicitly notes that “models continue making progress with increased token budgets across the token budgets tested” and that the AISI expects “performance improvements would continue beyond” the 100 million token ceiling they tested.1
More tokens, more progress. No plateau observed.
The AISI was careful to scope the finding. The cyber ranges lacked active defenders, defensive tooling, and penalties for triggering alerts.1 The assessment applies to “weakly defended and vulnerable enterprise systems” — not hardened production environments with SOCs and IDS. Mythos also failed the “Cooling Tower” range, which focused on operational technology.1
Those caveats matter. But the trajectory matters more. Previous models could not complete the full chain on these ranges.1 Now one completes a 32-step corporate intrusion in 3 of 10 tries, and the performance curve bends upward with compute. The question is not whether AI can break into weakly defended, vulnerable systems — the AISI demonstrated that it can. The question is when the success rate against hardened environments crosses the threshold where it becomes economically rational to automate.
The Economics: $12,500 Per Attempt
Breunig’s analysis converts the AISI findings into dollars.2 At 100 million tokens per attempt, a single Mythos run on TLO costs approximately $12,500. Ten TLO attempts cost $125,000.2
Those numbers sound large in isolation. They sound small relative to what a 32-step corporate network compromise costs the defender. The model achieves a 30% success rate at a fraction of the cost, runs on demand, and the success rate improves with budget. Run the same attack chain 100 times instead of 10 — assuming independent, identically-configured attempts against a static target — and the expected number of successful penetrations jumps from 3 to 30 at roughly $1.25 million in tokens. Expensive for an individual researcher. A rounding error for a nation-state actor.
Breunig’s core thesis: “to harden a system you need to spend more tokens discovering exploits than attackers will spend exploiting them.”2 Security becomes a token budget race. Breunig argues that defenders must outspend attackers in automated exploit discovery, or they lose by default.
He proposes a three-phase model: Development, Review, and Hardening.2 Development builds the system. Review catches known bug classes. Hardening is the new phase — autonomous exploit discovery running continuously until the team exhausts the budget. The security of a system becomes a function of how many tokens the team burns trying to break it before deployment.
“You don’t get points for being clever,” Breunig writes. “You win by paying more.”2
Linus’s Law Gets a Token Dimension
Breunig extends Linus’s Law — “given enough eyeballs, all bugs are shallow” — to include tokens.2 Enough automated review cycles, with enough compute budget, will surface vulnerabilities that human review missed for decades.
The evidence supports the extension. As documented in When Your Agent Finds a Vulnerability, Carlini’s work at Anthropic reportedly found a 23-year-old Linux kernel vulnerability using a 10-line bash script and Claude Code.4 As documented in Project Glasswing, Anthropic scaled that approach with Mythos to discover what they describe as thousands of zero-days across major operating systems and browsers.5 The AISI evaluation now provides independent confirmation of the underlying capability.
Simon Willison adds an observation worth noting: AI-driven security review increases the value of open-source libraries, because the tokens spent securing them benefit every user collectively.3 Proprietary code bears its own security costs. Open-source code amortizes those costs across the entire user base.
Breunig references Anthropic’s code review product at $15-20 per review as one data point on current pricing.2 He also cites the LiteLLM and Axios supply chain incidents in the context of dependency security — examples of the kind of supply chain vulnerabilities that underscore the need for automated review.2
The formula crystallizes: “Code remains cheap, unless it needs to be secure.”2 Every line of code in a production system carries an implicit security debt. That debt previously hid in plain sight — buried in the salaries of security teams and the probabilistic hope that manual review would catch the critical bugs. Token-based security makes the cost explicit and measurable.
What the Caveats Actually Mean
The AISI’s caveats deserve careful reading, not dismissal.
No active defenders changes the calculus significantly. A 32-step attack chain against a system with no monitoring, no alerting, and no incident response is a fundamentally different problem than the same chain against a staffed SOC. Real enterprise networks have EDR, network segmentation, anomaly detection, and human analysts. Every alert an automated attacker triggers is a chance for defense to respond.
No penalties for noise means the model can attempt brute-force approaches that a human attacker would avoid. A real adversary who triggers hundreds of IDS alerts in an hour gets investigated. The AISI ranges did not model that feedback loop. In a real network, noise is expensive — for the attacker. Stealth constrains the search space. Remove that constraint, and the problem becomes strictly easier.
The Cooling Tower failure is also instructive. Mythos solved the IT-focused TLO range but failed the operational technology range.1 OT environments have different protocols, different constraints, and different failure modes. The AISI notes the model got stuck on IT portions of that range, so the failure does not necessarily indicate poor OT-specific ability — but the model’s capabilities are clearly not uniform across domains. IT network penetration and industrial control system attacks are different problems, and drawing conclusions about OT readiness from this evaluation requires caution.
But the caveats also have an expiration date. Token budgets scale. Model capabilities improve between evaluations. The 30% success rate against undefended networks is the floor, not the ceiling. The AISI itself expects performance to improve beyond the budgets tested.1 Defenders who dismiss the findings because the ranges lacked active defense are betting that inference scaling will plateau before it reaches their defenses — a bet the AISI’s own data, within the ranges tested, does not support.
Operational Implications for Practitioners
For anyone running AI agents in production — and I run autonomous agents overnight through the Ralph Loop with 95 hooks as security infrastructure — the proof-of-work framing changes how to think about defense.
Security hooks are a minimum spend, not a sufficient one. My 95 hooks gate what agents can do: blocking force pushes, validating credentials, enforcing sandboxes. Those hooks prevent my own agents from causing damage. They do nothing against an external attacker who spends 100 million tokens probing the systems those agents interact with. Hook infrastructure is necessary but not sufficient.
Automated offensive testing becomes mandatory. Breunig’s three-phase model — Development, Review, Hardening — implies that every deployment pipeline needs an adversarial phase where AI agents attempt to break the system before it ships. Not a checkbox penetration test. A token-budget-exhaustion exercise. Run automated exploit discovery until the budget runs out, fix what surfaces, repeat.
The Ralph Loop now has a security corollary. I wrote about iterative security degradation in the context of performance — agents that pass every test while introducing 446x slowdowns. The same pattern applies to security. An agent that writes correct, functional, well-tested code can still introduce subtle vulnerabilities that only surface under adversarial automated review. The solution is the same: add the missing gate. Performance benchmarks catch performance regressions. Automated red-teaming catches security regressions.
Open-source dependencies deserve token budgets. Willison’s observation about collective benefit applies directly to dependency management. Every open-source library in a production stack is either receiving automated security review from someone or it is not. Breunig cites the LiteLLM and Axios supply chain incidents in the context of dependency security — cases where vulnerabilities persisted in widely-used libraries.2 Practitioners should evaluate their dependency trees with a new question: who is spending tokens on this library’s security?
The Uncomfortable Math
The proof-of-work framing makes security economics explicit in a way that expertise-based models never did. Under the old model, security quality was a function of who you hired and how skilled they were. Under the new model, security quality is a function of how many tokens you spend trying to break your own systems.
Talent still matters — someone needs to interpret results, prioritize fixes, and make architectural decisions. But the discovery phase, the part where automated agents surface vulnerabilities, is increasingly a compute problem. And within the ranges the AISI tested, compute problems favor the entity willing to spend more.
The parallel to cryptocurrency proof-of-work is instructive, even if imperfect. Bitcoin miners burn electricity to secure the chain. Defenders burn tokens to secure the system. In both cases, the security guarantee is proportional to the compute spent. In both cases, an attacker willing to spend more compute gains an advantage. The difference: Bitcoin mining difficulty adjusts automatically. Security token budgets require human judgment about how much is enough.
For well-funded organizations, the path forward is clear. Add autonomous exploit discovery to the deployment pipeline. Set a token budget proportional to the system’s risk profile. Run the budget out. Fix what surfaces. Ship.
For everyone else, the path forward is less comfortable. If you cannot afford to spend more tokens defending than attackers will spend attacking, you need to rely on shared infrastructure — open-source security review, vendor-provided scanning, collective defense. The security equivalent of herd immunity. And like herd immunity, it only works if enough participants contribute. Free-riding on open-source security review without contributing tokens back is a strategy that works until it does not.
The AISI evaluation showed that AI agents can complete corporate network attacks. Breunig argues that defense is a spending problem. Willison identified the one structural advantage defenders have: shared infrastructure amortizes costs across everyone who uses it.
The question for every practitioner is the same one that proof-of-work systems have always asked: how much compute are you willing to burn?
Citations
-
UK AI Security Institute, “Our Evaluation of Claude Mythos Preview’s Cyber Capabilities,” aisi.gov.uk, April 13, 2026. ↩↩↩↩↩↩↩↩↩↩↩↩
-
Drew Breunig, “Cybersecurity Looks Like Proof of Work Now,” dbreunig.com, April 14, 2026. ↩↩↩↩↩↩↩↩↩↩↩↩
-
Simon Willison, “Cybersecurity Looks Like Proof of Work Now,” simonwillison.net, April 14, 2026. ↩
-
Nicholas Carlini, “An AI Found a Bug in My Code (That Humans Missed for 23 Years),” nicholas.carlini.com, 2026. As referenced in When Your Agent Finds a Vulnerability. ↩
-
Anthropic, “Mythos Preview: Responsible Disclosure of Cyber Capabilities,” red.anthropic.com, 2026. As referenced in Project Glasswing. ↩