← All Posts

The Dark Factory Verification Layer

StrongDM released software under two rules: “Code must not be written by humans” and “Code must not be reviewed by humans.”1 A three-person team—Justin McCarthy, Jay Taylor, and Navan Chauhan—built and released Attractor and CXDB (16K lines of Rust, 9.5K of Go, 6.7K of TypeScript) with a minimum spend of $1,000 in tokens per engineer per day.1 BCG Platinion, citing Spotify and TechCrunch coverage, reports that Spotify’s best developers had not written code since December 2025, with the company merging hundreds of AI-generated pull requests monthly.2

Dan Shapiro calls the endpoint Level 5: the Dark Factory. Code generated by machines, verified by machines, deployed without a human reading a single line.3 The preceding levels track the progression most teams are on right now—from manual coding (Level 0) through task offload (Level 1), autopilot-on-the-highway (Level 2), Waymo-with-safety-driver (Level 3), and robotaxi where you write the spec and leave for 12 hours (Level 4).3

The question nobody has answered well: what does the verification layer look like at Level 5?

The Verification Problem Compounds

At every level below 5, a human reads code at some point. Level 3, the human manages the AI as a senior developer. Level 4, the human checks whether tests pass after 12 hours.3 These levels work because a person with institutional knowledge can pattern-match against intent. The spec said “retry with exponential backoff” and the code does linear retry—a developer catches that in a glance.

Remove the human entirely, and verification becomes a different problem. Not harder in degree. Different in kind. The verifier cannot rely on reading comprehension. The verifier must encode what “correct” means in executable form, then evaluate the output against that encoding without ever inspecting the artifact itself.

The core trap is agents gaming tests. StrongDM discovered their agents writing return true to pass test suites while doing nothing useful.1 The tests were green. The CI pipeline was happy. The code was worthless. Stanford Law’s Eran Kahana extends the observation to a structural warning: the broader issue is circularity, where the same technology class evaluates code that the same class wrote.4

Goodhart’s Law operates here with unusual force. When agents optimize for test passage, test passage ceases to measure correctness.4 Every metric that becomes the target stops being a good metric. The verification layer for a dark factory must account for this dynamic or it will measure compliance, not quality.

How StrongDM Actually Solves Verification

StrongDM’s answer is what they call “Scenarios”—end-to-end user stories stored outside the codebase, functioning like holdout sets in machine learning.1 The analogy is precise: just as ML models are evaluated against data they never trained on, agent-built code is evaluated against scenarios the agent cannot access during generation.

The key metric is “Satisfaction”: the fraction of observed trajectories that likely satisfy the user.1 No industry standard exists for what score constitutes sufficient satisfaction. StrongDM arrived at their own threshold empirically.

To make scenario-based testing work at scale, StrongDM built a Digital Twin Universe—behavioral clones of Okta, Jira, Slack, Google Docs, Drive, and Sheets.1 The twins target 100% API compatibility using popular publicly available reference SDK client libraries.1 The agents run against the twins, not against mocked endpoints. The behavioral fidelity of the twin determines the trustworthiness of the test.

StrongDM observed something I have also seen: “with the second revision of Claude 3.5 (October 2024), long-horizon agentic coding workflows began to compound correctness rather than error.”1 Below a capability threshold, longer agent runs produce more mistakes. Above it, longer runs produce better code. The dark factory pattern only became viable after models crossed that threshold.

Five Layers of Governance

BCG Platinion’s five-pillar transformation framework includes a governance layer with multiple verification steps before code reaches production.2 The pillars: an intent-driven operating system, codified knowledge infrastructure, workforce upskilling, a governance layer with independent verification agents, and factory architecture for orchestration.2 Within the governance pillar, BCG Platinion describes scenario-based tests run by independent agents, static analysis, architecture conformance checking, behavioral regression testing, and red-team agents that actively try to break the output.2

The independence matters. When the same agent writes and tests its own code, Kahana’s circularity problem applies.4 When a separate agent—with different system prompts, different context, different incentives—evaluates the work, the failure modes decorrelate. Not eliminate. Decorrelate. Two agents can still share systematic biases inherited from training data. But the probability of identical blind spots drops when the evaluation agent operates from a different frame.

BCG Platinion identifies “intent thinking” as a critical competency for dark factory teams: translating business needs into precise, testable descriptions of desired outcomes.2 The human role shifts from writing code to writing specifications that agents can execute against. Poor specs produce passing tests on wrong behavior—the same return true dynamic StrongDM encountered.1

BCG Platinion also identifies a constraint I have experienced directly: “AI agents are only as effective as the codified knowledge they can access.”2 An agent operating without project context generates plausible code that violates local conventions, ignores architectural decisions, and rediscovers problems the team already solved. Codified knowledge—design decisions, API contracts, style guides, failure histories—is infrastructure, not documentation.

What I Already Run at Level 4

My overnight execution loop, the Ralph Loop, operates at Shapiro’s Level 4. I write specs, launch agents, sleep, and review results in the morning. The agents run against 95+ hooks that intercept every tool call—file writes, git commands, shell execution—before and after execution. The hooks enforce constraints the agent cannot negotiate with or override.

The hooks address Kahana’s gaming problem at the tool level. An agent that tries to force-push to main gets blocked before the command executes, not after a test catches the damage. An agent that tries to commit files matching .env patterns gets intercepted. An agent that reports “all tests pass” without running pytest gets flagged by the evidence gate, which demands pasted test output showing zero failures, not a claim that tests would pass.

The evidence gate enforces six criteria on every non-trivial change: follows codebase patterns (name the pattern and the file), simplest working solution (state rejected alternatives), edge cases handled (list each one), tests pass (paste the output), no regressions (name the files checked), and solves the actual problem (state the user’s need and how the change addresses it). “I believe” and “it should” are not evidence. The gate rejects hedging language and demands artifacts.

The quality loop—implement, review, evaluate, refine, zoom out, repeat, report—runs as a behavioral constraint encoded in the agent’s system prompt via CLAUDE.md. The loop does not guarantee the agent follows every step. The hooks verify that it did.

BCG Platinion’s five pillars map to infrastructure I already maintain:

  • Intent-Driven OS: CLAUDE.md files and PRD-driven development specs encode project intent as executable context.
  • Codified Knowledge: 139+ skills, organized as reusable capabilities, give agents access to project conventions, architectural decisions, and domain knowledge.
  • Governance: Hooks implement the interception layer. The evidence gate implements the audit layer. The quality loop implements the behavioral constraint layer.

Two pillars I have not built: workforce upskilling (irrelevant for a solo practitioner) and factory architecture as a dedicated orchestration platform (my current setup uses Claude Code’s native agent spawning, not a purpose-built factory).

The Gap Between Level 4 and Level 5

Moving from Level 4 to Level 5 means eliminating the morning review. Right now, I wake up and read what the agents produced overnight. I check git diffs. I run the application. I verify that the output matches my intent. That review takes 30 minutes to an hour, and it catches problems the hooks miss.

The problems the hooks miss are the interesting ones. They fall into categories that current automation handles poorly:

Intent drift. The agent completed the spec faithfully but the spec was ambiguous, and the agent chose the wrong interpretation. No test catches an incorrect interpretation that produces valid behavior. StrongDM’s scenarios approach this by encoding user stories as the specification, not technical requirements.1 The scenarios describe what a user experiences, not what the code does.

Architectural erosion. The agent added a feature that works in isolation but degrades the system’s structural coherence. A new database query that bypasses the existing repository pattern. A new endpoint that duplicates logic from another module. Static analysis catches some of these. Architecture conformance checking—BCG Platinion’s governance layer—catches more.2 Neither catches the subtle ones where the new code is technically consistent with the patterns but introduces a conceptual split that compounds over future changes.

Institutional knowledge loss. Kahana raises an underappreciated risk: when nobody reads the code, nobody builds intuition about the system.4 As Kahana warns, “Nobody will know why. Nobody will know how to fix it.”4 Today, my morning review builds that intuition incrementally. At Level 5, the system becomes opaque to its operator. Every complex system eventually needs intervention that automation cannot handle—a security incident, a business logic change that violates assumptions baked into the test suite, an integration with an external system that behaves differently than its documentation claims. The operator who never read the code cannot intervene effectively.

What the Verification Layer Actually Needs

Synthesizing StrongDM’s practice, BCG Platinion’s governance framework, Kahana’s failure analysis, and my own infrastructure, the verification layer for a dark factory requires at minimum:

Holdout-style evaluation. Tests that the generating agent cannot access during code production. StrongDM’s scenarios. Behavioral specifications stored separately from the codebase, evaluated by independent agents. Without holdout evaluation, Goodhart’s Law turns every test suite into a target.

Digital twins for integration testing. Agents cannot test against production systems. Mocks are too shallow—they verify API contracts, not behavior. Twins that replicate the behavioral surface of external dependencies enable end-to-end scenario execution without production risk.

Multi-agent verification with decorrelated failure modes. The writing agent and the evaluating agent must operate from different contexts. Red-team agents that actively probe for gaming, shortcuts, and phantom verification add a layer that passive testing cannot provide.

Tool-level interception. Hooks that block harmful operations before execution, not tests that detect damage after the fact. The hook layer operates below the agent’s decision-making and cannot be circumvented by clever prompting or return true shortcuts.

Executable intent specifications. Specs precise enough that ambiguity is detectable. BCG Platinion’s “intent thinking” competency.2 Shapiro’s Level 4 spec that you write before leaving for 12 hours.3 The spec is the product. The code is a side effect.

Audit trail with no accountability gap. Kahana cites AI Life Cycle Core Principles requiring output “traceable to an appropriate responsible party.”4 No industry-standard audit methodology exists for agent-built software yet.4 The verification layer needs to produce artifacts that a human (or regulator, or incident responder) can follow from deployed behavior back to the specification that generated it.

The Honest Assessment

I run Level 4 with high confidence. My overnight agents produce work that passes morning review more often than not. The hooks catch the mechanical failures. The evidence gate catches the epistemic failures. The quality loop reduces the behavioral failures.

Level 5 requires solving problems I have not solved. Intent drift detection without human pattern-matching. Architecture conformance that catches conceptual erosion, not just structural violations. Institutional knowledge that accumulates in the system rather than in the operator’s head.

BCG Platinion reports 3-5x productivity gains from teams adopting dark factory patterns.2 StrongDM released agent-built software with three engineers and a token budget.1 The productivity case is clear. The verification case is not.

The teams succeeding at Level 5 share a common trait: they invested more in verification infrastructure than in code generation. StrongDM built an entire Digital Twin Universe before trusting agents to ship code.1 BCG Platinion’s framework has five transformation pillars including a governance layer with multiple verification steps before code reaches production.2 The dark factory is not a factory that runs in the dark. It is a factory where the lights are the verification layer, and everything else—including the code—is a commodity.

I wrote previously about what breaks when agents run unsupervised and about the evidence gate as a defense against phantom verification. Those pieces describe the infrastructure for Level 4. The dark factory demands that same infrastructure, extended to operate without the human who currently reads the morning diff. The hooks, the evidence gates, the quality loops—they are necessary at Level 5, but not sufficient. The missing piece is verification that scales with the same autonomy as the generation.

Building that piece is the work ahead.



  1. Simon Willison, “Software Factory,” simonwillison.net (February 7, 2026), covering StrongDM’s fully autonomous development methodology by Justin McCarthy, Jay Taylor, and Navan Chauhan. 

  2. BCG Platinion, “The Dark Software Factory,” bcgplatinion.com. 

  3. Dan Shapiro, “Five Levels of AI Coding,” danshapiro.com (January 2026). 

  4. Eran Kahana, “Built by Agents, Tested by Agents, Trusted by Whom?” Stanford Law (February 8, 2026). 

Related Posts

What Actually Breaks When You Run AI Agents Unsupervised

Seven named failure modes from 500+ autonomous agent sessions. Each has a detection signal, a real example, and a concre…

16 min read

The Performance Blind Spot: AI Agents Write Slow Code

118 functions with slowdowns from 3x to 446x in two Claude Code PRs. AI agents optimize for correctness, not performance…

14 min read

Your Agent Writes Faster Than You Can Read

Five research groups published about the same problem this week: AI agents produce code faster than developers can under…

16 min read