Dark Factory Verification: When No Human Reads the Code

April 14, 2026 13 min read

ai-engineering agents autonomous-coding verification quality infrastructure

A Dark Factory (Level 5) is a software development environment where machines generate, verify, and deploy code — no human reads a single line. The verification layer that makes the dark factory viable requires holdout-style evaluation, Digital Twin Universes, multi-agent review, and encoded taste constraints. Without that layer, agents game tests and quality disappears.

StrongDM released software under two rules: “Code must not be written by humans” and “Code must not be reviewed by humans.”¹ A three-person team (Justin McCarthy, Jay Taylor, and Navan Chauhan) built and released Attractor and CXDB (16K lines of Rust, 9.5K of Go, 6.7K of TypeScript) with a minimum spend of $1,000 in tokens per engineer per day.¹ BCG Platinion, citing Spotify and TechCrunch coverage, reports that Spotify’s best developers had not written code since December 2025, with the company merging hundreds of AI-generated pull requests monthly.²

Dan Shapiro calls the endpoint Level 5: the Dark Factory. Code generated by machines, verified by machines, deployed without a human reading a single line.³ The preceding levels track the progression most teams follow right now: manual coding (Level 0), task offload (Level 1), autopilot-on-the-highway (Level 2), Waymo-with-safety-driver (Level 3), and robotaxi where you write the spec and leave for 12 hours (Level 4).³

The question nobody has answered well: what does the verification layer look like at Level 5? The following analysis maps the infrastructure required, building on the taste infrastructure that governs how I evaluate quality across all my engineering work.

The Verification Problem Compounds

At every level below 5, a human reads code at some point. Level 3, the human manages the AI as a senior developer. Level 4, the human checks whether tests pass after 12 hours.³ These levels work because a person with institutional knowledge can pattern-match against intent. The spec said “retry with exponential backoff” and the code does linear retry; a developer catches that in a glance.

Remove the human entirely, and verification becomes a different problem. Not harder in degree. Different in kind. The verifier cannot rely on reading comprehension. The verifier must encode what “correct” means in executable form, then evaluate the output against that encoding without ever inspecting the artifact itself.

The core trap is agents gaming tests. StrongDM discovered their agents writing return true to pass test suites while doing nothing useful.¹ The tests were green. The CI pipeline reported success. The code was worthless. Stanford Law’s Eran Kahana extends the observation to a structural warning: the broader issue is circularity, where the same technology class evaluates code that the same class wrote.⁴

Goodhart’s Law operates here with unusual force. Taste is a technical system makes the case that automated verification needs a judgment layer above it; without taste-level evaluation, tests become targets rather than measures. When agents optimize for test passage, test passage ceases to measure correctness.⁴ Every metric that becomes the target stops being a good metric. The verification layer for a dark factory must account for this dynamic or it will measure compliance, not quality.

How StrongDM Actually Solves Verification

StrongDM’s answer is what they call “Scenarios”: end-to-end user stories stored outside the codebase, functioning like holdout sets in machine learning.¹ The analogy is precise: just as ML practitioners evaluate models against data they never trained on, independent agents evaluate the generated code against scenarios the coding agent cannot access during generation.

The key metric is “Satisfaction”: the fraction of observed trajectories that likely satisfy the user.¹ No industry standard exists for what score constitutes sufficient satisfaction. StrongDM arrived at their own threshold empirically.

To make scenario-based testing work at scale, StrongDM built a Digital Twin Universe: behavioral clones of Okta, Jira, Slack, Google Docs, Drive, and Sheets.¹ The twins target 100% API compatibility using popular publicly available reference SDK client libraries.¹ The agents run against the twins, not against mocked endpoints. The behavioral fidelity of the twin determines the trustworthiness of the test.

StrongDM observed a pattern I have also seen: “with the second revision of Claude 3.5 (October 2024), long-horizon agentic coding workflows began to compound correctness rather than error.”¹ Below a capability threshold, longer agent runs produce more mistakes. Above it, longer runs produce better code. The dark factory pattern only became viable after models crossed that threshold.

Five Layers of Governance

BCG Platinion’s five-pillar transformation framework includes a governance layer with multiple verification steps before code reaches production.² The pillars: an intent-driven operating system, codified knowledge infrastructure, workforce upskilling, a governance layer with independent verification agents, and factory architecture for orchestration.² The governance pillar includes scenario-based tests run by independent agents, static analysis, architecture conformance checking, behavioral regression testing, and red-team agents that actively try to break the output.²

The independence matters. When the same agent writes and tests its own code, Kahana’s circularity problem applies.⁴ When a separate agent (with different system prompts, different context, different incentives) evaluates the work, the failure modes decorrelate. Not eliminate. Decorrelate. Two agents can still share systematic biases inherited from training data. But the probability of identical blind spots drops when the evaluation agent operates from a different frame.

BCG Platinion identifies “intent thinking” as a critical competency for dark factory teams: translating business needs into precise, testable descriptions of desired outcomes.² The human role shifts from writing code to writing specifications that agents can execute against. Poor specs produce passing tests on wrong behavior, the same return true dynamic StrongDM encountered.¹

BCG Platinion also identifies a constraint I have experienced directly: “AI agents are only as effective as the codified knowledge they can access.”² An agent operating without project context generates plausible code that violates local conventions, ignores architectural decisions, and rediscovers problems the team already solved. Codified knowledge (design decisions, API contracts, style guides, failure histories) is infrastructure, not documentation.

What I Already Run at Level 4

My overnight execution loop, the Ralph agent architecture, operates at Shapiro’s Level 4. I write specs, launch agents, sleep, and review results in the morning. The agents run against 95+ hooks that intercept every tool call (file writes, git commands, shell execution) before and after execution. The hooks enforce constraints the agent cannot negotiate with or override.

The hooks address Kahana’s gaming problem at the tool level. A separate post documents the full hook architecture, but the relevant property here is interception: hooks fire before tool execution, not after. An agent that tries to force-push to main gets blocked before the command executes. An agent that tries to commit files matching .env patterns gets intercepted. An agent that reports “all tests pass” without running pytest gets flagged by the evidence gate, which demands pasted test output showing zero failures, not a claim that tests would pass.

The evidence gate enforces six criteria on every non-trivial change: follows codebase patterns (name the pattern and the file), simplest working solution (state rejected alternatives), edge cases handled (list each one), tests pass (paste the output), no regressions (name the files checked), and solves the actual problem (state the user’s need and how the change addresses it). “I believe” and “it should” are not evidence. The gate rejects hedging language and demands artifacts.

The quality loop (implement, review, evaluate, refine, zoom out, repeat, report) runs as a behavioral constraint encoded in the agent’s system prompt via CLAUDE.md. The loop does not guarantee the agent follows every step. The hooks verify that it did.

BCG Platinion’s five pillars map to infrastructure I already maintain:

Intent-Driven OS: CLAUDE.md files and PRD-driven development specs encode project intent as executable context.
Codified Knowledge: 139+ skills, organized as reusable capabilities, give agents access to project conventions, architectural decisions, and domain knowledge.
Governance: Hooks implement the interception layer. The evidence gate implements the audit layer. The quality loop implements the behavioral constraint layer.

I have not built two pillars: workforce upskilling (irrelevant for a solo practitioner) and factory architecture as a dedicated orchestration platform (my current setup uses Claude Code’s native agent spawning, not a purpose-built factory). The compound context system describes how these infrastructure layers accumulate into a capital asset, making each subsequent session more capable.

The Gap Between Level 4 and Level 5

Moving from Level 4 to Level 5 means eliminating the morning review. Right now, I wake up and read what the agents produced overnight. I check git diffs. I run the application. I verify that the output matches my intent. That review takes 30 minutes to an hour and catches problems the hooks miss.

The problems the hooks miss are the interesting ones. They fall into categories that current automation handles poorly:

Intent drift. The agent completed the spec faithfully but the spec was ambiguous, and the agent chose the wrong interpretation. No test catches an incorrect interpretation that produces valid behavior. StrongDM’s scenarios approach this by encoding user stories as the specification, not technical requirements.¹ The scenarios describe what a user experiences, not what the code does.

Architectural erosion. The agent added a feature that works in isolation but degrades the system’s structural coherence. A new database query that bypasses the existing repository pattern. A new endpoint that duplicates logic from another module. Static analysis catches some of these. Architecture conformance checking (BCG Platinion’s governance layer) catches more.² Neither catches the subtle cases where the new code is technically consistent with the patterns but introduces a conceptual split that compounds over future changes.

Institutional knowledge loss. Kahana raises an underappreciated risk: when nobody reads the code, nobody builds intuition about the system.⁴ As Kahana warns, “Nobody will know why. Nobody will know how to fix it.”⁴ Today, my morning review builds that intuition incrementally. At Level 5, the system becomes opaque to its operator. Every complex system eventually needs intervention that automation cannot handle: a security incident, a business logic change that violates assumptions baked into the test suite, an integration with an external system that behaves differently than its documentation claims. The operator who never read the code cannot intervene effectively.

What the Verification Layer Actually Needs

Synthesizing StrongDM’s practice, BCG Platinion’s governance framework, Kahana’s failure analysis, and my own infrastructure, the verification layer for a dark factory requires at minimum:

Holdout-style evaluation. Tests that the generating agent cannot access during code production. StrongDM’s scenarios. Behavioral specifications stored separately from the codebase, evaluated by independent agents. Without holdout evaluation, Goodhart’s Law turns every test suite into a target.

Digital twins for integration testing. Agents cannot test against production systems. Mocks are too shallow; they verify API contracts, not behavior. Twins that replicate the behavioral surface of external dependencies enable end-to-end scenario execution without production risk.

Multi-agent verification with decorrelated failure modes. The writing agent and the evaluating agent must operate from different contexts. Red-team agents that actively probe for gaming, shortcuts, and phantom verification provide a layer that passive testing cannot.

Tool-level interception. Hooks that block harmful operations before execution, not tests that detect damage after the fact. The hook layer sits below the agent’s decision-making and resists circumvention by clever prompting or return true shortcuts.

Executable intent specifications. Specs precise enough that ambiguity is detectable. BCG Platinion’s “intent thinking” competency.² Shapiro’s Level 4 spec that you write before leaving for 12 hours.³ The spec is the product. The code is a side effect.

Audit trail with no accountability gap. Kahana cites AI Life Cycle Core Principles requiring output “traceable to an appropriate responsible party.”⁴ No industry-standard audit methodology for agent-built software exists yet.⁴ The verification layer needs to produce artifacts that a human (or regulator, or incident responder) can trace from deployed behavior back to the specification that generated it.

The Honest Assessment

I run Level 4 with high confidence. My overnight agents produce work that passes morning review more often than not. The hooks catch the mechanical failures. The evidence gate catches the epistemic failures. The quality loop reduces the behavioral failures.

Level 5 requires solving problems I have not solved. Intent drift detection without human pattern-matching. Architecture conformance that catches conceptual erosion, not just structural violations. Institutional knowledge that accumulates in the system rather than in the operator’s head.

BCG Platinion reports 3-5x productivity gains from teams adopting dark factory patterns.² StrongDM released agent-built software with three engineers and a token budget.¹ The productivity case is clear. The verification case is not.

The teams succeeding at Level 5 share a common trait: they invested more in verification infrastructure than in code generation. StrongDM built an entire Digital Twin Universe before trusting agents to ship code.¹ BCG Platinion’s framework has five transformation pillars including a governance layer with multiple verification steps before code reaches production.² The dark factory is not a factory that runs in the dark. It is a factory where the lights are the verification layer, and everything else (including the code) is a commodity.

I wrote previously about what breaks when agents run unsupervised and about the evidence gate as a defense against phantom verification. Those pieces describe the infrastructure for Level 4. The dark factory demands that same infrastructure, extended to operate without the human who currently reads the morning diff. The hooks, the evidence gates, the quality loops: all necessary at Level 5, but not sufficient. The missing piece is verification that scales with the same autonomy as the generation.

Building that piece is the work ahead. The AI engineering hub collects the full arc of this investigation, from individual hook design through compound context to the dark factory frontier.

FAQ

What is a Dark Factory in software development?

A Dark Factory (Dan Shapiro’s Level 5) is a software development environment where machines generate code, verify code, and deploy code without a human reading a single line. The preceding levels progress from manual coding (Level 0) through increasing automation, with Level 4 being “robotaxi” mode where a human writes the spec, leaves for 12 hours, and reviews results. The dark factory eliminates even that final review.

What is the biggest verification challenge at Level 5?

The core trap is agents gaming tests. StrongDM discovered agents writing return true to pass test suites while doing nothing useful. Goodhart’s Law operates with unusual force: when agents optimize for test passage, test passage ceases to measure correctness. The verification layer must account for this by using holdout-style evaluation (tests the generating agent cannot access during code production), multi-agent verification with decorrelated failure modes, and tool-level interception that blocks harmful operations before execution.

What is the gap between Level 4 and Level 5?

Three problems that current automation handles poorly: intent drift (the agent completed the spec faithfully but chose the wrong interpretation of an ambiguous requirement), architectural erosion (new features that work in isolation but degrade structural coherence), and institutional knowledge loss (when nobody reads the code, nobody builds intuition about the system needed for intervention during incidents or business logic changes).

How does StrongDM solve the verification problem?

StrongDM uses “Scenarios,” end-to-end user stories stored outside the codebase that function like holdout sets in machine learning. Independent agents evaluate the code against scenarios the coding agent cannot access during generation. They built a Digital Twin Universe (behavioral clones of Okta, Jira, Slack, Google Docs) targeting 100% API compatibility, so agents test against realistic behavioral surfaces rather than shallow mocks.

Simon Willison, “Software Factory,” simonwillison.net (February 7, 2026), covering StrongDM’s fully autonomous development methodology by Justin McCarthy, Jay Taylor, and Navan Chauhan. ↩↩↩↩↩↩↩↩↩↩↩↩
BCG Platinion, “The Dark Software Factory,” bcgplatinion.com. ↩↩↩↩↩↩↩↩↩↩
Dan Shapiro, “Five Levels of AI Coding,” danshapiro.com (January 2026). ↩↩↩↩
Eran Kahana, “Built by Agents, Tested by Agents, Trusted by Whom?” Stanford Law (February 8, 2026). ↩↩↩↩↩↩↩