The Blind Judge: Scoring Claude Code vs Codex in 36 Duels
Thomas Ricouard (@Dimillian) put it better than any benchmark: “Claude Code is like a very, very mid-refactor that I know he can execute. Codex is a state-of-the-art architecture. I’m not sure yet if it can actually do it without breaking stuff.”1
I stopped wondering and started measuring. I built a system that races Claude Code against Codex CLI on the same task, labels the outputs randomly as Alpha and Beta, and scores both plans blind on five dimensions before revealing which agent wrote which. Thirty-six duels later, the scoreboard says Claude won 8 of 12 decided duels. But the scoreboard is not the point. The point is what the blind judge produces after scoring: a synthesis that combines the strongest elements from both plans into something better than either contestant delivered alone.
TL;DR
Thirty-six duels. Blind evaluation on five dimensions (Correctness, Completeness, Simplicity, Decomposition, Actionability). Claude Code won 8, Codex CLI won 3, one undecided, across 13 duels with structured judgment manifests (12 with a declared winner). The real output is not a winner. It is the synthesis step that cherry-picks the best elements from both plans and produces an implementation brief stronger than either agent alone. The companion post Thinking With Ten Brains covers collaborative deliberation.12 The blind judge covers competitive evaluation. The methodology matters more than the scoreboard.
Why Comparison Is Hard
Everyone is comparing AI coding agents right now. Nobody agrees on the results.
The problem is structural. Model comparisons degrade along three axes: vibes (you tried one task on each and went with your gut), recency bias (the last success overwrites all previous failures), and task-specific strengths (the model that wins on your refactoring task loses on your security review). These are not bad observations. They are bad experiments.
Alex Finn (@AlexFinn) runs a dual validation workflow where two models check each other’s output.2 The dual-check approach catches errors that either model alone would miss. The insight is sound: independent evaluation surfaces disagreements, and disagreements are where the bugs hide.
@doodlestein runs 10+ agents simultaneously — Claude, Codex, and Gemini — using canned prompts he calls “idea wizards” to attack the same problem from different angles.3 A planner that excels at decomposition might miss a correctness bug that a more detail-oriented agent catches immediately.
Both workflows improve on single-model evaluation. Neither eliminates the biggest threat: confirmation bias. If you know which model wrote which plan, you will score the model you trust more generously. Every time. Not because you are careless, but because bias operates below awareness. The missing piece is blinding. If the evaluator does not know which agent produced which output, confirmation bias has nothing to attach to.
The Blind Evaluation Protocol
The /duel system works in five phases:
- Parallel execution. Both agents receive the same task prompt and project context. Claude Code runs in one process, Codex CLI in another. Neither sees the other’s output.
- Random labeling. A coin flip assigns one agent to “Alpha” and the other to “Beta.” The system writes the mapping to
agent-mapping.jsonand seals it. Neither the judge nor I see the mapping until after scoring. - Blind scoring. A judge agent reads both plans and scores each on five dimensions, 0-4 per dimension, maximum 20 points total. The judge sees only “Alpha plan” and “Beta plan.”
- Winner recommendation. The judge declares a winner (or undecided) with a confidence level and written reasoning.
- Synthesis. The judge combines the strongest elements from both plans into a refined implementation brief. The synthesis is the actual output.
The five scoring dimensions:
| Dimension | What It Measures | 0 | 4 |
|---|---|---|---|
| Correctness | Are the technical claims and fixes actually right? | Fundamental errors | Every claim verified against code |
| Completeness | Does the plan cover all requirements and edge cases? | Major gaps | Comprehensive with edge cases handled |
| Simplicity | Is this the minimum correct solution? | Over-engineered | Right-sized, no unnecessary scope |
| Decomposition | Are steps well-ordered with clear dependencies? | Monolithic or tangled | Clean phases, parallelism identified |
| Actionability | Can a developer start executing immediately? | Vague direction | Specific files, lines, commands |
The key design decision: synthesis is not a 50/50 blend. It heavily weights the winner’s core strategy while cherry-picking genuine insights from the loser. Early attempts at equal-weight synthesis produced incoherent plans that inherited the worst properties of both. Weighted synthesis produces plans that are structurally sound (from the winner) and thoroughly hardened (from the loser’s valid edge cases).
Case Study: The Security Remediation Duel
In February 2026, a three-agent security audit found 7 CRITICAL and 7 HIGH findings in ResumeGeni, a FastAPI application with Supabase auth and Stripe payments.4 I had already shipped two trivial fixes. Nine remained. I ran a duel to generate the remediation plan.
Both agents received the same briefing: the finding list with file paths and line numbers, the architecture context, the constraint that a proven fix pattern already existed in one file, and the requirement to produce a phased deployment plan.
Alpha’s plan: 11 stories for 9 findings, organized into three deployment waves. A test baseline story (SEC-01) blocked all subsequent work. Deployment gates with specific metrics: auth success rate, 5xx monitoring, webhook reject counts. Thorough rejected-alternatives discussion. Stories used a What/Why/Success structure with line ranges.
Beta’s plan: Direct 1:1 mapping of findings to stories. Three deployment waves: Critical as a single unit, High-priority as independently deployable, Cleanup. Investigation-before-fix for the middleware finding. Specific line numbers, function names, import paths, and curl commands for verification per story.
The correctness gap told the story. Beta caught two things Alpha missed entirely.
First: the middleware finding (C3) flagged get_user(jwt=...) as a session contamination vector. Beta correctly identified that get_user() is a stateless verification call. gotrue-py only calls _save_session() in verify_otp and exchange_code_for_session, not in get_user. Alpha treated it as definitively needing the same fix as the other two routers, which would add unnecessary per-request overhead on every authenticated request. Beta said: investigate first, fix only if confirmed.
Second: the magic link and passkeys routers use both admin.generate_link() (which requires the SERVICE_KEY singleton) and verify_otp() (which needs a fresh per-request client). Alpha’s plan applied the fresh-client pattern uniformly. An implementer following that plan would break admin operations. Beta explicitly called out the split: fresh client for verify_otp, shared singleton for admin.generate_link().
The scores:
| Dimension | Alpha | Beta |
|---|---|---|
| Correctness | 3 | 4 |
| Completeness | 3 | 3 |
| Simplicity | 2 | 4 |
| Decomposition | 3 | 3 |
| Actionability | 2 | 4 |
| Total | 13/20 | 18/20 |
Alpha was Codex. Beta was Claude. High confidence.4
The synthesis combined Beta’s technical precision with Alpha’s operational rigor. Here is one story from the synthesis output, showing how it merged both plans:
Story 1.1 (C1 — Magic Link Shared Singleton): In
magic_link.py, add_create_auth_client(). Use fresh anon client ONLY forverify_otp(line 224). Keep shared singleton foradmin.generate_link()(line 213) which needs SERVICE_KEY.
That story inherits Beta’s precise line numbers and the critical admin/anon client split, wrapped in a structure that slots into Alpha’s three-wave deployment sequence. The full synthesis kept Beta’s investigation-first approach for C3, Beta’s specific curl verification commands, Alpha’s deployment gates (auth success rate monitoring, 5xx tracking), and Alpha’s regression testing strategy (E2E Playwright auth suite after Wave 1, Stripe test webhook after Wave 2). The result: a 3-wave plan with 12 stories, executable within a day, with operational guardrails that neither plan alone provided.
The Scoreboard (and Why It Misleads)
Across 36 duels, 13 produced structured judgment manifests. One manifest declared undecided, leaving 12 with a clear winner:
| Task Type | Winner | Confidence |
|---|---|---|
| Job ingestion system design | Claude | Medium |
| Job ingestion code review | Codex | High |
| Job page UX design | Claude | High |
| ATS integration review | Claude | High |
| Job corpus expansion planning | Claude | High |
| Deliberation architecture | Codex | Low |
| NIST RFI public comment | Claude | High |
| NIST RFI revision | Claude | High |
| Codebase deep review | Claude | High |
| Security remediation planning | Claude | High |
| Calibration task | Codex | Low |
| Codebase analysis | Undecided | - |
Claude: 8. Codex: 3. Undecided: 1.11
Do not treat the scoreboard as a model benchmark. It is not one.
Claude’s wins concentrate in review, verification, and security tasks: 7 of 8 wins are high-confidence on tasks involving code review, security analysis, or technical evaluation. Codex’s one high-confidence win came on a code review task where its procedural thoroughness and explicit dependency chains outperformed Claude’s less structured approach. The other two wins were low-confidence. The pattern: Claude produces more actionable, technically precise plans. Codex produces stronger operational process and broader theoretical coverage.
Ricouard was right. Planning quality versus execution reliability is a real axis.1 But the scoreboard reflects my task mix (heavy on security and architecture review), not some objective model ranking. Someone running duels on greenfield feature development or infrastructure automation would likely see different results. Nathan Lambert’s analysis of the post-benchmark era makes the same point: traditional benchmarks no longer convey meaningful signal when the fine margins between Opus 4.6 and Codex 5.3 depend on task shape and evaluation methodology.10
The scoreboard tells you about my workflow. It does not tell you which model is “better.”
The Synthesis Is the Product
The winning plan is not the point. The synthesis is.
Every duel produces three artifacts: Plan Alpha, Plan Beta, and the Synthesis. The synthesis follows a consistent structure: adopt the winner’s core strategy, incorporate the loser’s valid edge cases, remove unnecessary scope from both. It is not diplomatic. It does not split the difference. It makes explicit choices about which elements to keep and which to discard, with written justification for each.
In the job corpus expansion duel, Claude’s plan activated existing infrastructure first (seed scripts for 8,780 known boards the system was not yet polling), then expanded to new ATS platforms, then built discovery systems.6 Codex’s plan started with a codebase audit and instrumentation spec before ingesting a single job. Claude won on simplicity and actionability. But Codex identified something Claude missed: the need for a board lifecycle state machine (active/failing/quarantined). Codex also flagged a deduplication regression audit to prevent volume expansion from masking duplicate explosion. The synthesis kept Claude’s activate-first sequencing and incorporated Codex’s observability and lifecycle tracking as Phase 1.5, after initial seeding delivered measurable results. The same pattern appeared in the job ingestion system duel, where Claude’s plan reused existing APScheduler and registry tables while Codex proposed a more thorough two-table provenance schema. The synthesis adopted Claude’s pragmatic architecture and cherry-picked Codex’s dedup hash improvement.7
In the ATS review duel, Claude found P0 runtime crashes (method signature mismatches in the scheduler that would silently break job tracking) that Codex missed entirely.5 Codex found scheduler overlap prevention and admin endpoint abuse vectors that Claude did not flag. The synthesis started with Claude’s P0 fixes and supplemented with Codex’s operational hardening.
The pattern across 36 duels: the judge model consistently produces syntheses that are stronger than either contestant’s plan. The judge is not smarter. The adversarial structure forces full coverage.9 Each agent independently identifies risks and edge cases. The judge sees all of them. The synthesis inherits the union of both agents’ insights minus their individual blind spots.
The pattern connects to a broader finding from multi-agent deliberation: independence is the mechanism. Ten deliberation agents evaluating a decision from different perspectives catch biases that any single agent misses. Two dueling agents attacking the same task from different architectures catch implementation gaps that either agent alone would ship. The synthesis step is the same in both systems: combine independent evaluations into a single artifact that benefits from all perspectives.
I document the orchestration layer that supports both systems separately. What matters here is that dueling and deliberation serve complementary functions. Deliberation answers “should we build this?” Dueling answers “what is the strongest way to build it?”
When to Duel vs. When to Deliberate
Both systems use independent evaluation and synthesis. They serve different decision types.
| Decision Type | Tool | Why |
|---|---|---|
| Architecture decisions | Deliberation | 10 specialist perspectives catch risks across domains |
| “Should we build this?” | Deliberation | Cost Analyst, Maintenance Pessimist, UX Advocate |
| Implementation plans | Dueling | Competitive pressure produces more actionable plans |
| “How should we build this?” | Dueling | Two agents find different bugs and edge cases |
| Technical review | Dueling | Different review styles catch different defect categories |
| Risk assessment | Deliberation | Named thinking frameworks (inversion, pre-mortem) |
My pattern: deliberate the design, duel the implementation plan, execute the synthesis.
A security remediation decision goes through deliberation first (“Is this the right prioritization? Are we missing systemic issues?”), then dueling (“What is the strongest phased plan to execute these fixes?”), then execution from the judge’s synthesis. The deliberation system and the dueling system share infrastructure but serve distinct purposes in the decision pipeline.
What I Got Wrong
Early duels had no blind labeling. I read both plans knowing which model wrote which. Confirmation bias was real and measurable: I consistently scored Claude higher on Actionability before blinding, then saw the gap narrow (though not disappear) after introducing random Alpha/Beta assignment. The blinding protocol is not optional.
I started with three scoring dimensions (Correctness, Completeness, Actionability). Two duels in, I realized I was conflating plan structure with plan content. Adding Simplicity (is this over-engineered?) and Decomposition (are the steps well-ordered?) separated those concerns and produced more useful scoring.
First synthesis attempts blended both plans equally. The results were incoherent: Alpha’s testing strategy grafted onto Beta’s deployment sequence, with neither plan’s assumptions holding. Weighted synthesis, where the judge explicitly adopts the winner’s framework and selectively incorporates the loser’s insights, was the breakthrough.
N=36 on my task mix is not a model benchmark. It is a workflow tool evaluation. The dueling system tells me which agent produced the stronger plan for my specific task in my specific codebase. Extrapolating to “Claude is better than Codex” would be the same vibes-based reasoning the system exists to eliminate.
I use Claude to judge duels between Claude and Codex. I acknowledge the conflict.8 The mitigation is structural: blind labeling, structured dimensions, and the fact that Codex won 3 duels and came close on several others. A stronger test would run the same duels through a non-Claude judge (Gemini or GPT) and compare scoring distributions. I have not done that yet. Until I do, the 8-3 split should carry an asterisk: the judge and one contestant share a model family.
References
-
Thomas Ricouard (@Dimillian), post on X, February 2026. Direct quote comparing Claude Code and Codex CLI: planning quality versus execution reliability as distinct evaluation axes. ↩↩
-
Alex Finn (@AlexFinn), post on X, February 2026. Dual validation workflow running Codex and Claude Code side by side, each plan validated against the other. ↩
-
@doodlestein, post on X, February 2026. Runs 10+ agents (Claude, Codex, Gemini) simultaneously using canned “idea wizard” prompts to attack the same problem from different architectures. ↩
-
Author’s duel session,
20260224-215831-security-remediation-plan-for-resumegeni. Blind Alpha/Beta assignment, 5-dimension scoring, high-confidence judgment. Full session record includesjudgment.json,plan-claude.md,plan-codex.md, andagent-mapping.json. ↩↩ -
Author’s duel session,
20260221-155640-review-the-resumegeni-ats-integration-im. Claude (Alpha) identified P0 runtime crashes with specific line numbers. Codex (Beta) produced 13 procedural steps without pinpointing the actual bugs. Claude scored 18/20, Codex 13/20. High confidence. ↩ -
Author’s duel session,
20260224-103926-you-are-investigating-how-to-massively-e. Job corpus expansion from 160K to 500K. Claude scored 20/20, Codex 13/20. Claude activated existing infrastructure first; Codex started with codebase audit. ↩ -
Author’s duel session,
20260221-120648-resumegeni-phase-1-build-modular-job-ing. Job ingestion system design. Medium confidence, Claude (Beta) won on simplicity (4 vs 2) and actionability (4 vs 3). Codex (Alpha) had stronger theoretical completeness. ↩ -
Perez et al., “Red Teaming Language Models with Language Models,” arXiv:2202.03286, 2022. Demonstrates that LLMs can evaluate other LLMs through adversarial testing. The same-family evaluation bias concern is the author’s own observation, informed by the general finding that model-generated evaluations carry systematic biases. ↩
-
Van Valen, Leigh, “A New Evolutionary Law,” Evolutionary Theory, 1973. The Red Queen hypothesis: organisms must constantly adapt to maintain relative fitness against co-evolving competitors. Applied here by analogy: the adversarial structure of dueling creates similar pressure for plan quality. ↩
-
Nathan Lambert, “Opus 4.6, Codex 5.3, and the post-benchmark era,” Interconnects, February 2026. Argues that traditional benchmarks no longer convey meaningful signal when model differences depend on task shape and evaluation methodology. ↩
-
Author’s scoreboard across 36 total duels, 13 with structured judgment manifests, 12 with declared winners. Claude: 8 wins (7 high-confidence). Codex: 3 wins (1 high-confidence). Undecided: 1. Remaining 23 duels were calibration runs or lacked the structured judgment pipeline. ↩
-
Author’s companion post on collaborative multi-agent evaluation. See “Thinking With Ten Brains: How I Use Agent Deliberation as a Decision Tool.” ↩