The Blind Judge: Scoring Claude Code vs Codex in 36 Duels

Thomas Ricouard (@Dimillian) put it better than any benchmark: “Claude Code is like a very, very mid-refactor that I know he can execute. Codex is a state-of-the-art architecture. I’m not sure yet if it can actually do it without breaking stuff.”¹

I stopped wondering and started measuring. I built a system that races Claude Code against Codex CLI on the same task, labels the outputs randomly as Alpha and Beta, and scores both plans blind on five dimensions before revealing which agent wrote which. Thirty-six duels later, the scoreboard says Claude won 8 of 12 decided duels. But the scoreboard is not the point. The point is what the blind judge produces after scoring: a synthesis that combines the strongest elements from both plans into something better than either contestant delivered alone.

TL;DR

Thirty-six duels. Blind evaluation on five dimensions (Correctness, Completeness, Simplicity, Decomposition, Actionability). Claude Code won 8, Codex CLI won 3, one undecided, across 13 duels with structured judgment manifests (12 with a declared winner). The real output is not a winner. It is the synthesis step that cherry-picks the best elements from both plans and produces an implementation brief stronger than either agent alone. The companion post Thinking With Ten Brains covers collaborative deliberation.¹² The blind judge covers competitive evaluation. The methodology matters more than the scoreboard.

Why Comparison Is Hard

Everyone is comparing AI coding agents right now. Nobody agrees on the results.

The problem is structural. Model comparisons degrade along three axes: vibes (you tried one task on each and went with your gut), recency bias (the last success overwrites all previous failures), and task-specific strengths (the model that wins on your refactoring task loses on your security review). These are not bad observations. They are bad experiments.

Alex Finn (@AlexFinn) runs a dual validation workflow where two models check each other’s output.² The dual-check approach catches errors that either model alone would miss. The insight is sound: independent evaluation surfaces disagreements, and disagreements are where the bugs hide.

@doodlestein runs 10+ agents simultaneously — Claude, Codex, and Gemini — using canned prompts he calls “idea wizards” to attack the same problem from different angles.³ A planner that excels at decomposition might miss a correctness bug that a more detail-oriented agent catches immediately.

Both workflows improve on single-model evaluation. Neither eliminates the biggest threat: confirmation bias. If you know which model wrote which plan, you will score the model you trust more generously. Every time. Not because you are careless, but because bias operates below awareness. The missing piece is blinding. If the evaluator does not know which agent produced which output, confirmation bias has nothing to attach to.

The /duel system works in five phases:

Parallel execution. Both agents receive the same task prompt and project context. Claude Code runs in one process, Codex CLI in another. Neither sees the other’s output.
Random labeling. A coin flip assigns one agent to “Alpha” and the other to “Beta.” The system writes the mapping to agent-mapping.json and seals it. Neither the judge nor I see the mapping until after scoring.
Blind scoring. A judge agent reads both plans and scores each on five dimensions, 0-4 per dimension, maximum 20 points total. The judge sees only “Alpha plan” and “Beta plan.”
Winner recommendation. The judge declares a winner (or undecided) with a confidence level and written reasoning.
Synthesis. The judge combines the strongest elements from both plans into a refined implementation brief. The synthesis is the actual output.

The five scoring dimensions:

Dimension	What It Measures	0	4
Correctness	Are the technical claims and fixes actually right?	Fundamental errors	Every claim verified against code
Completeness	Does the plan cover all requirements and edge cases?	Major gaps	Comprehensive with edge cases handled
Simplicity	Is this the minimum correct solution?	Over-engineered	Right-sized, no unnecessary scope
Decomposition	Are steps well-ordered with clear dependencies?	Monolithic or tangled	Clean phases, parallelism identified
Actionability	Can a developer start executing immediately?	Vague direction	Specific files, lines, commands

The key design decision: synthesis is not a 50/50 blend. It heavily weights the winner’s core strategy while cherry-picking genuine insights from the loser. Early attempts at equal-weight synthesis produced incoherent plans that inherited the worst properties of both. Weighted synthesis produces plans that are structurally sound (from the winner) and thoroughly hardened (from the loser’s valid edge cases).

Case Study: The Security Remediation Duel

In February 2026, a three-agent security audit found 7 CRITICAL and 7 HIGH findings in ResumeGeni, a FastAPI application with Supabase auth and Stripe payments.⁴ I had already shipped two trivial fixes. Nine remained. I ran a duel to generate the remediation plan.

Both agents received the same briefing: the finding list with file paths and line numbers, the architecture context, the constraint that a proven fix pattern already existed in one file, and the requirement to produce a phased deployment plan.

Alpha’s plan: 11 stories for 9 findings, organized into three deployment waves. A test baseline story (SEC-01) blocked all subsequent work. Deployment gates with specific metrics: auth success rate, 5xx monitoring, webhook reject counts. Thorough rejected-alternatives discussion. Stories used a What/Why/Success structure with line ranges.

Beta’s plan: Direct 1:1 mapping of findings to stories. Three deployment waves: Critical as a single unit, High-priority as independently deployable, Cleanup. Investigation-before-fix for the middleware finding. Specific line numbers, function names, import paths, and curl commands for verification per story.

The correctness gap told the story. Beta caught two things Alpha missed entirely.

First: the middleware finding (C3) flagged get_user(jwt=...) as a session contamination vector. Beta correctly identified that get_user() is a stateless verification call. gotrue-py only calls _save_session() in verify_otp and exchange_code_for_session, not in get_user. Alpha treated it as definitively needing the same fix as the other two routers, which would add unnecessary per-request overhead on every authenticated request. Beta said: investigate first, fix only if confirmed.

Second: the magic link and passkeys routers use both admin.generate_link() (which requires the SERVICE_KEY singleton) and verify_otp() (which needs a fresh per-request client). Alpha’s plan applied the fresh-client pattern uniformly. An implementer following that plan would break admin operations. Beta explicitly called out the split: fresh client for verify_otp, shared singleton for admin.generate_link().

The scores:

Dimension	Alpha	Beta
Correctness	3	4
Completeness	3	3
Simplicity	2	4
Decomposition	3	3
Actionability	2	4
Total	13/20	18/20

Alpha was Codex. Beta was Claude. High confidence.⁴

The synthesis combined Beta’s technical precision with Alpha’s operational rigor. Here is one story from the synthesis output, showing how it merged both plans:

Story 1.1 (C1 — Magic Link Shared Singleton): In magic_link.py, add _create_auth_client(). Use fresh anon client ONLY for verify_otp (line 224). Keep shared singleton for admin.generate_link() (line 213) which needs SERVICE_KEY.

That story inherits Beta’s precise line numbers and the critical admin/anon client split, wrapped in a structure that slots into Alpha’s three-wave deployment sequence. The full synthesis kept Beta’s investigation-first approach for C3, Beta’s specific curl verification commands, Alpha’s deployment gates (auth success rate monitoring, 5xx tracking), and Alpha’s regression testing strategy (E2E Playwright auth suite after Wave 1, Stripe test webhook after Wave 2). The result: a 3-wave plan with 12 stories, executable within a day, with operational guardrails that neither plan alone provided.

The Scoreboard (and Why It Misleads)

Across 36 duels, 13 produced structured judgment manifests. One manifest declared undecided, leaving 12 with a clear winner:

Task Type	Winner	Confidence
Job ingestion system design	Claude	Medium
Job ingestion code review	Codex	High
Job page UX design	Claude	High
ATS integration review	Claude	High
Job corpus expansion planning	Claude	High
Deliberation architecture	Codex	Low
NIST RFI public comment	Claude	High
NIST RFI revision	Claude	High
Codebase deep review	Claude	High
Security remediation planning	Claude	High
Calibration task	Codex	Low
Codebase analysis	Undecided	-

Claude: 8. Codex: 3. Undecided: 1.¹¹

Do not treat the scoreboard as a model benchmark. It is not one.

Claude’s wins concentrate in review, verification, and security tasks: 7 of 8 wins are high-confidence on tasks involving code review, security analysis, or technical evaluation. Codex’s one high-confidence win came on a code review task where its procedural thoroughness and explicit dependency chains outperformed Claude’s less structured approach. The other two wins were low-confidence. The pattern: Claude produces more actionable, technically precise plans. Codex produces stronger operational process and broader theoretical coverage.

Ricouard was right. Planning quality versus execution reliability is a real axis.¹ But the scoreboard reflects my task mix (heavy on security and architecture review), not some objective model ranking. Someone running duels on greenfield feature development or infrastructure automation would likely see different results. Nathan Lambert’s analysis of the post-benchmark era makes the same point: traditional benchmarks no longer convey meaningful signal when the fine margins between Opus 4.6 and Codex 5.3 depend on task shape and evaluation methodology.¹⁰

The scoreboard tells you about my workflow. It does not tell you which model is “better.”

The Synthesis Is the Product

The winning plan is not the point. The synthesis is.

Every duel produces three artifacts: Plan Alpha, Plan Beta, and the Synthesis. The synthesis follows a consistent structure: adopt the winner’s core strategy, incorporate the loser’s valid edge cases, remove unnecessary scope from both. It is not diplomatic. It does not split the difference. It makes explicit choices about which elements to keep and which to discard, with written justification for each.

In the job corpus expansion duel, Claude’s plan activated existing infrastructure first (seed scripts for 8,780 known boards the system was not yet polling), then expanded to new ATS platforms, then built discovery systems.⁶ Codex’s plan started with a codebase audit and instrumentation spec before ingesting a single job. Claude won on simplicity and actionability. But Codex identified something Claude missed: the need for a board lifecycle state machine (active/failing/quarantined). Codex also flagged a deduplication regression audit to prevent volume expansion from masking duplicate explosion. The synthesis kept Claude’s activate-first sequencing and incorporated Codex’s observability and lifecycle tracking as Phase 1.5, after initial seeding delivered measurable results. The same pattern appeared in the job ingestion system duel, where Claude’s plan reused existing APScheduler and registry tables while Codex proposed a more thorough two-table provenance schema. The synthesis adopted Claude’s pragmatic architecture and cherry-picked Codex’s dedup hash improvement.⁷

In the ATS review duel, Claude found P0 runtime crashes (method signature mismatches in the scheduler that would silently break job tracking) that Codex missed entirely.⁵ Codex found scheduler overlap prevention and admin endpoint abuse vectors that Claude did not flag. The synthesis started with Claude’s P0 fixes and supplemented with Codex’s operational hardening.

The pattern across 36 duels: the judge model consistently produces syntheses that are stronger than either contestant’s plan. The judge is not smarter. The adversarial structure forces full coverage.⁹ Each agent independently identifies risks and edge cases. The judge sees all of them. The synthesis inherits the union of both agents’ insights minus their individual blind spots.

The pattern connects to a broader finding from multi-agent deliberation: independence is the mechanism. Ten deliberation agents evaluating a decision from different perspectives catch biases that any single agent misses. Two dueling agents attacking the same task from different architectures catch implementation gaps that either agent alone would ship. The synthesis step is the same in both systems: combine independent evaluations into a single artifact that benefits from all perspectives.

I document the orchestration layer that supports both systems separately. What matters here is that dueling and deliberation serve complementary functions. Deliberation answers “should we build this?” Dueling answers “what is the strongest way to build it?”

When to Duel vs. When to Deliberate

Both systems use independent evaluation and synthesis. They serve different decision types.

Decision Type	Tool	Why
Architecture decisions	Deliberation	10 specialist perspectives catch risks across domains
“Should we build this?”	Deliberation	Cost Analyst, Maintenance Pessimist, UX Advocate
Implementation plans	Dueling	Competitive pressure produces more actionable plans
“How should we build this?”	Dueling	Two agents find different bugs and edge cases
Technical review	Dueling	Different review styles catch different defect categories
Risk assessment	Deliberation	Named thinking frameworks (inversion, pre-mortem)

My pattern: deliberate the design, duel the implementation plan, execute the synthesis.

A security remediation decision goes through deliberation first (“Is this the right prioritization? Are we missing systemic issues?”), then dueling (“What is the strongest phased plan to execute these fixes?”), then execution from the judge’s synthesis. The deliberation system and the dueling system share infrastructure but serve distinct purposes in the decision pipeline.

What I Got Wrong

Early duels had no blind labeling. I read both plans knowing which model wrote which. Confirmation bias was real and measurable: I consistently scored Claude higher on Actionability before blinding, then saw the gap narrow (though not disappear) after introducing random Alpha/Beta assignment. The blinding protocol is not optional.

I started with three scoring dimensions (Correctness, Completeness, Actionability). Two duels in, I realized I was conflating plan structure with plan content. Adding Simplicity (is this over-engineered?) and Decomposition (are the steps well-ordered?) separated those concerns and produced more useful scoring.

First synthesis attempts blended both plans equally. The results were incoherent: Alpha’s testing strategy grafted onto Beta’s deployment sequence, with neither plan’s assumptions holding. Weighted synthesis, where the judge explicitly adopts the winner’s framework and selectively incorporates the loser’s insights, was the breakthrough.

N=36 on my task mix is not a model benchmark. It is a workflow tool evaluation. The dueling system tells me which agent produced the stronger plan for my specific task in my specific codebase. Extrapolating to “Claude is better than Codex” would be the same vibes-based reasoning the system exists to eliminate.

I use Claude to judge duels between Claude and Codex. I acknowledge the conflict.⁸ The mitigation is structural: blind labeling, structured dimensions, and the fact that Codex won 3 duels and came close on several others. A stronger test would run the same duels through a non-Claude judge (Gemini or GPT) and compare scoring distributions. I have not done that yet. Until I do, the 8-3 split should carry an asterisk: the judge and one contestant share a model family.

References

Thomas Ricouard (@Dimillian), post on X, February 2026. Direct quote comparing Claude Code and Codex CLI: planning quality versus execution reliability as distinct evaluation axes. ↩↩
Alex Finn (@AlexFinn), post on X, February 2026. Dual validation workflow running Codex and Claude Code side by side, each plan validated against the other. ↩
@doodlestein, post on X, February 2026. Runs 10+ agents (Claude, Codex, Gemini) simultaneously using canned “idea wizard” prompts to attack the same problem from different architectures. ↩
Author’s duel session, 20260224-215831-security-remediation-plan-for-resumegeni. Blind Alpha/Beta assignment, 5-dimension scoring, high-confidence judgment. Full session record includes judgment.json, plan-claude.md, plan-codex.md, and agent-mapping.json. ↩↩
Author’s duel session, 20260221-155640-review-the-resumegeni-ats-integration-im. Claude (Alpha) identified P0 runtime crashes with specific line numbers. Codex (Beta) produced 13 procedural steps without pinpointing the actual bugs. Claude scored 18/20, Codex 13/20. High confidence. ↩
Author’s duel session, 20260224-103926-you-are-investigating-how-to-massively-e. Job corpus expansion from 160K to 500K. Claude scored 20/20, Codex 13/20. Claude activated existing infrastructure first; Codex started with codebase audit. ↩
Author’s duel session, 20260221-120648-resumegeni-phase-1-build-modular-job-ing. Job ingestion system design. Medium confidence, Claude (Beta) won on simplicity (4 vs 2) and actionability (4 vs 3). Codex (Alpha) had stronger theoretical completeness. ↩
Perez et al., “Red Teaming Language Models with Language Models,” arXiv:2202.03286, 2022. Demonstrates that LLMs can evaluate other LLMs through adversarial testing. The same-family evaluation bias concern is the author’s own observation, informed by the general finding that model-generated evaluations carry systematic biases. ↩
Van Valen, Leigh, “A New Evolutionary Law,” Evolutionary Theory, 1973. The Red Queen hypothesis: organisms must constantly adapt to maintain relative fitness against co-evolving competitors. Applied here by analogy: the adversarial structure of dueling creates similar pressure for plan quality. ↩
Nathan Lambert, “Opus 4.6, Codex 5.3, and the post-benchmark era,” Interconnects, February 2026. Argues that traditional benchmarks no longer convey meaningful signal when model differences depend on task shape and evaluation methodology. ↩
Author’s scoreboard across 36 total duels, 13 with structured judgment manifests, 12 with declared winners. Claude: 8 wins (7 high-confidence). Codex: 3 wins (1 high-confidence). Undecided: 1. Remaining 23 duels were calibration runs or lacked the structured judgment pipeline. ↩
Author’s companion post on collaborative multi-agent evaluation. See “Thinking With Ten Brains: How I Use Agent Deliberation as a Decision Tool.” ↩

The Blind Judge: Scoring Claude Code vs Codex in 36 Duels

TL;DR

Why Comparison Is Hard

The Blind Evaluation Protocol

Case Study: The Security Remediation Duel

The Scoreboard (and Why It Misleads)

The Synthesis Is the Product

When to Duel vs. When to Deliberate

What I Got Wrong

References

Related Posts

Thinking With Ten Brains: How I Use Agent Deliberation as a Decision Tool

Anthropic Measured What Works. My Hooks Enforce It.

Multi-Agent Deliberation: When Agreement Is the Bug