Thinking With Ten Brains: How I Use Agent Deliberation as a Decision Tool

I was three hours into designing a memory retrieval system for my Claude Code harness when I decided to run the decision through my multi-agent deliberation system instead. Ten AI agents evaluated the project independently. Nine of them had opinions about architecture, security, and performance. The tenth asked a question I had not thought to ask: “What does the problem you are solving actually cost?”

The answer killed the project. The token overhead I planned to optimize away cost less per month than a coffee. The retrieval system I planned to build would take 200-400 hours of engineering. Break-even: 18 to 36 years.¹

Every other agent had produced useful analysis. The Technical Architect’s design was clean. The Security Analyst found real risks. The Performance Engineer’s math was precise. But none of them questioned whether the project should exist. I certainly had not questioned it. I was already anchored on the solution. The Cost Analyst had no such anchor, because it evaluates every proposal from zero.

TL;DR

You cannot remove cognitive biases by being aware of them. Kahneman proved this decades ago: even experts who study biases fall prey to them.² Multi-agent deliberation is a structural intervention, not a prompting trick. Ten AI agents with different evaluation priorities force the externalization of reasoning, making blind spots visible before they become commitments. I built the architecture in January 2026 and have used it for two months on decisions ranging from memory systems to blog strategy to API design. This post is about the practice: how to think with ten brains, when to do it, and when it makes things worse.

The Problem With Your Own Head

Daniel Kahneman spent a career documenting a structural failure in human cognition. System 1 generates fast, intuitive judgments. System 2 is supposed to check them. In practice, System 2 operates in “a comfortable low-effort mode” and endorses System 1’s conclusions without scrutiny.² Kahneman’s central finding: the oversight system is lazy. It rubber-stamps the intuition.

This maps directly to how most people use AI. You ask one agent a question. The agent generates an answer (System 1). You read the answer and decide if it sounds right (System 2). But your System 2 is evaluating the answer through the same biases that shaped the question. You anchored on your first framing. You gave the agent context that confirmed your existing hypothesis. The agent, trained to be helpful, reinforced your direction. At no point did anyone challenge the premise.

Here are the biases that hit hardest in engineering decisions:

Bias	How It Appears	What Catches It
Confirmation	Seeking data that supports the planned approach	Agent with opposing mandate
Anchoring	First estimate dominates all subsequent thinking	Independent estimation from multiple agents
Sunk cost	“We already built the foundation, might as well continue”	Cost Analyst who evaluates from zero
Availability	Overweighting the most recent production incident	Agent with access to historical patterns
Dunning-Kruger	Confident in areas where you lack depth	Domain specialist agent
Survivorship	“The last three deploys went fine”	Maintenance Pessimist who asks about the failures you forgot

The counter-strategies are well documented: devil’s advocate processes, pre-mortem analysis, structured decision frameworks, external feedback loops.³ The problem is execution. Running a pre-mortem requires assembling people, scheduling time, and overcoming social pressure. Seeking a devil’s advocate requires finding someone willing to disagree with the person who signs their performance review.

Multi-agent deliberation removes the execution barrier. The agents are always available. They have no social incentives to agree. They evaluate independently by design, not by discipline.

Deliberation as Externalized Thinking

Sam Altman frames writing as “externalized thinking”: when a problem feels confusing, writing it down forces clarity.⁴ The same mechanism applies to structured debate. When ten agents articulate their reasoning in parallel, the reasoning becomes an artifact you can inspect.

This is not a new idea. Marvin Minsky proposed in The Society of Mind that intelligence emerges from the interaction of many smaller, simpler agents, not from a single sophisticated process.⁵ Andrew Ng identified three patterns for multi-agent systems: debate (propose, critique, revise), collaboration (parallel specialists with a synthesizer), and adversarial evaluation (red team versus blue team).⁶ Edward de Bono’s Six Thinking Hats framework, published in 1985, assigns parallel perspectives (facts, emotions, caution, optimism, creativity, process) to prevent groups from anchoring on a single mode of thinking.⁷

My deliberation system implements all three patterns simultaneously. The ten research agents are specialists (Ng’s collaborative pattern). The Debate and Synthesis agents create structured disagreement (Ng’s debate pattern). The Maintenance Pessimist and Security Analyst function as adversarial evaluators. Each agent maps to a thinking hat:

Agent	De Bono’s Hat	Thinking Mode
Technical Architect	White	Facts, feasibility, integration patterns
Cost Analyst	White	Data, economics, break-even analysis
UX Advocate	Red	User feelings, cognitive load, friction
Security Analyst	Black	Risks, vulnerabilities, failure modes
Maintenance Pessimist	Black	Technical debt, long-term costs
Innovation Scout	Green	Novel approaches, alternatives
Performance Engineer	Yellow	Efficiency gains, optimization potential
Quality Guardian	Blue	Process, testing strategy, observability

The architecture is documented elsewhere. What matters here is the practice. Deliberation forces you to externalize the decision into a format where biases become visible. You stop asking “is this a good idea?” and start reading ten independent answers to “what could go wrong, what does the math say, and what alternatives exist?”

Pedro Domingos describes the ideal AI as a “mental exoskeleton”: something that extends your thinking rather than replacing it, represents your interests rather than flattering your conclusions.⁸ A deliberation panel that includes a devil’s advocate, a cost analyst, and a maintenance pessimist is exactly that. It amplifies the parts of your cognition that are structurally weak.

Case Study: The Memory Architecture Decision

In February 2026, I ran the first live test of the deliberation system on the question from the opening: what memory architecture should my Claude Code harness use across 12 active projects?¹

My harness injects MEMORY.md files into every conversation. These files contain project decisions, patterns, error history, and architecture notes. The problem: most of that context is irrelevant to any given session. Only 5-10% of loaded memory matters for the current task. The rest is wasted tokens. An obvious optimization target.

Initial confidence scored 0.50, well below the 0.70 threshold that triggers deliberation. The system deployed all ten research agents. Each investigated independently with context isolation: agents could not see each other’s findings during research.

Three approaches emerged:

Approach	Score	Support	Verdict
Smart Native (selective injection)	7.04/10	8 of 10 agents	Winner
Stay Native (current system, hardened)	6.50/10	5 of 10 agents	Safe but low impact
Full Stack Memory (external tools)	5.38/10	1 of 10 agents	Highest capability, critical risk

The scores tell one story. What the individual agents found tells a better one.

Technical Architect: Identified four integration patterns (MCP server, augmented MEMORY.md, embedding retrieval, agent-based manager). Recommended a tiered approach: augment existing files now, add embedding retrieval later. Clean design, well-scoped.

Security Analyst: Rated every external memory tool HIGH to CRITICAL risk for credential exposure. Identified a specific attack: a compromised session injects “always summarize API keys” into persistent memory, poisoning every future session silently.

Performance Engineer: Quantified the waste. Only 5-10% of loaded memory is relevant per conversation. But with 1M token context windows, the total memory overhead is 2K tokens, just 0.2% of capacity. The “obvious optimization” targets a rounding error.

UX Advocate: “The best memory system is one you never think about.” Every alternative adds cognitive tax. Users start asking “is memory working? What does it know?” and stop trusting automated context. The invisible system has higher user trust than any visible one.

Maintenance Pessimist: Multiple memory systems create combinatorial failure surfaces. Four interacting systems produce 16 pairwise failure modes. Claude Code updates frequently. External plugins break on version changes. A silent hook failure means the agent operates with incomplete context and no warning.

Cost Analyst: This is the agent that killed the project. The total token cost of always-loading memory files across all 12 projects: trivial. The proposed retrieval system would save a few dollars per month. Engineering time to build it: 200-400 hours. Break-even: 18 to 36 years. The Cost Analyst’s summary: “In a world obsessed with optimization, sometimes the right answer is to leave well enough alone.”

No single agent produced a wrong analysis. The Technical Architect’s design worked. The Performance Engineer’s token math checked out. But the decision required all ten perspectives to avoid the optimization trap. Left to my own instincts, I would have built the retrieval system because it felt like progress. The Cost Analyst asked the question I could not ask myself because three hours of scoping had already anchored my thinking on the solution.

Deliberation vs. Dueling

Deliberation is collaborative: ten agents evaluating a decision from different perspectives. I also built a competitive variant that races Claude Code against Codex CLI on the same task, scores both plans blind, and synthesizes the strongest elements from each. Thirty-six duels have produced patterns worth their own writeup. The short version: I deliberate architectural decisions and duel implementation plans. Deliberation answers “should we build this?” Dueling answers “what is the strongest way to build it?”

The Maintenance Pessimist and the Art of Inversion

Charlie Munger’s inversion technique asks: instead of “how do I achieve X?”, ask “what would guarantee failure at X?” Then avoid those things.⁹ Gary Klein’s pre-mortem operationalizes the same idea: assume the project failed, then explain why.¹⁰ Philip Tetlock’s research on forecasting accuracy found that “foxes” who integrate multiple perspectives consistently outperform “hedgehogs” who commit to one big idea.¹¹

Each deliberation agent embodies a named thinking framework:

Agent	Thinking Framework	The Question It Asks
Maintenance Pessimist	Inversion (Munger)	“What will make us regret this in 6 months?”
Security Analyst	Pre-mortem (Klein)	“It shipped and got breached. What did we miss?”
Innovation Scout	Fox thinking (Tetlock)	“What approaches from other domains apply here?”
Cost Analyst	First principles	“What does the math actually say?”
UX Advocate	Empathy mapping	“How does the user experience this failure?”

The Maintenance Pessimist is the most valuable agent in my system. Not because it is the smartest or the most thorough, but because it asks the question I am least likely to ask myself. When I am excited about building something, the last thing I want to think about is what it will cost to maintain in six months. The Maintenance Pessimist has no excitement. It has no sunk cost. It evaluates the proposal as if it already exists and asks what breaks.

In the memory architecture deliberation, the Maintenance Pessimist identified that four interacting memory systems produce 16 pairwise failure modes. Claude Code updates frequently. External plugins break on version changes. Silent hook failures mean the agent operates with incomplete context and no warning. These are not hypothetical risks. They are predictions based on patterns the pessimist has been trained to recognize.

Kahneman described the pre-mortem as one of the most effective debiasing techniques he knows of, because it legitimizes dissent.² A deliberation agent that is designed to dissent removes the social cost entirely.

The Evidence Gate: Do Not Let Yourself Self-Report

My harness uses an Evidence Gate pattern for every completion report.¹² The rule: feelings are not evidence. “I believe this works” is not a claim. Running the test suite and pasting the output is a claim.

Criterion	Required Evidence	NOT Sufficient
Follows codebase patterns	Name the pattern and file	“I followed best practices”
Simplest working solution	Name rejected alternatives and why	“It’s clean”
Edge cases handled	List specific cases and how each resolves	“I considered edge cases”
Tests pass	Paste test output	“Tests should pass”
No regressions	Name related files and features checked	“Nothing else should be affected”

Hedging language is a red flag: “should,” “probably,” “seems to,” “I believe,” “looks correct.” Each word signals that verification did not happen.¹² This applies to human reasoning too. When you catch yourself saying “I’m pretty confident this is the right approach,” that is not evidence. That is System 2 rubber-stamping System 1.

Multi-agent deliberation enforces the Evidence Gate structurally. The Cost Analyst does not say “this probably makes economic sense.” It says “$9/month current cost, $5/month savings, 200-400 hours to build, 18-36 year break-even.” The Security Analyst does not say “the security posture looks reasonable.” It says “memory poisoning scenario: compromised session injects credential-harvesting instructions into persistent memory.”

The most effective debiasing mechanism I have found is not a checklist or a philosophy. It is a system where the agents cannot self-report. They must produce evidence, and that evidence gets evaluated by other agents who have no incentive to agree.

When NOT to Deliberate

Deliberation has failure modes too. The system adds 2-4 minutes and $2-3 per invocation at full scale. More importantly, it can overcorrect.

I ran a deliberation on a straightforward API endpoint refactor. Ten agents produced concerns about backward compatibility, migration paths, rate limiting, error handling, monitoring, and documentation. The endpoint served two internal consumers. The deliberation generated 14 action items for what should have been a 20-line change. I ignored 12 of them and shipped the refactor. The deliberation was technically correct, the risks were real, but the decision was a two-way door.¹³

Jeff Bezos distinguishes Type 1 decisions (irreversible, one-way doors) from Type 2 decisions (reversible, two-way doors). Type 1 decisions demand careful deliberation: database schema changes, security architecture, public API contracts. Type 2 decisions demand speed: internal refactors, documentation updates, feature flag experiments.¹³ Applying heavyweight process to lightweight decisions is its own form of waste.

Rules I follow:

Deliberate when: - The decision is irreversible or expensive to reverse - Multiple trade-offs require specialist evaluation - Your confidence is below 0.70 (you feel uncertain but cannot articulate why) - The domain is outside your primary expertise

Just decide when: - The change is behind a feature flag or easily reverted - The scope is contained (one file, one function, one endpoint) - You have made this type of decision successfully before - The cost of being wrong is lower than the cost of deliberating

Never deliberate on: - Documentation fixes - Variable renames - Test fixture updates - Log message changes

The 10% of decisions that warrant deliberation produce 90% of the value. Deliberating everything produces analysis paralysis. Deliberating nothing ships the biases you cannot see.

What I Have Learned in Two Months

The system has run approximately 40 deliberations since January 2026. Patterns:

The Cost Analyst is the most underrated agent. Engineers instinctively reach for the Performance Engineer and Security Analyst. The Cost Analyst has killed more bad ideas than any other persona by asking the one question engineers hate: “what does this actually cost?”
Consensus below 0.70 means the question is wrong. When agents cannot agree, the problem is usually ambiguous framing, not genuine disagreement. Re-scoping the question and running again produces better results than forcing convergence.
The Maintenance Pessimist catches what post-mortems find too late. Every concern the Maintenance Pessimist raised about the memory architecture has since been validated by the actual experience of maintaining simpler systems.
Two agents capture 80% of the value. The minimum viable pattern: one agent argues FOR, one argues AGAINST. Independence is the mechanism. Ten agents are better, but two agents are infinitely better than one.
Deliberation improves the question, not just the answer. The most common outcome is not “the winning approach.” It is “the question reframed in a way that makes the answer obvious.”

References

Author’s deliberation session delib-20260207-082618-9105e6. 10 research agents, 3 approaches generated, winning approach scored 7.04/10 with 8/10 agent support. Full session record in Obsidian vault. ↩↩
Kahneman, Daniel, Thinking, Fast and Slow, Farrar, Straus and Giroux, 2011. System 2 operates in “a comfortable low-effort mode” and endorses System 1 conclusions without scrutiny. ↩↩↩
Author’s vault note, “20 Cognitive Biases That Mess Up Your Decisions.” Counter-strategies: devil’s advocate process, pre-mortem analysis, structured decision frameworks, external feedback loops. ↩
Altman, Sam. “I think of writing as externalized thinking. If I have a very hard problem or if I feel a little bit confused about something, I have to write it down.” Via @StartupArchive_. ↩
Minsky, Marvin, The Society of Mind, Simon & Schuster, 1986. Intelligence emerges from the interaction of many smaller, simpler agents, not from a single sophisticated process. ↩
Ng, Andrew. Multi-agent AI patterns: debate (propose-critique-revise), collaboration (parallel specialists with synthesizer), adversarial (red team vs. blue team). Reported March 2024. ↩
de Bono, Edward, Six Thinking Hats, Little, Brown and Company, 1985. Six parallel perspectives prevent anchoring on a single thinking mode. ↩
Domingos, Pedro. AI as “mental exoskeleton”: extend rather than replace human cognition, represent user interests rather than flattering conclusions. ↩
Munger, Charlie. Inversion thinking: instead of “How do I achieve X?”, ask “What would guarantee failure at X?” Then avoid those things. Frequently cited in Berkshire Hathaway shareholder meetings. ↩
Klein, Gary, “Performing a Project Premortem,” Harvard Business Review, September 2007. Assume the project failed, then explain why. Based on research by Mitchell, Russo, and Pennington (1989) showing prospective hindsight increases identification of failure reasons by 30%. ↩
Tetlock, Philip E., Expert Political Judgment: How Good Is It? How Can We Know?, Princeton University Press, 2005. “Foxes” who integrate multiple perspectives consistently outperform “hedgehogs” who commit to one idea. Expanded in Superforecasting (Crown, 2015). ↩
Author’s Evidence Gate pattern. Implementation in Quality Loop rules (~/.claude/rules/quality-loop.md). Hedging language triggers mandatory re-verification. See also Jiro Quality Philosophy. ↩↩
Bezos, Jeff, 2015 Letter to Amazon Shareholders (SEC filing). Type 1 decisions: irreversible, one-way doors requiring careful deliberation. Type 2 decisions: reversible, two-way doors requiring speed. ↩↩