The Protege Pattern

A 7-billion parameter model solved 42.4% of SWE-bench Verified tasks. The previous small-model record was 17.0%. The model did not get smarter. The model learned when to ask for help.¹

Kon et al. trained a Qwen2.5-Coder-7B-Instruct model to collaborate with a frontier model as an expert. The expert answered roughly four questions per task, consuming 11% of total tokens.¹ The remaining 89% of tokens came from the small model executing routine operations: reading files, running tests, applying patches. The cost dropped from $0.54-$1.24 per instance (expert-only) to $0.13-$0.15 (protege with expert).¹ An 8.2x cost reduction with a 25-point performance gain over the previous small-model state of the art.

The result validates a pattern that practitioners have been converging on independently: the protege pattern.

TL;DR

The protege pattern splits agent work between a small, cheap model (the protege) that handles routine execution and a frontier model (the expert) that handles judgment calls. SWE-Protege demonstrated a 25.4-point improvement and 8.2x cost reduction.¹ Anthropic’s own multi-agent research system uses the same tier split: Claude Opus for the lead agent, Claude Sonnet for subagents.³ The pattern works because most agent work is mechanical. Routing that mechanical work to a model 5x cheaper per token reclaims 80% of the cost budget without sacrificing quality on the decisions that matter.

The Expert-Protege Framework

SWE-Protege frames the relationship with precision.¹ The protege is the sole decision-maker. The expert never initiates. The protege chooses when to escalate, what question to ask, and how to incorporate the response. Reinforcement learning trains the protege to optimize two competing objectives: solve the task AND minimize expert usage.

The RL reward structure penalizes three failure modes:

Degenerative looping. The protege asks the same question repeatedly. The penalty discourages learned helplessness.

Unproductive collaboration. The protege asks a question, ignores the answer, and continues with its original plan. The penalty discourages performative escalation.

Over-reliance. The protege routes every decision to the expert. The penalty discourages the protege from becoming a pass-through layer.

The result is a protege that develops genuine judgment about its own limitations. The 7B model learned to distinguish between tasks it could handle alone (file reads, test execution, straightforward patches) and tasks requiring expert intervention (architectural decisions, ambiguous requirements, multi-file dependency analysis).¹

Why Routing Works

The academic foundation for model routing predates SWE-Protege. RouteLLM demonstrated that routing between a strong model and a weak model achieves up to 3.66x cost savings while maintaining 95% of the strong model’s quality.¹¹ The router learns which queries require frontier capability and which queries a smaller model handles equally well.

IBM Research found similar results with a “frugal” routing method: calling smaller, specialized models sequentially until one produces a confident answer.¹⁴ The approach achieves up to 85% cost reduction on straightforward queries.

The underlying insight is distributional. Most agent operations are not hard. Reading a file, running a grep, applying a well-defined patch, executing a test suite: these operations require correct execution, not deep reasoning. A model 5x cheaper per token handles them identically to a frontier model.⁷ The hard operations (diagnosing a subtle bug, choosing between architectural approaches, evaluating whether a solution is correct) benefit from frontier reasoning. The protege pattern routes each operation to the appropriate tier.

Anthropic’s own documentation makes the tier split explicit. The “Choosing the Right Model” guide recommends Haiku for “sub-agent tasks” and Opus for “professional software engineering” and “advanced agents.”⁸ The guidance is not marketing. The guidance reflects measured performance differences across task complexity distributions.

Production Implementations

Three production systems demonstrate the protege pattern at scale.

Anthropic’s multi-agent research system. Claude Opus leads, Claude Sonnet executes as subagents.³ The system outperformed single-agent Claude Opus by 90.2% on internal evaluation. The improvement came not from a better model but from better task decomposition. Sonnet subagents consumed the bulk of tokens on research operations while Opus focused reasoning budget on synthesis and judgment.

Carlini’s C compiler. Sixteen parallel Claude agents produced a 100,000-line Rust-based C compiler that builds bootable Linux 6.9.⁴ Cost: $20,000 across ~2,000 sessions. While all agents ran at the same tier, the project revealed the self-organizing property that the protege pattern formalizes: agents naturally gravitated toward “the next most obvious problem.”⁴ No central orchestrator assigned tasks.

Chris Lattner reviewed the compiler and identified the boundary between what AI agents handle well and where human judgment remains essential: “Lower barriers to implementation do not reduce the importance of engineers; instead, they elevate the importance of vision, judgment, and taste.”⁵⁶ The agents excelled at assembling known techniques. The agents struggled with “the open-ended generalization required for production-quality systems.”⁵

Model routing in practice. The “What Claude Code Chooses” study analyzed 2,430 tool picks across three Claude models.⁹ Opus 4.6 showed forward-looking preferences (Drizzle 100% vs Prisma 0%), while Sonnet 4.5 made more conventional choices.⁹ The divergence drew significant community discussion.¹⁰ Different tiers bring different biases to ambiguous decisions. A protege making routine tool selections does not need frontier reasoning. A protege encountering an ambiguous architectural choice benefits from escalation.

Cost Arithmetic

The economics make the pattern compelling even before considering performance gains.

At current Anthropic pricing, the tier spread is exactly 5x:⁷

Model	Input	Output	Role
Opus 4.6	$5/MTok	$25/MTok	Expert
Haiku 4.5	$1/MTok	$5/MTok	Protege

A typical agent session consumes 50,000-200,000 tokens in each direction. Assuming 100K input and 100K output tokens at Opus-only pricing, a session costs $0.50 input + $2.50 output = $3.00. If the protege handles 80% of tokens and the expert handles 20%, the same session costs:

Protege (80K tokens): $0.08 input + $0.40 output = $0.48
Expert (20K tokens): $0.10 input + $0.50 output = $0.60
Total: $1.08 (64% savings)

SWE-Protege achieved even more aggressive savings because the expert consumed only 11% of tokens, not 20%.¹ Across 100 agent sessions per day, the difference compounds: $300/day at expert-only versus $108/day with protege routing. Over a month: $9,000 versus $3,240.

The SWE-bench leaderboard provides the performance context.¹² Claude 4.5 Opus at high reasoning achieves 76.8% resolve rate at $0.754 per instance. A protege-routed approach at 42.4% resolve rate costs $0.13-$0.15 per instance.¹ For tasks within the protege’s capability, the cost-per-solved-task favors routing. For tasks requiring frontier reasoning, the expert remains available on demand.

The Collaborativeness Phenomenon

Wang et al. discovered a property that explains why the protege pattern produces better results than either model alone.¹³ The “Mixture-of-Agents” paper found that models generate better responses when presented with outputs from other models, even when those other models are less capable.¹³

The finding inverts the expected hierarchy. A frontier model reading a small model’s initial analysis and file readings produces better output than the frontier model starting from scratch. The small model’s work is not just cheap labor offloaded from the expert. The small model’s work provides structured context that improves the expert’s reasoning.

Anthropic’s multi-agent research confirmed the pattern: upgrading subagents from Sonnet 3.7 to Sonnet 4 produced “a larger performance gain than doubling the token budget on Claude Sonnet 3.7.”³ Model quality at the protege tier matters. A better protege makes a better expert.

What You Can Build

Three escalation patterns map to progressively more autonomous implementations.

Pattern 1: Confidence-based routing. The simplest implementation. The protege generates a response and a confidence score. Below a threshold, the query routes to the expert. RouteLLM provides an open-source framework for training the router.¹¹ Start here.

Pattern 2: Task-type routing. Classify operations by type and route deterministically. File reads, test execution, and formatting to Haiku. Code review, architecture decisions, and ambiguous requirements to Opus. Anthropic’s “Building Effective Agents” guide calls this the routing pattern: “classifying inputs and directing easy/common questions to smaller, cost-efficient models.”²

Pattern 3: Learned escalation. The SWE-Protege approach. Train the protege to decide its own escalation points through reinforcement learning.¹ The protege develops genuine judgment about its limitations. The most sophisticated and highest-performing pattern, but requires RL infrastructure and expert-labeled training data.

Each pattern trades implementation complexity for cost savings and autonomy. Pattern 1 requires a confidence calibration dataset. Pattern 2 requires a task taxonomy. Pattern 3 requires RL training runs. All three outperform single-tier deployment on cost-adjusted performance.

Key Takeaways

The protege pattern is not load balancing. The protege makes decisions about its own limitations. The expert provides judgment, not throughput.
Most agent work is mechanical. Routing that work to a 5x cheaper model reclaims the cost budget for the decisions that require frontier reasoning.
Better proteges make better experts. The collaborativeness phenomenon means small-model outputs improve frontier-model reasoning.¹³
Lattner’s observation applies to the pattern itself: “As writing code is becoming easier, designing software becomes more important than ever.”⁵ The protege handles the easier writing. The expert handles the harder designing.

Part of the AI Engineering series. See also: Context Is the New Memory, Claude Code as Infrastructure, and The 10% Wall.

Kon, P.T.J., Pradeep, A., Chen, A., Ellis, A.P., Hunt, W., Wang, Z., Yang, J., & Thompson, S. “SWE-Protege: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents.” arXiv:2602.22124. 42.4% Pass@1 on SWE-bench Verified, 8.2x cost reduction, expert consulted ~4 times per task. ↩↩↩↩↩↩↩↩↩
Schluntz, E. & Zhang, B. “Building Effective Agents.” Anthropic Research Blog. Routing pattern: easy questions to Haiku, hard questions to Sonnet/Opus. ↩
Hadfield, J. et al. “How We Built Our Multi-Agent Research System.” Anthropic Engineering Blog. Opus lead + Sonnet subagents, 90.2% improvement over single-agent Opus. ↩↩↩
Carlini, N. “Building a C Compiler with a Team of Parallel Claudes.” Anthropic Engineering Blog. 16 agents, $20K, 100K lines, bootable Linux. ↩↩
Lattner, C. “The Claude C Compiler: What It Reveals About the Future of Software.” Modular Blog. “Lower barriers to implementation elevate the importance of vision, judgment, and taste.” ↩↩↩
Willison, S. “The Claude C Compiler.” Simon Willison’s Weblog. Commentary synthesizing Carlini and Lattner perspectives. ↩
Anthropic Model Pricing. Pricing page. Opus 4.6: $5/$25 MTok. Haiku 4.5: $1/$5 MTok. 5x tier spread. ↩↩
Anthropic. “Choosing the Right Model.” API Documentation. Haiku for “sub-agent tasks,” Opus for “professional software engineering.” ↩
Ong, E. & Vikati, A. “What Claude Code Actually Chooses.” Amplifying Research. 2,430 tool picks, Opus shows forward-looking preferences. ↩↩
Hacker News. “What Claude Code Chooses.” Discussion. 573 points, 213 comments. ↩
Ong, I. et al. “RouteLLM: Learning to Route LLMs with Preference Data.” ICLR 2025. arXiv:2406.18665. 3.66x cost savings, 95% quality retention. ↩↩
SWE-bench. “SWE-bench Leaderboards.” swebench.com. Claude 4.5 Opus: 76.8% at $0.754/instance. ↩
Wang, J. et al. “Mixture-of-Agents Enhances Large Language Model Capabilities.” ICLR 2025 Spotlight. arXiv:2406.04692. Weaker models improve stronger models through structured collaboration. ↩↩↩
IBM Research. “LLM Routing for Quality, Low-Cost Responses.” IBM Research Blog. Up to 85% cost reduction with frugal routing. ↩

The Protege Pattern

TL;DR

The Expert-Protege Framework

Why Routing Works

Production Implementations

Cost Arithmetic

The Collaborativeness Phenomenon

What You Can Build

Key Takeaways

Related Posts

Context Is the New Memory

The CLI Thesis

The Ralph Loop: How I Run Autonomous AI Agents Overnight