Building AI Systems: From RAG to Agents

Most teams start with RAG, discover its limits, then bolt on fine-tuning. Both solve retrieval and reasoning. Neither solves orchestration: deciding when to act, how many agents to spawn, when to stop, and what consensus means. I built a multi-agent deliberation system (3,500 lines of Python, 86 hooks, 141 tests) that handles all three.

TL;DR

RAG retrieves documents at query time. Fine-tuning modifies model weights with domain data. Both are retrieval and reasoning tools. Neither handles orchestration: coordinating multiple agents, validating consensus, or deciding when a task needs one model call versus twelve. I hit this wall building a blog quality system that needed parallel linting, depth scoring, citation verification, and LLM evaluation across 33 posts. The solution was an agent orchestration layer with confidence-triggered deliberation, spawn budget management, and four-check consensus validation. This post covers the RAG-vs-fine-tuning decision, then goes where most guides stop: what happens when you need agents.

Part 1: RAG vs Fine-Tuning

A 2024 Databricks study found that 78% of enterprise AI teams chose RAG first, yet 34% later discovered fine-tuning would have been the better approach, wasting an average of 3.2 months in implementation time.¹

The decision isn’t either/or. It’s about matching the technique to the problem.

When RAG Wins

Frequently changing knowledge. RAG retrieves current documents at query time. When the knowledge base updates daily (product docs, support articles, regulatory filings), RAG serves current information without retraining.²

Source attribution requirements. RAG cites specific documents because retrieval produces an explicit source list. In regulated industries (healthcare, finance, legal), source attribution is often a compliance requirement.³

Large knowledge bases. A RAG system over 10 million documents performs comparably to one over 1 million if retrieval quality is consistent. Fine-tuned models hit capacity limits determined by model size.⁴

When Fine-Tuning Wins

Domain-specific reasoning patterns. RAG provides information. Fine-tuning provides capability. A model fine-tuned on medical diagnosis conversations learns differential diagnosis patterns: how to weight symptoms, when to consider rare conditions, how to formulate follow-up questions. RAG can supply the medical knowledge, but the reasoning pattern requires weight adjustment.⁵

Strict output format requirements. Fine-tuning enforces structured output (JSON, XML, specific schemas) more reliably than prompt engineering. For systems where format failures cause downstream errors, fine-tuning provides higher reliability.⁶

Latency-critical applications. RAG adds retrieval latency: embed the query, search the vector database, retrieve documents, inject them into the prompt. For applications with sub-200ms response targets, eliminating retrieval through fine-tuning may be necessary.⁷

The Comparison Matrix

Dimension	RAG	Fine-Tuning	Both
Knowledge freshness	Hours	Weeks-months	Hours
Setup cost	Low-medium	Medium-high	High
Per-query cost	Higher (retrieval + generation)	Lower (generation only)	Highest
Source attribution	Native	Difficult	Partial
Output format control	Moderate	High	High
Domain reasoning	Weak	Strong	Strong
Knowledge base size	Unlimited	Limited by model	Unlimited
Latency	Higher	Lower	Highest
Hallucination control	Better (grounded in docs)	Varies	Best

The Combined Approach

Most production systems combine both techniques. Fine-tune the model on domain reasoning patterns and output formats. Use RAG to provide current, attributable knowledge at query time. The fine-tuned model knows how to reason about the domain. The RAG system provides what to reason about.⁸

Part 2: When You Need Agents

RAG and fine-tuning handle retrieval and reasoning. Neither handles orchestration: deciding whether a task needs one model call or twelve, when to spawn parallel workers, how to validate their outputs, and when to escalate to a human.

I hit this wall building my blog quality infrastructure. I had 33 blog posts to evaluate, fix, and verify. A single model call per post wasn’t enough. Each post needed linting (12 modules), depth scoring (5 signals), readability analysis, citation verification, and LLM evaluation. Running these sequentially took too long. Running them in parallel without coordination produced conflicts and inconsistent results.

The solution wasn’t more RAG or better fine-tuning. It was an agent orchestration layer.

What Agent Orchestration Requires

The traditional ML pipeline assumes linear flow: data, preprocess, model, evaluate, deploy.⁹ Agent systems are non-linear. An agent might:

Evaluate its own confidence and decide it needs help
Spawn 5 parallel sub-agents with different expertise
Collect and rank their outputs
Detect when agents converge too quickly (groupthink)
Validate consensus against quality thresholds
Generate a final recommendation

Each step requires infrastructure that RAG and fine-tuning don’t provide.

Part 3: What I Built

The Architecture

My deliberation system spans 3,500 lines of Python across 12 modules:

Deliberation System
├── confidence.py       Triggers deliberation based on ambiguity/complexity/stakes
├── state_machine.py    Workflow: idle → research → ranking → consensus
├── agents.py           5+ persona templates (Architect, Security, Performance...)
├── context_isolation.py  RLM L0-L3 layers prevent context contamination
├── ranking.py          Stack ranking with weighted consensus calculation
├── spawner.py          Parallel agent spawning via Task API
├── conformity.py       Detects groupthink and premature convergence
├── mailbox.py          Multi-round debate protocol
├── memory.py           Cross-session learning and persona tracking
├── scaling.py          Dynamic agent count based on complexity
├── prd_generator.py    Converts decisions into product requirements
└── observability.py    Session metrics and audit replay

The system sits on top of 86 hooks that intercept operations at six lifecycle points (PreToolUse, PostToolUse, Stop, and three others). Every agent spawn, every file write, every git command passes through validation.

The Confidence Trigger

Not every task needs five agents debating. I built a confidence scoring algorithm that evaluates four dimensions (the same weighted-composite pattern I use for knowledge triage):

Ambiguity - Does the query have multiple valid interpretations? (Pattern matches: “best way,” “should I,” “compare vs,” “pros and cons”)
Domain complexity - Does it require specialized knowledge? (Pattern matches: “architecture,” “security,” “performance,” “database schema”)
Stakes - Is the decision reversible? (Pattern matches: “production,” “breaking change,” “delete,” “security vulnerability”)
Context dependency - Does it require understanding the broader system?

The score maps to three levels:

HIGH (0.85+): Proceed without deliberation
MEDIUM (0.70-0.84): Proceed with a confidence note logged
LOW (below 0.70): Trigger full multi-agent deliberation

The threshold adapts by task type. Security decisions require 85% consensus. Documentation changes need only 50%. This prevents over-engineering simple tasks while ensuring risky decisions get proper scrutiny.

The Spawn Budget Problem

My first implementation used depth-based recursion limits: agent at depth 0 spawns depth 1, which spawns depth 2, blocked at depth 3. This failed immediately. Deliberation agents need to run in parallel, not in series. Five agents at depth 1 isn’t deep recursion. It’s wide collaboration.

The fix: a spawn budget model. The root agent receives a budget (12 agents maximum). It spends that budget spawning parallel workers. Workers inherit the remaining budget but can’t exceed it. This prevents runaway chains while allowing the parallel execution that deliberation requires.

The real-world test came when I ran 10 review agents across translated blog posts. The recursion guard blocked agents 4 through 10 because it counted spawns as depth increments. After switching to the budget model, all 10 ran successfully. The depth never exceeded 1. The width expanded to match the task.¹⁰

Consensus Validation

After agents complete, a post-deliberation hook runs four checks:

Phase readiness - Has the deliberation progressed past research into ranking?
Agent quorum - Did at least 2 agents complete? (Configurable per task type)
Consensus threshold - Does agreement meet the required level? (70% base, 85% for security)
Dissent documentation - If agents disagree, are concerns recorded?

Check 4 was the insight I didn’t expect. Early runs produced “consensus” where agents simply agreed with the first response. The conformity detector now flags premature convergence: if all agents agree within the first round with high similarity scores, the system forces a second round of adversarial analysis.

What I Learned the Hard Way

Atomic file writes matter. Multiple agents writing to the same state file simultaneously corrupted JSON. The fix: write to .tmp files, then mv atomically. The OS guarantees mv is atomic on the same filesystem. This one-line change eliminated an entire category of race conditions.

Context isolation prevents groupthink. Each agent receives independent context through four layers (L0: base knowledge, L1: task-specific, L2: persona-specific, L3: round-specific). Without isolation, agents converge on the first plausible answer rather than exploring the solution space. With isolation, the Security agent and the Performance agent reach genuinely different conclusions because they start from different assumptions.

Test agent infrastructure harder than application code. The system has 141 tests: 48 bash integration tests for hook behavior, 81 Python unit tests for library modules, and 12 end-to-end pipeline simulations. Every failure story in my project memory (spawn budget blocking, quote detection false positives, blog plan files accidentally served as posts) became a test case after the fix. Agent bugs are harder to reproduce than application bugs because they depend on timing, ordering, and concurrent state.

The Human-Agent Division

Human Responsibility	Agent Responsibility
Problem definition	Pipeline execution
Confidence thresholds	Execution within thresholds
Consensus requirements	Consensus computation
Quality gate criteria	Quality gate enforcement
Error analysis	Error detection
Architecture decisions	Architecture options
Domain context injection	Documentation generation

The pattern: humans own decisions that require organizational context, ethical judgment, or strategic direction. Agents own decisions that require computational search across large possibility spaces.¹¹ I explored this division further in my agent architecture analysis.

The agentic ML engineer does not hand-code pipelines. The agentic ML engineer defines objectives, constraints, and evaluation criteria. Agents handle the implementation loop: propose, execute, evaluate, iterate. The human role shifts from builder to architect: setting guardrails, reviewing outputs, and making judgment calls that require domain context agents lack.¹²

Key Takeaways

For engineers starting with AI systems: - Start with RAG for any use case involving frequently changing knowledge or source attribution requirements; RAG provides a functional baseline in days, while fine-tuning takes weeks of data preparation - Combine RAG and fine-tuning when the application needs both domain reasoning AND current knowledge - If you need more than one model call per task, you need agent orchestration, and that’s a different engineering problem than RAG or fine-tuning

For teams building agent systems: - Build confidence scoring before building agents; most tasks don’t need deliberation, and the system that knows when to use agents is more valuable than the agents themselves - Use spawn budgets, not depth limits, for parallel agent architectures; depth limits block wide collaboration patterns that agent deliberation requires - Test consensus quality, not just consensus existence; premature agreement is worse than no agreement because it creates false confidence

References

Databricks, “State of Enterprise AI Architecture,” Databricks Research, 2024. ↩
Lewis, Patrick et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS 2020. ↩
Gao, Luyu et al., “Precise Zero-Shot Dense Retrieval without Relevance Labels,” ACL 2023. ↩
Borgeaud, Sebastian et al., “Improving Language Models by Retrieving from Trillions of Tokens,” ICML 2022. ↩
Singhal, Karan et al., “Large Language Models Encode Clinical Knowledge,” Nature, 620, 172-180, 2023. ↩
Hu, Edward J. et al., “LoRA: Low-Rank Adaptation of Large Language Models,” ICLR 2022. ↩
Miao, Xupeng et al., “Towards Efficient Generative LLM Serving: A Survey from Algorithms to Systems,” arXiv:2312.15234, 2023. ↩
Anthropic, “RAG + Fine-Tuning: A Practical Architecture Guide,” Anthropic Cookbook, 2024. ↩
Sculley, D. et al., “Hidden Technical Debt in Machine Learning Systems,” NeurIPS 2015. ↩
Author’s experience building multi-agent deliberation infrastructure, documented in project MEMORY.md. 86 hooks, 141 tests, 12 Python modules. ↩
Sambasivan, Nithya et al., “‘Everyone Wants to Do the Model Work, Not the Data Work’: Data Cascades in High-Stakes AI,” CHI 2021, ACM. ↩
Hollmann, Noah et al., “Large Language Models for Automated Machine Learning,” arXiv:2402.08355, 2024. ↩