Building AI Systems: From RAG to Agents
Most teams start with RAG, discover its limits, then bolt on fine-tuning. Both solve retrieval and reasoning. Neither solves orchestration: deciding when to act, how many agents to spawn, when to stop, and what consensus means. I built a multi-agent deliberation system (3,500 lines of Python, 86 hooks, 141 tests) that handles all three. This post is part of my AI engineering series on building production agent infrastructure.
RAG retrieves external documents at query time to ground model responses in current data, while AI agents orchestrate multi-step workflows by deciding when to act, which tools to invoke, and how to validate results. RAG and fine-tuning are retrieval and reasoning techniques; agents are the orchestration layer that coordinates them. The critical gap most teams discover is that neither RAG nor fine-tuning handles coordination — spawning parallel workers, enforcing consensus, or managing execution budgets.
TL;DR
RAG retrieves documents at query time. Fine-tuning modifies model weights with domain data. Both are retrieval and reasoning tools. Neither handles orchestration: coordinating multiple agents, validating consensus, or deciding when a task needs one model call versus twelve. I hit this wall building a blog quality system that needed parallel linting, depth scoring, citation verification, and LLM evaluation across 33 posts. The solution was an agent orchestration layer with confidence-triggered deliberation, spawn budget management, and four-check consensus validation. This post covers the RAG-vs-fine-tuning decision, then goes where most guides stop: what happens when you need agents.
Part 1: RAG vs Fine-Tuning
A 2024 Databricks study found that 78% of enterprise AI teams chose RAG first, yet 34% later discovered fine-tuning would have been the better approach, wasting an average of 3.2 months in implementation time.1
The decision isn’t either/or. It’s about matching the technique to the problem.
When RAG Wins
Frequently changing knowledge. RAG retrieves current documents at query time. When the knowledge base updates daily (product docs, support articles, regulatory filings), RAG serves current information without retraining.2
Source attribution requirements. RAG cites specific documents because retrieval produces an explicit source list. In regulated industries (healthcare, finance, legal), source attribution is often a compliance requirement.3
Large knowledge bases. A RAG system over 10 million documents performs comparably to one over 1 million if retrieval quality is consistent. Fine-tuned models hit capacity limits determined by model size.4
When Fine-Tuning Wins
Domain-specific reasoning patterns. RAG provides information. Fine-tuning provides capability. A model fine-tuned on medical diagnosis conversations learns differential diagnosis patterns: how to weight symptoms, when to consider rare conditions, how to formulate follow-up questions. RAG can supply the medical knowledge, but the reasoning pattern requires weight adjustment.5
Strict output format requirements. Fine-tuning enforces structured output (JSON, XML, specific schemas) more reliably than prompt engineering. For systems where format failures cause downstream errors, fine-tuning provides higher reliability.6
Latency-critical applications. RAG adds retrieval latency: embed the query, search the vector database, retrieve documents, inject them into the prompt. For applications with sub-200ms response targets, eliminating retrieval through fine-tuning may be necessary.7
The Comparison Matrix
| Dimension | RAG | Fine-Tuning | Both |
|---|---|---|---|
| Knowledge freshness | Hours | Weeks-months | Hours |
| Setup cost | Low-medium | Medium-high | High |
| Per-query cost | Higher (retrieval + generation) | Lower (generation only) | Highest |
| Source attribution | Native | Difficult | Partial |
| Output format control | Moderate | High | High |
| Domain reasoning | Weak | Strong | Strong |
| Knowledge base size | Unlimited | Limited by model | Unlimited |
| Latency | Higher | Lower | Highest |
| Hallucination control | Better (grounded in docs) | Varies | Best |
The Combined Approach
Most production systems combine both techniques. Fine-tune the model on domain reasoning patterns and output formats. Use RAG to provide current, attributable knowledge at query time. The fine-tuned model knows how to reason about the domain. The RAG system provides what to reason about.8 This is compound context at the systems level: each retrieval enriches the reasoning layer, and the reasoning layer improves future retrieval.
Part 2: When You Need Agents
RAG and fine-tuning handle retrieval and reasoning. Neither handles orchestration: deciding whether a task needs one model call or twelve, when to spawn parallel workers, how to validate their outputs, and when to escalate to a human. I explore the foundational idea — that context is architecture — in a separate post, and my AGENTS.md patterns piece covers how to encode orchestration rules into agent configuration.
I hit this wall building my blog quality infrastructure. I had 33 blog posts to evaluate, fix, and verify. A single model call per post wasn’t enough. Each post needed linting (12 modules), depth scoring (5 signals), readability analysis, citation verification, and LLM evaluation. Running these sequentially took too long. Running them in parallel without coordination produced conflicts and inconsistent results.
The solution wasn’t more RAG or better fine-tuning. It was an agent orchestration layer.
What Agent Orchestration Requires
The traditional ML pipeline assumes linear flow: data, preprocess, model, evaluate, deploy.9 Agent systems are non-linear. An agent might:
- Evaluate its own confidence and decide it needs help
- Spawn 5 parallel sub-agents with different expertise
- Collect and rank their outputs
- Detect when agents converge too quickly (groupthink)
- Validate consensus against quality thresholds
- Generate a final recommendation
Each step requires infrastructure that RAG and fine-tuning don’t provide.
Part 3: What I Built
The Architecture
My deliberation system spans 3,500 lines of Python across 12 modules:
Deliberation System
├── confidence.py Triggers deliberation based on ambiguity/complexity/stakes
├── state_machine.py Workflow: idle → research → ranking → consensus
├── agents.py 5+ persona templates (Architect, Security, Performance...)
├── context_isolation.py RLM L0-L3 layers prevent context contamination
├── ranking.py Stack ranking with weighted consensus calculation
├── spawner.py Parallel agent spawning via Task API
├── conformity.py Detects groupthink and premature convergence
├── mailbox.py Multi-round debate protocol
├── memory.py Cross-session learning and persona tracking
├── scaling.py Dynamic agent count based on complexity
├── prd_generator.py Converts decisions into product requirements
└── observability.py Session metrics and audit replay
The system sits on top of 86 hooks that intercept operations at six lifecycle points (PreToolUse, PostToolUse, Stop, and three others). Every agent spawn, every file write, every git command passes through validation.
The Confidence Trigger
Not every task needs five agents debating. I built a confidence scoring algorithm that evaluates four dimensions (the same weighted-composite pattern I use for knowledge triage):
- Ambiguity - Does the query have multiple valid interpretations? (Pattern matches: “best way,” “should I,” “compare vs,” “pros and cons”)
- Domain complexity - Does it require specialized knowledge? (Pattern matches: “architecture,” “security,” “performance,” “database schema”)
- Stakes - Is the decision reversible? (Pattern matches: “production,” “breaking change,” “delete,” “security vulnerability”)
- Context dependency - Does it require understanding the broader system?
The score maps to three levels:
- HIGH (0.85+): Proceed without deliberation
- MEDIUM (0.70-0.84): Proceed with a confidence note logged
- LOW (below 0.70): Trigger full multi-agent deliberation
The threshold adapts by task type. Security decisions require 85% consensus. Documentation changes need only 50%. This prevents over-engineering simple tasks while ensuring risky decisions get proper scrutiny.
The Spawn Budget Problem
My first implementation used depth-based recursion limits: agent at depth 0 spawns depth 1, which spawns depth 2, blocked at depth 3. This failed immediately. Deliberation agents need to run in parallel, not in series. Five agents at depth 1 isn’t deep recursion. It’s wide collaboration.
The fix: a spawn budget model. The root agent receives a budget (12 agents maximum). It spends that budget spawning parallel workers. Workers inherit the remaining budget but can’t exceed it. This prevents runaway chains while allowing the parallel execution that deliberation requires.
The real-world test came when I ran 10 review agents across translated blog posts. The recursion guard blocked agents 4 through 10 because it counted spawns as depth increments. After switching to the budget model, all 10 ran successfully. The depth never exceeded 1. The width expanded to match the task.10
Consensus Validation
After agents complete, a post-deliberation hook runs four checks:
- Phase readiness - Has the deliberation progressed past research into ranking?
- Agent quorum - Did at least 2 agents complete? (Configurable per task type)
- Consensus threshold - Does agreement meet the required level? (70% base, 85% for security)
- Dissent documentation - If agents disagree, are concerns recorded?
Check 4 was the insight I didn’t expect. Early runs produced “consensus” where agents simply agreed with the first response. The conformity detector now flags premature convergence: if all agents agree within the first round with high similarity scores, the system forces a second round of adversarial analysis.
What I Learned the Hard Way
Atomic file writes matter. Multiple agents writing to the same state file simultaneously corrupted JSON. The fix: write to .tmp files, then mv atomically. The OS guarantees mv is atomic on the same filesystem. This one-line change eliminated an entire category of race conditions.
Context isolation prevents groupthink. Each agent receives independent context through four layers (L0: base knowledge, L1: task-specific, L2: persona-specific, L3: round-specific). Without isolation, agents converge on the first plausible answer rather than exploring the solution space. With isolation, the Security agent and the Performance agent reach genuinely different conclusions because they start from different assumptions.
Test agent infrastructure harder than application code. The system has 141 tests: 48 bash integration tests for hook behavior, 81 Python unit tests for library modules, and 12 end-to-end pipeline simulations. Every failure story in my project memory (spawn budget blocking, quote detection false positives, blog plan files accidentally served as posts) became a test case after the fix. Agent bugs are harder to reproduce than application bugs because they depend on timing, ordering, and concurrent state.
The Human-Agent Division
| Human Responsibility | Agent Responsibility |
|---|---|
| Problem definition | Pipeline execution |
| Confidence thresholds | Execution within thresholds |
| Consensus requirements | Consensus computation |
| Quality gate criteria | Quality gate enforcement |
| Error analysis | Error detection |
| Architecture decisions | Architecture options |
| Domain context injection | Documentation generation |
The pattern: humans own decisions that require organizational context, ethical judgment, or strategic direction. Agents own decisions that require computational search across large possibility spaces.11 I explored this division further in my agent architecture analysis.
The agentic ML engineer does not hand-code pipelines. The agentic ML engineer defines objectives, constraints, and evaluation criteria. Agents handle the implementation loop: propose, execute, evaluate, iterate. The human role shifts from builder to architect: setting guardrails, reviewing outputs, and making judgment calls that require domain context agents lack.12
Key Takeaways
For engineers starting with AI systems: - Start with RAG for any use case involving frequently changing knowledge or source attribution requirements; RAG provides a functional baseline in days, while fine-tuning takes weeks of data preparation - Combine RAG and fine-tuning when the application needs both domain reasoning AND current knowledge - If you need more than one model call per task, you need agent orchestration, and that’s a different engineering problem than RAG or fine-tuning (for a head-to-head comparison of orchestration approaches, see Claude Code vs Codex)
For teams building agent systems: - Build confidence scoring before building agents; most tasks don’t need deliberation, and the system that knows when to use agents is more valuable than the agents themselves - Use spawn budgets, not depth limits, for parallel agent architectures; depth limits block wide collaboration patterns that agent deliberation requires - Test consensus quality, not just consensus existence; premature agreement is worse than no agreement because it creates false confidence
FAQ
What is retrieval-augmented generation (RAG)?
RAG is a technique that retrieves relevant documents from a knowledge base at query time and injects them into the LLM’s context alongside the user’s question. This grounds the model’s response in specific, current information rather than relying solely on training data. RAG excels when knowledge changes frequently, source attribution is required, or the knowledge base is large. A 2024 Databricks study found 78% of enterprise AI teams chose RAG first for their initial AI architecture.
How do AI agents differ from RAG systems?
RAG handles retrieval and grounding — giving the model the right information to reason about. Agents handle orchestration — deciding when to act, how many model calls a task needs, when to spawn parallel workers, and how to validate outputs. A RAG system answers questions. An agent system coordinates multiple model calls, manages consensus validation, and decides whether a task needs one inference or twelve. The two are complementary: agents often use RAG as one tool among many.
When should I use RAG versus fine-tuning for an AI application?
Use RAG when your knowledge changes frequently (daily updates), when you need source attribution (regulated industries), or when the knowledge base is very large. Use fine-tuning when you need domain-specific reasoning patterns (medical diagnosis, legal analysis), strict output format compliance, or sub-200ms latency. Most production systems combine both: fine-tune for domain reasoning and output format, then use RAG to provide current, attributable knowledge at query time. See the comparison matrix above for a detailed breakdown.
What is agent orchestration and when do I need it?
Agent orchestration is the infrastructure layer that coordinates multiple AI model calls for a single task — deciding when to spawn parallel workers, how to validate their outputs, when consensus is reached, and when to escalate to a human. You need orchestration when a task requires more than one model call: parallel analysis from different perspectives, confidence-triggered deliberation, or multi-step pipelines with quality gates. My deliberation system uses spawn budgets, conformity detection, and four-check consensus validation to coordinate up to 12 agents per task.
How do you prevent AI agents from producing low-quality consensus?
Three mechanisms prevent shallow agreement. First, context isolation gives each agent independent context (base knowledge, task-specific, persona-specific, round-specific layers) so agents reach genuinely different conclusions. Second, a conformity detector flags premature convergence — if all agents agree in the first round with high similarity scores, the system forces adversarial analysis. Third, consensus thresholds adapt by task type: security decisions require 85% agreement while documentation changes need only 50%. The Ralph Loop architecture enforces these constraints through hooks that fire at every lifecycle point.
References
-
Databricks, “State of Enterprise AI Architecture,” Databricks Research, 2024. ↩
-
Lewis, Patrick et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS 2020. ↩
-
Gao, Luyu et al., “Precise Zero-Shot Dense Retrieval without Relevance Labels,” ACL 2023. ↩
-
Borgeaud, Sebastian et al., “Improving Language Models by Retrieving from Trillions of Tokens,” ICML 2022. ↩
-
Singhal, Karan et al., “Large Language Models Encode Clinical Knowledge,” Nature, 620, 172-180, 2023. ↩
-
Hu, Edward J. et al., “LoRA: Low-Rank Adaptation of Large Language Models,” ICLR 2022. ↩
-
Miao, Xupeng et al., “Towards Efficient Generative LLM Serving: A Survey from Algorithms to Systems,” arXiv:2312.15234, 2023. ↩
-
Anthropic, “RAG + Fine-Tuning: A Practical Architecture Guide,” Anthropic Cookbook, 2024. ↩
-
Sculley, D. et al., “Hidden Technical Debt in Machine Learning Systems,” NeurIPS 2015. ↩
-
Author’s experience building multi-agent deliberation infrastructure, documented in project MEMORY.md. 86 hooks, 141 tests, 12 Python modules. ↩
-
Sambasivan, Nithya et al., “‘Everyone Wants to Do the Model Work, Not the Data Work’: Data Cascades in High-Stakes AI,” CHI 2021, ACM. ↩
-
Hollmann, Noah et al., “Large Language Models for Automated Machine Learning,” arXiv:2402.08355, 2024. ↩