Context Engineering Is Architecture: 650 Files Later

My CLAUDE.md started at 50 lines. Six months later, it had grown into a 650-file distributed architecture across seven layers. The evolution revealed that context engineering is software architecture for a substrate where memory degrades with every token — not prompt engineering with more files.

Context determines what an AI agent can think, what it remembers, and what it forgets. Every other design decision is downstream. What follows is six months of production context engineering for Claude Code: the architecture, the failures, and the system that survived them.

Context engineering for AI agents is the practice of designing systems that deliver the right instructions to the agent at the right time while ensuring irrelevant instructions never load at all. It goes beyond writing a good prompt file — it requires a layered architecture where rules, skills, hooks, configs, and state files each load at specific lifecycle events. What gets excluded matters as much as what gets included, because every irrelevant token displaces context the agent actually needs for correctness.

TL;DR

Böckeler’s context engineering article (martinfowler.com, 2026)¹ and Miessler’s clear-thinking framework² advance the conversation, but both underestimate what production use demands. Context engineering requires designing a system where the right instructions reach the agent at the right time, and the wrong instructions never load at all. My system uses nine rules, 40 skills, 19 agents, 84 hooks, and 14 config files distributed across a seven-layer hierarchy. The architecture lives in which files load when. What gets excluded matters as much as what gets included.

Context Is Not a File

The weekly “how to write a CLAUDE.md” post misses the point. Writing a good CLAUDE.md is necessary and insufficient, the same way writing good code is necessary and insufficient for building a good system. Architecture is the structure that determines how components interact. In agent systems, context is that structure.

What started at 50 lines had grown into a monolith: my first CLAUDE.md was 200 lines covering coding standards, project structure, preferences, corrections, philosophy, credentials, and active project state. It worked for a month. Then three things happened simultaneously:

The file grew past 300 lines and started conflicting with itself (rule A said “keep it simple,” rule B said “add thorough error handling”)
Context from project A leaked into project B (iOS-specific rules polluting a web development session)
The agent spent tokens reading instructions that were irrelevant to the current task

The symptoms point to architectural problems, not documentation problems: coupling, scope leakage, and resource waste. The same forces drive software architecture decisions.³

The Seven Layers

After six months of refactoring, my context system settled into a seven-layer hierarchy. Each layer serves a distinct purpose and loads at a specific time:

Layer	Contents	When It Loads	Count
1. Core	`CLAUDE.md`: philosophy, active projects, corrections	Every session start	One file, 205 lines
2. Rules	Domain-specific constraints (API design, security, testing, git)	Every session start	Nine files
3. Skills	Reusable knowledge modules with procedures and examples	On demand (invoked or auto-activated by hooks)	40 directories
4. Agents	Specialized reviewer/generator specifications	On demand (via Task tool)	19 files
5. Hooks	Automatic context injection at lifecycle events	Event-driven (session start, pre-commit, post-tool)	84 scripts
6. Config	Numerical parameters (thresholds, budgets, limits)	Referenced by hooks and skills	14 JSON files
7. State	Live tracking (agent lineage, confidence calibration, costs)	Referenced by hooks	36 files

The critical insight is layer separation. Rules load every session because they apply universally. Skills load on demand because they’re domain-specific. Hooks fire at events because timing matters. Loading all 650 files at session start would exhaust the context window before the agent read the first user message.⁴

A concrete example of each layer:

Layer 1 (Core):     ~/.claude/CLAUDE.md — philosophy, active projects, corrections
Layer 2 (Rules):    ~/.claude/rules/security.md — "Validate ALL user input at boundaries"
Layer 3 (Skills):   ~/.claude/skills/fastapi/SKILL.md — endpoint patterns, Pydantic models
Layer 4 (Agents):   ~/.claude/agents/security-reviewer.md — OWASP-focused review spec
Layer 5 (Hooks):    ~/.claude/hooks/pre-commit-check.sh — blocks commits with TODO/FIXME
Layer 6 (Config):   ~/.claude/config/thresholds.json — {"max_spawn_children": 12}
Layer 7 (State):    ~/.claude/state/agent-lineage.json — tracks active agent tree

Each file in the hierarchy has a single responsibility and a clear loading trigger. The hook (Layer 5) fires at pre-commit regardless of project type. The skill (Layer 3) loads only when a FastAPI task begins. The distinction determines token cost.

Three Failures That Shaped the Architecture

Failure 1: The Monolithic CLAUDE.md

My original CLAUDE.md contained iOS rules, web rules, API design patterns, git conventions, credential management, and philosophical principles, all in one file. The agent read all of it every session, even when building a web project that would never touch SwiftUI.

The cost: Irrelevant context consumed an estimated 4,000 tokens per session. Over 50 sessions, that totals roughly 200,000 tokens spent on instructions the agent never used.

The fix: Extract domain-specific content into rules/ files. rules/security.md loads every session (security applies everywhere). rules/ios-security.md loads only when the session involves iOS projects. The context-window management post documents the token economics that drove this decision.

Failure 2: Skills That Went Stale

I created a fastapi skill with examples from FastAPI 0.109. Three months later, the project used FastAPI 0.115 with different patterns. The skill silently taught outdated conventions. The agent produced working code that didn’t match the current codebase.

The cost: A 45-minute debugging session tracing why a new endpoint used a different dependency injection pattern than every other endpoint. The pattern came from the stale skill, not from the codebase.

The fix: Skills reference versioned documentation and include “last verified” dates. More importantly, the quality loop requires that generated code matches existing codebase patterns, a metacognitive check that catches stale-skill drift regardless of the skill’s content.

Failure 3: Conflicting Rules at Different Layers

CLAUDE.md said “prefer simple solutions.” A skill said “add thorough error handling with retry logic and circuit breakers.” Both were reasonable in isolation. Together, they produced inconsistent output (sometimes minimal, sometimes over-engineered) depending on which instruction the agent weighted more heavily in a given turn.

The cost: Unpredictable output quality. The same type of task would produce different quality levels across sessions, making the system unreliable. Here are the actual conflicting instructions:

# In CLAUDE.md (Layer 1):
- Prefer simple solutions. Avoid over-engineering.

# In skills/fastapi/SKILL.md (Layer 3):
- Add thorough error handling with retry logic and circuit breakers
  for all external API calls.

Both are correct in isolation. “Prefer simple solutions” is good general advice. “Add retry logic for external calls” is good domain-specific advice. But an agent that reads both will sometimes skip retry logic (following the simplicity principle) and sometimes over-engineer a simple function (following the error handling instruction).

The fix: Establish a clear precedence hierarchy. CLAUDE.md sets principles (“prefer simplicity”). Rules set constraints (“validate all user input”). Skills set procedures (“here’s how to build a FastAPI endpoint”). When they conflict, the more specific layer wins: a skill’s procedure overrides a principle’s guidance for the specific task the skill covers. The resolution mirrors how programming languages handle scope: local overrides global.⁵ The fix for the example above: add a qualifier to the skill — “Add retry logic for external API calls (this overrides the general simplicity principle for network-dependent code because external calls fail intermittently and retries are not over-engineering).”

Context Budget as Architectural Constraint

The context window is not infinite. Claude’s standard 200K token context window⁶ sounds large until you measure what consumes it. (A 1M token beta exists via API header, which scales the budget proportionally but does not eliminate the trade-offs — loading 650 files would still exhaust even 1M tokens.)

Consumer	Typical Tokens	% of Window
System prompt + CLAUDE.md + rules	8,000-12,000	4-6%
Conversation history	20,000-80,000	10-40%
File reads (per file)	5,000-20,000	2.5-10%
Tool outputs (per call)	1,000-10,000	0.5-5%
Skill/agent context (on demand)	3,000-15,000	1.5-7.5%

Token consumption measurements come from 50 Claude Code sessions I tracked between August 2025 and February 2026.⁷ A 90-minute intensive session can exhaust the window through normal file reads and tool interactions. Note that the 200K window is the total input capacity; the model’s effective attention — the portion it attends to with high fidelity — is a different and smaller number. Context budget discipline matters even when total token usage is well below the limit, because instructions at the beginning of a long context receive more reliable attention than instructions in the middle. The context budget forces architectural trade-offs:

Always load core philosophy, active corrections, and security rules. All three are cheap (4-6% of budget) and universally applicable. The cost of loading them when unneeded is negligible. The cost of omitting them when needed is incorrect output.

Load on demand skills, agent specifications, and reference documentation. Each costs 5-15% of the window per skill and applies to a single domain. Loading them only when relevant preserves budget for the actual work. The trigger for loading a skill is the task type, not the session type — a web development session that encounters a database migration needs the database skill even though it didn’t load at startup.

Never load stale handoff documents, deprecated configurations, or historical state. Every stale file consumes budget without improving output. Worse, stale context actively degrades output by teaching outdated patterns. The compounding engineering pattern requires periodic pruning of context that no longer earns its token cost.

Context budgeting mirrors the trade-off that drives database indexing, caching strategies, and memory management in traditional software.⁸ The constraint differs (tokens instead of bytes), but the architectural discipline is identical.

Context Propagation in Multi-Agent Systems

When the main agent spawns a subagent via the Task tool, choosing what context to propagate drives every other design decision. My multi-agent deliberation system uses 10 research agents. Each agent needs enough context to evaluate independently, but too much shared context causes the convergence problem documented in the boids post: agents that share too much context produce identical conclusions.

The propagation rules:

Context Type	Propagated?	Why
Core philosophy	Yes	Consistency across agents
Domain rules	Yes	Shared quality standards
Task-specific instructions	Yes	The actual work
Conversation history	No	Independence requires isolation
Other agents’ findings	No (until synthesis)	Prevents premature convergence
Skill procedures	Selectively	Only skills relevant to the agent’s role

The Ralph architecture solves a related problem: context propagation across time (iterations), not across agents (parallel processes). Both share the same principle: propagate constraints and principles, isolate implementation details.

Propagation in Practice

A concrete example clarifies the trade-off. When the main agent spawns a security reviewer subagent, the Task prompt includes:

## Propagated Context
- Core philosophy: Quality is the only variable (from CLAUDE.md)
- Security rules: OWASP Top 10, input validation at boundaries (from rules/security.md)
- Task: Review the authentication endpoint changes in app/auth.py

## NOT Propagated
- The main agent's conversation history (50+ messages about UI layout)
- Findings from the architecture reviewer (would bias security analysis)
- The FastAPI skill (irrelevant to a security-focused review)

The security reviewer receives 2,000 tokens of context instead of 40,000. It evaluates the code against security criteria without the distraction of UI decisions or the bias of another agent’s conclusions. The isolation produces an independent assessment. If both the security reviewer and the architecture reviewer flag the same concern independently, the finding has higher confidence than if one reviewer saw the other’s report and agreed.

Observing Context Quality

Token count is a proxy metric. The real measure of context quality is: does the agent produce the right output on the first attempt?

After tracking 50 sessions, I identified three quality signals:

First-attempt success rate. How often does the agent’s first response require no correction? With the monolithic CLAUDE.md, the rate was roughly 60%. After the seven-layer architecture, it rose to approximately 80%. The improvement came from removing irrelevant context (fewer distractions) and adding domain-specific skills (more relevant knowledge).⁹

Here is a representative before/after from the tracking log. The task is identical: “add rate limiting to an API endpoint.”

Before (monolithic CLAUDE.md, July 2025). The agent’s first response imported a rate limiting library not used in the project (it followed an example from the stale FastAPI skill that was loaded alongside iOS rules, security rules, and 150 lines of unrelated content). The correction: “use the existing RateLimiter class in app/middleware.py.” One correction, categorized as “factual error — missing codebase context.”

After (seven-layer architecture, November 2025). Same task type, different endpoint. The agent’s first response used the existing RateLimiter middleware, matched the parameter naming convention in adjacent endpoints, and cited the pattern source: “following the rate limiting pattern in app/middleware.py:42.” No correction needed. The FastAPI skill (Layer 3) loaded on demand and contained current patterns; the iOS rules did not load because the session involved no iOS work.

To track this yourself: tag each agent session in a log with “accepted” or “corrected” after the first response, then compute the ratio monthly. The granularity does not need to be precise — directional trends over 20+ sessions are more useful than exact percentages.

Correction type. When the agent does require correction, is the error factual (wrong API, wrong pattern) or a judgment call (wrong approach, wrong priority)? Factual errors indicate missing context — the agent lacked the information needed to produce the right answer. Judgment errors indicate conflicting or ambiguous context — the agent had the information but weighed competing instructions incorrectly. The distinction matters because the fix differs: factual errors require adding context (a new rule or skill), while judgment errors require clarifying context (resolving a conflict between existing layers). Tracking correction types reveals which layer needs attention.

Context pressure at task completion. How much context budget remains when the task finishes? If the window is 90% full before the task is half done, the context architecture is loading too much irrelevant material. My hooks system includes a context pressure monitor that warns when utilization exceeds 70%.

A Session Walkthrough: What Loads When

A concrete walkthrough clarifies how the seven layers interact during a real task. The scenario: I ask Claude Code to “add rate limiting to the /api/ab/event endpoint” in the blakecrosley.com project.

Session start (0s). Claude Code loads Layer 1 (CLAUDE.md, 205 lines, ~800 tokens) and Layer 2 (all 9 rules files, ~3,200 tokens). Total context consumed: ~4,000 tokens (2% of 200K window). The agent now knows my coding philosophy, security requirements, API design conventions, and git workflow — regardless of what I ask it to do.

Task begins (2s). The agent reads the task and identifies it as a FastAPI endpoint modification. Layer 5 (hooks) fires: the UserPromptSubmit hook injects the current date and session context (~200 tokens). The agent does not load the SwiftUI skill, the iOS security rules, or the Midjourney prompt skill. Those files exist but are irrelevant. Excluding them saves ~8,000 tokens.

File reads (5-30s). The agent reads app/main.py, the existing endpoint code, and the test file. Each file read consumes 5,000-15,000 tokens of context. After reading three files: ~30,000-50,000 tokens consumed (15-25% of window).

Skill activation (if needed). If the agent needs FastAPI-specific patterns, it loads Layer 3 (skills/fastapi/SKILL.md, ~2,500 tokens). The skill loads because the task requires it, not because the session started. If this were a CSS-only task, the FastAPI skill would never load.

Implementation and verification (30s-5min). The agent writes code, runs tests, and reports results. By task completion, context consumption is typically 60,000-100,000 tokens (30-50% of window). The remaining budget allows for follow-up questions, corrections, or a second task in the same session.

The key observation: Layer 1 and Layer 2 consume 2% of the window and apply to every task. Layer 3-7 content loads selectively and consumes 5-25% only when relevant. If all 650 files loaded at session start, they would consume approximately 150,000-200,000 tokens, leaving almost no room for the actual work.

The Distributed Architecture Pattern

Nothing in the seven-layer system is unique to AI agents. It mirrors established software patterns:

Software Pattern	Context Equivalent
Environment variables	Core CLAUDE.md (always loaded, global)
Configuration files	Rules (loaded at startup, domain-specific)
Libraries/modules	Skills (loaded on demand, self-contained)
Microservices	Agents (isolated, communicating through protocols)
Event handlers	Hooks (triggered by lifecycle events)
Database	State files (persistent, queryable)
API contracts	Config schemas (shared numerical parameters)

The parallel is not metaphorical. The same forces (coupling, cohesion, scope, lifecycle) drive identical architectural decisions.¹⁰ Context engineering is software engineering for a substrate where “memory” is a scarce, degradable resource rather than an abundant, persistent one.¹¹

Independent validation of this architecture comes from the obra/superpowers project, which independently evolved a similar multi-layer structure for Claude Code context: core principles in a global file, domain skills loaded on demand, and verification procedures separated from behavioral instructions.¹² The convergence suggests the layered architecture is not an idiosyncratic design choice but a natural response to the constraints of the substrate.

Key Takeaways

For engineers building agent systems:

Design context like software architecture, not documentation. What loads when, what overrides what, and what propagates to subagents determines agent behavior more than any individual instruction. Apply the same separation-of-concerns discipline you use for code.
Separate layers by lifecycle. Universal rules load every session. Domain-specific knowledge loads on demand. Per-session state stays transient. Mixing lifecycles in one file creates the coupling problems that software architecture exists to solve.

For teams scaling AI workflows:

Treat the context window as a budget. System prompts, file reads, and tool outputs consume the 200K token window. Every persistent instruction competes with working memory. Measure what you load and prune what doesn’t earn its tokens.
Propagation rules determine multi-agent quality. Subagents need shared principles for consistency and isolated state for independence. Propagating too much context causes convergence. Propagating too little causes incoherence.

Exercise: Audit your context system. Create this directory structure and populate it with your existing instructions:

mkdir -p .claude/{rules,skills,agents,hooks,config,state}
# Move domain-specific sections from CLAUDE.md into rules/
# Move reusable procedures into skills/
# Move agent specs into agents/

Then answer: which files load every session that should load on demand? Which on-demand files are missing? Track the answers in a simple log — tag each session with “first-attempt correct” or “needed correction” and review the ratio after 20 sessions.

FAQ

How should I structure context files for AI coding agents?

Extract domain-specific content into separate files when it exceeds 20-30 lines and when it would be irrelevant in at least some sessions. If you write iOS-specific rules and web-specific rules in the same CLAUDE.md, the agent reads both regardless of the current project. Keep universal constraints (security rules, coding standards, philosophical principles) in the core file. Move domain-specific patterns (FastAPI endpoints, SwiftUI conventions, database queries) into on-demand files that load only when the task requires them.

How do I know when a skill has gone stale?

Two signals: the agent produces code that works but doesn’t match the current codebase patterns, or the agent references APIs or patterns from an older version of a framework. Add a “last verified” date to each skill and check it quarterly. More importantly, the quality loop (does generated code match existing patterns?) catches stale-skill drift regardless of the skill’s content.

What is the right size for a context budget?

For a 200K token window, reserve 4-6% for system prompt and rules (always loaded), leave 40-60% for conversation history and file reads (the actual work), and budget 10-15% for on-demand skills and agent context. If context pressure exceeds 70% before the task is half done, you are loading too much irrelevant material. The specific percentages depend on task complexity — a multi-file refactor consumes more context than a single-function bug fix.

How do I handle conflicting instructions between layers?

Apply scope-based precedence: the more specific layer wins. Skills (local scope) override rules (module scope), which override CLAUDE.md (global scope). When you discover a conflict, make the resolution explicit rather than relying on the model to resolve it probabilistically. Add a note in the more specific layer: “This overrides the general principle of X for this specific context because Y.”

This post builds on Context Window Management (token economics) and The Ralph System (filesystem memory). The Claude Code hooks system implements the automation layer. For agent coordination patterns, see Multi-Agent Deliberation and Boids to Agents.

Birgitta Böckeler, “Context Engineering for Coding Agents,” martinfowler.com, February 2026. martinfowler.com/articles/exploring-gen-ai/context-engineering-coding-agents.html. Böckeler’s colleague Bharani Subramaniam defines context engineering as “curating what the model sees so that you get a better result.” The definition is correct. The present post argues that the structure of how that information is organized and delivered is an architectural discipline, not a documentation exercise. ↩
Daniel Miessler, “How to Talk to AI,” danielmiessler.com, June 2025. danielmiessler.com/blog/how-to-talk-to-ai. Miessler argues that the real skill underlying both prompt engineering and context engineering is clear thinking: the ability to articulate exactly what you want to accomplish. The framing complements the present post, which focuses on the structural discipline of organizing context rather than thinking clearly about it. ↩
The software architecture parallel is deliberate. Robert C. Martin, Clean Architecture: A Craftsman’s Guide to Software Structure and Design, Prentice Hall, 2017, Part III (Design Principles) and Part V (Architecture). Martin identifies the same forces: coupling, cohesion, and the separation of concerns. The difference in AI context systems is that “memory” is ephemeral and bounded, adding a constraint that traditional architecture does not face. ↩
The 650-file count is the author’s measurement as of February 2026. Global context: ~400 files (rules, skills, agents, hooks, configs, state, docs, handoffs). Project-specific context (blakecrosley.com): ~250 files (PRDs, docs, plans, workflows, i18n configs). Only a fraction loads per session. ↩
Variable scope resolution in programming languages (local, enclosing, global, built-in) is the direct analogy. Python’s LEGB rule defines the same hierarchy: local scope, enclosing function scope, global scope, built-in scope. See Python Software Foundation, “Execution Model,” section 4.2.2, “Resolution of names.” docs.python.org/3/reference/executionmodel.html. Skills (local scope) override rules (module scope) which override CLAUDE.md (global scope). The analogy breaks down slightly because LLM instruction following is probabilistic, not deterministic, but the architectural principle holds. ↩
Anthropic, “Models overview,” platform.claude.com, 2025. platform.claude.com/docs/en/docs/about-claude/models. All current Claude models (Opus 4.6, Sonnet 4.6, Haiku 4.5) specify a 200K token context window, with Opus 4.6 and Sonnet 4.6 supporting 1M tokens in beta. ↩
Token consumption measurements from 50 Claude Code sessions I tracked between August 2025 and February 2026. See Context Window Management for the full methodology. ↩
The analogy between token budgets and memory hierarchies follows the framework in Hennessy, J.L. and Patterson, D.A., Computer Architecture: A Quantitative Approach, 6th edition, Morgan Kaufmann, 2017, Chapter 2 (Memory Hierarchy Design). Hennessy and Patterson’s treatment of cache hierarchies, locality of reference, and the cost of memory access at different levels maps directly to context engineering: frequently needed context (L1 cache / core rules) loads fastest, while rarely needed context (disk / on-demand skills) loads only when referenced. ↩
First-attempt success rate is a rough metric based on the author’s subjective assessment of whether the first response required correction. It is not a controlled experiment. The directional improvement (60% to 80%) is consistent across session types but should not be cited as precise measurement. ↩
James Lewis and Martin Fowler, “Microservices,” martinfowler.com, March 2014. martinfowler.com/articles/microservices.html. Lewis and Fowler define microservices as “a suite of small services, each running in its own process and communicating with lightweight mechanisms.” The forces they identify (independent deployment, decentralized governance, bounded contexts) map directly to the agent isolation and protocol-based communication in the context architecture described here. ↩
The “context as architecture” framing draws from Xu et al., “Everything is Context: Agentic File System Abstraction for Context Engineering,” arXiv, December 2025. arxiv.org/abs/2512.05470. The paper proposes a file-system abstraction for managing context in generative AI systems, treating diverse knowledge artifacts, memory, and tools as context within token constraints. The theoretical framework supports the practical architecture described here. ↩
obra/superpowers, github.com/obra/superpowers. The superpowers project organizes agent context into core instructions, on-demand skills, and verification procedures — a structure that converged with the seven-layer architecture described here despite being developed independently. The convergence supports the thesis that layered context is a natural architectural response to bounded context windows. ↩