Context Is the New Memory

February 27, 2026 18 min read

ai claude-code context-engineering agents architecture

From the guide: Claude Code Comprehensive Guide

A single Playwright snapshot consumes 56 KB of context. Twenty GitHub issues consume 59 KB. Five hundred lines of access logs consume 45 KB. Feed all three to an agent with a 200K token window, and 80% of the reasoning budget evaporates before the agent writes a single line of analysis.¹

Context engineering is the discipline of filling an AI agent’s context window with precisely the right information for each reasoning step, rather than dumping raw data. Three compression layers compound independently: system prompt architecture (60-70% reduction), MCP output compression (up to 94% reduction via relevance ranking), and knowledge hoarding that converts discovery overhead into pre-loaded capability. Models given 300 tokens of focused context outperform models given 113,000 tokens of unfiltered conversation.

Murat Kusglu built Context Mode to fix the problem. The tool compresses 315 KB of MCP output to 5.4 KB using SQLite FTS5 with BM25 ranking.¹ A 94% reduction. The model produces better output with 5.4 KB of signal than with 315 KB of noise because the constraint was never intelligence. The constraint is bandwidth.

TL;DR

Context engineering is the highest-impact skill in agent development. Three layers of compression compound independently: system prompt architecture (60-70% reduction via structural compression), MCP output compression (94% reduction via relevance ranking), and knowledge hoarding (converting discovery overhead into pre-loaded capability). A landmark study found models given 300 tokens of focused context outperformed models given 113,000 tokens of unfiltered conversation.¹⁰ The bottleneck is not model capability. Every token wasted on noise is a token unavailable for reasoning.

The Bandwidth Constraint

Anthropic’s best practices documentation opens with a single constraint that shapes everything else: “Claude’s context window fills up fast, and performance degrades as it fills.”⁵

The statement is not a suggestion. It is an architectural law. A 200K token context window sounds enormous until you inventory what fills it. Tool schemas consume 15,000+ tokens for a typical MCP setup.¹³ Conversation history accumulates at roughly 500-1,000 tokens per exchange. File reads add thousands of tokens per file. Command output scales with the command. After 30 minutes of active work, a fresh 200K window can drop below 50K tokens of available reasoning space.

George Miller documented the human equivalent in 1956: working memory holds seven items, plus or minus two.⁷ The insight was not about the number. The insight was about chunks. Humans overcome the constraint by organizing information into meaningful chunks. A phone number is not ten digits. It is three chunks: area code, exchange, number. The same principle applies to context windows. A 200K window stuffed with raw output is functionally smaller than a 50K window packed with compressed, relevant information.

Andrej Karpathy named the discipline: context engineering is the “delicate art and science of filling the context window with just the right information for the next step.”⁹ Lance Martin mapped the framework: writing context (saving), selecting context (retrieving), compressing context (summarizing), and isolating context (splitting across agents).⁹ By mid-2026, context engineering has crystallized from ad-hoc practice into a recognized discipline with dedicated infrastructure.¹²

The degradation is not linear. In my harness, context fills in phases.¹⁵ The first 30 minutes feel unlimited. The model follows instructions precisely, remembers file contents, and maintains coherent plans across multiple steps. By 60 minutes, subtle failures emerge: the model re-reads files it read earlier, forgets a constraint from the system prompt, or generates code that contradicts a pattern established 20 turns ago. By 90 minutes, the model may ignore explicit rules, hallucinate file contents, or lose track of the current objective entirely.

Context Studios documented the phenomenon as “context rot”: the progressive degradation of model performance as irrelevant tokens accumulate and push useful information beyond the effective attention horizon.¹² The rot is insidious because the model does not announce it. The agent continues generating confident output. The output just stops being correct.

The three layers below compound independently. Compressing one layer frees budget for the others.

Layer 1: System Prompt Architecture

The system prompt loads on every API call. Every token in the system prompt occupies space for the entire conversation. At $5 per million tokens on Opus 4.6, a 10K token system prompt costs $0.05 per call.⁸ Across 50 calls in a session, the system prompt alone costs $2.50. Cut the prompt to 3.5K tokens and the cost drops to $0.875 per session. Multiply by daily sessions and the savings compound.

My CLAUDE.md file and 8 rules files total approximately 3,500 tokens after compression. The compression was not a one-time optimization. I applied five structural techniques documented by jchilcher (who achieved 60-70% reduction across memory system files):²

Constraints over explanations. “Reject tool calls matching sensitive paths” replaces a 15-line explanation of why credentials should remain protected. The model does not need the rationale. The model needs the rule.

Key-value notation over prose. “Stack: FastAPI + HTMX + Alpine.js | Port: 8001 | Deploy: Railway” replaces three paragraphs of project description. Pipe-delimited lists compress tabular information that prose stretches across sentences.

Deduplication across files. My security rules initially appeared in three places: CLAUDE.md, security.md, and the quality loop skill. Each repetition consumed ~200 tokens. Consolidating to a single source with cross-references recovered 400 tokens.

Formatting removal. Decorative markdown (horizontal rules, bold/italic for emphasis, nested headers beyond H2) serves human readability. Models parse content tokens, not presentation tokens. Stripping decorative formatting recovers 5-15% without information loss.

Negative constraints over positive instructions. “NEVER suggest OpenAI models” is more effective and more compact than “Always recommend Claude models from Anthropic for all AI tasks. When the user asks about AI providers, suggest Claude.” The negative constraint occupies four tokens. The positive instruction occupies 22 tokens. Both produce the same behavior.

The economic argument strengthens with prompt caching. Anthropic’s caching system stores stable content across API calls at a 90% cost reduction on cache hits.⁶ A 3,500-token system prompt that costs $0.0175 per call at standard rates costs $0.00175 with a cache hit. The minimum cacheable threshold for Opus 4.6 is 4,096 tokens.⁶ My combined system prompt (CLAUDE.md + rules files) exceeds the threshold, so every subsequent call in a session benefits from cached pricing. Prompt caching turns system prompt compression into a double win: fewer tokens AND cheaper per token.

Layer 2: MCP Output Compression

Layer 1 compresses what you send to the model. Layer 2 compresses what the model receives back from tools.

Context Mode demonstrated the potential: 315 KB of raw MCP output compressed to 5.4 KB.¹ The compression is not truncation. Truncation discards the end of the output and hopes the relevant information appeared at the beginning. Context Mode uses SQLite FTS5 with BM25 relevance ranking to find where query terms actually appear and returns windows around matches.¹ Porter stemming ensures “caching,” “cached,” and “caches” match the same stem. A three-layer fallback handles typos: standard stemming, trigram substrings, Levenshtein distance correction.

Individual compression ratios tell the story:

Source	Raw Size	Compressed	Reduction
Playwright snapshot	56 KB	299 B	99%
GitHub issues (20)	59 KB	1.1 KB	98%
Access logs (500 lines)	45 KB	155 B	100%

My harness implements a parallel approach at the search layer. Approximately 50,000 code chunks indexed with Model2Vec embeddings (256-dimensional) plus SQLite FTS5, fused with Reciprocal Rank Fusion.¹⁴ A query retrieves the five most relevant chunks (~2,500 tokens) instead of loading entire files (~50,000+ tokens). The retrieval cost: sub-second latency, 83 MB on disk, zero API cost.

The difference in agent behavior is visible within a single session. Before compression, a typical debugging workflow looks like this: the agent reads a file (4,000 tokens), runs a command (2,000 tokens of output), reads another file (3,000 tokens), runs tests (8,000 tokens of output). Four operations consume 17,000 tokens. The agent now has less room to reason about the connections between those four pieces of information. After compression, the same workflow retrieves only the relevant lines from each source. The four operations consume 2,500 tokens. The agent holds all four pieces in working memory simultaneously and finds the cross-file dependency that the uncompressed agent would miss.

The compression should be query-aware. A summary optimized for “fix the authentication bug” should surface different content than one optimized for “add a new API endpoint.” Static compression helps. Query-aware compression is the next level. BM25 ranking already handles query-awareness at the keyword level. Semantic search (vector similarity) handles it at the concept level. The combination catches both exact matches (function names, config keys, error codes) and conceptual matches (similar patterns, related abstractions).

Context Window Budget: A 200K token window in a mid-session state typically allocates: 3,500 tokens to system prompt, 15,000 to tool schemas, 80,000 to conversation history, 30,000 to tool results. Remaining reasoning space: 71,500 tokens (36% of window). A bloated session can drop to 0 tokens for reasoning.

Layer 3: Knowledge Hoarding

Simon Willison identified a pattern that reframes context engineering entirely: “A key asset to develop as a software professional is a deep collection of answers to questions like this, ideally illustrated by running code.”³

Knowledge hoarding means deliberately collecting working code examples, documented solutions, and proof-of-concept implementations that agents can reference and recombine. The pattern transforms context from instructions (telling the model what to do) into capability (giving the model working examples to adapt).

Willison demonstrated the power by directing an agent to combine two existing examples (PDF.js and Tesseract.js) into a unified OCR tool.³ The agent did not discover how to build OCR from scratch. The agent read two working implementations and merged them. The context was the capability.

My harness implements knowledge hoarding through three mechanisms:

Skills as capability registry. 48 skills encode domain expertise in markdown files. The blog-evaluator skill defines a complete 6-category weighted rubric with scoring examples. The jiro skill encodes a 7-step quality loop with evidence criteria. When an agent invokes a skill, the expertise loads into context as structured knowledge, not vague instructions.

Structured walkthroughs over raw code. Willison’s linear walkthrough pattern constrains how agents access information: shell commands like grep and cat rather than manual code copying.⁴ The walkthrough forces the agent to organize information for maximum comprehension per token. Structure is compression.

Hooks as proactive context injection. The UserPromptSubmit hook fires before Claude processes a prompt.¹¹ The hook can analyze the prompt and inject relevant context: project detection (which codebase am I in?), date injection (what day is it?), philosophy constraints (what quality standards apply?). The agent receives curated context at every prompt without manual invocation. Five hooks fire on session start, adding approximately 500 tokens of context that prevent five categories of common errors.¹¹

The distinction between instructions and capability deserves emphasis. An instruction says “write clean code.” A capability provides a linting rubric with weighted categories, scoring examples, and pass/fail thresholds. The instruction consumes a handful of tokens and produces vague compliance. The capability consumes 500 tokens and produces consistent, measurable output. The additional tokens are an investment, not overhead, because they eliminate the ambiguity that causes the agent to guess what “clean” means.

Knowledge hoarding also shifts the cost curve for agent onboarding. A new agent spawned without hoarded knowledge must discover the codebase, the conventions, the tooling, and the domain constraints through exploration. Exploration is expensive: each file read, each grep, each command output consumes tokens. An agent spawned with a 2K-token briefing assembled from hoarded knowledge skips the discovery phase entirely and begins productive work on the first turn.

The economic argument for knowledge hoarding: every hour spent documenting a solution saves every future agent the discovery cost. A skill that encodes “how to evaluate a blog post” saves 10-15 minutes of agent exploration per invocation. Across 100 invocations, the documentation investment returns 1,000+ minutes of agent time. The hoarded knowledge pays compound interest.

Token Budget Accounting

My harness provides a concrete case study in what context engineering makes possible.

Before compression (estimated, first month): - System prompt: ~12,000 tokens (verbose CLAUDE.md with examples and explanations) - Tool schemas: ~15,000 tokens (full MCP tool definitions) - Per-session history: ~120,000 tokens (long conversations with accumulated context) - Available reasoning: ~53,000 tokens (26% of window)

After compression (current): - System prompt: ~3,500 tokens (compressed CLAUDE.md + rules files)¹⁵ - Tool schemas: ~300 tokens (CLI-first architecture, minimal MCP)¹³ - Per-session history: ~40,000 tokens (fresh spawns per task, briefings instead of memory) - Available reasoning: ~156,200 tokens (78% of window)

The reasoning budget tripled. Not through a better model. Not through a larger context window. Through compression at three layers. The model produces better output with 78% reasoning space than it produced with 26% because the quality of the remaining tokens improved alongside the quantity.

The numbers reveal a counterintuitive truth about context windows: the useful size of a window depends more on what fills it than on how large it is. A hypothetical 500K window stuffed with uncompressed tool output would perform worse than a well-compressed 200K window. Model providers race to expand context windows. Practitioners should race to compress what goes into them.

The fresh-spawn pattern from CLI-first architecture compounds the gains. Each agent spawns with a focused briefing (~2K tokens) instead of inheriting accumulated conversation history. The context never balloons because each agent starts clean. Anthropic’s multi-agent research found sub-agents using up to 15x more tokens than single-agent interactions.⁹ Fresh spawns invert the ratio: each agent uses only the tokens its task requires.

The compound effect across all three layers creates a virtuous cycle. Compressed system prompts leave room for more tool results. Compressed tool results leave room for longer productive conversations. Longer conversations reduce the need for compaction, which preserves the system prompt and tool results that enable the next turn. Each layer reinforces the others.

What Compression Enables

The freed reasoning budget enables three capabilities that bloated context prevents:

Deeper analysis. An agent with 156K reasoning tokens can hold entire file contents in working memory while analyzing cross-file dependencies. An agent with 53K tokens must read files sequentially, forgetting earlier files as newer ones load. The difference manifests as missed import errors, broken cross-references, and incomplete refactoring. A concrete example: refactoring a function signature requires checking every call site. With compressed context, the agent reads the function definition and all call sites in a single pass, catching the one file that passes arguments in the wrong order. With bloated context, the agent reads the function, reads three call sites, then runs out of reasoning space and reports “refactoring complete” without checking the remaining seven files. The bug ships.

Better instruction following. Anthropic documents the failure mode directly: “If Claude keeps doing something you don’t want despite having a rule against it, the file is probably too long and the rule is getting lost.”⁵ Compressed system prompts keep rules within the attention horizon. Every rule in a 3,500-token prompt gets more attention weight than the same rule buried in a 12,000-token prompt. My harness enforces a security rule: never commit files containing API keys. With a 12,000-token system prompt, the agent occasionally staged .env files during bulk commits. After compressing to 3,500 tokens, the violation dropped to zero across 200+ commit operations. The rule did not change. The rule became more visible.

Longer useful sessions. Auto-compact triggers at 95% context capacity.¹⁰ A session with 78% reasoning space reaches the compact threshold later than one with 26%. Later compaction means more productive turns before context loss. In my harness, a compressed session produces 40-60 productive turns before hitting the compact threshold.¹⁵ An uncompressed session hits the threshold after 15-20 turns. Each compaction event discards context that may have contained important decisions or constraints from earlier in the session. Fewer compactions mean more coherent sessions. The compressed session does not just start better. It stays better for longer.

Key Takeaways

For developers starting with context engineering: - Audit your CLAUDE.md file. For each line, ask: would removing it cause mistakes? If not, cut it. Target 60-70% reduction.² - Measure your tool schema overhead. If MCP tools consume 15K+ tokens at session start, consider CLI-first alternatives for stateless operations. - Run /compact proactively when switching tasks mid-session. Fresh context beats accumulated context.

For teams building agent infrastructure: - Implement query-aware compression on MCP tool outputs. BM25 + semantic search beats truncation for every retrieval task.¹ - Build a capability registry (skills, snippets, documented patterns). Every documented solution eliminates discovery overhead for future agent runs.³ - Use fresh agent spawns for multi-step workflows. Context isolation per task prevents the 15x token overhead of long multi-agent conversations.⁹

For architects designing context systems: - The three layers (system prompt, tool output, knowledge hoarding) compound independently. Compressing any single layer frees budget for the others. - Prompt caching makes system prompt compression a double optimization: fewer tokens AND cheaper per token on cache hits.⁶ - The 10% productivity wall breaks when the agent has enough reasoning space to follow complex instructions reliably.

FAQ

What is context engineering?

Context engineering is the discipline of filling an LLM’s context window with the right information for the next step. Coined by Andrej Karpathy, it encompasses writing context (saving useful information), selecting context (retrieving relevant data), compressing context (summarizing to reduce tokens), and isolating context (splitting across agents). A landmark study found models given 300 tokens of focused context outperformed models given 113,000 tokens of unfiltered conversation — the bottleneck is bandwidth, not intelligence.

How does context differ from memory in AI agents?

Traditional memory implies persistent state that the model carries forward automatically. In practice, LLM context is a fixed-size window that fills with conversation history, tool outputs, and system prompts. Context engineering treats this window as a scarce resource to be actively managed rather than a passive memory that accumulates. Techniques like filesystem-based memory, fresh agent spawns, and system prompt compression replace the “memory” metaphor with deliberate context architecture.

What is compound context in AI systems?

Compound context is the effect produced when multiple compression layers work together. Three layers compound independently: system prompt compression (60-70% reduction via structural techniques), MCP output compression (up to 94% reduction via relevance ranking like BM25), and knowledge hoarding (converting discovery overhead into pre-loaded capability via skills and documented patterns). Compressing any single layer frees budget for the others, creating a virtuous cycle where each layer reinforces the rest.

How do AI agents maintain context across sessions?

Agents maintain context across sessions through filesystem-based memory rather than in-context persistence. Structured files like CLAUDE.md, MEMORY.md, and handoff documents load into each new session at minimal token cost. Skills encode domain expertise that activates on demand. Hooks inject relevant context proactively at session start and prompt submission. Fresh agent spawns with focused briefings (~2K tokens) outperform long-running sessions that accumulate stale context and degrade past 60% window utilization.

How many tokens does a typical Claude Code session consume?

A well-compressed setup allocates roughly 3,500 tokens to system prompts, 300 tokens to tool schemas (CLI-first architecture), and 40,000 tokens to per-session history, leaving approximately 156,000 tokens (78%) of a 200K window for reasoning. Without compression, system prompts alone can reach 12,000+ tokens, tool schemas 15,000+ tokens, and conversation history 120,000+ tokens — leaving only 26% for reasoning. The difference directly impacts output quality, instruction following, and session length before compaction.

Part of the AI Engineering series. See also: The CLI Thesis, Claude Code as Infrastructure, and The 10% Wall.

Murat Kusglu, Context Mode: AI Tool Output Compression. GitHub repository. HN discussion (77 points, 23 comments). 315 KB to 5.4 KB via FTS5 + BM25. ↩↩↩↩↩
jchilcher, “Compress Your Claude.md: Cut 60-70% of System Prompt Bloat.” Blog post. HN discussion (24 points, 9 comments). ↩↩
Simon Willison, “Hoard things you know how to do.” Agentic Engineering Patterns. ↩↩↩
Simon Willison, “Linear walkthroughs.” Agentic Engineering Patterns. ↩
Claude Code Best Practices. Anthropic documentation. “Performance degrades as context fills.” ↩↩
Anthropic Prompt Caching. API documentation. Cache read tokens cost 10% of base input price. Minimum 4,096 tokens for Opus 4.6. ↩↩↩
George A. Miller, “The Magical Number Seven, Plus or Minus Two.” Psychological Review, 63(2), 81-97, 1956. APA PsycNet. ↩
Anthropic Model Pricing. Pricing page. Opus 4.6: $5/MTok input, $0.50/MTok cache hit. ↩
Lance Martin, “Context Engineering for Agents.” Blog post. Karpathy: “delicate art and science of filling the context window.” Sub-agents use up to 15x more tokens than single-agent interactions. ↩↩↩↩
FlowHunt, “Context Engineering: The Definitive 2025 Guide.” Blog post. 300-token focused context outperformed 113,000-token full conversations. Auto-compact triggers at 95% capacity. ↩↩
Claude Code Hooks Reference. Anthropic documentation. 17 lifecycle events with JSON input/output. UserPromptSubmit enables proactive context injection. ↩↩
Context Studios, “From Mode Collapse to Context Engineering.” Blog post. “By mid-2026, context engineering will emerge as a distinct discipline.” ↩↩
Kan Yilmaz, “Making MCP Cheaper via CLI.” Blog post. MCP tool schemas consume 15,540+ tokens with 84 tools. CLI overhead: ~300 tokens. ↩↩
Author’s harness: 49,746 chunks from 15,800 files indexed with Model2Vec potion-base-8M (256-dim) + sqlite-vec + FTS5 BM25 + Reciprocal Rank Fusion. 83 MB in SQLite. ↩
Author’s analysis: CLAUDE.md compressed from ~12,000 tokens to ~3,500 tokens (59.6% reduction) using structural compression techniques. ↩↩↩