← Tous les articles

The Agent Didn't Get Smarter

Six months ago, a coding task took an entire session of explanation. Last week, the same kind of task took one sentence. The model did not change between those two sessions. Claude Opus 4.6 served both. Same weights, same architecture, same context window, same capabilities.

The project changed.

The Wrong Conversation

The AI productivity conversation is almost entirely about model capabilities. Which model is fastest. Which model writes the best code. Which model handles the longest context. The implicit assumption is that the model is the variable: upgrade the model, improve the output.

This assumption is wrong for long-running projects. On a project I have been working on for six months with 500+ agent sessions, the model contributes maybe 30% of the session quality. The other 70% comes from the accumulated project infrastructure: convention documents, decision memories, handoff artifacts, behavioral hooks, codified skills, and test coverage.

A better model on a bare project produces better output than a worse model on a bare project. A worse model on a project with 500 sessions of accumulated context often produces better output than a better model on a bare project. The infrastructure dominates the model.

Evidence

The market page performance fix illustrates the point. One sentence: “fix the market page performance.” The agent:

  1. Read a handoff document from a previous session that diagnosed the bottleneck
  2. Identified the correct code path (market_hub(), not _fetch_market_data())
  3. Implemented a paginated database query with an aggregate RPC
  4. Wrote tests
  5. Deployed

Austin went from 14 seconds to 108 milliseconds. A 132x improvement from a single prompt.1

This did not happen because the model is smart. It happened because the handoff document existed. The handoff captured a diagnosis that survived three code review corrections and two priority reorderings across four days. Without the handoff, the agent would have started from scratch. It would have investigated the wrong code path (as the first draft of the handoff did). It would have proposed unnecessary HTMX partials (as the original plan did). The handoff contained the errors already made and corrected. The agent inherited the corrected understanding.

The model’s contribution was reading the handoff and implementing the fix. The infrastructure’s contribution was having the right handoff to read.

What Changes and What Doesn’t

Between session 1 and session 500 on the same project, exactly one thing stays constant: the model. Everything else changes.

What changes:

  • The CLAUDE.md grows from empty to comprehensive. Convention questions disappear.
  • Memory files accumulate. Decisions are cached. Trade-offs are recorded. The project stops relitigating settled questions.
  • Hooks accumulate. Each one prevents a class of failure that occurred in a previous session. 84 hooks intercepting 15 event types, each one a scar from a past incident.
  • Skills accumulate. Repetitive workflows become one-command operations. The nightcheck that took an entire session to design now runs in 2 minutes.
  • Tests accumulate. The agent makes bolder changes because it can verify them immediately.
  • Handoff documents accumulate. Complex investigations persist across session boundaries.

What stays the same:

  • The model. Same weights. Same capabilities. Same tendency to drift off task, phantom-verify test results, and propose unnecessary abstractions.

The model’s failure modes are constant. The infrastructure’s ability to catch those failure modes grows with every session. Session 500 is better than session 1 not because the model improved, but because the infrastructure learned to compensate for the model’s constant weaknesses.

The Investment Frame

If the model is not the variable, then model selection is not the primary investment decision. The primary investment is in context infrastructure.

A team that spends $200/month on Claude Opus and invests heavily in CLAUDE.md files, memory systems, hooks, skills, and test coverage will outperform a team that spends $200/month on Claude Opus with no infrastructure investment. The model cost is identical. The output quality diverges because the infrastructure diverges.

This reframes the productivity question. The question is not “which model should we use?” The question is “what have we built around the model that makes every session better than the last?”

The organizations I see struggling with AI productivity are not using the wrong model. They are starting every session from scratch. No convention document. No memory system. No hooks. No skills. No accumulated context. Every session is session 1, regardless of how many sessions came before.

The Model Will Improve

Models will continue improving. Claude Opus 5 will be better than Claude Opus 4.6. The improvement is real and valuable. But the improvement is additive, not multiplicative.

A model that is 20% better at code generation produces 20% better output on a bare project. The same model with 500 sessions of accumulated context produces output that is qualitatively different, not just quantitatively better. The context infrastructure does not add 20% to the model’s capability. It provides the diagnosis, the constraints, the verification criteria, and the operational history that the model cannot produce on its own regardless of how capable it is.

No model, however capable, can know that market_hub() loads all company_markets rows and paginates in Python unless something tells it. The handoff document tells it. The model reads and acts. The intelligence is distributed between the model (reading, reasoning, implementing) and the infrastructure (knowing, constraining, verifying).

Session 500

Session 500 looks like this: I state what I want in one sentence. The agent reads the CLAUDE.md and knows the conventions. It reads the memory files and knows the decisions. It reads the handoff and knows the diagnosis. It runs into a hook that prevents the same mistake another agent made three months ago. It checks its work against the test suite. It reports completion with evidence for every claim.

Session 1 looks like this: I explain the database schema, the routing conventions, the template inheritance, the cache layer, the deployment pipeline, and the testing patterns. The agent asks clarifying questions. It proposes an approach that violates three conventions. I correct it. It implements the fix. It reports “tests pass” without running pytest.

The model is the same in both sessions. The project is not.


FAQ

Doesn’t model quality still matter?

Yes. A stronger model reads context more effectively, reasons about trade-offs more accurately, and implements solutions more cleanly. Model quality sets the floor. Infrastructure raises the ceiling. On a mature project, the ceiling matters more than the floor.

Is this specific to coding agents?

No. Any AI workflow where the same task domain recurs across sessions benefits from accumulated context. Writing, research, analysis, customer support. The specific infrastructure differs (style guides instead of CLAUDE.md, knowledge bases instead of hooks), but the dynamic is the same: the project gets better because the context around the model accumulates.

What about multimodal models or reasoning models?

Same principle. A reasoning model that can think for 10 minutes about a problem still needs to know what problem to think about. The handoff document, the convention file, and the memory system provide the problem definition. The model provides the reasoning. Better reasoning on a well-defined problem produces better results than inferior reasoning, but better reasoning on an undefined problem produces better-sounding confusion.

How do I start if I have zero context infrastructure?

Write a CLAUDE.md file that describes your project conventions. That single file is the highest-leverage investment. Everything else compounds from there.2


Sources


  1. Blake Crosley, “Compound Context: Why AI Projects Get Better the Longer You Stay With Them,” blakecrosley.com, March 2026. 

  2. Anthropic, “Manage Claude’s memory,” Anthropic Documentation, 2026. 

Articles connexes

The Handoff Document

A diagnosis that survived three code review corrections, two priority reorderings, and guided the correct implementation…

7 min de lecture

Compound Context: Why AI Projects Get Better the Longer You Stay With Them

Every problem you solve with an AI agent deposits context that the next session withdraws with interest. This is context…

11 min de lecture

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

8 min de lecture