← 所有文章

The Forgetting Agent: Why Multi-Turn Conversations Collapse

Ninety minutes into building my deliberation system, the agent stopped referencing architecture it had discussed thirty minutes earlier. Session logs showed Claude had compressed away the module dependency graph to make room for new tool outputs. The agent continued writing code, but the code no longer reflected the cross-module contracts it had established in the first hour. Tests passed. Integration failed. The agent had forgotten its own design.

That failure cost me a full day of debugging. The research now explains why it happened.

TL;DR

Microsoft Research and Salesforce tested 15 LLMs across 200,000+ simulated conversations and found a 39% average performance drop from single-turn to multi-turn interaction.1 The degradation starts in as few as two turns. Three independent mechanisms drive the collapse: context compression discards critical state, reasoning coherence fragments as token budgets shrink, and coordination between agents breaks down without shared ground truth. Longer context windows do not fix any of these. The Ralph loop pattern (fresh context per iteration with filesystem state) sidesteps compression loss but introduces its own costs. Below: the research, the three mechanisms, detection methods you can run today, and a protocol for multi-turn resilience.


The 90-Minute Cliff

My context-is-architecture post documented a seven-layer context system spanning 650 files. Building that system required extended coding sessions where the agent needed to hold complex architectural state: module boundaries, dependency chains, hook execution order, and cross-file contracts.

I measured session quality across 30 Ralph loop iterations in January and February 2026. The data showed a consistent pattern:

Minutes 0-30:   Precise multi-file edits, correct cross-references
Minutes 30-60:  Occasional missed imports, still recoverable
Minutes 60-90:  Single-file tunnel vision, loses architectural context
Minutes 90+:    Repetitive attempts, contradicts earlier decisions

The quality cliff appeared regardless of task type. Long refactoring sessions, test suite builds, and documentation passes all degraded on the same curve. What varied was the severity: tasks requiring more cross-file state hit the cliff harder than isolated single-file work.

I attributed the pattern to context window pressure and built the Ralph loop to work around it. Spawn a fresh Claude instance per iteration; inject state from the filesystem; never rely on conversational memory beyond one iteration. The pattern works. But the MSR/Salesforce study published in May 2025 revealed that the problem is more structural than context window size alone.


Three Mechanisms of Multi-Turn Collapse

Laban et al. decomposed multi-turn degradation into independent mechanisms, and the distinction matters because each requires a structurally different intervention.1

Mechanism 1: Context Compression

Every AI conversation operates within a finite token budget. As the conversation grows, the system compresses earlier turns to make room for new content. The compression is lossy. Architectural decisions documented in turn 3 may not survive to turn 15.

I caught this directly during the deliberation system build. The agent established a module dependency graph in the first 20 minutes: deliberation_engine.py depends on consensus_calculator.py, which depends on vote_aggregator.py. By minute 75, the agent had compressed away the dependency chain and wrote an import cycle. The code was syntactically valid. The circular import caused a runtime crash.

Detection: Track the ratio of cross-file references in agent output over time. When the agent stops referencing files it discussed earlier, compression has likely discarded the relevant context.

# Count unique file references per 30-min window in a session log
# Declining count signals compression loss
git log --since="2 hours ago" --pretty=format:"%s" | \
  grep -oP '[a-z_]+\.(py|js|ts)' | sort -u | wc -l

Mechanism 2: Reasoning Coherence Loss

The MSR/Salesforce study found that multi-turn degradation decomposes into two components: a minor aptitude loss and a significant reliability increase.1 Aptitude measures whether the model can produce a correct answer at all. Reliability measures whether it does so consistently.

In single-turn mode, models achieved roughly 90% average performance across six generation tasks. In multi-turn mode, performance dropped to approximately 65%: a 25-point absolute decline. The critical finding: “When LLMs take a wrong turn in multi-turn conversation, they get lost and do not recover.”1

Reasoning coherence loss manifests as the agent contradicting its own earlier decisions. Not because the system compressed the context away (mechanism 1), but because the model’s reasoning chain fragmented across turns. Each turn’s reasoning is locally sound but globally inconsistent.

Du et al.’s work on cognitive decision routing addresses this mechanism directly.2 Inspired by Kahneman’s dual-process theory (fast intuitive responses vs. slow deliberate reasoning), their system adapts reasoning depth based on task demands. The insight: not every agent turn requires the same depth of reasoning, and applying uniform depth wastes budget on trivial steps while under-investing in critical decisions.

Detection: Look for contradictions between early and late session output. If the agent advocates approach A in minute 15 and approach B in minute 60 without acknowledging the change, coherence has degraded.

Mechanism 3: Coordination Failure

Multi-agent systems compound multi-turn degradation with coordination failure. When two or more agents collaborate on a task, each agent’s context degrades independently. An agent that has forgotten a shared constraint can’t coordinate around it.

Bhardwaj et al.’s Agent Context Protocols address this by establishing structured communication channels between agents.3 Their framework achieved 28.3% accuracy on AssistantBench by defining explicit protocols for context sharing, error propagation, and state synchronization. Krishnan’s Unified Agent Communication Protocol extends this with zero-trust security boundaries between agents.4

I encountered coordination failure during a 10-agent deliberation where three reviewers evaluated the same code change. By the fourth review round, the agents had diverged on what the “current version” of the code looked like. Each agent’s context held a different snapshot. Their reviews contradicted each other not because they disagreed, but because they reviewed different code.

Detection: In multi-agent workflows, compare the state assumptions each agent holds. If agents reference different versions of the same artifact, coordination has failed.


Why Longer Context Windows Do Not Fix It

The intuitive response to multi-turn degradation is “give the model more tokens.” The MSR/Salesforce study disproves this intuition with a clever experimental design.

They tested a “Concat” condition: present the full multi-turn conversation as a single concatenated prompt. The Concat condition achieved 95.1% of single-turn performance.1 The context length was identical to the multi-turn condition. The information content was identical. The only difference was the interaction structure: one turn vs. many turns.

The 39% degradation is not a context length problem. Doubling the context window from 200K to 400K tokens would not eliminate the degradation, because the degradation comes from the turn boundaries themselves, not from running out of space.

The Concat finding matches my production data. Claude operates with roughly 200,000 tokens of context. My context window management measurements showed that the longest single-session runs (3+ hours, heavy tool use) consume approximately 180,000 tokens before compaction triggers. But quality degrades well before the window fills. The 90-minute cliff occurs at roughly 60-70% context utilization, not at the boundary. The resulting cognitive debt compounds as the agent produces code faster than a developer can verify it.

Du et al.’s cognitive decision routing reframes the problem: the issue is not how many tokens the model can hold, but how efficiently the model allocates reasoning resources across those tokens.2 Their system achieved a 34% reduction in computational costs with a 23% improvement in consistency by routing simple decisions through fast reasoning and complex decisions through deliberate reasoning.


The Fresh-Context Solution (And Its Costs)

The Ralph loop solves mechanism 1 (compression) and partially solves mechanism 2 (coherence) by never running a conversation long enough for either to manifest. Each iteration spawns a fresh Claude instance with a full 200K-token context. State persists through the filesystem, not through conversational memory.

# Simplified Ralph loop iteration (from jiro-artisan.sh)
while [ "$stories_remaining" -gt 0 ]; do
  # Orient: inject current state from filesystem
  state=$(cat jiro.state.json)
  progress=$(cat jiro.progress.json)
  git_state=$(git diff --stat HEAD)

  # Spawn fresh context with injected state
  claude --print \
    "State: $state" \
    "Progress: $progress" \
    "Git: $git_state" \
    "Task: implement next story from prd.json"

  # Update filesystem state from agent output
  update_state_from_output
done

Each iteration gets the full context budget. No compression artifacts from previous turns. No coherence fragments from earlier reasoning chains. The filesystem serves as the agent’s external memory: jiro.state.json tracks the current story, jiro.progress.json records completed work across iterations, and git diff provides ground truth about what actually changed.

Zhang, Kraska, and Khattab’s Recursive Language Models take a complementary approach: instead of spawning fresh instances, the model offloads context to a Python REPL environment and reasons over context in code rather than token space.5 RLM-Qwen3-8B outperformed its baseline by 28.3% on long-context tasks by treating long prompts as external data structures rather than internal memory. Where the Ralph loop externalizes state to files, RLMs externalize state to code. Both patterns solve the same compression problem through different mechanisms.

Nanda et al.’s Wink system addresses what happens when degradation is already underway.6 Analyzing over 10,000 real-world agent trajectories, they found that misbehaviors (specification drift, repetitive loops, tool call failures) occur in approximately 30% of all sessions. Wink observes the agent’s trajectory and provides targeted course-correction, resolving 90% of single-intervention misbehaviors. The detection is real-time: Wink identifies degradation patterns as they emerge rather than waiting for a failure to propagate through the codebase.

The Costs

Fresh-context iteration is not free. Three costs:

1. Orient overhead. Every iteration spends tokens re-reading state that the previous iteration already understood. My measurements show 15-20% of each iteration’s token budget goes to the orient step: reading state files, scanning recent git history, rebuilding enough context to continue. A 200K-token iteration starts with approximately 160-170K tokens of usable capacity.

2. Lost implicit knowledge. Conversational context carries implicit knowledge that filesystem state cannot capture: the reasoning behind a design choice, the alternatives considered and rejected, the nuance of why approach A was chosen over approach B. The orient step injects facts (what changed, what’s next). The reasoning (why) evaporates between iterations.

3. Coordination cost. If multiple Ralph loops run concurrently (parallel story implementation), each loop maintains independent state. Coordinating between loops requires explicit merge logic and conflict resolution that a single long session handles implicitly.

The cost-benefit calculation is clear: for sessions under 60 minutes, a single conversation is more efficient. Beyond 90 minutes, the fresh-context pattern produces higher-quality output despite the orient overhead. The crossover point depends on task complexity: high cross-file state pushes the crossover earlier; isolated single-file work pushes it later.


Measuring Degradation Before It Hits

You don’t need to wait for a production failure to detect multi-turn degradation. Three methods, from simplest to most thorough:

Method 1: Context Pressure Monitoring

Track context utilization in real time. My context-pressure.sh hook runs after every tool call and warns when utilization exceeds 60%:

# Simplified context pressure check
context_used=$(wc -c < "$CONVERSATION_LOG" | awk '{print int($1/4)}')
context_max=200000
utilization=$(( context_used * 100 / context_max ))

if [ "$utilization" -gt 60 ]; then
  echo "[WARN] Context at ${utilization}% — quality degradation likely"
fi

if [ "$utilization" -gt 80 ]; then
  echo "[CRITICAL] Context at ${utilization}% — start new session"
fi

Method 2: Cross-Reference Tracking

Monitor how many distinct files the agent references per output. A declining trend signals compression loss:

# Track file reference diversity in recent commits
for commit in $(git log --oneline -5 --format="%H"); do
  files=$(git diff-tree --no-commit-id --name-only -r "$commit" | wc -l)
  echo "$commit: $files files touched"
done

Method 3: Contradiction Detection

Compare the agent’s architectural statements across time. If the agent claims “module A depends on module B” at minute 20 and “module A has no external dependencies” at minute 70, coherence has degraded. The automated version: diff the agent’s EXPLAIN statements (or design comments) between early and late session outputs.


A Protocol for Multi-Turn Resilience

Three tiers, each addressing a different mechanism. Start with Tier 1 and add layers as needed.

Tier Mechanism Addressed Intervention Implementation Cost
1 Compression Checkpoint state to filesystem every 30 minutes Low: 5-minute setup
2 Coherence Fresh-context iterations after 60-90 minutes Medium: requires state serialization
3 Coordination Explicit state synchronization between agents High: requires protocol design

Tier 1: State Checkpointing

Every 30 minutes, serialize the agent’s current architectural understanding to a file. Not the full conversation, but the structural state: what modules exist, how they connect, what constraints apply.

# Pre-compaction checkpoint (runs before Claude compresses context)
mkdir -p .claude/checkpoints
cat > ".claude/checkpoints/$(date +%s).md" << 'CHECKPOINT'
## Architectural State
- Module graph: [current understanding]
- Active constraints: [list]
- Design decisions made this session: [list with reasoning]
CHECKPOINT

If the agent’s behavior degrades, restore from the checkpoint rather than continuing with degraded context.

Tier 2: Fresh-Context Iterations

For sessions exceeding 60 minutes, switch to the Ralph loop pattern. The key is the orient step: inject enough state for the new context to continue productively without re-reading the entire conversation history.

Required state for the orient step: 1. Current task and acceptance criteria 2. Files modified in the previous iteration (from git diff) 3. Architectural decisions and their reasoning 4. Known constraints and failure modes

Tier 3: Agent Coordination Protocols

For multi-agent workflows, establish a shared state document that all agents read and write. The document serves as ground truth, preventing the divergence I observed during deliberation reviews.

{
  "version": 7,
  "last_updated": "2026-02-22T14:30:00Z",
  "active_files": ["engine.py", "calculator.py", "aggregator.py"],
  "constraints": [
    "No circular imports between modules",
    "All public functions require type annotations"
  ],
  "decisions": [
    {"decision": "Use RRF for vote aggregation", "reasoning": "Handles rank-only data", "turn": 3}
  ]
}

Every agent reads this document at the start of its turn and updates it at the end. Conflicts trigger a coordination pause rather than silent divergence.


Key Takeaways

  • Multi-turn degradation is structural, not a context length problem. The MSR/Salesforce study showed 39% degradation even when context length remained constant. Turn boundaries, not token limits, drive the collapse.1
  • Three independent mechanisms require three different interventions. Compression loss needs state checkpointing. Coherence loss needs fresh-context iteration. Coordination failure needs shared state protocols.
  • The 90-minute cliff is real and measurable. Track context utilization, cross-reference diversity, and architectural contradictions to detect degradation before production failures surface.
  • Fresh-context iteration works but costs 15-20% overhead. The Ralph loop pattern trades orient overhead for full context budgets per iteration. The trade-off favors fresh context beyond 60-90 minutes.
  • Adaptive reasoning allocation outperforms uniform depth. Du et al.’s cognitive decision routing achieved 34% cost reduction with 23% consistency improvement by matching reasoning depth to task demands.2

FAQ

Why do LLMs degrade in multi-turn conversations?

LLMs degrade in multi-turn conversations through three independent mechanisms. Context compression discards earlier information to fit new content within the token budget. Reasoning coherence fragments as the model's chain of thought spans multiple turns, producing locally sound but globally inconsistent outputs. Coordination between multiple agents fails when each agent's context degrades independently. Microsoft Research and Salesforce documented a 39% average performance drop across 15 LLMs and 200,000+ conversations, with degradation beginning in as few as two turns.

Do longer context windows fix multi-turn degradation?

Longer context windows do not fix multi-turn degradation. The MSR/Salesforce study tested a "Concat" condition where the full conversation was presented as a single prompt, achieving 95.1% of single-turn performance. The same content split across multiple turns dropped to roughly 65%. The degradation comes from turn boundaries themselves, not from context length limitations. Doubling the context window would not eliminate the 39% performance gap.

What is the fresh-context iteration pattern for AI agents?

Fresh-context iteration spawns a new AI instance for each work cycle rather than continuing a single long conversation. State persists through external storage (filesystem, database) rather than conversational memory. Each iteration reads the current state, performs work, and writes updated state back. The pattern eliminates compression artifacts and coherence fragmentation at the cost of 15-20% overhead for the "orient" step where the new instance reads and processes the external state. Production data shows the pattern outperforms single-session approaches for tasks exceeding 60-90 minutes.

How do you detect multi-turn degradation before it causes failures?

Three detection methods work in practice. Context pressure monitoring tracks token utilization and warns when it exceeds 60% (quality degradation likely) or 80% (start a new session). Cross-reference tracking monitors how many distinct files the agent references per output; a declining trend signals compression loss. Contradiction detection compares the agent's architectural claims across time; if the agent's understanding of module dependencies changes between early and late session outputs without an explicit decision, coherence has degraded.


References


  1. Laban, Philippe, et al., “LLMs Get Lost In Multi-Turn Conversation,” arXiv:2505.06120, May 2025. arxiv.org. Microsoft Research and Salesforce Research. Tested 15 LLMs across 8 model families on 200,000+ simulated conversations. 

  2. Du, Y., et al., “Cognitive Decision Routing in Large Language Models: When to Think Fast, When to Think Slow,” arXiv:2508.16636, August 2025. arxiv.org. Achieved 34% reduction in computational costs with 23% improvement in consistency. 

  3. Bhardwaj, et al., “Agent Context Protocols Enhance Collective Inference,” arXiv:2505.14569, May 2025. arxiv.org. Introduces structured communication protocols for multi-agent coordination, achieving 28.3% accuracy on AssistantBench. 

  4. Krishnan, “Beyond Context Sharing: A Unified Agent Communication Protocol,” arXiv:2602.15055, February 2026. arxiv.org. Proposes standardized agent-to-agent orchestration with zero-trust security boundaries. 

  5. Zhang, Alex L., Tim Kraska, and Omar Khattab, “Recursive Language Models,” arXiv:2512.24601, December 2025. arxiv.org. MIT CSAIL. RLM-Qwen3-8B outperforms baseline by 28.3% on long-context tasks by offloading context to a Python REPL environment. 

  6. Nanda, Rahul, et al., “Wink: Recovering from Misbehaviors in Coding Agents,” arXiv:2602.17037, February 2026. arxiv.org. Misbehaviors occur in approximately 30% of all agent trajectories; Wink resolves 90% of single-intervention cases. 

  7. Author’s session quality measurements across 30 Ralph loop iterations, January-February 2026. Data collected from jiro.progress.json session logs and git diff --stat output per iteration. Orient overhead measured by token count of state injection vs. total iteration budget. 

  8. Author’s context-is-architecture system. Seven-layer hierarchy across 650 files documented in Context Engineering Is Architecture

  9. Author’s multi-agent deliberation system. 10-agent consensus with 3-reviewer autonomous code review documented in The Deliberation System

相關文章

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

8 分鐘閱讀

Your Agent Writes Faster Than You Can Read

Five research groups published about the same problem this week: AI agents produce code faster than developers can under…

16 分鐘閱讀

Why My AI Agent Has a Quality Philosophy

My Claude Code agent inherited every sloppy human habit at machine speed. I built 3 philosophies, 150+ quality gates, an…

27 分鐘閱讀