Your Agent Has Memory You Didn't Write

Q: Why might this matter more for production agents than for existing benchmarks?

Partial hedge on this one. ImplicitMemBench itself uses a multi-step protocol (Learning/Priming-Interfere-Test),1 so it is not the case that the benchmark is “single-shot.” I don’t want to repeat the usual careless line about benchmarks. What does seem worth flagging (as practitioner speculation, not a finding of the paper) is that most other agent evals people look at measure either functional task completion or explicit fact recall, both of which favor models. If the implicit-memory gap reported by this paper is real beyond its own protocol (and I don’t know that it is), those other evals are missing a dimension of production behavior that users actually experience in long-running sessions. I’m treating that as a testable hypothesis, not a conclusion.

April 10, 2026 25 min read Updated April 12, 2026

ai agents memory claude-code hermes research evaluation implicit-memory benchmark

From the guide: Claude Code Comprehensive Guide

LLMs develop unconscious behavioral memory that existing evaluations miss entirely. An ACL 2026 paper found that top models score under 66% on detecting their own learned behavioral patterns, patterns that persist across sessions without explicit storage. The explicit memory you write (SOUL.md, CLAUDE.md) is only half the picture.

I spent most of today writing a practitioner’s reference for Hermes Agent. One of the load-bearing sections covers SOUL.md, the file where you pin your agent’s identity. Voice, tone, preferences, behavioral guardrails. The whole premise of the section is that you put the identity there, the agent reads it at the top of every system prompt, and the agent behaves accordingly. Explicit memory. Declarative. Auditable. Version-controlled. The right kind of memory, the kind a serious practitioner should care about.

A paper landed on arxiv yesterday that I caught in a signal scan tonight, and reading it has made me hold the SOUL.md premise more loosely than I did earlier today.¹

The paper is called ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models.¹ The authors describe it as the first systematic benchmark for implicit memory in LLMs: the memory that (in their framing) shapes what an agent automatically enacts, as distinct from the explicit memory that shapes what it consciously recalls.¹ Top performers score below 66%.¹ The authors also report a “dramatic” asymmetry inside that score,¹ which I’ll unpack with appropriate hedging further down.

TL;DR

Existing memory benchmarks measure explicit recall: given a fact you told the model, can it retrieve it. ImplicitMemBench measures a different memory system, the one that (per the authors) shapes automatic behavior “without conscious retrieval,” drawn from standard cognitive-science constructs (procedural memory, priming, classical conditioning).¹ On a 300-item first-attempt-scoring benchmark, no model the authors tested exceeded 66% overall: DeepSeek-R1 scored 65.3%, Qwen3-32B 64.1%, GPT-5 63.0%, and the authors describe top performers as “far below human baselines.”¹ The headline number only tells half of it. The abstract also reports a “dramatic” asymmetry: 17.6% on inhibition versus 75.0% on preference, a ~4x gap, framed as a “universal bottleneck” the authors say needs “architectural innovations beyond parameter scaling.”¹ I read the asymmetry (with the hedge that the abstract does not publish the full methodology behind those two numbers) as consistent with a folklore failure mode I have been watching in agent work: systems that reinforce recently-seen preferences fast and fail to unlearn recently-seen failures. If that reading holds, it reframes the conversation about agent identity, safety, and skill evolution from “what did you put in the prompt?” to “what might the session be quietly shaping that your explicit pins cannot audit?” The reframing is my extension of the paper, not the paper’s own claim.

Key Takeaways

The bullets below are my reading of what the paper’s findings imply for practitioners, not claims the paper itself makes. The paper tests 17 LLMs on a 300-item cognitive-science benchmark; it does not evaluate production agent harnesses or prompting strategies. I label each takeaway accordingly.

Extension: pinning identity in SOUL.md, AGENTS.md, CLAUDE.md, system prompts, or persistent memory files is explicit declarative memory, which existing benchmarks already show models do well. ImplicitMemBench measures a different memory system entirely, and models score below 66% on it.¹ The practitioner implication (that explicit identity pins may not propagate to first-attempt automatic behavior) is my inference, not the paper’s.
Extension: the 17.6% vs 75.0% asymmetry, if it generalizes beyond the benchmark, would predict an agent that absorbs recently-seen preferences fast and is slow to stop repeating recently-seen failures. The paper reports the two numbers and labels them “dramatic” and “universal,”¹ but does not publish per-item methodology for how “preference” and “inhibition” were operationalized, and does not test this pattern in agent harnesses. The production-behavior reading is mine.
Extension: every token that lands in the context window from a tool call, MCP response, scraped web page, or prompt-injection attempt is in-context behavioral influence. Not training in any weight-update sense, but influence on the next first-attempt response that the explicit prompt layer cannot cleanly audit. The paper does not make this claim directly; I’m extending the implicit-memory framing to context-window content.
Paper claim: the 17-model evaluation reveals “severe limitations,” “dramatic asymmetries,” and “universal bottlenecks requiring architectural innovations beyond parameter scaling.”¹ The authors frame the gap as architectural. I read that as weak evidence against “more prompt engineering will fix this,” but the paper does not specifically test prompting mitigations, so treat that reading as my hypothesis, not theirs.

What the Paper Measures

The paper’s framing is that existing memory benchmarks for LLM agents “evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval.”¹ The gap they identify: “effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders.”¹ If the only way your agent can avoid a mistake is for you to re-tell it not to make the mistake every turn, you are not building on implicit memory; you are paying explicit memory cost every request.

ImplicitMemBench tests three constructs drawn directly from cognitive science accounts of non-declarative memory, quoted from the abstract:¹

Procedural Memory: “one-shot skill acquisition after interference.” Can the model, after being shown how to do something once, actually execute it again later when other instructions have intervened? Procedural memory lets a human learn to ride a bike: you don’t recall how to ride, you do riding, even after years away from the bike.
Priming: “theme-driven bias via paired experimental/control instances.” Does seeing one class of thing make the model more likely to produce that class of thing on the next unrelated task, without the model being aware that the priming happened?
Classical Conditioning: “Conditioned Stimulus-Unconditioned Stimulus (CS-US) associations shaping first decisions.” If the model has been exposed to a stimulus-response pairing, does that pairing show up as a bias on a totally new task where neither the CS nor the US is the point of the question?

The authors use a 300-item suite under a unified “Learning/Priming-Interfere-Test protocol with first-attempt scoring.”¹ First-attempt scoring is important. A model that can self-correct after being told it got something wrong is fine, but the research question here is whether the memory shaped the automatic first response. If the first response is wrong and the correction only happens after explicit feedback, the implicit memory system (as the paper defines it) failed on that item. The authors summarize their contribution with one line I want to lift directly: the benchmark “reframes evaluation from ‘what agents recall’ to ‘what they automatically enact’.”¹

The Results

The headline number: “no model exceeds 66% overall.”¹

DeepSeek-R1: 65.3%
Qwen3-32B: 64.1%
GPT-5: 63.0%

The top performers above are described as “far below human baselines,” though the abstract does not publish the exact human baseline number or a full per-model ranking.¹ Seventeen models total are evaluated in the paper.¹

The headline hides the sub-result. The authors write that “analysis uncovers dramatic asymmetries (inhibition 17.6% vs. preference 75.0%) and universal bottlenecks requiring architectural innovations beyond parameter scaling.”¹ I want to be careful here about what the numbers mean. The abstract does not give a full methodology breakdown for how the authors computed those two numbers, so my gloss of them is an inference from the abstract’s wording, not a read of the paper’s internal definitions. With that hedge flagged:

Preference: 75.0% (paper’s number). My gloss, pending the full paper: models appear relatively good at showing that implicit exposure pulled them toward a stimulus. Priming and CS-US pairings that bias behavior in a particular direction land correctly about three-quarters of the time.
Inhibition: 17.6% (paper’s number). My gloss, pending the full paper: models appear dramatically worse at showing that implicit exposure pushed them away from a stimulus. The “don’t do that again” signal lands correctly fewer than one time in five. I infer the behavioral meaning from the word “inhibition” and the paper’s framing of classical conditioning; the abstract does not spell out the operationalization.

The authors explicitly label the asymmetry “dramatic” and attribute it to “universal bottlenecks,”¹ and the word universal matters: the authors present this as a pattern across their 17-model evaluation, not a one-model artifact. I am not going to claim that the bottleneck is a “prompting problem” or “not a prompting problem.” The paper does not test prompting as a mitigation, and saying either would go past what the abstract supports.

What the Asymmetry Actually Means

I want to be precise about what I’m claiming here, because this is the part where it’s tempting to over-read a benchmark.

What the paper shows. On a 300-item cognitively grounded benchmark scored on first-attempt answers, LLMs are dramatically worse at demonstrating implicit inhibition than implicit preference, by a factor of roughly four, across every model tested. The authors call this a universal bottleneck that cannot be fixed by scaling.

What I’m claiming — separately from the paper. The asymmetry pattern maps onto a failure mode I have been watching in my own agent work for months, without previously having a name for it. Agent harnesses (in my experience) seem surprisingly good at absorbing context that points toward a preferred style, tool, or approach. The agent’s behavior drifts toward whatever you fed it most recently, fast. They seem surprisingly bad at not repeating a failure they just watched happen. The agent tries the same broken command, the same wrong tool, the same stale path, even after those failed in the same session. That is folklore, not a measurement; it is my practitioner impression, not a controlled study. The ImplicitMemBench numbers are consistent with that folklore, which is why I care about the paper. They do not, by themselves, validate the folklore, and I do not want to claim the paper gives my folklore “a number” when the paper measured something tighter and more controlled than anything I have been observing.

What I’m not claiming. I am not claiming that ImplicitMemBench specifically measured agent-harness behavior or production Claude Code / Cursor / Codex workflows. It didn’t. It measured 17 models against a structured cognitive-science protocol. The mapping from benchmark to production behavior is my extension, labeled as such, and I don’t want anyone reading this to think the paper made that claim for me.

With those labels in place, the distinction the benchmark draws between explicit recall of an instruction and automatic first-attempt behavior under priming/conditioning is the distinction I want my own agent work to start taking seriously. You can tell the agent “don’t do X” and the explicit recall will likely work; it can repeat “don’t do X” back to you when asked. What ImplicitMemBench measures is a different thing: does the agent automatically not do X on the next first-attempt decision, in the absence of any explicit reminder? I don’t know whether production agent harnesses inherit the benchmark’s 17.6% aggregate inhibition number on first-attempt behavior in the wild. That mapping is untested, and I am not claiming it. I am claiming something weaker: the distinction between “can recall the rule” and “automatically enacts the rule” is sharper than I had been treating it, and the paper’s results are part of why.

The SOUL.md Illusion

The Hermes guide I was writing today treats SOUL.md as the agent’s primary identity pin. Slot #1 in every system prompt. Tone, voice, guardrails. The guide makes a version of the argument that every persistent-memory system for agents has made for the last two years: if you put the identity in the right declarative memory file, the agent’s behavior stays aligned with it.

That argument is not wrong, but ImplicitMemBench is giving me a reason to be less confident about how completely it holds. SOUL.md is explicit declarative memory, the memory system existing benchmarks already measure and models already do well on. Models can recall its contents on demand; that’s the easy part. The harder question, and the one I don’t think SOUL.md answers: does the explicit pin meaningfully override the implicit priming, conditioning, and first-attempt bias that build up as a session fills with tool outputs, retrieved documents, prior assistant turns, user corrections, and everything else that shapes first-attempt behavior without any retrieval step? I don’t know. The paper doesn’t test SOUL.md or any equivalent identity-pin file, and I don’t want to claim it answers that question for me.

Here’s the worry, framed as a hypothesis rather than a finding. If you pin an identity in SOUL.md that says “be terse and factual,” and then the session fills up with a long, narrative-style conversation thread from the user, the implicit-memory framing predicts that the priming should partly shape first-attempt behavior on the next turn, even while the explicit pin still holds on recall. Whether the priming actually wins on average in production, I cannot prove from this paper, and I am not going to try. The SOUL.md illusion, as I’m naming it: the possibility that you pinned the recall of the identity rather than the automatic enactment of it, and those two things are not the same.

I am not saying don’t write SOUL.md. I am still going to write it, and the Hermes guide will still recommend it, because explicit declarative memory is load-bearing for the things it does well. What I am saying, labeled clearly as my own extrapolation: if you are building anything that depends on the agent not repeating a mistake, not drifting toward a recently-seen style, not being pulled off-task by a priming signal you did not intend, I would not bet the reliability budget on SOUL.md alone, and I would not assume that making SOUL.md longer or more specific solves it. The paper uses the phrase “architectural innovations beyond parameter scaling,”¹ which I read (cautiously) as weak evidence that prompt-engineering mitigations will not close the gap the benchmark measures. The paper itself does not test prompt-engineering mitigations, so I cannot say it proves they fail; I can only say it does not give me confidence that they will work.

What the Paper Doesn’t Say (and What I’m Adding)

The paper is a benchmark paper. It measures a gap, it quantifies it, it argues the gap is architectural. It does not prescribe specific harness-level mitigations or claim anything about specific production agent systems. Everything in this section is my framing, not the paper’s.

Implication 1: every token in the context window is in-context behavioral influence. If the implicit memory framing holds outside the benchmark (and I am speculating here, not reporting), every token landing in the context window from a tool call, a retrieved document, or an intermediary response shapes the next turn’s first-attempt behavior in ways that reading the explicit prompt cannot cleanly audit. I have previously written about the silent egress attack surface (untrusted tool outputs carrying injected instructions) and your agent having a middleman you didn’t vet (untrusted LLM API routers between your client and the model). Neither of those posts claimed implicit memory as the causal mechanism. Both claimed prompt injection and supply-chain compromise as the mechanisms. ImplicitMemBench offers a possible additional lens on why those attacks work the way they do: even if the hostile tool output or the compromised router never explicitly “tells” the agent what to do, the contents of what it returns could be priming the agent’s next decision. That’s a hypothesis ImplicitMemBench is consistent with, not a finding the paper reports.

Implication 2: session length might be a reliability hazard, not just a cost hazard. The folklore observation is that agents get worse over long sessions and the folklore explanation is context-window pressure. ImplicitMemBench is not a session-length study at all. It is a 300-item first-attempt-scoring benchmark under a Learning/Priming-Interfere-Test protocol,¹ which measures a different thing than “what happens over 30 turns in a production session.” I don’t want to pretend it maps directly onto production sessions. What I am suggesting, as a hypothesis, is that the mechanism the paper names (implicit priming and classical conditioning landing in first-attempt decisions without retrieval) is a candidate alternative explanation for the folklore drift, and it deserves serious consideration even though the paper does not test it in that frame. My operational rule in the meantime: run sessions shorter than your context window allows, not as long as it permits. Cheap insurance against whatever the real mechanism turns out to be.

Implication 3: the “static skills are dead skills” argument needs a footnote. I wrote Static Skills Are Dead Skills earlier this week arguing that skills stop improving the moment they ship unless you build a trajectory feedback loop. That argument assumed the failure mode was absence: absence of aggregation, absence of a pattern detector, absence of an evolver. Reading ImplicitMemBench against that earlier post, I want to flag a possible second failure mode layered on top: even with trajectory-driven skill updates, the update landing in the skill file (explicit declarative memory) might not propagate cleanly to first-attempt automatic behavior if something closer to the implicit-memory layer drives first-attempt decisions. I don’t know that it is. The paper does not test skill updates. But it is a concern I did not have when I wrote the earlier post, and I am flagging it as a concern rather than a conclusion.

Implication 4: the measurement problem for agent quality may be getting harder. Most existing agent evals measure either functional task completion (did the agent solve the problem) or explicit fact recall (did the agent remember what you told it). ImplicitMemBench introduces, on its own protocol, a third dimension: first-attempt automatic behavior under implicit priming. If that dimension turns out to matter in production (which I don’t know, and the paper doesn’t test), any serious quality loop for agent work needs a measurement hook for it, and most loops today don’t have one. I am treating that as a TODO for my own quality system rather than a prescription for yours.

Implication 5: alignment is a retrieval gate, not an erasure mechanism. A separate paper from Liu et al. strengthens the implicit-memory framing from a different angle.² They show that fine-tuning on semantically related text (even public-domain novels) reactivates verbatim recall of copyrighted books the model had memorized during pretraining but alignment had suppressed: up to 85-90% verbatim reproduction, single spans exceeding 460 words, generalizing across 30+ unrelated authors when fine-tuned on just one, with r >= 0.90 cross-model correlation across GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1.² The mechanism matters for the implicit-memory argument: the memorization was already encoded in pretraining weights. Fine-tuning didn’t inject new knowledge — it bypassed the alignment gate that blocked retrieval. If alignment functions as a gate rather than an eraser, the model’s actual memory footprint is larger and less controllable than what explicit mechanisms (alignment, system prompts, identity pins) expose. ImplicitMemBench makes the same structural claim from the behavioral side: the model has memory, both behavioral and content, that your explicit pins do not govern. The finetuning paper and ImplicitMemBench are measuring different manifestations of the same underlying reality. (As before, the connection between these two papers is my framing, not a claim either paper makes.)

What to Actually Do

Neither paper prescribes or tests anything in this section. What follows is my reading, working forward from my own prior arguments and using ImplicitMemBench and the alignment-gate finding as additional pieces of evidence, of what the findings imply for practitioners building against current harnesses. Label accordingly.

Stop assuming explicit pins are sufficient. Keep writing SOUL.md, AGENTS.md, CLAUDE.md, and memory files, but treat them as necessary-not-sufficient. The AGENTS.md patterns post documents how to structure these files effectively; this post adds a boundary condition on what they can guarantee. The thing I’m updating is my own default assumption that “if it’s in the system prompt, it holds.” The paper doesn’t test that assumption; it tests adjacent questions and reports scores that make me want to hold my own assumption more loosely than I did yesterday.

Shorten sessions deliberately. The folklore observation is that agents get worse over long sessions. The folklore explanation I’ve been using is “context pressure.” ImplicitMemBench is not a study of session length. It uses a controlled Learning/Priming-Interfere-Test protocol, not long-running production sessions.¹ But the mechanism it names (implicit priming and classical conditioning landing without retrieval) is a candidate alternative explanation for that folklore. The operational rule I am adopting: when a session drifts, don’t fight it with more explicit correction. /new the session and start fresh. Whether the drift is context-window pressure, implicit priming, or something else, a clean session resets whichever of those is actually the cause.

Treat inhibition as hard to enforce in the prompt. If you need your agent to not do something, do not rely on having told it not to. Build a structural guard (a linter, a pre-tool hook, a sandbox policy, a tool that refuses the call) that enforces the prohibition at the code layer. My Jiro quality loop argument has been that hard gates have to be outside the model for a reason; I already held that position before this paper. ImplicitMemBench adds a specific pattern (the 17.6% aggregate inhibition number¹) that is consistent with the argument I’ve been making, though the paper does not itself test prompting or agent harnesses, and I don’t want to overclaim that it proves the position.

Audit context for what it primes, not just how many tokens it is. Token count is the measurement everyone has. If the implicit-priming framing is a useful lens (and I am treating it as a hypothesis I want to test, not a settled result), a 20k-token context full of narrative user-persona content might shape first-attempt behavior toward narrative outputs more than a 60k-token context full of structured code. I don’t have tooling for that kind of content-axis audit yet, and I’m not sure anyone does. The minimum-viable version is: look at your recent sessions and ask “what would a human reading this context be primed toward?” Whether that question is actually predictive of agent behavior is empirical and I am not going to pretend the paper decides it.

Log first-attempt disposition, not just final disposition. If you’re running any kind of trajectory capture against your skills, separate “what the agent tried first” from “what the agent landed on after correction.” ImplicitMemBench’s first-attempt scoring protocol¹ is the methodological argument for why that separation matters: the final disposition measures the agent plus the correction loop, while first-attempt measures what the agent actually produced before external feedback. For any quality loop where user experience depends on the first response landing right, you need the first-attempt number, and almost nothing logs it separately today.

FAQ

Does ImplicitMemBench test any agent harness specifically?

No. It tests 17 LLMs directly on a 300-item benchmark under a Learning/Priming-Interfere-Test protocol with first-attempt scoring.¹ It is not a harness benchmark. It does not evaluate Claude Code, Cursor, Codex, Hermes, or any production agent loop. The mapping I draw in this post from the benchmark results to agent-harness production behavior is my extension, labeled as such throughout, and is not a finding from the paper.

Is the 17.6% vs. 75.0% asymmetry a per-model result or an aggregate?

The abstract describes the asymmetry as part of the authors’ analysis of the overall benchmark results across models, and labels it as evidence of “universal bottlenecks.”¹ I read that as the asymmetry showing up consistently across the 17 models tested, with the specific numbers reflecting the aggregate pattern. The abstract does not publish a per-model breakdown, and I am not going to invent one. For the full per-model breakdown, the paper is the source.

Why might this matter more for production agents than for existing benchmarks?

Partial hedge on this one. ImplicitMemBench itself uses a multi-step protocol (Learning/Priming-Interfere-Test),¹ so it is not the case that the benchmark is “single-shot.” I don’t want to repeat the usual careless line about benchmarks. What does seem worth flagging (as practitioner speculation, not a finding of the paper) is that most other agent evals people look at measure either functional task completion or explicit fact recall, both of which favor models. If the implicit-memory gap reported by this paper is real beyond its own protocol (and I don’t know that it is), those other evals are missing a dimension of production behavior that users actually experience in long-running sessions. I’m treating that as a testable hypothesis, not a conclusion.

Does this contradict your `SOUL.md` advice in the Hermes guide?

No. It adds a boundary condition. The Hermes guide recommends SOUL.md as the primary identity pin because explicit declarative memory is still load-bearing for what it does well: consistent recall of identity, auditable version control, predictable behavior under direct questioning. The Hermes guide did not cover (because nothing existed to measure it until this paper dropped) the fact that the explicit identity pin does not automatically propagate to first-attempt automatic behavior under priming and classical conditioning. You still want SOUL.md. You also want structural guards outside it.

Can prompt engineering fix any of this?

The honest answer is that the paper does not test prompting as a mitigation strategy, so I can’t tell you with paper-authority. What I can say: the authors frame the gap as “requiring architectural innovations beyond parameter scaling,”¹ which is a stronger claim than “better prompts will help” but isn’t quite “no prompt can help.” For the inhibition side specifically (17.6% aggregate), my practitioner intuition (which you should discount relative to the paper itself) is that structural guards outside the model are a safer bet than prompt instructions. But that’s me, not the paper.

Is this one of the “memory benchmark” papers I’ve been seeing a lot of recently?

No, and the paper explicitly distinguishes itself from them. The abstract’s framing is that existing memory benchmarks evaluate explicit recall of facts: give the model a fact, ask the model to retrieve it. ImplicitMemBench measures a different thing entirely, automatic behavior adaptation without any retrieval step.¹ That distinction is the paper’s contribution and the reason it earned acceptance to ACL 2026 Main Conference.¹

Where does this sit relative to your earlier posts on agent memory?

The post sits within the AI engineering hub and is a direct companion to Static Skills Are Dead Skills. Context is architecture makes the structural case for why what enters the context window matters; compound context describes the infrastructure that accumulates over sessions. That earlier post argued that skills need trajectory aggregation to stay alive, and I assumed the failure mode was pure absence: if you could just get the trajectory data and run a pattern detector, you would be fine. ImplicitMemBench points to a second failure mode layered on top: even with perfect trajectory-driven skill updates, first-attempt behavior may not reflect the update because the update landed in explicit memory and implicit memory drives the actual decisions. The earlier post is still correct about what it claimed; the present post updates what it didn’t know to claim.

Could this be a measurement artifact?

Possibly. The paper is new (submitted 9 April 2026, accepted to ACL 2026 Main Conference), and single benchmarks can measure artifacts of their specific protocols as easily as they measure real phenomena.¹ I am not going to pretend otherwise. The reason I think it is not just an artifact is that the failure mode it describes (agents reinforcing preferences fast while failing to unlearn failures) is folklore I have been watching without a name for it for more than a year. The benchmark doesn’t have to be perfectly calibrated for the direction of the result to be the thing practitioners should act on.

References

Chonghan Qin, Xiachong Feng, Weitao Ma, Xiaocheng Feng, Lingpeng Kong, “ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models,” arXiv:2604.08064 [cs.AI], submitted 9 April 2026, accepted to ACL 2026 Main Conference. Primary source for: the framing of explicit versus implicit memory in LLM agents (“existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval”); the three cognitively grounded constructs of the benchmark (Procedural Memory = “one-shot skill acquisition after interference”; Priming = “theme-driven bias via paired experimental/control instances”; Classical Conditioning = “Conditioned Stimulus–Unconditioned Stimulus (CS–US) associations shaping first decisions”); the benchmark design (300-item suite, unified Learning/Priming-Interfere-Test protocol with first-attempt scoring); the evaluation coverage (17 models); the specific top-performer scores (DeepSeek-R1 65.3%, Qwen3-32B 64.1%, GPT-5 63.0%, no model exceeding 66% overall, all described as “far below human baselines”); the asymmetry finding (“dramatic asymmetries (inhibition 17.6% vs. preference 75.0%) and universal bottlenecks requiring architectural innovations beyond parameter scaling”); and the reframing phrase (“reframes evaluation from ‘what agents recall’ to ‘what they automatically enact’”). All direct quotes in this post are from the published abstract. Claims about how the benchmark findings apply to production agent harnesses, including SOUL.md, AGENTS.md, Claude Code, Hermes, MCP, and session-length effects, are my own framing, clearly labeled as such throughout, and are not attributed to the paper. ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩
Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg, Tuhin Chakrabarty, “Alignment Whack-a-Mole: Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models,” arXiv:2603.20957, submitted 21 March 2026 (preprint, under review). Primary source for: the finding that fine-tuning on semantically related text reactivates verbatim recall of copyrighted books already memorized during pretraining but suppressed by alignment (up to 85–90% verbatim reproduction; single spans exceeding 460 words); cross-author generalization (fine-tuning on one author extracts 30+ unrelated authors); cross-model replication (GPT-4o, Gemini-2.5-Pro, DeepSeek-V3.1, r ≥ 0.90 memorization correlation); and the structural conclusion that alignment functions as a retrieval gate, not an erasure mechanism: the memorization was encoded in pretraining weights, not injected by fine-tuning. Used in this post to support the argument that the model’s actual memory footprint exceeds what explicit mechanisms expose. The connection between this paper and ImplicitMemBench is my framing, not a claim either paper makes. ↩↩

Reward the Tool Before the Answer

AI agents fail when answers claim tool work that never happened. Four failure modes and the rule that catches them, with…

13 min read

The Workbench I Carry

Steve Jobs's philosophy of invisible craft, operationalized: whole-widget integrity, refusal, and care inside an AI harn…

16 min read

Your Agent Has Memory You Didn't Write

TL;DR

Key Takeaways

What the Paper Measures

The Results

What the Asymmetry Actually Means

The SOUL.md Illusion

What the Paper Doesn’t Say (and What I’m Adding)

What to Actually Do

FAQ

Does ImplicitMemBench test any agent harness specifically?

Is the 17.6% vs. 75.0% asymmetry a per-model result or an aggregate?

Why might this matter more for production agents than for existing benchmarks?

Does this contradict your `SOUL.md` advice in the Hermes guide?

Can prompt engineering fix any of this?

Is this one of the “memory benchmark” papers I’ve been seeing a lot of recently?

Where does this sit relative to your earlier posts on agent memory?

Could this be a measurement artifact?

References

Related Posts

Reward the Tool Before the Answer

The Workbench I Carry

More from 941 Apps

TL;DR

Key Takeaways

What the Paper Measures

The Results

What the Asymmetry Actually Means

The SOUL.md Illusion

What the Paper Doesn’t Say (and What I’m Adding)

What to Actually Do

FAQ

Does ImplicitMemBench test any agent harness specifically?

Is the 17.6% vs. 75.0% asymmetry a per-model result or an aggregate?

Why might this matter more for production agents than for existing benchmarks?

Does this contradict your SOUL.md advice in the Hermes guide?

Can prompt engineering fix any of this?

Is this one of the “memory benchmark” papers I’ve been seeing a lot of recently?

Where does this sit relative to your earlier posts on agent memory?

Could this be a measurement artifact?

References

Related Posts

Reward the Tool Before the Answer

The Workbench I Carry

More from 941 Apps

Does this contradict your `SOUL.md` advice in the Hermes guide?