← All Posts

Static Skills Are Dead Skills

From the guide: Claude Code Comprehensive Guide

Last night I shipped a Settings Reference section to the Claude Code guide. Fifteen entries. Every citation grep’d against a line number. I shipped it on conviction after the critique loop came back clean. By the time I was committing the .md file, I already knew I’d need a v3 — not because I’d done anything wrong, but because the guide changes, the underlying product changes, the user queries shift, and the section I’d just shipped would start drifting the minute I walked away from it.

A skill, whether it’s a Markdown reference section or an agent skill definition in .claude/skills/, is only alive while somebody is watching its trajectory. The minute you stop watching, it becomes static. Static skills decay in place.

A new arxiv paper from Ma, Yang, Ji, Wang, and Wang (“SkillClaw: Let Skills Evolve Collectively with Agentic Evolver,” April 2026) formalizes this problem at the research level.1 Their opening framing, quoted directly: “Large language model (LLM) agents such as OpenClaw rely on reusable skills to perform complex tasks, yet these skills remain largely static after deployment. As a result, similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users, preventing the system from improving with experience.”1

I’ve been living that failure mode for months. So have you, if you’re building skills for any agent harness.

TL;DR

Agent skills get shipped, then stop improving. Users discover the same failure modes independently and never feed those discoveries back into the skill itself. Ma et al. frame this as a collective intelligence problem: cross-user and over-time interactions are signals about when a skill works or fails, but no ecosystem-level mechanism exists to aggregate them into skill updates. Their SkillClaw framework proposes treating aggregated trajectories as the evolution signal, running an autonomous evolver that identifies recurring behavioral patterns and translates them into refinements or capability extensions.1 The abstract cites “OpenClaw” as an example LLM agent that uses reusable skills — I have not been able to identify OpenClaw as a specific shipping product from the abstract alone, and I am not going to speculate about it in this post. What I am going to claim is that the structural problem the paper describes maps onto anyone building skills for Claude Code, Codex, Cursor, or their own harness. The take: if your skill library is not continuously ingesting trajectories from real use, it’s dead from the day you ship it.

Key Takeaways

  • Skill authors: The work is not done when the skill ships. The work is done when you have a loop that watches how the skill gets used, catches recurring failure modes, and feeds them back into the skill definition. Shipping is the beginning of the skill’s life, not the end.
  • Harness builders: Log every skill invocation with its trajectory — the inputs, the tool calls, the outputs, the error state. That log is the evolution signal. If you are not logging it, you are not improving your skills; you are maintaining them.
  • Jiro-minded practitioners: The SkillClaw paper is academic language for the Shokunin pattern applied to skills. The skill is the craft. The trajectories are the practice. The evolution is the pursuit of mastery. Static = dead.

What the Paper Actually Says

I’m going to walk through the abstract claims with care, then clearly mark where I’m extending the framing.

The problem statement (from the abstract). LLM agents rely on reusable skills to perform complex tasks. These skills remain largely static after deployment. Similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across users. The system does not improve with experience.1

That’s a claim about a specific failure mode, not a claim that all skills decay. A skill that never gets invoked doesn’t decay. A skill that gets invoked by one user who never reports issues doesn’t decay visibly. The decay shows up when you have multiple users, each encountering their own version of the same failure, and the system has no way to aggregate those encounters into a single update. (That last sentence is my framing, not the paper’s.)

The existing gap (from the abstract). The abstract states that while cross-user interactions “provide complementary signals about when a skill works or fails, existing systems lack a mechanism to convert such heterogeneous experiences into reliable skill updates.”1 This is the load-bearing claim. It is not that nobody has thought about skill improvement. It is that no ecosystem-level mechanism aggregates trajectories, identifies recurring patterns, and translates them into updates.

The SkillClaw pipeline (from the abstract). The abstract describes a continuous pipeline: SkillClaw “aggregates trajectories generated during use and processes them with an autonomous evolver, which identifies recurring behavioral patterns and translates them into updates to the skill set by refining existing skills or extending them with new capabilities.”1 The updated skills are maintained in a shared repository and synchronized across users, so improvements discovered in one context propagate system-wide without requiring user effort.1

The evaluation (from the abstract). The paper evaluates SkillClaw on a benchmark called WildClawBench using Qwen3-Max as the underlying model. The abstract’s own phrasing is grammatically broken in the published version: “experiments on WildClawBench show that limited interaction and feedback, it significantly improves the performance of Qwen3-Max in real-world agent scenarios.”1 I read this as: with limited interaction and feedback, SkillClaw still produces significant performance improvements over the baseline. The abstract does not publish specific numbers — the full paper presumably does.

That’s the paper as the abstract describes it. The authors propose that multi-user agent ecosystems with shared skills benefit from automated trajectory aggregation feeding automated skill updates, and they report that their implementation significantly improves Qwen3-Max performance under limited-feedback conditions.

What the Paper Doesn’t Say (and What I’m Adding)

The abstract cites “OpenClaw” as one example (“LLM agents such as OpenClaw”) of an agent that uses reusable skills. I do not know what OpenClaw is from the abstract alone — I could not quickly identify it as a specific shipping product. The paper’s framework (SkillClaw) is presented as a solution for multi-user agent ecosystems generally, not for OpenClaw specifically, so the question “what is OpenClaw” is mostly tangential to the argument. I am flagging it so that nobody reads this post and walks away thinking the paper is about Claude Code. It isn’t. It names OpenClaw as an example and proposes SkillClaw as a general mechanism.

What I am claiming — separately from the paper — is that the structural problem the paper describes maps onto a real problem I’ve been living in the Claude Code skill ecosystem. That claim is mine, not the paper’s. Here’s why I think it maps.

Skills in the Claude Code ecosystem are shipped as static artifacts. A skill is a SKILL.md file (or a bundle of supporting files) that describes how a task should be performed. You write it once. You commit it. You reference it with a slash command or via @skill-name typeahead. Once it ships, it is a static artifact. There is no automatic mechanism that watches how the skill gets used in practice and updates the skill definition based on what works and what fails.

Different users hit the same failure modes independently. Every skill I’ve shipped has at least one recurring failure mode that only shows up under specific conditions. Someone invokes the skill with an input I didn’t anticipate, hits the edge case, works around it manually, and moves on. Another person, somewhere else, hits the same edge case and does their own workaround. The skill itself is unchanged.

The aggregate signal is real but unused. If I could see every trajectory from every invocation of every skill I’ve shipped, I could identify the recurring failure modes in an afternoon. That signal exists — it’s in every individual user’s session history. It’s just not aggregated anywhere, so nobody acts on it.

The fix is either manual or missing. Right now, the only mechanism for skill improvement is me noticing a problem in my own usage, or someone filing an issue, or someone opening a PR. Those are all user-effort-required pathways. The SkillClaw paper’s core insight — that the trajectory data already exists and should be converted into skill updates automatically — is exactly the mechanism we’re missing.

That’s my claim about how the paper’s framing applies to Claude Code. It’s not what the paper says. It’s how I’m reading the paper against my own work.

The Shokunin Pattern, Applied to Skills

There’s a framing I keep coming back to when I think about craft. Jiro Ono, the sushi master, is the canonical example. Sixty years of the same work. Every day, watching what happens at the counter, adjusting the technique, refining the rice temperature, the knife angle, the timing of the shari. The work itself is the training signal. The practitioner is the aggregator.

I wrote about the Shokunin / quality-loop framing a while back. The core idea: the craft is the feedback loop. You do the work, you watch the work, you notice what broke, you adjust, you do the work again. Over and over. The mastery lives in the delta between what you intended and what actually happened, and in your willingness to carry that delta into the next attempt.

A static skill breaks that loop. You ship the skill. You stop watching. The delta between what the skill intended and what actually happens accumulates in a hundred different sessions that you never see. The skill does not get better because the craftsman is not at the counter.

The SkillClaw paper proposes an automated aggregator — not a replacement for the human, but a mechanism that watches all the trajectories, notices what broke across sessions, and proposes updates back into the skill definition. That is not a crazy ambition. It’s actually the minimum bar if you want a skill to survive its own deployment.

What This Looks Like in Practice

If I wanted to build the SkillClaw pattern against Claude Code skills I maintain today, here’s what I’d need:

1. A trajectory log for every skill invocation. Every time a skill runs, the inputs, the tool calls it makes, the outputs, the error states, and the final disposition (did the user accept the result? revert it? rewrite it?). This already exists at the session level in Claude Code — the question is whether it’s aggregated across sessions and extracted for the skill owner.

2. A pattern detector. Something that reads the trajectory log and identifies recurring patterns: same input class leading to same failure, same tool call failing in the same way, same edge case showing up under different user contexts. This is not AGI — it’s clustering on structured trajectory data.

3. A proposal generator. Given a detected pattern, draft a candidate update to the skill: a new handling branch, an additional example, an extra constraint in the SKILL.md body. The update is a proposal, not a shipped change.

4. A gate. Every proposed update goes through human review, factual verification (the same hard gate I apply to everything else), and a critique loop before it ships. The automation does the aggregation, not the shipping.

5. Distribution. When a proposed update is accepted, it propagates to every user of that skill. In a centralized ecosystem this is trivial (update the canonical skill, everyone pulls). In a distributed ecosystem this is harder.

Most of it is already present in Claude Code — session logging exists, skill definitions are versioned, the critique loop is operational — the missing piece is the aggregation and pattern detection layer that connects session trajectories to skill updates.

The Uncomfortable Implication

Every skill I’ve shipped in the last six months is dead in exactly the sense the SkillClaw paper describes. I write the skill. I use it myself. I notice problems. I fix them in the skills I use. The skills get better for me. They don’t get better for anyone else unless that person independently notices the same problem and files something.

The work I did last night on the Settings Reference is exactly this pattern. The Claude Code guide is a shared artifact. Users query it for specific config keys. I can see the GSC data telling me which config keys get searched. That’s aggregated trajectory data — it’s literally telling me which skills in the guide are getting invoked and where the results are landing. And until I went looking at that data, the guide was static. It had been static for weeks. Not because nobody was watching the trajectories, but because I was the only person who could watch them, and I had other things to do.

The SkillClaw paper is the academic formalization of the problem. The practical mechanism is simpler: if you don’t have an automatic pipeline from trajectory data to skill updates, your skills are aging in place. They might still work for some users under some conditions. They are not getting better.

The only question is whether you accept that your skills are dead the moment they ship, or whether you build the watcher that keeps them alive.

The Minimum Viable Aggregator

Before I started this post, I had zero trajectory aggregation on my skills. None. I had session history I could read manually, but nothing that surfaced patterns across sessions. That is exactly the static-skill pathology the paper describes, and I was running it.

Here is the smallest actual thing I can ship against it right now, today: a single text file that logs every skill invocation across my own sessions, append-only, with timestamp + skill name + input shape + final disposition (accepted / revised / reverted). No pattern detector. No autonomous evolver. Just the log.

That file is the minimum viable aggregator. It is not SkillClaw. It is the input layer SkillClaw would need if it existed, and it’s the input layer I need before I can even see whether my skills have recurring failure modes. Without it, I’m guessing. With it, I can at least scan the log by hand when I’m reviewing a skill and ask: did this thing break in the same way three times this month?

That’s the commitment. One file. Append-only. Logged per invocation. Reviewed when I review the skill.

If that works, the next layer is the pattern detector. If the pattern detector works, the next layer is the proposal generator. The ambition of the paper is a full autonomous evolver running across a multi-user ecosystem. The ambition for me is to not be running in the dark.


FAQ

Is “OpenClaw” in the paper the same as Claude Code?

No, and I also cannot tell you what OpenClaw is. The abstract mentions “LLM agents such as OpenClaw” as one example of an agent that uses reusable skills, without defining it. I could not quickly identify it as a specific shipping product from the abstract alone. The important thing is that the paper’s SkillClaw framework is presented as a general solution for multi-user agent ecosystems, not as a solution specifically for OpenClaw or Claude Code. Whatever OpenClaw is, the paper is not a Claude Code paper, and my claims about Claude Code in this post are my own, not the paper’s.1

What’s the actual novel contribution of the paper?

Per the abstract: a framework for collective skill evolution in multi-user agent ecosystems that (1) aggregates trajectories across users and time, (2) runs an autonomous evolver to detect recurring patterns, and (3) translates patterns into updates to skills in a shared repository that synchronize across users.1 The novelty is not “skills can be improved” — that’s obvious. The novelty is proposing that the improvement loop should be autonomous and trajectory-driven, not human-driven.

Does the paper report specific improvement numbers?

The abstract describes the improvement as “significant” on a benchmark called WildClawBench using Qwen3-Max, under limited-feedback conditions, but does not publish specific numbers.1 For numbers, the full paper is the source.

Why is this different from a git pull request against a skill definition?

A PR is a human-initiated mechanism. Someone has to notice the problem, write the fix, file the PR, review it, merge it. Every step requires human effort. The SkillClaw framework the paper proposes is autonomous aggregation — the system notices the pattern across many users, proposes the fix itself, and synchronizes the update without any single user having to file anything.1 Whether that autonomous version is desirable or safe for any specific ecosystem is a separate question. The paper’s contribution is showing that it’s technically coherent.

Does this apply to my custom Claude Code skills?

The paper does not make claims about any specific Claude Code skill ecosystem. My claim — separate from the paper — is that the structural problem (skills shipped as static artifacts, failure modes rediscovered by each user independently, no aggregation mechanism) does apply to Claude Code skills, and that anyone building skills for Claude Code or any similar harness should be thinking about how to build a trajectory-driven improvement loop. That’s my opinion, not a finding from the paper.

What’s the Shokunin connection?

The Shokunin / quality-loop framing argues that mastery comes from the delta between what you intended and what actually happened, carried into the next attempt. Static skills break that loop because the deltas accumulate in sessions the craftsman never sees. SkillClaw is the academic version of closing that loop — automating the collection of deltas and feeding them back into the skill. The discipline is the same; the mechanism is different.


References


  1. Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, “SkillClaw: Let Skills Evolve Collectively with Agentic Evolver,” arXiv:2604.08377, April 2026. Primary source for the problem statement (static skills after deployment, rediscovered failure modes across users), the SkillClaw pipeline description (trajectory aggregation → autonomous evolver → shared skill repository → cross-user synchronization), and the evaluation (WildClawBench benchmark, Qwen3-Max, improvement described as “significant” with limited interaction and feedback — abstract does not publish specific numbers). The abstract cites “OpenClaw” as an example LLM agent but does not define it; I do not make claims about what OpenClaw is beyond what the abstract says. Claims about how the SkillClaw framing applies to Claude Code skills specifically are my own, clearly labeled as such, and are not attributed to the paper. 

Related Posts

The CLI Thesis

Three top HN Claude Code threads converge on one conclusion: CLI-first architecture is cheaper, faster, and more composa…

15 min read

Claude Code as Infrastructure

Claude Code is not an IDE feature. It is infrastructure. 84 hooks, 48 skills, 19 agents, and 15,000 lines of orchestrati…

12 min read

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

8 min read