AI Agent Skills Need Behavioral Audits, Not Pass Rates

AI agent skills look easy to evaluate until a pass rate barely moves.

Counterfactual Trace Auditing reported a +0.3 percentage point average task-success gain from skills across one benchmark configuration, while the same audit still found 522 specific ways the skills changed agent behavior across 49 tasks.1 A pass-rate dashboard would call that almost nothing. A trace audit sees the real shift.

AI agent skills need behavioral audits, not pass rates. A skill can change which tool an agent chooses, which path it reads, which evidence it skips, which risk it ignores, and which side effect it creates while the final task outcome looks unchanged.

TL;DR

AI agent skills should not earn trust from pass rates alone. A pass rate tells teams whether the final task succeeded under a benchmark grader. A behavioral audit asks whether the skill changed the agent’s actions in ways the team intended.

Recent research makes the gap hard to ignore. Counterfactual Trace Auditing compares agent traces with and without a skill and surfaces skill-induced patterns that ordinary success metrics miss.1 Behavioral Integrity Verification compares what a skill claims to do against what the skill actually does, then reports widespread description-behavior mismatch across a large skill corpus.2 SkillsBench shows curated skills can improve agent performance, but also shows self-generated skills can fail to help and that some tasks get worse with skills.3

The practical rule: do not install a skill because a benchmark went up. Install a skill after the trace shows the behavior belongs.

Key Takeaways

For teams using agent skills: - Treat every skill as behavior-changing code, even when the file contains only Markdown. - Audit trace changes, side effects, and failure modes before sharing the skill across projects.

For skill authors: - Declare expected behavior, allowed tools, forbidden actions, and evidence obligations. - Test the skill against paired traces, not only final task outcomes.

For security reviewers: - Compare declared capabilities with observed capabilities. - Flag hidden expansion, external access, destructive actions, and policy bypasses as skill defects.

For evaluation teams: - Report pass rate, behavior delta, side-effect delta, and review burden separately. - A flat pass rate can still hide a dangerous behavior change.

Why Do Pass Rates Miss Skill Risk?

Pass rates compress the wrong object.

A skill changes the agent before the task starts. It can add domain procedure, tool preference, formatting rules, review steps, confidence language, or recovery behavior. The benchmark grader usually sees only the final artifact: correct or incorrect.

That creates a blind spot:

Skill Effect Pass Rate Sees Behavioral Audit Sees
Better tool ordering Maybe success Which call moved earlier and why.
Extra file reads Maybe success Which files entered context.
More aggressive patching Maybe success Diff size, ownership, and rollback risk.
Skipped verification Maybe success Missing evidence before completion.
Hidden external access Maybe success Network or MCP boundary expansion.
Lower review burden Maybe success Smaller trace, clearer proof, fewer unresolved claims.

The final answer can look right while the skill makes the run less trustworthy. The opposite can also happen: a skill can produce a failed result while teaching a better search or recovery pattern that deserves repair rather than deletion.

Pass rate belongs in the audit. It cannot be the audit.

What Did Counterfactual Trace Auditing Add?

Counterfactual Trace Auditing compares two runs: one with the skill and one without it.1

The paper’s point lands because the headline pass-rate gain stays tiny in the reported WebArena configuration. Average task success rises by only +0.3 percentage points when the benchmark uses skills.1 Yet the authors identify 522 skill-induced behavior patterns across 49 tasks, spanning changes such as validation steps, form interaction, error recovery, page navigation, and misuse patterns.1

That split is the article.

The skill affected behavior even when aggregate task success barely moved.

CTA works by aligning traces into phases and identifying skill-induced patterns. The audit does not only ask whether a task passed. It asks where the skill changed the trajectory, whether the change helped or hurt, and which skill instruction appears responsible.1

That method gives teams a better review object:

Audit Question Why It Matters
Which step changed? Connects behavior to a trace location.
Which instruction caused the change? Connects behavior to skill text.
Did the change help, hurt, or merely move cost? Prevents pass-rate theater.
Did the change create side effects? Catches risk hidden behind success.
Does the change generalize across tasks? Separates one lucky run from a skill worth keeping.

Teams need that object before they promote a skill from local experiment to shared process.

What Did Behavioral Integrity Verification Add?

Behavioral Integrity Verification asks a different question: does a skill do what its description says?2

The BIV paper studies large-scale skill repositories and reports that more than 80% of analyzed skills showed some form of description-behavior deviation.2 The authors classify most deviations as oversight-driven rather than adversarial, but they still find adversarial cases and multi-stage risk patterns.2

That finding matters because descriptions drive activation.

In agent systems, a skill description often decides whether the skill enters context. The description says when the agent should load it. If the description understates capability, hides side effects, or fails to mention tool access, the agent and user both make a bad routing decision before any task-specific reasoning begins.

BIV points at a missing manifest layer for skills:

Declared Surface Behavioral Audit Should Verify
Activation condition Does the skill run only for the stated task class?
Capability Does observed behavior stay inside the claim?
Tool use Which tools, commands, MCP servers, or files does the skill cause?
Side effects Does the skill read, write, delete, send, spend, publish, or deploy?
External access Does the skill create network, browser, or third-party movement?
Safety claim Does the skill actually add the promised check?
Refusal boundary Does the skill preserve blocked actions?

The scary version is a malicious skill that lies. The ordinary version is a sloppy skill that forgets to tell the truth.

Both versions need an audit.

What Did SkillsBench Add?

SkillsBench shows why teams should not overcorrect and declare skills worthless.

The benchmark evaluates agent skills across 86 tasks and 7,308 trajectories.3 The paper reports that curated skills improve average pass rate by 16.2 percentage points over a no-skill baseline, while self-generated skills provide no benefit on average.3 It also reports negative deltas on some tasks, meaning a skill can make certain work worse.3

That result gives the balanced view.

Skills can help. Skill quality matters. Task fit matters. Source matters. Evaluation method matters.

The adoption lesson is not “avoid skills.” The adoption lesson is “review skills like capability packages.”

A useful skill should answer:

Question Required Answer
What job does the skill improve? Concrete task class and reader/user.
Which behavior should change? Tool choice, evidence check, format, review, or recovery pattern.
Which behavior must not change? Forbidden tools, paths, side effects, and authority boundaries.
What evidence proves the skill helped? Trace delta, pass rate, review effort, and side-effect profile.
How can the team remove it? Version, owner, rollback, and replacement path.

The skill earns promotion only when the observed behavior matches those answers.

What Does A Behavioral Audit Look Like?

A behavioral audit compares expected skill behavior with observed agent behavior.

The minimum audit has four passes.

Audit Pass Evidence
Declaration audit Skill description, activation condition, capabilities, tools, and forbidden actions.
Counterfactual trace audit Paired runs with and without the skill on the same task set.
Side-effect audit Files, commands, network calls, external writes, approvals, and rollback state.
Failure audit Failed runs, near misses, recovered errors, and repeated repair patterns.

The output should look less like a leaderboard and more like a review packet.

For each task, capture:

  1. Task name and risk lane.
  2. Skill version and source.
  3. Baseline trace.
  4. Skill trace.
  5. Changed steps.
  6. Changed tool calls.
  7. Changed side effects.
  8. Evidence gained or lost.
  9. Final outcome.
  10. Reviewer decision: keep, revise, scope, block, or retire.

That packet gives a human reviewer a way to make a judgment that survives beyond one benchmark run.

Where Do Skill Contracts Fit?

ContractSkill points toward a cleaner shape for skills that need stricter behavior.4

The paper argues that web-agent skills written in natural language can be ambiguous, brittle, and hard to debug. It proposes contract-based skills with explicit task definitions, preconditions, postconditions, and step-level procedures, so a system can localize failures and repair the affected portion instead of rewriting the whole skill.4

That contract framing fits behavioral audits.

Freeform Skill Contract-Shaped Skill
“Use care when publishing.” “Before publish, verify source URLs, route render, schema, and rollback.”
“Check the page.” “Fetch route, assert status 200, assert changed marker, assert no fallback text.”
“Avoid risky commands.” “Block delete, force push, external POST, and writes outside owned paths.”
“Translate naturally.” “Preserve URLs and citations; translate visible headings; gate English residue.”

Contract-shaped skills reduce ambiguity. They also make audits cheaper because expected behavior sits in a structure the reviewer can compare with the trace.

The contract should not make every skill huge. Plain skills still work for low-risk writing format or checklist tasks. Contracts matter when a skill can alter external systems, public content, data, money, security posture, or shared project behavior.

How Do You Repair A Bad Skill?

Do not delete a useful skill because one run failed. First identify where the behavior broke.

AgentRx focuses on repairing agent failures by locating critical failure steps in execution trajectories, generating constraints, and validating repairs against an auditable log.5 The paper targets agent behavior broadly rather than skill files specifically, but the repair shape maps well to skills: find the failure step, derive a constraint, test the repaired behavior, and preserve evidence.

Skill repair should follow the same sequence:

Failure Repair
Skill activates too broadly Narrow the description and trigger examples.
Skill changes the wrong tool choice Add tool-selection rules and counterexamples.
Skill skips verification Add a stop condition before completion.
Skill creates too much diff Add ownership and changed-path limits.
Skill causes network movement Add egress rules and approval requirements.
Skill improves one task but harms another Split the skill or scope it to the winning task class.

Repair should end with a new audit, not a confident commit message.

If the trace still shows the wrong behavior after repair, retire the skill.

The Minimum Standard

Before a team shares an AI agent skill, require one behavioral audit packet.

Field Required Evidence
Source Repository, author, version, and install path.
Purpose The task class the skill claims to improve.
Activation The exact condition that should load the skill.
Allowed behavior Tools, files, resources, and actions the skill may influence.
Forbidden behavior Tools, paths, side effects, and authority the skill must not expand.
Counterfactual traces Same task with and without the skill.
Outcome delta Pass rate, failure rate, review effort, and runtime cost.
Behavior delta Changed steps, tool calls, side effects, and evidence.
Risk decision Keep, revise, scope, block, or retire.
Rollback How the team removes the skill and returns to prior behavior.

That packet forces the right question.

The question is not “did the skill help once?” The question is “does the skill reliably change behavior in the way the team wants?”

The Worthy Standard

Skills make agents feel better fast. That speed tempts teams to accumulate process files, commands, agents, hooks, and prompts because each one seems cheap.

Cheap context still changes behavior.

A worthy skill earns its place by improving the whole workflow. It should reduce review burden, sharpen evidence, narrow risk, or teach a procedure the agent could not reliably perform without it. A skill that only makes the agent sound more confident should go away. A skill that improves pass rate while expanding hidden side effects should fail review.

The standard should stay simple:

  • Declare what the skill should change.
  • Prove the trace changed that way.
  • Name what must not change.
  • Prove the trace respected that boundary.
  • Keep the skill only when the behavior deserves to exist.

AI agent skills are not magic notes. They are behavior patches. Treat them like code.

Quick Summary

AI agent skills need behavioral audits because pass rates hide too much. Counterfactual Trace Auditing shows skills can change hundreds of trace patterns while aggregate success barely moves.1 Behavioral Integrity Verification shows skill descriptions often diverge from actual capabilities.2 SkillsBench shows curated skills can help, but self-generated skills and task mismatch can fail or harm.3

The operating rule is direct: evaluate the behavior, not only the score. A skill deserves trust when its declaration, traces, side effects, failures, repairs, and rollback path all line up.

FAQ

What is a behavioral audit for AI agent skills?

A behavioral audit checks how a skill changes an agent’s actual run: tool calls, file access, side effects, verification steps, recovery behavior, and final outcome. It compares observed behavior with the skill’s declared purpose and boundaries.

Why are pass rates not enough for skill evaluation?

Pass rates show whether a task succeeded under a grader. They do not show whether the skill expanded tool access, skipped evidence, increased side effects, or changed behavior in ways the team did not intend.

What is Counterfactual Trace Auditing?

Counterfactual Trace Auditing compares agent trajectories with and without a skill, aligns trace phases, and identifies skill-induced behavior patterns. It helps teams see behavior changes that aggregate success metrics can miss.1

What is Behavioral Integrity Verification?

Behavioral Integrity Verification compares skill descriptions against actual skill behavior. It detects when a skill’s stated capability, activation condition, or safety claim does not match observed behavior.2

What should a team audit before sharing a skill?

Teams should audit the skill source, activation condition, declared capabilities, allowed and forbidden actions, paired traces, side effects, failure cases, repair path, and rollback plan.


References


  1. Xuanyu Zhang, Yiding Liu, Chengsong Huang, Ensheng Shi, Weizhi Ma, Yifei Zhang, Qun Liu, Shumin Deng, Jiahang Shen, and Shiqi Wang, “Counterfactual Trace Auditing of LLM Agent Skills,” arXiv:2605.11946v1, submitted May 13, 2026. Source for paired trace comparison, skill-induced pattern detection, phase alignment, WebArena skill evaluation, +0.3 percentage point aggregate pass-rate gain, and 522 behavior patterns across 49 tasks. 

  2. Ning Liu, Meng Fang, Youtao Zhang, Dominik T. Matt, Stanislav Pletnev, Hongzhi Wang, and Erwin Schoitsch, “Behavioral Integrity Verification for Agentic AI Skills,” arXiv:2605.11770v1, submitted May 13, 2026. Source for declared-versus-actual skill capability verification, repository-scale skill analysis, description-behavior deviation findings, oversight and adversarial deviation categories, and multi-stage risk patterns. 

  3. Lingkai Kong, Xiangliang Zhang, and Jiamou Liu, “SkillsBench: Can LLMs Learn from Their Own and Other Agents’ Skills for Reliable Task Execution?,” arXiv:2602.12670v1, submitted February 17, 2026. Source for the 86-task, 7,308-trajectory SkillsBench evaluation, curated-skill pass-rate improvement, self-generated skill result, and negative task deltas. 

  4. Meiyi Ma, Fengan Xia, Canran Xu, Wenqi Li, Aranya Roy, Zhaopeng Tu, Ranveer Chandra, and Dongmei Zhang, “ContractSkill: Contract-based Skill Design for LLM-powered Web Agents,” arXiv:2603.20340v1, submitted March 25, 2026. Source for contract-based skill definitions, preconditions, postconditions, step-level procedures, deterministic verification, fault localization, and minimal local repair. 

  5. Cunxiang Wang, Ruoxi Sun, Yidong Wang, Piji Li, and Yue Zhang, “AgentRx: Scalable Automated Failure Diagnosis and Repair for LLM Agents,” arXiv:2602.02475v1, submitted February 3, 2026. Source for critical-step failure localization, constraint generation, trace validation, and auditable repair logs for LLM-agent failures. 

Related Posts

AI Code Review Needs Dissent, Not Consensus

AI code review needs independent agents that preserve dissent, validate findings, route uncertainty to humans, and re-re…

13 min read

Static Skills Are Dead Skills

Agent skills decay the minute nobody watches the trajectories. A new paper on cross-user skill evolution frames the prob…

16 min read

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

11 min read