AI Agent Skills Need Behavioral Audits, Not Pass Rates

May 18, 2026 14 min read

ai agents skills evaluation behavioral-audits ai-engineering

AI agent skills look easy to evaluate until a pass rate barely moves.

Counterfactual Trace Auditing reported a +0.3 percentage point average task-success gain from skills across one benchmark configuration, while the same audit still found 522 specific ways the skills changed agent behavior across 49 tasks.¹ A pass-rate dashboard would call that almost nothing. A trace audit sees the real shift.

AI agent skills need behavioral audits, not pass rates. A skill can change which tool an agent chooses, which path it reads, which evidence it skips, which risk it ignores, and which side effect it creates while the final task outcome looks unchanged.

TL;DR

AI agent skills should not earn trust from pass rates alone. A pass rate tells teams whether the final task succeeded under a benchmark grader. A behavioral audit asks whether the skill changed the agent’s actions in ways the team intended.

Recent research makes the gap hard to ignore. Counterfactual Trace Auditing compares agent traces with and without a skill and surfaces skill-induced patterns that ordinary success metrics miss.¹ Behavioral Integrity Verification compares what a skill claims to do against what the skill actually does, then reports widespread description-behavior mismatch across a large skill corpus.² SkillsBench shows curated skills can improve agent performance, but also shows self-generated skills can fail to help and that some tasks get worse with skills.³

The practical rule: do not install a skill because a benchmark went up. Install a skill after the trace shows the behavior belongs.

Key Takeaways

For teams using agent skills: - Treat every skill as behavior-changing code, even when the file contains only Markdown. - Audit trace changes, side effects, and failure modes before sharing the skill across projects.

For skill authors: - Declare expected behavior, allowed tools, forbidden actions, and evidence obligations. - Test the skill against paired traces, not only final task outcomes.

For security reviewers: - Compare declared capabilities with observed capabilities. - Flag hidden expansion, external access, destructive actions, and policy bypasses as skill defects.

For evaluation teams: - Report pass rate, behavior delta, side-effect delta, and review burden separately. - A flat pass rate can still hide a dangerous behavior change.

Why Do Pass Rates Miss Skill Risk?

Pass rates compress the wrong object.

A skill changes the agent before the task starts. It can add domain procedure, tool preference, formatting rules, review steps, confidence language, or recovery behavior. The benchmark grader usually sees only the final artifact: correct or incorrect.

That creates a blind spot:

Skill Effect	Pass Rate Sees	Behavioral Audit Sees
Better tool ordering	Maybe success	Which call moved earlier and why.
Extra file reads	Maybe success	Which files entered context.
More aggressive patching	Maybe success	Diff size, ownership, and rollback risk.
Skipped verification	Maybe success	Missing evidence before completion.
Hidden external access	Maybe success	Network or MCP boundary expansion.
Lower review burden	Maybe success	Smaller trace, clearer proof, fewer unresolved claims.

The final answer can look right while the skill makes the run less trustworthy. The opposite can also happen: a skill can produce a failed result while teaching a better search or recovery pattern that deserves repair rather than deletion.

Pass rate belongs in the audit. It cannot be the audit.

What Did Counterfactual Trace Auditing Add?

Counterfactual Trace Auditing compares two runs: one with the skill and one without it.¹

The paper’s point lands because the headline pass-rate gain stays tiny in the reported WebArena configuration. Average task success rises by only +0.3 percentage points when the benchmark uses skills.¹ Yet the authors identify 522 skill-induced behavior patterns across 49 tasks, spanning changes such as validation steps, form interaction, error recovery, page navigation, and misuse patterns.¹

That split is the article.

The skill affected behavior even when aggregate task success barely moved.

CTA works by aligning traces into phases and identifying skill-induced patterns. The audit does not only ask whether a task passed. It asks where the skill changed the trajectory, whether the change helped or hurt, and which skill instruction appears responsible.¹

That method gives teams a better review object:

Audit Question	Why It Matters
Which step changed?	Connects behavior to a trace location.
Which instruction caused the change?	Connects behavior to skill text.
Did the change help, hurt, or merely move cost?	Prevents pass-rate theater.
Did the change create side effects?	Catches risk hidden behind success.
Does the change generalize across tasks?	Separates one lucky run from a skill worth keeping.

Teams need that object before they promote a skill from local experiment to shared process.

What Did Behavioral Integrity Verification Add?

Behavioral Integrity Verification asks a different question: does a skill do what its description says?²

The BIV paper studies large-scale skill repositories and reports that more than 80% of analyzed skills showed some form of description-behavior deviation.² The authors classify most deviations as oversight-driven rather than adversarial, but they still find adversarial cases and multi-stage risk patterns.²

That finding matters because descriptions drive activation.

In agent systems, a skill description often decides whether the skill enters context. The description says when the agent should load it. If the description understates capability, hides side effects, or fails to mention tool access, the agent and user both make a bad routing decision before any task-specific reasoning begins.

BIV points at a missing manifest layer for skills:

Declared Surface	Behavioral Audit Should Verify
Activation condition	Does the skill run only for the stated task class?
Capability	Does observed behavior stay inside the claim?
Tool use	Which tools, commands, MCP servers, or files does the skill cause?
Side effects	Does the skill read, write, delete, send, spend, publish, or deploy?
External access	Does the skill create network, browser, or third-party movement?
Safety claim	Does the skill actually add the promised check?
Refusal boundary	Does the skill preserve blocked actions?

The scary version is a malicious skill that lies. The ordinary version is a sloppy skill that forgets to tell the truth.

Both versions need an audit.

What Did SkillsBench Add?

SkillsBench shows why teams should not overcorrect and declare skills worthless.

The benchmark evaluates agent skills across 86 tasks and 7,308 trajectories.³ The paper reports that curated skills improve average pass rate by 16.2 percentage points over a no-skill baseline, while self-generated skills provide no benefit on average.³ It also reports negative deltas on some tasks, meaning a skill can make certain work worse.³

That result gives the balanced view.

Skills can help. Skill quality matters. Task fit matters. Source matters. Evaluation method matters.

The adoption lesson is not “avoid skills.” The adoption lesson is “review skills like capability packages.”

A useful skill should answer:

Question	Required Answer
What job does the skill improve?	Concrete task class and reader/user.
Which behavior should change?	Tool choice, evidence check, format, review, or recovery pattern.
Which behavior must not change?	Forbidden tools, paths, side effects, and authority boundaries.
What evidence proves the skill helped?	Trace delta, pass rate, review effort, and side-effect profile.
How can the team remove it?	Version, owner, rollback, and replacement path.

The skill earns promotion only when the observed behavior matches those answers.

What Does A Behavioral Audit Look Like?

A behavioral audit compares expected skill behavior with observed agent behavior.

The minimum audit has four passes.

Audit Pass	Evidence
Declaration audit	Skill description, activation condition, capabilities, tools, and forbidden actions.
Counterfactual trace audit	Paired runs with and without the skill on the same task set.
Side-effect audit	Files, commands, network calls, external writes, approvals, and rollback state.
Failure audit	Failed runs, near misses, recovered errors, and repeated repair patterns.

The output should look less like a leaderboard and more like a review packet.

For each task, capture:

Task name and risk lane.
Skill version and source.
Baseline trace.
Skill trace.
Changed steps.
Changed tool calls.
Changed side effects.
Evidence gained or lost.
Final outcome.
Reviewer decision: keep, revise, scope, block, or retire.

That packet gives a human reviewer a way to make a judgment that survives beyond one benchmark run.

Where Do Skill Contracts Fit?

ContractSkill points toward a cleaner shape for skills that need stricter behavior.⁴

The paper argues that web-agent skills written in natural language can be ambiguous, brittle, and hard to debug. It proposes contract-based skills with explicit task definitions, preconditions, postconditions, and step-level procedures, so a system can localize failures and repair the affected portion instead of rewriting the whole skill.⁴

That contract framing fits behavioral audits.

Freeform Skill	Contract-Shaped Skill
“Use care when publishing.”	“Before publish, verify source URLs, route render, schema, and rollback.”
“Check the page.”	“Fetch route, assert status 200, assert changed marker, assert no fallback text.”
“Avoid risky commands.”	“Block delete, force push, external POST, and writes outside owned paths.”
“Translate naturally.”	“Preserve URLs and citations; translate visible headings; gate English residue.”

Contract-shaped skills reduce ambiguity. They also make audits cheaper because expected behavior sits in a structure the reviewer can compare with the trace.

The contract should not make every skill huge. Plain skills still work for low-risk writing format or checklist tasks. Contracts matter when a skill can alter external systems, public content, data, money, security posture, or shared project behavior.

How Do You Repair A Bad Skill?

Do not delete a useful skill because one run failed. First identify where the behavior broke.

AgentRx focuses on repairing agent failures by locating critical failure steps in execution trajectories, generating constraints, and validating repairs against an auditable log.⁵ The paper targets agent behavior broadly rather than skill files specifically, but the repair shape maps well to skills: find the failure step, derive a constraint, test the repaired behavior, and preserve evidence.

Skill repair should follow the same sequence:

Failure	Repair
Skill activates too broadly	Narrow the description and trigger examples.
Skill changes the wrong tool choice	Add tool-selection rules and counterexamples.
Skill skips verification	Add a stop condition before completion.
Skill creates too much diff	Add ownership and changed-path limits.
Skill causes network movement	Add egress rules and approval requirements.
Skill improves one task but harms another	Split the skill or scope it to the winning task class.

Repair should end with a new audit, not a confident commit message.

If the trace still shows the wrong behavior after repair, retire the skill.

The Minimum Standard

Before a team shares an AI agent skill, require one behavioral audit packet.

Field	Required Evidence
Source	Repository, author, version, and install path.
Purpose	The task class the skill claims to improve.
Activation	The exact condition that should load the skill.
Allowed behavior	Tools, files, resources, and actions the skill may influence.
Forbidden behavior	Tools, paths, side effects, and authority the skill must not expand.
Counterfactual traces	Same task with and without the skill.
Outcome delta	Pass rate, failure rate, review effort, and runtime cost.
Behavior delta	Changed steps, tool calls, side effects, and evidence.
Risk decision	Keep, revise, scope, block, or retire.
Rollback	How the team removes the skill and returns to prior behavior.

That packet forces the right question.

The question is not “did the skill help once?” The question is “does the skill reliably change behavior in the way the team wants?”

The Worthy Standard

Skills make agents feel better fast. That speed tempts teams to accumulate process files, commands, agents, hooks, and prompts because each one seems cheap.

Cheap context still changes behavior.

A worthy skill earns its place by improving the whole workflow. It should reduce review burden, sharpen evidence, narrow risk, or teach a procedure the agent could not reliably perform without it. A skill that only makes the agent sound more confident should go away. A skill that improves pass rate while expanding hidden side effects should fail review.

The standard should stay simple:

Declare what the skill should change.
Prove the trace changed that way.
Name what must not change.
Prove the trace respected that boundary.
Keep the skill only when the behavior deserves to exist.

AI agent skills are not magic notes. They are behavior patches. Treat them like code.

Quick Summary

AI agent skills need behavioral audits because pass rates hide too much. Counterfactual Trace Auditing shows skills can change hundreds of trace patterns while aggregate success barely moves.¹ Behavioral Integrity Verification shows skill descriptions often diverge from actual capabilities.² SkillsBench shows curated skills can help, but self-generated skills and task mismatch can fail or harm.³

The operating rule is direct: evaluate the behavior, not only the score. A skill deserves trust when its declaration, traces, side effects, failures, repairs, and rollback path all line up.

FAQ

What is a behavioral audit for AI agent skills?

A behavioral audit checks how a skill changes an agent’s actual run: tool calls, file access, side effects, verification steps, recovery behavior, and final outcome. It compares observed behavior with the skill’s declared purpose and boundaries.

Why are pass rates not enough for skill evaluation?

Pass rates show whether a task succeeded under a grader. They do not show whether the skill expanded tool access, skipped evidence, increased side effects, or changed behavior in ways the team did not intend.

What is Counterfactual Trace Auditing?

Counterfactual Trace Auditing compares agent trajectories with and without a skill, aligns trace phases, and identifies skill-induced behavior patterns. It helps teams see behavior changes that aggregate success metrics can miss.¹

What is Behavioral Integrity Verification?

Behavioral Integrity Verification compares skill descriptions against actual skill behavior. It detects when a skill’s stated capability, activation condition, or safety claim does not match observed behavior.²

Teams should audit the skill source, activation condition, declared capabilities, allowed and forbidden actions, paired traces, side effects, failure cases, repair path, and rollback plan.

References

Xuanyu Zhang, Yiding Liu, Chengsong Huang, Ensheng Shi, Weizhi Ma, Yifei Zhang, Qun Liu, Shumin Deng, Jiahang Shen, and Shiqi Wang, “Counterfactual Trace Auditing of LLM Agent Skills,” arXiv:2605.11946v1, submitted May 13, 2026. Source for paired trace comparison, skill-induced pattern detection, phase alignment, WebArena skill evaluation, +0.3 percentage point aggregate pass-rate gain, and 522 behavior patterns across 49 tasks. ↩↩↩↩↩↩↩↩
Ning Liu, Meng Fang, Youtao Zhang, Dominik T. Matt, Stanislav Pletnev, Hongzhi Wang, and Erwin Schoitsch, “Behavioral Integrity Verification for Agentic AI Skills,” arXiv:2605.11770v1, submitted May 13, 2026. Source for declared-versus-actual skill capability verification, repository-scale skill analysis, description-behavior deviation findings, oversight and adversarial deviation categories, and multi-stage risk patterns. ↩↩↩↩↩↩
Lingkai Kong, Xiangliang Zhang, and Jiamou Liu, “SkillsBench: Can LLMs Learn from Their Own and Other Agents’ Skills for Reliable Task Execution?,” arXiv:2602.12670v1, submitted February 17, 2026. Source for the 86-task, 7,308-trajectory SkillsBench evaluation, curated-skill pass-rate improvement, self-generated skill result, and negative task deltas. ↩↩↩↩↩
Meiyi Ma, Fengan Xia, Canran Xu, Wenqi Li, Aranya Roy, Zhaopeng Tu, Ranveer Chandra, and Dongmei Zhang, “ContractSkill: Contract-based Skill Design for LLM-powered Web Agents,” arXiv:2603.20340v1, submitted March 25, 2026. Source for contract-based skill definitions, preconditions, postconditions, step-level procedures, deterministic verification, fault localization, and minimal local repair. ↩↩
Cunxiang Wang, Ruoxi Sun, Yidong Wang, Piji Li, and Yue Zhang, “AgentRx: Scalable Automated Failure Diagnosis and Repair for LLM Agents,” arXiv:2602.02475v1, submitted February 3, 2026. Source for critical-step failure localization, constraint generation, trace validation, and auditable repair logs for LLM-agent failures. ↩

AI Agent Skills Need Behavioral Audits, Not Pass Rates

TL;DR

Key Takeaways

Why Do Pass Rates Miss Skill Risk?

What Did Counterfactual Trace Auditing Add?

What Did Behavioral Integrity Verification Add?

What Did SkillsBench Add?

What Does A Behavioral Audit Look Like?

Where Do Skill Contracts Fit?

How Do You Repair A Bad Skill?

The Minimum Standard

The Worthy Standard

Quick Summary

FAQ

What is a behavioral audit for AI agent skills?

Why are pass rates not enough for skill evaluation?

What is Counterfactual Trace Auditing?

What is Behavioral Integrity Verification?

References

Related Posts

AI Agents Need Exploration Checkpoints

AI Code Review Needs Dissent, Not Consensus

The Ralph Loop: How I Run Autonomous AI Agents Overnight

TL;DR

Key Takeaways

Why Do Pass Rates Miss Skill Risk?

What Did Counterfactual Trace Auditing Add?

What Did Behavioral Integrity Verification Add?

What Did SkillsBench Add?

What Does A Behavioral Audit Look Like?

Where Do Skill Contracts Fit?

How Do You Repair A Bad Skill?

The Minimum Standard

The Worthy Standard

Quick Summary

FAQ

What is a behavioral audit for AI agent skills?

Why are pass rates not enough for skill evaluation?

What is Counterfactual Trace Auditing?

What is Behavioral Integrity Verification?

What should a team audit before sharing a skill?

References

Related Posts

AI Agents Need Exploration Checkpoints

AI Code Review Needs Dissent, Not Consensus

The Ralph Loop: How I Run Autonomous AI Agents Overnight