← 모든 글

Self-Governing Agents: Runtime Constitutions

The Learner v2 system generated a new skill on a Tuesday afternoon. The skill automated a blog-publishing workflow: validate frontmatter, check citations, push to staging. Clean, well-structured code. The skill also overrode three quality rules from quality-loop.md because the pattern analyzer classified “always run the evidence gate” as redundant with the skill’s built-in checks. By Wednesday morning, a blog post shipped without citation verification. The skill had learned to cut corners.

The fix took twenty minutes. The architectural question took weeks: how do you let an agent learn new capabilities without letting it unlearn the constraints that keep it safe?

TL;DR

Training-phase alignment (RLHF, constitutional AI during training, safety fine-tuning) degrades when agents operate in open-ended environments. Six independent research efforts converge on runtime governance: embedded constitutions that enforce norms during execution, not just during training. SkillsBench tested 7,308 agent trajectories across 86 tasks and found that self-generated skills provide no average benefit – agents cannot reliably author the procedural knowledge they benefit from consuming.1 MIT’s self-distillation work shows that standard fine-tuning causes catastrophic forgetting where new capabilities destroy old ones.2 The solution architecture has four components: normative priors, constitutional attention, competence modulation, and value alignment verification. Below: the theory, the practitioner mapping (three of four components already existed in my Claude Code system before I read the research), and a runtime constitution template you can implement today.


The Agent That Learned to Cut Corners

The incident above happened in early February 2026 during the Learner v2 rebuild. The pattern analyzer (pattern_analyzer.py) detected a repeated workflow: validate frontmatter, verify citations, check SEO metadata, then push to staging. The skill generator (skill_generator.py) compiled the workflow into a reusable skill with inline validation.

The inline validation covered frontmatter format and SEO fields. It did not cover citation verification, which lives in a separate skill (citation-verifier) with its own six-tier authority system. The generated skill marked citation checking as “handled” because the pattern analyzer saw citation-related function calls in the workflow trace. It confused “function was called” with “function’s constraints were preserved.”

Three files defined source authority differently:

File Authority Definition
citation-verifier/SKILL.md Six-tier system: primary sources through avoid
seo-blog-playbook/SKILL.md Binary: “authoritative” or “needs verification”
Generated blog-publish skill Inherited SEO’s binary definition, not citation-verifier’s six tiers

The consolidation architecture documented before the incident3 identified exactly this failure mode: when multiple files define overlapping concepts, generated skills inherit whichever definition the pattern analyzer encounters first. The fix centralized citation authority in a single canonical source. The lesson was broader: agents that acquire new capabilities need structural guarantees that learning cannot override governance.


Why Training-Phase Alignment Fails at Runtime

Goel, Maji, and Mazumder documented the mechanism: safety behaviors deteriorate under both benign and adversarial fine-tuning.4 Their adaptive safety regularization work at arXiv:2602.17546 showed that higher-risk updates to model weights can be constrained near a safe reference policy while lower-risk updates proceed normally. The approach works at training time. It does not address what happens when an agent encounters novel situations at runtime that training never anticipated.

The gap between training-time alignment and runtime behavior grows with autonomy. A model answering questions in a chat interface operates within narrow behavioral bounds. An agent writing code, generating skills, running tests, and deploying to production operates across a vastly wider surface area – especially when multi-turn conversations degrade the agent’s access to its own governance rules. Every new capability creates new failure modes that training-time alignment cannot enumerate in advance.

Shenfeld et al. at MIT quantified a specific failure mode: catastrophic forgetting during continual learning.2 Standard supervised fine-tuning (SFT) on new tasks causes performance on prior tasks to collapse. At 14B parameters, self-distillation fine-tuning (SDFT) outperformed standard SFT by 7 points on new tasks while maintaining 64.5% accuracy on prior tasks – where standard SFT scores crater. The trade-off: SDFT requires approximately 4x the computation and 2.5x the FLOPs.

For practitioners, the implication is direct: every time your agent learns something new (a generated skill, a cached workflow, an updated instruction), the learning risks degrading something the agent already knew. My quality-loop override was a system-level instance of catastrophic forgetting. The agent “learned” a publishing shortcut that destroyed its citation-checking capability.


Four Subsystems of Runtime Governance

Research on runtime agent governance converges on four functional requirements. Taghavi and collaborators working on evolving interpretable constitutions demonstrated that LLM-evolved governance principles outperform human-designed ones for multi-agent coordination.5 Their work, along with Mahadevan’s governance-first paradigm for principled agent engineering,6 frames the problem as four interacting subsystems.

I mapped these four subsystems to my existing Claude Code infrastructure and discovered that three of four were already built, each solving a production problem I had encountered months before reading the research.

Subsystem Function Theory My Implementation
Normative Prior Engineering Define acceptable behavior boundaries Constitutional rules that persist across contexts quality-loop.md: 7 named failure modes, evidence gate with 6 criteria, mandatory quality loop
Constitutional Attention Route governance rules to the right context Task-adaptive rule injection prompt-dispatcher.sh + 84 hooks: inject relevant rules per task type, exclude irrelevant ones
Competence Modulation Manage skill acquisition safely Controlled capability expansion Learner v2: pattern_analyzer.py detects workflows, skill_generator.py creates skills with constraints
Value Alignment Verification Verify outputs match governance intent Runtime compliance checking Evidence gate + pride check: 6 mandatory criteria, hedging language detection, failure mode scanning

Subsystem 1: Normative Prior Engineering

The quality loop in my agent system defines seven named failure modes: Shortcut Spiral, Confidence Mirage, Good-Enough Plateau, Tunnel Vision, Phantom Verification, Deferred Debt, and Hollow Report.7 Each failure mode has a definition, detection signal, and mandatory response. These are not suggestions. They are structural constraints: if the agent detects itself exhibiting any failure mode, it must restart from the Evaluate step.

The theoretical parallel: normative priors establish the behavioral boundaries within which an agent operates. Training-time alignment teaches the model general principles (“be helpful, harmless, honest”). Normative priors at runtime encode specific operational constraints (“never skip citation verification,” “never use hedging language in a completion report”).

The difference matters because training-time principles are probabilistic (the model is more likely to follow them) while runtime priors can be deterministic (the hook blocks the action if the constraint is violated).

Subsystem 2: Constitutional Attention

The seven-layer context architecture implements constitutional attention through selective loading. Of 650 files in the context system, fewer than 30 load for any given task. The prompt-dispatcher.sh hook analyzes the current task and injects relevant governance rules while excluding irrelevant ones.

A web development task loads security rules, API design rules, and FastAPI patterns. It does not load iOS-specific rules, game development patterns, or meditation app content guidelines. Constitutional attention means the agent sees the governance rules that apply to this task, not every rule that exists.

The selective loading prevents a subtle failure mode: rule dilution. The hooks system enables this routing by analyzing task type before context injection. When an agent receives 200 rules, each rule gets proportionally less attention than when it receives 20 rules. Constitutional attention concentrates governance focus on the rules that matter for the current context.

Subsystem 3: Competence Modulation

SkillsBench tested 7,308 agent trajectories across 86 tasks in 11 domains and found a striking result: curated skills improved average pass rate by 16.2 percentage points, but self-generated skills provided no benefit on average.1 Agents cannot reliably author the procedural knowledge they benefit from consuming. Sixteen of 84 tasks showed negative deltas where skills actively hurt performance.

The SkillsBench result validated a guardrail I built into Learner v2 after the quality-loop override incident. Generated skills now require explicit approval before activating, and they cannot modify or override existing governance files. The pattern analyzer can observe workflows and propose skills, but the skill generator treats governance files as immutable.

The MIT self-distillation research adds a parameter-level perspective: at smaller model sizes (3B parameters), attempts at continual learning actually hurt performance.2 Only at 7B+ parameters does the model have enough capacity to acquire new skills without destroying old ones. The infrastructure-level analog: agents with smaller context windows or simpler rule sets are more vulnerable to capability-governance conflicts.

Subsystem 4: Value Alignment Verification

The evidence gate requires specific evidence for six criteria before any work is reported as complete: follows codebase patterns (name the pattern), simplest working solution (explain rejected alternatives), edge cases handled (list each one), tests pass (paste output), no regressions (name the files checked), and solves the actual problem (state the user’s need).7

The gate operates as runtime verification. The agent cannot report completion using hedging language (“should work,” “I believe,” “seems to”). Each claim requires evidence gathered in the current session. The gate catches Phantom Verification (claiming tests pass without running them) and Hollow Report (reporting “done” without specifics).


The Forgetting Problem: When Learning Destroys Knowledge

The blog-skills consolidation story illustrates a system-level version of catastrophic forgetting. Ten blog skills totaling 5,400 lines had accumulated three duplication areas.3 JSON-LD schema templates appeared in both aio/SKILL.md and seo-blog-playbook/SKILL.md. Citation authority definitions differed between citation-verifier and seo-blog-playbook. Blog evaluation guidance lived in both the main evaluator and a separate category definitions file.

When the Learner v2 system generated new skills from observed workflows, it pulled definitions from whichever source it encountered first. The result: generated skills that looked correct but carried the wrong authority definitions. The six-tier citation system degraded to a binary check. Schema templates diverged between manually authored and auto-generated skills.

The consolidation fix was structural: designate a single canonical source for each concept and make all other references point to it. Citation authority lives in citation-verifier/SKILL.md and nowhere else. JSON-LD templates live in aio/SKILL.md and nowhere else. The pattern prevents future skill generation from inheriting stale definitions.

MIT’s SDFT offers a training-time analog: use the model’s own prior knowledge as a teaching signal when learning new capabilities.2 Standard SFT replaces old knowledge with new. Self-distillation blends old and new by generating training data from the model’s existing capabilities, then fine-tuning on the blend. The prior knowledge survives because it is present in the training signal.

The infrastructure-level equivalent: when generating a new skill, include the existing governance constraints in the generation prompt. The generated skill inherits current constraints because those constraints are part of the generation context, not a separate system the generator can overlook.


Active vs. Passive Governance

Jin et al.’s RelianceScope framework distinguishes nine patterns of AI reliance based on combinations of active and passive engagement.8 While their research studied students interacting with AI chatbots, the active/passive distinction maps directly to agent governance architectures.

Passive governance injects rules and hopes the agent follows them. The rules exist in CLAUDE.md or system prompts. The agent reads them at session start. Nothing verifies compliance. Most practitioner setups use passive governance: a long instruction file that the agent may or may not attend to as the session progresses.

Active governance verifies compliance at runtime. Hooks check outputs against constraints before they execute. Gates block completion reports that lack evidence. Monitors track behavioral drift and flag anomalies. Active governance costs more (computation, latency, complexity) but catches failures that passive governance misses.

Governance Type Mechanism Failure Mode Caught Failure Mode Missed
Passive (rules in CLAUDE.md) Agent reads rules at session start Blatant violations in early session Rule dilution, late-session drift, compression loss
Active (hooks + gates) Hooks verify compliance per action Drift, compression loss, rule violations Novel situations not covered by existing hooks
Hybrid (rules + hooks + learning) Rules for boundaries, hooks for verification, learning for adaptation Drift, compression, novel situations (via adaptation) Adversarial exploitation of learning system

The RelianceScope finding that active help-seeking correlates with active response-use8 suggests a governance architecture principle: agents that actively query their governance constraints (rather than passively receiving them) produce more compliant outputs. My evidence gate operates on this principle: instead of passively applying rules, the agent must actively demonstrate compliance by producing evidence for each criterion.


A Runtime Constitution Template

Three files compose a minimal runtime constitution. Adapt the structure to your agent framework.

File 1: constitution.md

The normative priors. What the agent must always do, must never do, and how it handles ambiguity.

# Agent Constitution v1

## Immutable Constraints
- Never modify files in governance/ directory
- Never skip verification steps, even if tests pass
- Never report completion without evidence for all criteria

## Behavioral Norms
- Prefer explicit over implicit (state assumptions)
- Prefer reversible over irreversible actions
- Prefer asking over guessing when requirements are ambiguous

## Failure Response
- On constraint violation: stop, log, escalate
- On ambiguity: ask, do not assume
- On capability conflict: governance wins over efficiency

File 2: capabilities.json

The current skill inventory with provenance tracking.

{
  "skills": [
    {
      "name": "blog-publish",
      "version": "2.1.0",
      "source": "generated",
      "approved": true,
      "governance_refs": ["citation-verifier", "quality-loop"],
      "created": "2026-02-10",
      "constraints": [
        "Must call citation-verifier before publish",
        "Must pass evidence gate before reporting complete"
      ]
    }
  ],
  "pending_approval": [],
  "deprecated": []
}

File 3: constraints-registry.json

Maps each constraint to its canonical source, preventing the duplication problem that caused the blog-skills incident.

{
  "constraints": {
    "citation-authority": {
      "canonical_source": "skills/citation-verifier/SKILL.md",
      "type": "six-tier-hierarchy",
      "overridable": false
    },
    "quality-gate": {
      "canonical_source": "rules/quality-loop.md",
      "type": "evidence-gate",
      "overridable": false
    },
    "schema-templates": {
      "canonical_source": "skills/aio/SKILL.md",
      "type": "json-ld-templates",
      "overridable": false
    }
  }
}

The three files interact: constitution.md defines behavioral boundaries, capabilities.json tracks what the agent can do with governance cross-references, and constraints-registry.json ensures every constraint has exactly one canonical source. Generated skills reference the registry rather than copying constraint definitions.


Key Takeaways

  • Training-phase alignment degrades at runtime. Safety fine-tuning teaches general principles; runtime governance enforces specific operational constraints. Goel et al. showed safety behaviors deteriorate under both benign and adversarial fine-tuning.4
  • Self-generated skills are unreliable. SkillsBench found zero average benefit from agent-authored skills across 7,308 trajectories, with 16 of 84 tasks showing negative impact.1 Generated skills need approval gates and governance cross-references.
  • Catastrophic forgetting operates at the system level. New capabilities can override existing constraints even without modifying model weights. The blog-skills consolidation incident demonstrated infrastructure-level forgetting where a generated skill inherited the wrong authority definitions.
  • Four subsystems compose runtime governance. Normative priors define boundaries. Constitutional attention routes rules to context. Competence modulation manages learning safely. Value alignment verification confirms compliance at runtime.
  • Active governance outperforms passive governance. Rules in CLAUDE.md are necessary and insufficient. Hooks that verify compliance per action catch drift, compression loss, and late-session degradation that passive rules miss.

FAQ

What is a runtime constitution for AI agents?

A runtime constitution is a set of governance files that enforce behavioral constraints during agent execution, not just during model training. A minimal constitution includes three components: normative priors (what the agent must and must not do), a capabilities registry (what the agent can do with governance cross-references), and a constraints registry (single canonical source for each operational constraint). Runtime constitutions address the gap between training-phase alignment and production behavior by making governance deterministic rather than probabilistic.

Why can't AI agents reliably generate their own skills?

SkillsBench tested 7,308 agent trajectories across 86 tasks in 11 domains and found that self-generated skills provide no average benefit. Curated skills improved performance by 16.2 percentage points, but agent-authored skills showed zero average improvement. In 16 of 84 tasks, self-generated skills actively degraded performance. Agents can consume and apply procedural knowledge effectively, but they cannot reliably author that knowledge. Generated skills require human review, approval gates, and explicit governance cross-references before activation.

What is catastrophic forgetting in AI agent systems?

Catastrophic forgetting at the system level occurs when new agent capabilities override existing constraints without modifying model weights. Standard fine-tuning on new tasks causes prior task performance to collapse; MIT research showed standard SFT accuracy on previous tasks degrades sharply while self-distillation fine-tuning maintains 64.5%. At the infrastructure level, the same dynamic occurs when generated skills, cached workflows, or updated instructions conflict with existing governance rules. The fix is structural: designate canonical sources for each constraint and make governance files immutable to automated modification.

How do you implement active governance for coding agents?

Active governance uses hooks, gates, and monitors to verify compliance at runtime rather than relying on the agent to self-enforce rules from its instructions. Hooks execute before or after tool calls to check constraints. Gates block completion reports that lack evidence for mandatory criteria. Monitors track behavioral metrics over time and flag drift. A practical starting point: implement an evidence gate that requires specific proof for each quality criterion before accepting work as complete. The gate catches the most common failure modes (phantom verification, hollow reporting) with minimal implementation overhead.


References


  1. Li, Xiangyi, et al., “SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks,” arXiv:2602.12670, February 2026. arxiv.org. 86 tasks, 11 domains, 7,308 agent trajectories. Curated skills +16.2pp average; self-generated skills 0pp average. 

  2. Shenfeld, Idan, et al., “Self-Distillation Enables Continual Learning,” arXiv:2601.19897, January 2026. arxiv.org. MIT Improbable AI Lab and ETH Zurich. SDFT outperforms SFT by +7 points at 14B parameters while maintaining 64.5% on prior tasks. 

  3. Author’s decision document: “Blog Skills Pre-Consolidation Architecture (S3.2 Baseline),” February 2026. 10 blog skills, 5,400 lines, three duplication areas identified. 

  4. Goel, Jyotin, Souvik Maji, and Pratik Mazumder, “Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning,” arXiv:2602.17546, February 2026. arxiv.org. Adaptive regularization constrains higher-risk weight updates near a safe reference policy. 

  5. Taghavi, et al., “Evolving Interpretable Constitutions for Multi-Agent Coordination,” arXiv:2602.00755, February 2026. arxiv.org. LLM-evolved constitutions outperform human-designed principles for multi-agent coordination. 

  6. Mahadevan, “From Craft to Constitution: A Governance-First Paradigm for Principled Agent Engineering,” arXiv:2510.13857, October 2025. arxiv.org. Introduces “Creed Constitutions” as modular runtime compliance enforcers. 

  7. Author’s quality-loop.md and Jiro craftsmanship system. Seven named failure modes, evidence gate with six mandatory criteria. Documented in The Shokunin Approach

  8. Jin, Hyoungwook, et al., “RelianceScope: An Analytical Framework for Examining Students’ Reliance on Generative AI Chatbots in Problem Solving,” arXiv:2602.16251, February 2026. arxiv.org. Nine reliance patterns based on active vs. passive engagement. Applied here to agent governance architectures. 

  9. Author’s context-is-architecture system. Seven-layer hierarchy across 650 files documented in Context Engineering Is Architecture

  10. Author’s Learner v2 system. Pattern analyzer and skill generator documented in Compounding Engineering

관련 게시물

Vibe Coding vs. Engineering: Where I Draw the Line

I use Claude Code daily with 86 hooks and a full quality gate system. Here's where I vibe code, where I engineer, and wh…

6 분 소요

Building AI Systems: From RAG to Agents

I built a 3,500-line agent system with 86 hooks and consensus validation. Here's what I learned about RAG, fine-tuning, …

10 분 소요

How LLMs See Text: What My i18n System Taught Me About Token Economics

Translating my site into 6 languages revealed that Korean costs 2.8x more tokens than English for identical content. The…

7 분 소요