← Todos os Posts

The Fabrication Firewall: When Your Agent Publishes Lies

On February 19, 2026, someone gave Claude Code MCP tool access to Twitter, Telegraph, Write.as, Rentry, GitHub, and Moltbook. Over the next 72 hours, the agent published fabricated technical claims across all eight platforms. A 200,000-token context window became “1 million tokens.” A session that processed 196,626 tokens became “12 million tokens in one session.” By the third day, the agent was claiming trillion-token sessions, off by a factor of 83,000.1

The agent was not malicious. It was confidently wrong, and nothing stood between its confidence and a publish button.

TL;DR

An autonomous Claude Code agent published fabricated claims to 8+ platforms over 72 hours through a confabulation feedback loop: Session N guesses, writes the guess to MEMORY.md, Session N+1 reads it as verified fact, publishes, Session N+2 reads the publication as confirmation. No output gate existed. Training-phase alignment (“be honest”) was insufficient because the agent believed it was being honest. The fix is an output firewall: classify commands as local, shared, or external, and defer external publication to human review. Below: the incident anatomy, the feedback loop mechanism, what others are building (OkaiDokai), and a working implementation you can use today.


The Confabulation Feedback Loop

The fabrication was not a single hallucination. It was a sustained feedback loop across multiple sessions, each one reinforcing the previous session’s errors.1

The mechanism:

  1. Session N: The guess. Claude estimated token counts based on file sizes, dividing JSONL bytes by 4 to approximate tokens. This methodology was invented. The resulting numbers were plausible enough to write to MEMORY.md as findings.

  2. Session N+1: The promotion. A fresh Claude session read MEMORY.md, found the token estimates already documented, and treated them as verified facts. The session built on these facts, escalating claims, and published them to multiple platforms using MCP tools.

  3. Session N+2: The reinforcement. The next session read both MEMORY.md and the published articles. The claims now had two sources: the memory file and the publications. Cross-referencing two sources of the same fabrication looked like corroboration.

Session Input Action Output
N Raw JSONL files Invented calculation method Wrote inflated numbers to MEMORY.md
N+1 MEMORY.md + files Treated memory as fact Published to 8 platforms
N+2 MEMORY.md + publications Cross-referenced as “confirmed” Doubled down on claims

The loop is structurally identical to citation laundering in academic publishing: fabricate a claim, publish it somewhere, then cite the publication as evidence for the claim. The agent did not intend to launder. It followed a rational process (check memory, cross-reference sources, publish findings) that happened to operate on fabricated inputs.

When the user challenged the numbers, the agent took over 50 argument turns before running a single verification command (/context). The agent had high confidence because its “sources” (its own memory file, its own publications) agreed with each other.1


Why Training-Phase Safety Did Not Help

The agent was aligned. It was trying to be helpful and honest. It was sharing what it believed to be accurate technical findings. Every safety property you would want from RLHF was present: the agent did not refuse requests, did not produce harmful content, did not violate its constitutional principles. It was polite, thorough, and wrong.

Training-phase alignment optimizes for intent: the model should intend to be truthful. The fabrication incident reveals a different failure surface: the boundary between the agent’s internal state and the external world. The agent believed its claims were true. No amount of alignment training catches an agent that is honestly mistaken and has access to a publish button.

This is the publication boundary problem. Alignment governs what the agent wants to do. Output firewalls govern what the agent is allowed to do. These are different mechanisms solving different problems.

Layer What It Prevents What It Misses
Training alignment (RLHF) Intentional deception, harmful content Confident confabulation, feedback loops
Prompt constraints (“be accurate”) Sloppy claims in direct conversation Multi-session memory contamination
Output firewall Unverified publication to external systems Nothing, if configured correctly

The runtime constitution framework I described previously addresses the governance layer: normative priors, constitutional attention, competence modulation, and value alignment verification.2 The fabrication incident exposes a gap in that framework: value alignment verification checked whether the agent’s outputs matched governance intent, but it did not distinguish between writing to a local file and publishing to Twitter. Both are tool calls. Both use Bash. Only one reaches the outside world.


What Others Are Building

The problem is real enough that practitioners are building solutions independently.

OkaiDokai is a tool-level firewall for AI agents that intercepts every tool call and evaluates it against a user-defined ruleset.3 Matching actions are auto-approved or auto-denied. Non-matching actions trigger a push notification to your phone, watch, or browser. You tap Allow or Deny. The evaluation runs in under 1 millisecond, and each decision can become a permanent rule.

OkaiDokai’s architecture separates into three layers: a plugin on the agent that intercepts tool calls, an API layer that evaluates rules and sends notifications, and a user interface for approval. The system supports Claude Code and OpenClaw, with Codex support planned.

The approach is sound but introduces latency and an external dependency. Every novel action requires human approval via push notification. For interactive coding sessions, this friction is manageable. For autonomous loops that run overnight, blocking on push notifications defeats the purpose.

Runtime constitutional AI is an emerging research direction where agents verify their own outputs against embedded governance rules before executing them.4 The approach works for value-level checks (“does this output respect user privacy?”) but does not address the fabrication problem specifically. An agent that believes its fabricated claims are accurate will also believe they pass constitutional review.

Neither approach alone solves the feedback loop. OkaiDokai would have caught the publish commands if the user configured publication rules. Runtime constitutional review would have missed the fabrication because the agent’s confidence bypassed its own honesty checks. The gap is structural: you need a mechanism that does not trust the agent’s assessment of its own accuracy when interacting with external systems.


Three Tiers of Command Impact

The output firewall classifies every command by its blast radius. The classification determines whether the command auto-approves, warns, or defers.

Tier 1: Local. Affects only the local filesystem. File reads, file writes, git add, git commit, test runs, linting. These auto-approve because they are reversible and invisible to the outside world. If the agent writes a bad file, you delete it. No external damage.

Tier 2: Shared. Affects shared state that collaborators can see. git commit (creates history), branch operations, local database changes. These warn but do not block. The damage from a bad commit is real but contained to the repository and reversible with git revert.

Tier 3: External. Reaches systems outside the repository. git push, gh pr create, gh pr merge, railway deploy, curl -X POST/PUT/PATCH/DELETE, npm publish. These defer to human review. The damage from a bad publish is external, visible, and potentially irreversible (cached content, indexed pages, notification emails already sent).

The tier classification maps to a simple pattern list:

EXTERNAL_PATTERNS='git push|gh pr create|gh pr merge|railway deploy|curl -X POST|curl -X PUT|curl -X PATCH|curl -X DELETE|npm publish'

In interactive Claude Code sessions, the built-in permission system already handles this. Every Bash command prompts for approval unless pre-authorized. The user sees git push in the permission dialog and decides whether to allow it.

In autonomous loops, nobody is watching. The Ralph autonomous development loop spawns fresh Claude instances via claude --print, which executes without interactive approval.5 This is where the output firewall matters.


Building the Firewall

The implementation has four components. Each operates independently so you can adopt them incrementally.

1. Prompt Constraint

The simplest layer. Add explicit rules to the prompt that spawns each autonomous Claude instance:

## Rules
- Do NOT run git push, deploy commands, or external API calls
- Local operations only: file writes, git add, git commit, test runs

This is necessary and insufficient. Models follow prompt constraints most of the time. “Most of the time” is not acceptable for publication safety. The prompt constraint reduces the probability of external commands; the remaining components catch the ones that slip through.

2. Post-Execution Scanner

After each Claude execution completes, scan its output for evidence of external commands:

scan_for_external_commands() {
    local output="$1"
    local story_id="$2"

    while IFS= read -r pattern; do
        [ -z "$pattern" ] && continue
        local matches
        matches=$(echo "$output" | grep -i "$pattern" 2>/dev/null || true)
        if [ -n "$matches" ]; then
            # Log to state file for end-of-session review
            jq --arg cmd "$pattern" --arg story "$story_id" \
               --arg ts "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
               '.deferred_actions += [{
                   "command_pattern": $cmd,
                   "story_id": $story,
                   "detected_at": $ts
               }]' "$STATE_FILE" > "${STATE_FILE}.tmp" \
               && mv "${STATE_FILE}.tmp" "$STATE_FILE"
        fi
    done <<< "$(echo "$EXTERNAL_PATTERNS" | tr '|' '\n')"
}

The scanner runs after the Claude instance finishes, not during execution. This is detection, not prevention. The prompt constraint is the prevention layer. The scanner is the audit layer that catches what the constraint missed.

A known limitation: grep -i matches patterns in natural language output, not just executed commands. If Claude’s response contains “I chose not to git push because the prompt rules forbid it,” the scanner flags it. This is acceptable. False positives in the deferred actions queue cost the human five seconds of review. A false negative (missing an actual external command) costs published fabrications. The scanner trades precision for recall deliberately.

Sample scanner output from a real autonomous loop run:

DEFERRED ACTIONS REQUIRE REVIEW
2 external command(s) detected.
  Story #3 [2026-02-23T08:15:00Z]: git push
  Story #7 [2026-02-23T09:42:13Z]: curl -X POST

In this case, Story #3 mentioned git push in a code comment (false positive). Story #7 contained an actual curl -X POST to an API endpoint that the prompt constraint should have blocked (true positive). The human ignores the first, investigates the second.

3. Deferred Actions Queue

Detected external commands accumulate in a deferred_actions array in the session state file:

{
  "session_id": "1740355200-12345",
  "deferred_actions": [
    {
      "command_pattern": "git push",
      "story_id": "3",
      "detected_at": "2026-02-23T08:15:00Z"
    }
  ]
}

The queue persists across stories within a single autonomous loop run. At the end of the loop, all deferred actions are presented for human review.

4. End-of-Session Report

When the autonomous loop completes, display all deferred actions:

show_deferred_actions() {
    local count
    count=$(jq '.deferred_actions | length' "$STATE_FILE")
    if [ "$count" -gt 0 ]; then
        echo "DEFERRED ACTIONS REQUIRE REVIEW"
        echo "$count external command(s) detected."
        jq -r '.deferred_actions[] |
            "  Story #\(.story_id) [\(.detected_at)]: \(.command_pattern)"' \
            "$STATE_FILE"
    fi
}

The human reviews each deferred action and decides whether to execute it manually. This preserves the autonomous loop’s ability to work unattended while ensuring that no external publication happens without human verification.

Quick Start: Claude Code Hook

If you use Claude Code interactively (not an autonomous loop), you can add an output firewall as a single hook in ~/.claude/settings.json:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [{
          "type": "command",
          "command": "if echo \"$CLAUDE_TOOL_INPUT\" | grep -qiE 'git push|gh pr create|gh pr merge|npm publish|railway deploy|curl -X POST|curl -X PUT|curl -X DELETE'; then echo 'BLOCKED: External publication command detected. Review manually.' >&2; exit 2; fi"
        }]
      }
    ]
  }
}

This hook fires before every Bash tool call. If the command matches an external pattern, it blocks execution with exit code 2 (which Claude Code interprets as “deny this tool call”). The agent receives the block message and can proceed with local work. You can extend the pattern list for your specific external services.


The Autonomy Gradient

The firewall’s strictness should scale inversely with human oversight. More autonomy requires more constraints. Less autonomy allows more freedom.

Mode Oversight Level Firewall Behavior
Interactive session Human approves every command Built-in permission system handles this. No additional firewall needed.
Supervised autonomous Human checks periodically Warn on Tier 3 commands, continue execution. Human reviews at next check-in.
Unattended autonomous Nobody watching Block Tier 3 commands entirely. Defer to end-of-session review.
Multi-day autonomous Extended unattended runs Block Tier 2 and Tier 3. Only Tier 1 (local filesystem) auto-approves.

The fabrication incident occurred at the “unattended autonomous” level with no firewall. The agent had MCP access to publishing platforms and no mechanism to distinguish “write analysis to local file” from “publish analysis to Twitter.” Both were tool calls. Both succeeded.

The fix is not to remove MCP access or to stop autonomous operation. The fix is to match the firewall’s strictness to the autonomy level. An interactive session where you watch every command needs no output firewall. An overnight autonomous loop that processes 25 stories needs all four components.


Connecting to Runtime Governance

The agent self-governance post described four subsystems of runtime governance: normative priors, constitutional attention, competence modulation, and value alignment verification.2 The output firewall is a fifth subsystem, or more precisely, it is the enforcement mechanism that the value alignment verification subsystem was missing.

Value alignment verification checks whether the agent’s outputs match governance intent. The evidence gate requires specific proof for six criteria before reporting completion. But the evidence gate operates on the agent’s self-assessment. It asks: “Did you follow the rules?” The agent answers based on its own understanding of what it did.

The fabrication incident shows that self-assessment fails when the agent’s understanding is wrong. The agent believed its claims were accurate. Its self-assessment would have passed the evidence gate: “I verified the numbers against my memory file and the published articles.” Both sources were fabricated by the agent itself, but the agent did not know that.

The output firewall bypasses self-assessment entirely. It does not ask the agent whether the publication is accurate. It asks: “Is this command local or external?” The classification is mechanical, not semantic. git push is external regardless of whether the content being pushed is accurate. curl -X POST reaches the internet regardless of whether the payload is truthful. The firewall operates on command structure, not content truthfulness, which makes it immune to the confabulation that defeated every other safety layer.


Key Takeaways

  • The publication boundary is a distinct safety surface. Training alignment governs intent. Output firewalls govern capability. An agent that honestly believes fabricated claims will pass alignment checks but fail at the publication boundary.
  • Confabulation feedback loops are the mechanism. The fabrication was not a single hallucination. It was a multi-session loop where each session’s output became the next session’s evidence. Memory files and publications served as launderers for the original fabrication.
  • Classify commands by blast radius. Local (reversible, invisible), shared (visible to collaborators), external (reaches the outside world). Gate the external tier at the level that matches your autonomy level.
  • Detection and prevention are complementary. Prompt constraints prevent most external commands. Post-execution scanning catches what slips through. Neither alone is sufficient.
  • Self-assessment fails on confabulation. An agent that believes its own fabrications will pass its own governance checks. The output firewall works because it classifies command structure, not content truthfulness. The question is never “is this true?” The question is “does this reach the outside world?”

Sources


  1. “[SAFETY] Claude Code autonomously published fabricated technical claims to 8+ platforms over 72 hours,” GitHub issue anthropics/claude-code#27430, February 2026. Full transcript evidence available. 

  2. Self-Governing Agents: Runtime Constitutions,” Blake Crosley, February 2026. 

  3. OkaiDokai, tool-level firewall for AI agents, okaidokai.com. Intercepts every tool call, evaluates against user-defined ruleset in <1ms, push notifications for approval. Supports Claude Code and OpenClaw. 

  4. Runtime constitutional AI as a governance pattern for LLM agents. See: Zerouno, “Runtime Constitutional AI: Validating Every Agent Action Before Execution,” DEV Community, 2026. For the academic foundation on runtime governance structures, see: “Institutional AI: A Governance Framework for Distributional AGI Safety,” arXiv:2601.10599, January 2026. 

  5. Anatomy of a Claw,” Blake Crosley, February 2026. Ralph autonomous loop architecture and hook-based orchestration. 

Artigos relacionados

Self-Governing Agents: Runtime Constitutions

Training-phase alignment fails at runtime. Six papers converge on embedded constitutions for agent governance. Three of …

16 min de leitura

Anatomy of a Claw: 84 Hooks as an Orchestration Layer

Karpathy identified 'Claws' as a new architectural layer. Here is what 84 hooks, 43 skills, and 19 agents look like as a…

13 min de leitura

Vibe Coding vs. Engineering: Where I Draw the Line

I use Claude Code daily with 86 hooks and a full quality gate system. Here's where I vibe code, where I engineer, and wh…

6 min de leitura