AI Engineering Series
56 posts on building with AI agents in production. Claude Code, Codex CLI, hooks, skills, memory, context, and autonomous workflows.
94 articles
Start Here
All 94 Articles
AI Agent Approval Prompts Are Not Authorization
AI agent approval prompts need scoped authority, risk lanes, audit logs, expiry, and revocation so humans approve concrete actions, not fluent request...
AI Coding Agents Need Smaller Review Surfaces
AI coding agents overwhelm reviewers with giant diffs. Smaller review surfaces keep engineers engaged, verification-focused, and accountable before me...
AI Agent Config Security Is Supply Chain Security
AI agent config security belongs in supply-chain review: hooks, editor tasks, install scripts, MCP files, and plugins can execute code before you noti...
AI Code Review Needs Dissent, Not Consensus
AI code review needs independent agents that preserve dissent, validate findings, route uncertainty to humans, and re-review fixes before teams merge ...
AI Malware Analysis Needs Evidence Packets
AI malware analysis needs evidence packets: hashes, commands, indicators, and claim-to-evidence trails matter more than confident agent summaries.
Research Papers Need Agent-Readable Claim Files
Agent-readable claim files let papers expose claims, scope limits, definitions, and figure commands so research agents cite, test, and reuse them safe...
AI Agent Monitoring Needs Runtime Intervention
AI agent monitoring should catch decisive errors during a run, not after failure. Runtime intervention turns traces, policies, and alerts into safe pa...
AI Agents Should Call Models
AI agents should call trained machine-learning models as tools instead of asking an LLM to guess prices, risk scores, forecasts, or classifications.
AI Agents Need Exploration Checkpoints
Exploration checkpoints let AI agents prove what they discovered before acting, reducing premature exploitation, brittle plans, and generic world mode...
AI Agent Safety Starts With Small Software
AI agent safety starts with small software: smaller tools, plain files, narrow permissions, and faster tests give coding agents fewer places to hide b...
AI Agent Skills Need Behavioral Audits, Not Pass Rates
AI agent skills can change behavior while pass rates stay flat. Behavioral audits compare traces, declared capabilities, and side effects before trust...
Agents.txt Is Not Access Control
Agents.txt is not access control. Use robots.txt, llms.txt, bot verification, logs, and server-side policy to manage AI crawlers without false confide...
AI Agent Ownership Is the Trust Primitive
AI agent ownership links every autonomous action to the account, session, scope, and operator who can stop it, review it, and accept responsibility.
Agent Keys Need Risk Budgets
Shuriken's Agent Kit shows why AI agent tools that can act need scoped keys, server-side limits, activity logs, revocation, and conservative defaults.
Long-Running AI Agents Need Durable Channels
Long-running AI agents need durable channels: workflow IDs, event logs, resumable streams, typed signals, safe cancellation, and user-visible checkpoi...
MCP Tools Need Action-Level Authorization
MCP tools need action-level authorization: bearer-token validation must lead to per-tool, per-role, and per-action capability checks before agents act...
Deep Research Agents Need Evidence Graphs
Deep research agents need evidence graphs to track missing pieces, reduce duplicate searches, and produce source-traced answers reviewers can inspect.
Rust's Draft LLM Policy Draws the Right Line
Rust's draft LLM usage policy allows AI for learning, review, and experiments while banning generated comments, docs, and human-review shortcuts in Ru...
Agent Code Search Has a Token Budget
Semble turns code search into a context-budget problem: hybrid retrieval, ranked snippets, and token savings beat grep-and-read loops for coding agent...
Codex Hooks Make the Harness Real
Codex hooks, Remote SSH, and mobile control make agent work operational. Evidence, approvals, git custody, release gates, and taste now decide quality...
Open Source Is Not a Security Boundary
GDS guidance on AI vulnerability discovery gets open-source security right: hide less by default, fix faster, and make exceptions explicit with eviden...
Agent Skills Need Package Managers
Agent skills, MCP servers, prompts, hooks, and commands now behave like dependencies. Teams need manifests, lockfiles, policy gates, review, and rollb...
The Agent Interface Is the Harness
Agent interface design is the operating layer: permissions, memory, traces, evidence, recovery, and taste decide whether autonomous AI agents earn tru...
AI Agent Review Packets Are the New Final Answer
AI agent review packets bundle claims, traces, approvals, tests, deployment proof, human review state, and unresolved gaps so agent work earns real tr...
Agentic Design Is Control Surface Design
Agentic design is not a prettier chat box. It is the control surface that makes autonomous software visible, interruptible, auditable, and worthy.
HTML Is the Format AI Agents Want
Thariq Shihipar's HTML examples show why agent output format matters: spatial structure, interaction, and visual evidence beat flattened Markdown.
Agents Need Supervision Surfaces
Agent supervision surfaces turn autonomous AI work into inspectable operations: approvals, traces, evidence, recovery, and review queues beat better c...
Agent Search Is a Runtime Problem
A new arXiv study compares grep and vector retrieval across Chronos, Claude Code, Codex, and Gemini CLI. Agent search quality lives in the runtime lay...
Agent Execution Traces Are the Runtime Contract
Shepherd, AI Workflow Store, and WildClawBench point to the same agent reliability layer: typed traces, reusable workflows, and native-runtime evaluat...
Managed Agents vs Local Agent Harnesses: What to Keep
Managed agents now handle sessions, sandboxes, tracing, and events. Keep local harness rules for taste, evidence, privacy, and publishing safely today...
Claude Code to Codex Migration Guide 2026
Claude Code to Codex migration guide: move AGENTS.md, skills, hooks, profiles, MCP, public-writing gates, and verified CLI notes from real local data.
The Cleanup Layer Is the Real AI Agent Market
Charlie Labs pivoted from building agents to cleaning up after them. The AI agent market is moving from generation to proof. Cleanup is the durable la...
Reward the Tool Before the Answer
AI agents fail when answers claim tool work that never happened. Four failure modes and the rule that catches them, with a tool-supervised RL parallel...
The Repo Shouldn't Get to Vote on Its Own Trust
Two Claude Code trust dialog bypass CVEs in 37 days reveal a load-order failure. One invariant fixes it: interpret no workspace byte until the path is...
The Agent Operator's Handbook: Supervising What You Can't See
Operating autonomous AI agents is a new discipline. Five responsibilities, a supervision stack, and an intervention framework define what operators do...
The Design Engineer's Agent Stack
Design engineers need agent infrastructure that enforces visual consistency, typography discipline, color compliance, and taste. Here are the six comp...
Chat Is the Wrong Interface for AI Agents
Chat works for prompting but fails for agent operations. Six interface patterns replace the scrolling text window with real control surfaces.
Taste Is a Technical System
Taste decomposes into constraints, evaluation criteria, pattern recognition, and coherence checks. Each component maps to engineering infrastructure.
Cybersecurity Is Proof of Work: AI Attacks at $12,500 a Run
Claude Mythos completed a 32-step corporate network attack simulation in 3 of 10 tries. Each attempt cost $12,500 in tokens. Security is now a spendin...
Dark Factory Verification: When No Human Reads the Code
When humans stop reading code, what does the verification layer look like? Mapping the infrastructure required for fully autonomous AI coding.
Runtime Defense for Tool-Augmented Agents
ClawGuard demonstrates deterministic tool-call interception works. The Vercel telemetry incident shows why. Runtime defense is the enforceable layer.
Your Agent Has a Middleman You Didn't Vet
Researchers tested 28 LLM API routers. 17 touched AWS canary credentials. One drained ETH from a private key. The router layer is the new attack surfa...
Static Skills Are Dead Skills
Agent skills decay the minute nobody watches the trajectories. A new paper on cross-user skill evolution frames the problem and the fix.
Your Agent Has Memory You Didn't Write
ACL 2026 paper measures LLM memory that existing evals miss: unconscious behavioral adaptation. Top models score under 66%. The asymmetry matters.
MCP Servers Are the New Attack Surface
50 MCP vulnerabilities, 30 CVEs in 60 days, 13 critical. Tool-use protocols are the attack surface nobody is auditing — here's the taxonomy and the fi...
Project Glasswing: When a Model Finds Too Many Bugs
Project Glasswing shows Anthropic restricting Claude Mythos after it found thousands of zero-days. What the rollout means for AI-assisted security.
When Your Agent Finds a Vulnerability
An Anthropic researcher found a 23-year-old Linux kernel vulnerability using Claude Code and a 10-line bash script. 22 Firefox CVEs followed.
What the Claude Code Source Leak Reveals
11 findings from the Claude Code source leak: how auto mode, bash security, prompt caching, and multi-agent coordination actually work.
Every Hook Is a Scar: 84 Agent Failures Encoded in Code
84 hooks intercept 15 of the 26 lifecycle event types Claude Code exposes. Each one traces back to a specific production failure: wiped caches, leaked...
Seventeen Thousand Signals
My vault has 17,900 signals from arXiv, Semantic Scholar, HN, NVD, and 9 other sources. Most are noise. The noise taught me what signal looks like.
The Fork Bomb Saved Us
The LiteLLM attacker made one implementation mistake. That mistake was the only reason 47,000 installs got caught in 46 minutes.
The Agent Didn't Get Smarter — The Project Did
The model is the same between session 1 and session 500. The project changed. This reframes the entire AI productivity conversation.
The Handoff Document: Agent Memory Across Sessions
A diagnosis survived three corrections over four days and guided a fix that cut page load from 14s to 108ms. Handoffs carry context agents cannot.
The Evidence Gate: Proof Over Plausibility in AI Output
\"I believe\" and \"it should\" are not evidence. Every agent completion report needs a file path, test output, or specific code before marking work d...
What I Run Before I Sleep: 15,000 Pages Every Night
Every night: 15,000 pages checked, TTFB measured, cache verified, sitemaps crawled. The goodnight routine is where operational discipline lives.
Taste Is Infrastructure: Encoding Aesthetic Judgment for AI
Agents have capability without opinion. The quality ceiling depends on how well you encode aesthetic judgment into hooks, gates, and constraints.
Overnight: Why My Site Is Busier at 3am Than at 3pm
Googlebot crawls 21,000 pages overnight, Bingbot crawls 10,000, and my nightcheck grinds through 15,000. Bot traffic exceeds all human visitors combin...
Quality Is the Only Variable When AI Agents Build
Time, cost, resources, and effort are not constraints. The question is what's right, not what's efficient. A philosophy for building with AI agents.
AI Agent Research: Claude Beat 33 Attack Methods
Claude Code autonomously discovered adversarial attacks with 100% success rate against Meta's SecAlign-70B, beating all 33 published methods in 96 ite...
Compound Context: Why AI Projects Improve Over Time
Every problem you solve with an AI agent deposits context that the next session withdraws with interest. This is context compounding.
AI Supply Chain Attacks: The Supply Chain Is the Surface
Trivy got compromised via tag hijacking, then LiteLLM on PyPI, then 47,000 installs in 46 minutes. The AI supply chain worked exactly as designed.
AI Agent Memory Architecture That Actually Works
Hybrid BM25+vector retrieval, skills as markdown, drift detection. Five March 2026 papers validate the same architecture built from production failure...
AI Agent Security: The Deploy-and-Defend Trust Paradox
1 in 8 enterprise AI breaches involve autonomous agents. Runtime hooks, OS-level sandboxes, and drift detection break the deploy-and-defend cycle.
Every Iteration Makes Your Code Less Secure
43.7% of LLM iteration chains introduce more vulnerabilities than baseline. Adding SAST scanners makes it worse. SCAFFOLD-CEGIS cuts degradation to 2....
Claude Code CLI Setup 2026: 5-Minute Quickstart
Use official Anthropic docs for policy; use Blake's reference for npm install, CLAUDE.md, permissions, and first-session checks.
Codex CLI vs Claude Code 2026: Architecture, Pricing, and China Access
Codex CLI vs Claude Code in 2026: kernel sandboxing, hook governance, model context, pricing, China cloud access, and when to use each tool.
Claude Code Hooks Tutorial: 5 Production Hooks From Scratch
Build 5 production Claude Code hooks from scratch with full JSON configs: auto-formatting, security gates, test runners, notifications, and quality ch...
Agent Sandbox Security Is a Suggestion: Three Failure Levels
An attacker opened a GitHub issue and shipped malware in Cline's next release. Agent sandboxes fail at three levels. Here is what actually works.
Silent Egress: The Attack Surface You Didn't Build
A malicious web page injected instructions into URL metadata. The agent fetched it, read the poison, and exfiltrated the API key. No error. No log.
Agent Sessions Are the Real Commit Messages We Discard
Git captures what changed. Agent sessions capture why. When agents write code, the session transcript is the real design document — and we discard it.
AI Agent Observability: Monitoring What You Can't See
AI agents consume disk, CPU, and network with zero operator visibility. Three observability layers close the gap before damage is irreversible.
Building a Hybrid Retriever for 16,894 Obsidian Files
49,746 chunks, 83 MB, zero API calls. How BM25 + vector search + RRF fusion in one SQLite file turns 16,894 Obsidian files into a queryable knowledge ...
The Performance Blind Spot: AI Agents Write Slow Code
118 functions with slowdowns from 3x to 446x in two Claude Code PRs. AI agents optimize for correctness, not performance — here's the data.
AGENTS.md Patterns: What Actually Changes Agent Behavior
Which AGENTS.md patterns actually change agent behavior? Anti-patterns to avoid, patterns that work, and a cross-tool compatibility matrix for 8 tools...
Anthropic Measured What Works. My Hooks Enforce It.
Anthropic analyzed 9,830 conversations. Iterative refinement doubles fluency markers. Polished outputs suppress evaluation. Quality hooks force iterat...
Context Is the New Memory
Context engineering is the highest-impact skill in agent development. Three compression layers turn a 200K token window from liability into advantage.
The Protege Pattern: Small Models That Know When to Ask
A 7B model with sparse expert access matches agents 50x its size. Route routine work to small models and judgment calls to frontier models.
What Actually Breaks When You Run AI Agents Unsupervised
Seven named failure modes from 500+ autonomous agent sessions. Each has a detection signal, a real example, and a concrete fix. The taxonomy HN asked ...
The CLI Thesis: Why Agent Architecture Beats IDE Plugins
Three top HN Claude Code threads converge on one conclusion: CLI-first architecture is cheaper, faster, and more composable than IDE agent workflows.
Claude Code vs Codex CLI 2026: Decision Reference
Use official docs for setup; use Blake's reference for architecture, safety, extensibility, and 36 blind duel results.
Claude Code as Infrastructure
Claude Code is not an IDE feature. It is infrastructure. 84 hooks, 48 skills, 19 agents, and 15,000 lines of orchestration prove the point.
Agent Deliberation: Thinking With Ten Brains
You cannot debias yourself by trying harder. 10 AI agents debating each other is a structural intervention for better decisions.
The Blind Judge: Scoring Claude Code vs Codex in 36 Duels
Claude Code vs Codex CLI, scored blind on 5 dimensions across 36 duels. The winner matters less than the synthesis combining both agents' strongest id...
What I Told NIST About AI Agent Security
Production evidence submitted to NIST: AI agent threats are behavioral. 7 failure modes, 3-layer defense, and framework gaps from 60 daily sessions.
The 10% Wall: Why AI Productivity Plateaus
121,000 developers surveyed, 92.6% using AI tools, productivity stuck at 10%. The wall is infrastructure, not intelligence. Three root causes and fixe...
The Fabrication Firewall: When Your Agent Publishes Lies
An autonomous agent published fabricated claims to 8 platforms over 72 hours. Training-phase safety failed at the publication boundary. Here is the fi...
Anatomy of a Claw: 84 Hooks as an Orchestration Layer
What 84 hooks, 43 skills, and 19 agents look like as a production agent orchestration layer. Three patterns that transfer to any agent harness.
Runtime Constitutions for AI Agents: A Governance Framework
Runtime constitutions enforce AI agent governance where training-phase alignment fails. Competence checks, output gates, and four subsystems keep agen...
Compounding Engineering: Why My Codebase Accelerates
Most codebases slow down as they grow. Mine accelerates. 95 hooks, 44 skills, and 14 configs make each feature cheaper than the last.
Design Careers That Survive AI: Judgment Over Tools
After 12 years as VP of Product Design, I watched three paradigm shifts. The skills that survived every one are the same skills AI can't replace.
Building AI Systems: From RAG to Agents
I built a 3,500-line agent system with 86 hooks and consensus validation. Here's what I learned about RAG, fine-tuning, and agent orchestration.
LLM Tokenization: Why Korean Costs 2.8x More Than English
Translating my site into 6 languages revealed that Korean costs 2.8x more tokens than English for identical content. An interactive visualizer shows w...
Vibe Coding vs. Engineering: Where I Draw the Line
I use Claude Code daily with 86 hooks and a full quality gate system. Here's where I vibe code, where I engineer, and why the boundary matters.
AI Theater: Why Only 23% of Companies Create Real Value
McKinsey found 90% of companies claim AI adoption but only 23% scale beyond pilots. I've witnessed three flavors of AI theater and practiced one mysel...