The Design Engineer's Agent Stack

April 15, 2026 14 min read

design engineering agents typography performance accessibility infrastructure

Design engineers need a different agent stack than pure engineers. The standard agent infrastructure optimizes for correctness: tests pass, types check, linting rules hold. Nobody has built the equivalent for design quality, the infrastructure that ensures agents produce work that looks considered, not merely functional. The six components of the design engineer’s agent stack are typography hooks, color system hooks, layout validation, Lighthouse gates, accessibility linting, and visual regression testing. Together, they encode craft into the pipeline.

The gap is visible in every AI-generated interface. The spacing is inconsistent. The font sizes drift outside the scale. Hardcoded hex values bypass the token system. Layout shifts appear on mobile because nobody checked CLS after the agent modified the CSS. The agent passed every test, satisfied every type check, and produced output that a code reviewer would approve, because code reviewers evaluate logic, not visual coherence. The design engineer notices the problems immediately. The agent infrastructure notices nothing, because nobody told it what to look for.

Agent infrastructure for engineers has matured rapidly. Hooks block dangerous git commands. Evidence gates require proof before marking work complete. Quality loops mandate re-reading every line. Engineering quality decomposes into verifiable properties (correctness, performance, security, type safety), and each property maps to a tool that produces binary outcomes.

Design quality decomposes too. Taste is a technical system with four encodable components: constraints, evaluation criteria, pattern recognition, and coherence. The first three map directly to automated infrastructure. Coherence requires human judgment, but those three cover enough ground to prevent the most common design failures an agent produces. Typography violations, color drift, layout instability, performance regressions, and accessibility failures are all detectable by machines. The design engineer’s agent stack detects them.

What Design Engineers Need From Agents

A pure engineer asks: does the code work? A design engineer asks six additional questions, each targeting a different dimension of visual quality.

Visual consistency. Spacing values follow the 8-point grid or the defined spacing scale. Alignment respects the vertical rhythm. Proportional relationships between elements remain stable across viewport sizes. An agent that adds a new card component using margin-top: 13px instead of var(--space-md) has introduced visual noise that no test will catch.

Typography discipline. Every font size in the codebase maps to a step in the type scale. No rogue sizes. No inline overrides that bypass the custom properties. Weight usage follows the established hierarchy: 700 for headings, 400 for body, 300 for metadata. An agent that sets a subtitle to font-size: 19px has invented a step that does not exist in the scale, and the visual hierarchy fractures.

Color system compliance. Every color value references a design token. No hardcoded hex values outside :root. Contrast ratios meet WCAG AA at minimum, AAA where possible. The zero-color system on my site uses four opacity tiers against absolute black, and every tier passes AAA. An agent that introduces color: #cccccc has bypassed the token system and created a contrast relationship that nobody validated.

Performance awareness. No Cumulative Layout Shift. First Contentful Paint stays within budget. Total Blocking Time does not regress. The agent must understand that visual changes have performance consequences. A CSS change that triggers layout recalculation on every scroll event is a performance bug, regardless of how the change looks.

Accessibility. Semantic HTML structure. Proper heading hierarchy. ARIA attributes where needed, absent where not. Color contrast verification. Focus indicators. Screen reader compatibility. The Lighthouse audit catches the measurable subset, but the agent must also maintain structural semantics that automated tools miss.

Taste. The hardest to encode. Coherence between elements. Restraint in decoration. Intentional white space rather than accidental emptiness. Taste is the quality that distinguishes a layout that follows every rule but feels wrong from a layout that follows every rule and feels right. Automated checks catch violations. The taste layer catches non-violations that still lack consideration.

Six Components of the Design Engineer’s Stack

Each component maps to a specific failure mode I have observed in agent-generated output. The components are not theoretical. Each one exists because something went wrong, the same origin story behind every hook in my 95-hook infrastructure.

1. Typography Hooks

A typography hook validates that every font-size declaration in a commit references a CSS custom property from the type scale. The hook scans changed files for raw pixel or rem values that do not map to a defined step.

#!/bin/bash
INPUT=$(cat)
DIFF=$(echo "$INPUT" | jq -r '.tool_input.command // empty')

# Catch font-size declarations that bypass the type scale
if echo "$DIFF" | grep -qE 'font-size:\s*[0-9]+(px|rem|em)'; then
  cat << EOF
{"decision": "block", "reason": "Font size must use a --font-size-* token"}
EOF
fi

The hook is blunt. A more refined version parses the value and checks whether the pixel equivalent matches any step in the 13-step scale. The point is not sophistication. The point is that the agent cannot introduce a rogue font size without the infrastructure flagging it. Bringhurst’s principle of harmonious type relationships holds not because the agent understands harmony, but because the hook enforces the scale that embodies it.¹

Font weight deserves separate validation. My system uses three weights: 700, 400, and 300. An agent that sets a card title to font-weight: 600 has introduced a weight that contradicts the established hierarchy. A typography hook catches the deviation before it reaches production.

2. Color System Hooks

Color drift is the most common design failure in agent-generated CSS. The agent knows that text should be white on a dark background. The agent does not know that #ffffff should be var(--color-text-primary), or that secondary text at 65% opacity is var(--color-text-secondary) and not rgba(255,255,255,0.60).

The color hook scans for hardcoded color values outside :root and the design token definitions:

# Block hardcoded colors outside token definitions
if echo "$DIFF" | grep -vE '^\+.*:root' | \
   grep -qE '#[0-9a-fA-F]{3,8}|rgba?\('; then
  cat << EOF
{"decision": "block", "reason": "Use color tokens, not hardcoded values"}
EOF
fi

The zero-color design system, the same brutalist constraint that drives the entire site’s visual identity, makes enforcement straightforward because the palette has exactly ten tokens. Any color value that does not match one of those tokens is wrong by definition. A broader palette would require more nuanced validation. The constraint-based approach simplifies the hook because the constraint simplifies the design.

3. Layout Validation

Layout validation catches two categories of failure: Cumulative Layout Shift introduced by CSS changes, and responsive breakpoint regressions.

CLS detection requires measuring the page before and after the change. A pre-commit hook cannot run a browser, but a CI pipeline can. The infrastructure runs Lighthouse in headless Chrome against the staging deployment, compares CLS values to the previous build, and blocks the merge if the delta exceeds 0.01. Google considers CLS below 0.1 “good.” My threshold is 10x stricter because I have seen what 0.493 CLS looks like and will not regress.

Responsive validation requires checking the layout at defined breakpoints. A visual regression tool captures screenshots at 375px (mobile), 768px (tablet), and 1440px (desktop), then compares them to baseline images. A five-pixel shift in the header at 375px that looks fine at 1440px surfaces in the mobile comparison. The agent that modified a max-width property without testing responsive behavior gets caught by the infrastructure that tests responsive behavior automatically.

4. Lighthouse Gates

A Lighthouse gate runs a full audit before every merge to the main branch. The gate enforces four thresholds:

Category	Threshold
Performance	100
Accessibility	100
Best Practices	100
SEO	100

The thresholds are not aspirational. They reflect the current production scores. Any commit that lowers any score below 100 gets blocked. The gate runs in CI using lighthouse-ci, and the results feed back into the pull request as a status check.

# lighthouse-ci configuration
assertions:
  performance: ["error", { minScore: 1 }]
  accessibility: ["error", { minScore: 1 }]
  best-practices: ["error", { minScore: 1 }]
  seo: ["error", { minScore: 1 }]
  cumulative-layout-shift: ["error", { maxNumericValue: 0.01 }]

The Lighthouse gate catches performance regressions that no human reviewer would notice. An agent that adds an unoptimized image, a render-blocking script, or a CSS file that triggers a flash of unstyled content fails the gate before the change reaches production. The gate does not understand why the change caused a regression. The gate does not need to understand. It blocks the regression, and the agent receives the failure reason in its context for the next attempt.

5. Accessibility Linting

Accessibility validation splits into two layers: static analysis and runtime evaluation.

Static analysis runs axe-core against the rendered HTML. The WCAG 2.1 AA ruleset catches missing alt text, improper heading hierarchy, insufficient color contrast, missing form labels, and ARIA misuse. The check runs in the same headless Chrome instance as the Lighthouse gate and adds negligible overhead.

// axe-core integration in CI
const { AxeBuilder } = require('@axe-core/playwright');
const results = await new AxeBuilder({ page })
  .withTags(['wcag2a', 'wcag2aa', 'wcag21aa'])
  .analyze();

if (results.violations.length > 0) {
  process.exit(1); // Block the merge
}

The runtime layer catches issues that static analysis misses: focus management after HTMX swaps, keyboard navigation through dynamic content, screen reader announcements for state changes. These checks require scripted interaction, not just DOM inspection. The no-build approach keeps the page simple enough that the accessibility surface area remains manageable.

Accessibility linting is the component most engineers already understand. The design engineer’s addition is not the tooling but the threshold: zero violations, not “acceptable” violations. The same philosophy drives 100/100/100/100 Lighthouse scores: perfection as the baseline, not the aspiration.

6. Visual Regression Testing

Visual regression testing compares screenshots of the current build against approved baselines. The comparison uses perceptual diffing algorithms that detect changes a human would notice while ignoring changes a human would not (sub-pixel rendering differences, anti-aliasing variations).

Tools like Percy, Chromatic, and BackstopJS automate the comparison. The pipeline captures screenshots at each defined breakpoint, runs perceptual diffing against the baseline, and flags any page where the diff exceeds the threshold. A 0.1% pixel difference in a footer is noise. A 2% shift in the hero section is a regression.

Visual regression is the closest automated approximation of “does the page look right?” Perceptual diffing cannot evaluate whether a layout change is an improvement or a degradation, only that a change occurred. The human reviews flagged diffs and approves or rejects them. The automation’s value is coverage: testing every page at every breakpoint on every commit, a task no human performs manually.

How the Stack Maps to My Infrastructure

The six components connect to decisions already documented across the design engineering content on this site.

The typography hooks enforce the 13-step type scale, a content-driven progression where the scale exists as CSS custom properties and the hooks ensure those properties are the only font sizes in the codebase. The color system hooks enforce the zero-color design system: ten tokens, four opacity tiers, no brand colors, non-optional. The Lighthouse gates maintain the 100/100/100/100 score and prevent any commit from undoing the CSS extraction and render-blocking elimination that achieved those numbers.

The no-build approach simplifies the entire stack. No source maps to reconcile. No tree-shaking ambiguity. No transpilation layer between authored and shipped CSS. What the agent writes is what ships, which means what the hooks validate is what the user sees.

The evidence gate applies to design reviews the same way it applies to engineering reviews. “The typography looks right” is not evidence. “Every font-size declaration in the diff maps to a --font-size-* token, verified by the typography hook” is evidence. The design system provides the tokens that the hooks enforce. Without tokens, there is nothing to validate against. Without hooks, the tokens are suggestions. Nathan Curtis identified the dynamic: a system without governance degrades into documentation that nobody reads.²

The Taste Layer

The six components catch violations. Typography hooks catch wrong font sizes. Color hooks catch hardcoded values. Layout validation catches CLS. Lighthouse gates catch performance regressions. Accessibility linting catches WCAG failures. Visual regression catches unintended changes.

None of these catch the output that follows every rule but still feels wrong.

A card component with correct font sizes, proper tokens, zero CLS, perfect Lighthouse scores, full WCAG compliance, and no visual regression — but with spacing that makes the title crowd the image, a line length that strains readability, and a hover state that feels abrupt rather than considered. Every automated check passes. The card is correct. The card is not good.

Taste operates above the rule layer. Constraints catch what violates the rules. Evaluation criteria catch what fails the metrics. Pattern recognition catches what the second look reveals. Coherence catches what only the whole-system view exposes. The six automated components handle constraints and evaluation criteria. Pattern recognition and coherence require the quality loop: the mandated second (and third, and fourth) pass through the work, each time checking not whether the rules hold but whether the result deserves to ship.

The quality loop is where the design engineer earns the “engineer” half of the title. An engineer who ships code that passes tests is doing the minimum. A design engineer who ships interfaces that pass automated checks and survive the quality loop maintains a standard that machines cannot yet evaluate. The pride check asks five questions, and the last (“did I leave it better?”) has no automated equivalent. Neither does the Steve criterion: would Blake sign his name to this?

The Compound Effect

Each component prevents a specific category of design failure. Together, the components produce a compound effect that exceeds the sum of individual checks.

An agent session without the stack produces output that drifts. Font sizes accumulate outside the scale. Color values hardcode instead of tokenize. Performance regresses by small increments that no single commit triggers but that accumulate across weeks. The drift is invisible in any individual diff and obvious in the aggregate.

An agent session with the stack cannot drift. The hooks block every deviation from the type scale. The color system rejects every hardcoded value. The Lighthouse gate catches every performance regression. The agent inherits the design engineer’s standards not because the agent understands those standards but because the infrastructure enforces them. The agent does not need taste. The agent needs constraints, and the constraints embody taste.

Jony Ive described Apple’s design process as “relentless refinement”: quality through iteration on a fixed set of principles, not innovation through novelty.³ The design engineer’s agent stack operationalizes the same idea. The principles are fixed in tokens, scales, and thresholds. The refinement is relentless because the automation runs on every commit.

The design engineer who encodes standards into the agent stack does more than maintain quality during autonomous generation. That engineer scales quality. Every session, every agent, every commit inherits the same constraints. The human still evaluates coherence, still runs the quality loop, still asks whether the output deserves to ship. But the human no longer catches font size violations, hardcoded colors, or CLS regressions. The stack caught those first. The human’s attention goes entirely to the questions machines cannot answer.

FAQ

Do I need all six components to start?

No. Start with the component that addresses your most common failure mode. Typography hooks and color system hooks provide the highest return because they catch the most frequent agent-generated design defects. Add Lighthouse gates and accessibility linting next. Visual regression and layout validation are the most infrastructure-heavy components and belong later in the adoption sequence.

Does the stack work with build tools?

The stack works with any frontend architecture. The no-build approach simplifies implementation because there is no transformation layer between authored and shipped code. With build tools, hooks must validate the source files while Lighthouse and visual regression validate the built output. The components remain the same. The integration points change.

Can agents learn taste without the stack?

Current language models do not have taste. Models produce statistically likely output, and statistically likely output trends toward the median of the training data. The stack does not teach agents taste. The stack constrains agents so that the pipeline rejects tasteless output before it ships. The distinction matters: encoding taste as infrastructure proves more reliable than hoping the model internalizes it from the prompt.

How do visual regression tests handle intentional changes?

Intentional changes produce expected visual diffs. The workflow flags the diffs, and the human reviews and approves them, updating the baseline. The value of visual regression is not preventing change but surfacing unintended change. An agent that modifies a button color also shifts the card layout by three pixels. The color change is intentional, the layout shift is not, and the visual regression test catches the side effect.

Sources

Robert Bringhurst, The Elements of Typographic Style, Hartley & Marks, 4th edition, 2012. Bringhurst establishes that typographic harmony follows mathematical ratios derived from musical intervals. ↩
Nathan Curtis, “Governance and Evolution,” Medium, 2019. Curtis documents the governance failure mode in unmanaged design systems, where tokens and guidelines exist but compliance degrades without enforcement mechanisms. ↩
Ian Parker, “The Shape of Things to Come,” The New Yorker, February 23, 2015. Ive describes Apple’s design process as iterative refinement within fixed constraints rather than open-ended exploration. ↩