← 所有文章

Taste Is Infrastructure

An agent can build a functional interface in minutes. Correct HTML, valid CSS, working interactivity. The output passes every automated check. It also looks like it was designed by committee in 2019.

The problem is not capability. The agent can generate any layout, any color scheme, any typography pairing you describe. The problem is that you have to describe it. The agent has no opinion about what looks good. It reaches for defaults: system fonts, equal spacing, safe colors, centered everything. The output works. It does not feel considered.

This is the taste gap. The distance between functional and good. Between correct and considered. Between an interface that passes validation and one that a designer would respect.

Taste does not scale through prompting. You cannot describe every aesthetic judgment in every prompt. “Make it look professional” produces generic output. “Use 1.5x body font size for section spacing, pair Inter with Source Serif Pro, use a 4:1 contrast ratio on secondary text” produces better output but requires you to make every decision yourself, which defeats the purpose of having an agent.

Taste scales when it becomes infrastructure.

What Taste Actually Is

Taste is pattern recognition applied to aesthetic quality. A designer who has reviewed thousands of layouts develops intuition about what works: which font pairings create tension, which spacing ratios feel balanced, which color distributions guide the eye. The intuition is not mystical. It is statistical. The designer has seen enough good work to recognize the patterns that make it good.

The patterns are learnable because they are structural. Typography conventions, spacing ratios, color relationships, hierarchy depth, component composition rules. These are not arbitrary preferences. They are solutions to perceptual problems that recur across every design domain. A heading needs to be visually distinct from body text. A card needs whitespace to breathe. A color palette needs a dominant, a secondary, and an accent. The specific values vary. The structural relationships are consistent.

If the patterns are structural and consistent, they can be extracted from professional work, encoded in a queryable format, and applied by agents at generation time.

The LICA Insight

A research team published a dataset of 1,550,244 multi-layer graphic design compositions as structured JSON layer trees.1 Not images. Not pixels. JSON objects where every design decision is a queryable field: font family, font size, line height, letter spacing, color, position, opacity, z-order, parent-child relationships.

The dataset includes 971,850 unique templates across 20 design categories, 2,700 distinct font families, and 27,261 animated layouts with per-component keyframe data. Each composition is a tree of typed components (text, image, vector, group) with rich per-element metadata.

The insight is that taste becomes computable when design decisions are represented as data rather than pixels. You do not need computer vision to analyze what font size professionals use for headings in social media layouts. You query a database.

Three Layers

Encoding taste into infrastructure requires three layers, each building on the last.

Layer 1: Design knowledge graph. Parse professional layout trees into a queryable database. Not flat statistics. Relationships: when designers use a 48px bold sans-serif heading, what do they pair it with for body text? What spacing ratios appear between parent and child components? How do poster layouts differ structurally from social media layouts? The output is a database you can query: “given a sans-serif heading at 32px, what body text properties appear in professional layouts?”

Layer 2: Layout embedder. A small model trained on Apple Silicon that takes a layout’s component tree and produces a vector embedding. Training uses contrastive learning: real professional layouts as positives, perturbed layouts (randomized spacing, swapped fonts, broken hierarchy) as negatives. The model learns an embedding space where professional layouts cluster and broken layouts are distant. When an agent produces a layout, embed it, find the 5 nearest professional layouts. If it is far from all of them, something is wrong.

Layer 3: Taste referee. Claude reads the embedder output, the nearest professional layouts, and the relevant knowledge graph patterns. It produces a judgment grounded in professional precedent: not “this spacing feels off” but “professional layouts in this category use 1.5x the body font size for section spacing, yours uses 0.8x.”

The serving pattern is identical to semantic search: embed the query, find nearest neighbors in a vector index, return structured context. The infrastructure is the same. The domain is different.

Why Statistics Are Not Enough

A frequency distribution of font sizes across 1.5 million layouts tells you what is common. It does not tell you what is good. The most popular font pairing is not the best font pairing. It is the safest one. An agent with frequency statistics produces median work. Median is not taste.

Taste requires understanding why a design choice works in context, not just how often it appears across all contexts. A bold sans-serif heading at 48px works in a poster layout because the viewing distance is large and the hierarchy needs to be immediately legible. The same heading in a mobile card layout would be overwhelming. Frequency statistics cannot capture this contextual reasoning. A contrastive model that learns the difference between professional layouts and perturbed versions in specific categories can.

The contrastive approach also handles negative space. A professional layout often works because of what it leaves out: the empty space that creates rhythm, the elements that are not there. Frequency statistics count what is present. A contrastive model learns from the full structure, including the structure of absence.

Taste as Competitive Advantage

Every team using AI agents to generate interfaces faces the same taste gap. The default output of every agent is the same generic aesthetic. The teams that close the gap first will produce visibly better work at the same speed.

The gap will not close through better prompting. It will not close through better models. Models improve at reasoning, code generation, and instruction following. They do not improve at aesthetic judgment because aesthetic judgment is not in the training objective. A model trained to predict the next token will converge on the most likely output, which is the median aesthetic.

The gap closes through infrastructure: datasets that encode professional design decisions, models that learn structural quality, and referee systems that ground agent output in professional precedent. The team that builds this infrastructure compounds its advantage with every project, because the knowledge graph grows, the embedder improves, and the referee becomes more precise.

This is compound context applied to design. Each professional layout analyzed deposits a pattern. Each pattern makes the next generation better. The portfolio of design knowledge grows slowly at first and then becomes the reason your output looks different from everyone else’s.


FAQ

Is this just a design system?

A design system defines tokens and components. Taste infrastructure evaluates whether a composition using those tokens and components is aesthetically successful. A design system tells you which colors to use. Taste infrastructure tells you whether the way you combined them works.

Do you need 1.5 million layouts?

No. The full LICA dataset is 1.5 million compositions, but the publicly available subset is 1,183 layouts with full JSON trees. That is enough to extract meaningful patterns for a focused domain (social media, presentations, documents). The knowledge graph grows with every layout you analyze.

Can a small model really learn taste?

The model does not need to generate designs. It needs to distinguish professional layouts from perturbed ones. This is a classification/embedding task, not a generation task. A 2-5M parameter model trained with contrastive learning on structured component trees is sufficient for nearest-neighbor quality assessment.

How does the referee work?

Claude reads three inputs: the agent’s generated layout (as structured data, not a screenshot), the 5 nearest professional layouts from the embedder, and relevant patterns from the knowledge graph. It produces specific corrections grounded in professional precedent. The referee reasons about the gap between the generated layout and the professional reference set.


Sources


  1. Elad Hirsch, Shubham Yadav, Mohit Garg, Purvanshi Mehta, “LICA: Layered Image Composition Annotations for Graphic Design Research,” arXiv:2603.16098, March 2026. 1,550,244 compositions, 971,850 templates, 27,261 animated layouts. Dataset: github.com/purvanshi-lica/lica-dataset

相關文章

The Handoff Document

A diagnosis that survived three code review corrections, two priority reorderings, and guided the correct implementation…

7 分鐘閱讀

Every Hook Is a Scar

84 hooks intercepting 15 event types. Each one traces back to a specific failure. The hook system is institutional memor…

8 分鐘閱讀

Your Agent Writes Faster Than You Can Read

Five research groups published about the same problem this week: AI agents produce code faster than developers can under…

16 分鐘閱讀