HTML Is the Format AI Agents Want

From the guide: Claude Code Comprehensive Guide

On May 8, 2026, Thariq Shihipar, an engineer working on Claude Code at Anthropic, published a personal site collecting 20 HTML artifacts an agent produced across nine categories of knowledge work, with one argument: when the answer carries spatial structure, interaction, or visual evidence, HTML beats Markdown.12

HTML beats Markdown for agent output because spatial structure, interaction, and visual evidence carry information that prose flattens. The format the agent emits is the control surface the human inspects, not a wrapper around it.

The post landed six days before a May 14 arXiv paper showed agent search quality lives in the runtime, not the retriever.3 The same pattern appears: format and runtime are the substrate, not the wrapper. The component matters only after the surface around it turns model output into something a person can verify.

TL;DR

Thariq Shihipar shipped a companion site with 20 HTML examples spanning code review, design systems, prototyping, exploration, diagrams, research, reports, and editor surfaces.1 The thesis: Markdown linearizes information that arrives spatially. Diffs, call graphs, side-by-side comparisons, and interactive prototypes carry meaning that prose flattens. The 8K-token GPT-4 launch era pushed Markdown as a token-efficient default; current Claude context-window documentation lists 200K-token and 1M-token models, changing the trade for many artifact sizes.45 For server-rendered, no-build web stacks like FastAPI plus HTMX, the post supplies the agent-side argument: HTML is the format the model wants to produce and the format the browser already renders. Routing through Markdown adds a translation step that drops fidelity at both ends.6

Key Takeaways

For agent builders: - Stop defaulting to Markdown for agent output when the answer is a comparison, diff, flow chart, or navigable structure. Request HTML and let the agent commit to spatial layout.1 - Treat the model’s output format as part of the tool surface. A single rendered artifact is more inspectable than a transcript that scrolls past.7

For interface designers: - HTML is the medium your design system already ships in. Routing through Markdown introduces a translation step that loses fidelity, then a second translation step on render.1 - The control surface is whatever the agent produces. If the human cannot see what the agent saw, the surface is broken.7

For teams running server-rendered no-build stacks: - The bet on HTML over a build pipeline now has the agent-side validation. The format the model wants to produce and the format the browser already renders are the same.6 - A server-rendered site removes a translation layer twice: once at the build step, once at the agent-output step. Both removals compound.

What Thariq Actually Argued

Shihipar works on Claude Code at Anthropic; the post is on his personal site, not Anthropic’s official blog.2 The companion gallery contains 20 self-contained HTML files an agent produced, grouped into nine categories of work where HTML structurally beats Markdown.1

His core claims:

Claim Why it bites
“Diffs and call-graphs are spatial information; markdown flattens them.” A side-by-side diff with severity-coded annotations communicates faster than a numbered list of file paths.1
“HTML is the medium your design system ships in.” Producing component variants in HTML matches the format the design system already targets. Markdown forces a translation step.1
“Motion and interaction can’t be described, only felt.” A prototype with real easing curves and clickable flows communicates in seconds what a paragraph of prose cannot.1
Markdown’s token-efficiency case was an artifact of small context windows. The 8K-token GPT-4 launch era is over; current Claude context-window docs list much larger 200K-token and 1M-token budgets.45

The second claim is the load-bearing one for anyone building web infrastructure. If the design system ships HTML, the agent should produce HTML. Anything else introduces a lossy round trip.

The 20 Examples Are the Argument

The categories on Shihipar’s gallery cover the work most people now hand to a coding agent:1

  • Code review: annotated diffs with severity-coded inline notes; module maps with highlighted call paths.
  • Exploration: side-by-side code approaches; visual design options laid out for selection rather than read sequentially.
  • Design: living design system pages; component variant sheets that render the variants.
  • Prototyping: animation sandboxes with real easing curves; interactive flows that respond to clicks.
  • Diagrams: inline SVG figures; annotated flow charts; box-and-arrow architecture sketches.
  • Research: collapsible sections; interactive concept explainers with live demonstrations.
  • Reports: formatted timelines and charts where the structure carries the meaning.
  • Editors: custom interfaces with export functionality embedded in the artifact.

Each one is an HTML page the model produced in one shot. The shared pattern: the answer is spatial or interactive, and the rendered artifact preserves what a Markdown response would have to describe in prose.

Why The Default Was Markdown

Markdown won the agent-output default for two reasons that no longer apply.

First, the GPT-3.5 and GPT-4 generation hit context windows in the 4K to 8K range during the period when the chat-output convention solidified.4 Markdown’s terseness was a real forcing function: a token spent on <div class="..."> was a token unavailable for analysis. Claude’s current context-window docs list 200K-token contexts for many models and 1M-token contexts for Opus 4.1 and Sonnet 4.6.5 The token-efficiency argument has decayed for many inspection artifacts.

Second, terminal renderers and chat windows render Markdown trivially, while HTML requires a webview or a browser tab. The surface convenience kept Markdown as the path of least resistance even after the token argument died.

Shihipar’s post carries weight because the author works on Claude Code at Anthropic. The 20 examples are concrete artifacts, not abstract claims.2 Simon Willison’s coverage on the same day reproduced the pattern with a Linux security exploit explainer rendered as an interactive HTML page rather than a Markdown writeup.8

What HTML Preserves That Markdown Drops

Four properties carry the argument:

Property Markdown handling HTML handling
Spatial relationships Linearized into headings and lists Preserved as layout, columns, side-by-side panes
Interaction Described in prose (“click here to expand”) Embodied via real DOM events and CSS transitions
Density without scroll Long scroll, no jump targets beyond headings Collapsibles, in-page anchors, floating navigation
Visual hierarchy Carried by the reader’s mental model of headings Carried by the layout the eye actually scans

Each property maps to a class of agent task that gets harder when you flatten the output to prose. A diff is a spatial comparison; a flow chart is a graph; a design system review is a visual judgment. Forcing those through Markdown asks the reader to reconstruct what the renderer could have shown.

The Runtime Connection

Agent search quality lives in the runtime, not the retriever. That post argued the retrieval method matters less than the harness around it: prompt shape, tool surface, transcript formatting, result delivery, retry behavior.3

The HTML argument extends the same frame to output. The model can produce the right answer in any format. The format you ask for is part of the runtime contract. Different formats produce different verifiable surfaces:

  • Markdown delivery: the user reads top to bottom, decides what matters, and reconstructs structure mentally.
  • HTML delivery: the model commits to a structure, the renderer makes that structure scannable, and the user inspects rather than reads.

Same data, different control surface. Agentic design is control surface design. The format the agent emits is part of that surface, not packaging around it.7

What This Means for the No-Build Stack

The FastAPI plus HTMX guide on this site makes the case for server-rendered HTML over a JavaScript build pipeline.6 Shihipar’s post supplies the agent-side argument:

  • The model wants to produce HTML.
  • The browser wants to render HTML.
  • Inserting Markdown or JSX between them adds two lossy translation steps.

A no-build server-rendered site removes the build-time translation. Producing HTML directly from the agent removes the output-time translation. The compounding payoff: the same format runs from the model through the route through the browser without intermediate forms.

The point does not claim React or Markdown are wrong everywhere. The point says the cost of the translation steps is now visible from both ends, and a stack that avoids both ends gets simpler in proportion.

Format Matters. Runtime Matters. Both Are the Substrate.

The agent-search paper and the HTML post arrived eight days apart and argue the same shape:13

  • The retriever is a component. The runtime is the substrate.
  • The model is a component. The output format is the substrate.

Component thinking keeps offering local upgrades: switch the retriever, add memory, swap the model, refine the prompt. Substrate thinking changes the surface that the user sees and the surface that the agent produces. Both this week’s findings push the work toward the second frame.

The practical move: when an agent answer carries spatial information, ask for HTML. When the agent runs in a harness, instrument the harness before instrumenting the model. Both moves compound. Neither is a silver bullet on its own.


FAQ

Did Anthropic publish this post?

No. Thariq Shihipar published it on his personal site, thariqs.github.io/html-effectiveness/.1 He works on Claude Code at Anthropic, so the authority signal is strong, but the post is not an Anthropic publication.2

Does the argument apply to every agent task?

No. The post explicitly targets work where spatial structure, interaction, or visual evidence carry meaning. For short factual answers or terminal-bound output, Markdown remains a fine default.1

What about token cost?

The cost case for Markdown was tied to small context windows. Claude’s current context-window docs list 200K-token and 1M-token models, which changes the HTML verbosity trade for the artifact sizes the post showcases.5

Does this break Claude Code’s existing Markdown defaults?

No. The argument is about output you ask the model to produce on demand for inspection, not about the transcript or terminal output. You can still ask for HTML in a single prompt and get a self-contained artifact back.1

How does this connect to the agent-search runtime paper?

Both arguments point at the substrate around the model rather than the model itself. Search quality depends on the harness; output quality depends on the format. The component is necessary; the substrate is what makes the component dependable.3

What should a FastAPI plus HTMX team do with this?

Treat HTML as a first-class output target for any AI feature you ship. The same format runs from the model through the route through the browser, and the no-build stack already optimizes for that path.6


References


  1. Thariq Shihipar, “The Unreasonable Effectiveness of HTML,” personal site, 8 May 2026. Primary source for the 20 HTML artifacts, the nine work categories (exploration, code review, design, prototyping, diagrams, research, reports, editors), the spatial-information argument (“diffs and call-graphs are spatial information; markdown flattens them”), the design-system claim (“HTML is the medium your design system ships in”), the interaction claim (“motion and interaction can’t be described, only felt”), and the position that HTML preserves user agency in agent loops. 

  2. Thariq Shihipar, personal site. Source for Shihipar’s statement that he is currently working on Claude Code at Anthropic and for the personal-site provenance of the HTML article. 

  3. Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, “Is Grep All You Need? How Agent Harnesses Reshape Agentic Search,” arXiv:2605.15184v1, submitted 14 May 2026. Source for the runtime-vs-component framing applied to agent search across Chronos, Claude Code, Codex CLI, and Gemini CLI on a 116-question LongMemEval-S subset. 

  4. OpenAI, “GPT-4 Research,” OpenAI, 14 March 2023. Source for GPT-4’s 8,192-token launch context length and limited access to the 32,768-context gpt-4-32k variant. 

  5. Anthropic, “Context windows,” Claude API Docs. Source for the current documentation that Opus 4.1 and Sonnet 4.6 have a 1M-token context window while other Claude models, including Sonnet 4.5 and Sonnet 4, have a 200K-token context window. 

  6. Blake Crosley, “FastAPI + HTMX: The No-Build Full-Stack,” blakecrosley.com guide, updated 15 May 2026. Source for the no-build server-rendered architecture argument, including the claim that HTMX eliminates the JavaScript build pipeline while producing 100/100/100/100 Lighthouse scores. 

  7. Blake Crosley, “Agentic Design Is Control Surface Design,” blakecrosley.com blog, 15 May 2026. Source for the control-surface frame: agentic design as the discipline of making autonomous software visible, interruptible, inspectable, and worthy of trust, with output format as part of that surface. 

  8. Simon Willison, “Using Claude Code: The Unreasonable Effectiveness of HTML,” simonwillison.net, 8 May 2026. Secondary coverage and additional context on Shihipar’s post, including the worked example of a Linux security exploit explainer rendered as a richly interactive HTML page. 

Related Posts

Project Glasswing: When a Model Finds Too Many Bugs

Anthropic built a model that finds thousands of zero-days, then restricted it to 12 partners. What Project Glasswing mea…

8 min read

Context Is the New Memory

Context engineering is the highest-impact skill in agent development. Three compression layers turn a 200K token window …

18 min read

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

11 min read