Agents.txt Is Not Access Control

DreamHost now documents that Web Hosting plans automatically include default robots.txt and agents.txt files when a site has not provided custom versions.1

That small hosting detail points at a larger shift. Websites now speak to at least three audiences at once: search crawlers, AI crawlers, and inference-time assistants looking for clean context. File names make the shift look tidy. robots.txt says what automated clients may crawl. llms.txt gives LLMs a curated map. agents.txt gestures at agent-facing policy. None of those files should make an operator feel protected.

Agents.txt is not access control. Treat crawler files as public policy hints and discovery aids. Real control still comes from server-side authorization, bot identity verification, rate limits, logs, cache behavior, and evidence that the crawlers you care about actually saw the current files.

TL;DR

The Robots Exclusion Protocol standard says crawler rules are “not a form of access authorization.”2 Google also warns that a URL disallowed in robots.txt can still appear in Search if other pages link to it.3 DreamHost’s own bot-control article says robots files act as suggestions to compliant search engines and that bad bots may ignore the file or use misleading user agents.1

AI crawlers add more policy dimensions. OpenAI separates OAI-SearchBot for ChatGPT search from GPTBot for training-related crawling and says ChatGPT-User represents user-triggered actions where robots.txt may not apply.4 Google says Google-Extended has no separate HTTP user-agent string and works as a robots.txt product token for training and grounding preferences, not for Google Search inclusion.5 The crawler-control file now needs purpose-level policy, not a single allow-or-block switch.

Use agents.txt if your host, platform, or agent ecosystem expects it. Use llms.txt if you want inference-time tools to understand your best pages. Keep robots.txt accurate because major crawlers still use it. Then verify requests at the server edge and read the logs. A text file can express intent. It cannot stop an untrusted client.

Key Takeaways

For site owners: - Publish robots.txt for crawl policy, llms.txt for AI-readable context, and agents.txt only as an agent-facing hint. - Do not put private routes, secret file names, internal prompts, or sensitive paths in any public crawler file. - Check logs after changes. A policy file matters only if the right crawler fetches it and changes behavior.

For SEO and AIO teams: - Separate search visibility from training permissions and user-triggered fetches. - Make the allow list explicit for bots you want, such as search crawlers and AI answer surfaces. - Pair crawler files with sitemap, canonical, schema, and llms.txt verification.

For security teams: - Treat user-agent strings as claims, not identity. - Verify crawlers with reverse DNS or published IP ranges where the operator supports it. - Enforce sensitive-resource access with authentication, WAF rules, application policy, and rate limits, not crawler etiquette.

What Changed With Agents.txt?

robots.txt has existed for decades. The RFC defines a robots.txt file that service owners make available so crawlers can decide which URIs they may access.2 The basic file shape looks familiar:

User-agent: *
Disallow: /private-draft/
Sitemap: https://example.com/sitemap.xml

agents.txt enters a different moment. The web no longer receives only search-engine crawlers. It receives training crawlers, answer-engine crawlers, ad-safety crawlers, browser assistant fetches, user-triggered LLM fetches, archive crawlers, SEO tools, and spam bots that borrow names from legitimate crawlers.

DreamHost’s documentation matters because it moves agents.txt from a niche idea into default hosting behavior for at least one mainstream host. The article says DreamHost automatically includes default robots.txt and agents.txt files for Web Hosting plans and lets site owners override either file by placing a custom file at the site root.1 That does not make agents.txt a standard with enforcement semantics. It does make the filename more likely to appear in the wild.

The safe reading is narrow:

File Best role Bad assumption
robots.txt Crawl preference for compliant crawlers. “Blocked means private.”
llms.txt Curated LLM-readable map for inference-time use. “Listed means ranked or cited.”
agents.txt Agent-facing policy hint where a platform looks for it. “A bot must obey it.”
Sitemap Complete URL discovery for indexable public pages. “Submitted means indexed.”
Server logs Evidence of what actually happened. “No visible referrer means no crawler used the page.”

The file names should not compete. They should form a policy packet: what crawlers may request, what AI systems should read, what agents should know, and what the server actually observed.

Robots.txt Still Matters, But It Does Not Protect

Crawler files fail when teams use them as security boundaries.

The RFC makes the boundary explicit. The protocol asks automated clients to honor rules when accessing URIs; it does not authorize access.2 Google says the same operationally: if another page links to a disallowed URL, Google may still find and index the URL address and other public link information even without crawling the blocked page content.3 DreamHost warns that robots rules act as suggestions to compliant search engines and that bad bots may ignore the file or use false user agents.1

Those facts lead to a simple rule: never put anything in robots.txt, agents.txt, or llms.txt that would damage you if copied into a search result, scraped into a dataset, or displayed by an LLM.

Bad crawler files expose more than they protect:

User-agent: *
Disallow: /internal-product-roadmap/
Disallow: /legal-private/
Disallow: /prompt-drafts/
Disallow: /customers/acme-renewal-risk/

The file above tells every visitor where sensitive material might live. A compliant crawler may avoid those paths. An attacker receives a directory map.

A safer file states public crawl policy without naming sensitive inventory:

User-agent: *
Allow: /
Disallow: /*.md$
Sitemap: https://example.com/sitemap.xml

That version expresses a real preference without revealing private structure. If /prompt-drafts/ exists, the server should protect it with authentication and noindex headers where appropriate. The crawler file should not carry the burden.

AI Crawlers Need Purpose-Level Policy

Search crawler policy used to feel binary: allow Googlebot, block noisy SEO tools, keep private pages private with server controls.

AI crawler policy adds purpose. A site owner may want a page to appear in ChatGPT search results while opting that same page out of model-training use. OpenAI’s crawler documentation makes that split explicit. It says OAI-SearchBot supports ChatGPT search features, while GPTBot crawls content that may be used for training OpenAI’s generative AI foundation models.4 OpenAI also says those settings are independent: a webmaster can allow OAI-SearchBot while disallowing GPTBot.4

Google draws a similar boundary in a different way. The Google crawler documentation says Google-Extended has no separate HTTP request user-agent string; existing Google user agents perform the crawl, and Google-Extended acts as a robots.txt product token.5 Google says the token controls whether crawled site content may support future Gemini model training and grounding, and it does not affect Google Search inclusion or ranking.5

Those two examples show why a flat block list misses the point. The real policy matrix asks:

Purpose Example signal Operator question
Search discovery Googlebot, Bingbot, OAI-SearchBot Do I want the page surfaced in search or answer results?
Training preference GPTBot, Google-Extended Do I want the page used for model-training or model-grounding workflows?
User-triggered fetch ChatGPT-User, browser assistants Did a human ask the assistant to retrieve the page?
Site understanding llms.txt, schema, RSS Did I give AI systems a clean explanation of the public content?
Abuse traffic Spoofed user agents, scraper tools Did the request prove identity and behave within policy?

The policy file should match the purpose. Do not disallow every AI user agent and then wonder why AI search surfaces ignore the site. Do not allow every AI crawler and then complain when training crawlers consume pages you meant only for user-facing search. Separate the purposes, state the preference, and verify behavior.

Llms.txt Solves A Different Problem

llms.txt does not replace robots.txt. Jeremy Howard’s proposal describes /llms.txt as a way to provide information that helps LLMs use a website at inference time.6 The same proposal says llms.txt can coexist with current web standards: sitemaps list pages for search engines, while llms.txt offers a curated overview for LLMs and can complement robots.txt with context for allowed content.6

That distinction matters for AIO work.

robots.txt answers: “May this crawler request this path?”

llms.txt answers: “If an assistant reads my site, what should it understand first?”

agents.txt may answer: “What should agentic clients know about desired behavior?”

Those questions sit near each other, but they do not collapse into one file. A serious site should treat AI discovery like a release surface:

  1. Publish canonical pages with clear titles and descriptions.
  2. Add structured data that matches the visible page.
  3. Keep sitemap and RSS output current.
  4. Publish llms.txt and llms-full.txt for curated AI context.
  5. Publish robots.txt with explicit crawler policy.
  6. Add agents.txt only if the platform or agent ecosystem gives the file a concrete reader.
  7. Check logs to confirm crawlers request the changed files.

Skipping the last step turns AIO into a hope ritual. The crawler file exists. The route returns 200. No evidence proves the intended clients saw it.

Verification Belongs At The Edge

User-agent strings do not prove identity. A random script can send User-Agent: Googlebot. A scraper can send User-Agent: GPTBot. Policy that trusts the header alone gives the most generous treatment to the easiest spoof.

Google documents two verification paths for requests that claim to come from Google: reverse DNS plus forward DNS for one-off checks, and published IP range matching for larger systems.7 OpenAI publishes IP-address JSON files for OAI-SearchBot, GPTBot, and ChatGPT-User in its crawler documentation.4 Those mechanisms do not cover every crawler. They do establish the right shape: identity requires evidence beyond a string.

The minimum edge policy should record:

Evidence Why it matters
User-agent Shows the client’s claim.
Source IP and ASN Helps separate cloud scrapers from verified crawler ranges.
Reverse DNS or IP-range result Proves identity where the operator supports verification.
Requested path Shows what content the client actually touched.
robots.txt fetch timing Shows whether the client checked policy before crawling.
Status code and cache result Shows what the crawler received.
Rate and path pattern Reveals abuse even from named bots.

That log packet turns crawler policy from opinion into evidence. If GPTBot keeps requesting disallowed paths, you can prove it. If a fake Googlebot hammers private-looking URLs from a residential proxy, you can block it without punishing real Googlebot. If OAI-SearchBot never requests the changed article, you know why the page has not surfaced in ChatGPT search.

A Practical AI Crawler Policy Packet

Do not start with the file. Start with the outcome.

Outcome Required control
Search engines should index public pages. Sitemap, canonical tags, schema, fast 200 responses, and allowed search crawlers.
AI answer engines should understand the site. Clean articles, schema, RSS, llms.txt, and source pages with explicit summaries.
Training crawlers should avoid specific content. Purpose-specific robots.txt groups, plus server enforcement where policy or law requires it.
Private content must remain private. Authentication, authorization, no public links, no crawler-file disclosure, and no cache leak.
Bad bots should not drain resources. Rate limits, WAF rules, verified-bot exceptions, and abuse logs.
Policy changes should be auditable. Route checks, crawler fetch logs, deployment timestamps, and a short review packet.

That packet gives each layer the right job. robots.txt communicates preference. llms.txt communicates context. agents.txt communicates agent-facing intent where a reader exists. The server enforces. The logs prove.

On my own site, crawler work follows that split. The public policy file welcomes legitimate crawlers and blocks raw Markdown paths that crawlers had inferred from code-block examples. The AI context files give assistants a curated route into the public writing. The overnight crawl census tells me whether crawlers saw errors, stale cache, missing routes, or old URLs that should now return 410. The policy file gives intent. The logs decide whether the intent worked.

What To Put In Agents.txt

Until the ecosystem settles, keep agents.txt boring and public.

Good candidates:

  • Site contact and policy URL.
  • Pointers to robots.txt, sitemap, llms.txt, and RSS.
  • A statement of preferred public-content use.
  • A warning that private or authenticated routes require authorization.
  • A support address for crawl issues.

Bad candidates:

  • Secret paths.
  • Internal prompt rules.
  • Non-public API routes.
  • Customer names.
  • Security exceptions.
  • Instructions that would harm the site if copied by a hostile client.

The right standard for agents.txt is not “Would a good agent appreciate this?” The right standard is “Would I be comfortable if a bad agent, a search result, and a random user all read this file?”

The Better Mental Model

Crawler files are signs on a public road.

A sign can say “delivery entrance,” “do not enter,” or “start here.” Respectful drivers follow the sign. Reckless drivers ignore it. The sign still helps because most legitimate traffic wants clear instructions. The sign fails when you treat it like a locked door.

AI crawlers make the signs more important and less sufficient at the same time. More important because AI systems need clear public context, purpose-specific policy, and route maps. Less sufficient because user agents multiply, training and search split apart, and bad clients can impersonate good ones.

The answer is not to give up on crawler files. The answer is to lower their authority to the right level. Publish clear public policy. Verify who requests the files. Watch what they fetch. Enforce private boundaries at the server. Treat every claim about “AI visibility” as unproven until logs and live routes support it.

That is the difference between AIO theater and real crawler operations.


FAQ

What is agents.txt?

agents.txt is an emerging agent-facing text file some hosts or tools may serve at the site root. DreamHost documents default agents.txt files for Web Hosting plans, but that documentation does not make the file an access-control standard. Treat it as a public hint until a specific agent platform documents exactly how it reads and applies the file.1

Does robots.txt block AI crawlers?

Compliant crawlers may honor robots.txt, and major operators document specific tokens for their crawlers. OpenAI documents OAI-SearchBot and GPTBot controls, while Google documents Google-Extended as a product token for training and grounding preferences.45 robots.txt still does not authenticate the client, hide content, or stop a bot that chooses to ignore the file.23

Should I publish llms.txt?

Publish llms.txt if you want AI assistants to find a curated map of your public content. The proposal frames llms.txt as inference-time context, not as a replacement for sitemap or robots.txt.6 A useful file points to the pages you actually want agents to understand.

Yes. Google says a URL blocked by robots.txt can still appear if other pages link to it, even though Google will not crawl or index the blocked page content.3 Use authentication, noindex where crawl access is allowed, and server-side policy for pages that must stay out of public results.

How do I tell a real crawler from a fake one?

Use more than the user-agent string. Google documents reverse DNS plus forward DNS checks and published IP range matching.7 OpenAI publishes IP-address JSON files for its documented bots.4 Where a crawler operator does not publish verification data, classify the request as a claim and rate-limit or challenge it according to behavior.

What is the safest crawler-file setup for a public site?

Use robots.txt for crawler policy, sitemap for URL discovery, llms.txt for curated AI context, and agents.txt only for public agent-facing guidance. Keep sensitive paths out of all public files. Then verify live routes, cache state, crawler fetches, and server logs before saying the setup works.

References


  1. DreamHost, “Control bots, spiders, and crawlers,” DreamHost Knowledge Base. Accessed May 18, 2026. 

  2. Koster, M., Illyes, G., Zeller, H., and Sassman, L., “RFC 9309: Robots Exclusion Protocol,” IETF, September 2022. 

  3. Google Search Central, “Introduction to robots.txt,” Google for Developers. 

  4. OpenAI, “Overview of OpenAI Crawlers,” OpenAI API Documentation. 

  5. Google Crawling Infrastructure, “Google’s common crawlers,” Google for Developers. 

  6. Jeremy Howard, “The /llms.txt file,” llms-txt proposal, September 3, 2024. 

  7. Google Crawling Infrastructure, “Verify requests from Google crawlers and fetchers,” Google for Developers, last updated March 20, 2026. 

Related Posts

The Fork Bomb Saved Us

The LiteLLM attacker made one implementation mistake. That mistake was the only reason 47,000 installs got caught in 46 …

7 min read

Open Source Is Not a Security Boundary

GDS guidance on AI vulnerability discovery gets open-source security right: hide less by default, fix faster, and make e…

12 min read

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

11 min read