robots.txt vs llms.txt vs sitemap.xml: what each is for

5/23/20266 min read

robots.txt is a fence, sitemap.xml is a map, llms.txt is a brief. Three small files, three different jobs. robots.txt controls who is allowed to crawl your site. sitemap.xml tells search crawlers what is worth crawling. llms.txt curates what AI assistants should read first when answering questions about you. The files are not interchangeable, and most online comparisons blur the boundary between them.

Get the metaphors right and the rest falls into place.

robots.txt is a fence

robots.txt sits at the root of your site at /robots.txt and tells well-behaved bots which paths they are allowed to fetch. It is an access policy expressed as a plain-text allowlist or denylist. It is voluntary; malicious crawlers ignore it. Major search engines and the public AI crawlers respect it. The protocol was standardized as RFC 9309 in 2022, formalizing decades of de-facto convention.

A minimum-viable robots.txt looks like this:

User-agent: *
Disallow: /admin/
Disallow: /api/internal/

User-agent: GPTBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

This file allows every bot except into /admin/ and /api/internal/, blocks OpenAI GPTBot entirely, and points crawlers at the sitemap. The Sitemap directive is the one bridge between robots.txt and sitemap.xml. The two files are otherwise separate concerns.

What robots.txt does not do:

It does not remove pages from search indexes. Disallow blocks crawling; it does not deindex content already known to a search engine. Use the noindex meta tag or HTTP header for that.
It does not protect private content. The file is publicly readable. Anyone can fetch /robots.txt and read the paths you are hiding. Authentication is for protection; robots.txt is for crawl politeness.
It does not bind crawlers that ignore it. Scrapers, spam bots, and many AI-training crawlers fetch your content regardless. Server-side blocking by user-agent or IP is the only enforcement layer.
It does not control rendering safely. Disallowing the path of a JavaScript file used by indexed pages can confuse search engines about page content. Disallow paths carefully on JS-heavy sites.
It does not control AI training as a single policy stance. Blocking GPTBot stops the OpenAI training crawler, but ChatGPT can still cite your site via a separate live-browsing bot and other AI products use different bot names. The relationship between training and retrieval is per-vendor.

sitemap.xml is a map

sitemap.xml lives anywhere on your domain (commonly /sitemap.xml) and tells search crawlers which URLs you want them to consider. Each entry can include a last-modified date, a priority hint, and a change-frequency hint. The format is defined by the sitemap protocol at sitemaps.org. It is the canonical way to surface URLs that crawlers might not find through internal linking.

A minimum-viable sitemap looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-05-21</lastmod>
  </url>
  <url>
    <loc>https://example.com/blog/welcome</loc>
    <lastmod>2026-05-10</lastmod>
  </url>
</urlset>

What sitemap.xml does not do:

It does not force indexing. Listing a URL in a sitemap is a suggestion, not a command. Search engines decide what to index based on content quality, internal linking, and many other signals.
It does not replace internal linking. Pages that nothing on your site links to are weak candidates for indexing even when they appear in the sitemap. Site structure carries more weight than a sitemap entry.
It does not honor lastmod lies. Updating every lastmod to today on every crawl makes the file untrustworthy and is widely discounted by major search engines. Accurate dates only.
It does not target AI assistants in any specific way. Some AI crawlers use sitemaps to discover URLs, but the format was designed for search-engine indexing, not for LLM retrieval.

llms.txt is a brief

llms.txt is the newest of the three, introduced in 2024 with the spec hosted at llmstxt.org. It is a single markdown file at /llms.txt that gives AI assistants a curated, structured view of your most important content. The point is to save the LLM from crawling and synthesizing the entire site every time someone asks about you, by pointing it at a short list of pages that already say what matters.

A minimum-viable llms.txt looks like this:

# Example.com

> A short paragraph describing what Example.com is, the audience it serves, and the kind of questions an AI assistant should be able to answer from this site.

## Docs

- [Getting started](https://example.com/docs/start.md): The 5-minute onboarding guide.
- [API reference](https://example.com/docs/api.md): The full HTTP API surface.

## Blog

- [Why we added llms.txt](https://example.com/blog/llms-txt.md): Our reasoning and implementation.

## Optional

- [Detailed pricing](https://example.com/pricing.md): Plan limits and feature matrix.

Sections under H2 headings are required (Docs, Blog, etc.). The Optional section is treated as lower priority and may be skipped by an LLM with limited context. Linked targets are usually served as markdown variants (page.md alongside page.html) for cleaner LLM ingestion, although the standard does not require it.

What llms.txt does not do:

It is not robots.txt. robots.txt controls access; llms.txt curates what to read. An LLM that respects llms.txt still respects robots.txt for crawl permission.
It is not a sitemap. A sitemap aims for completeness for indexers; llms.txt aims for curation for synthesizers. Listing every page defeats the point.
It does not bind any specific model. Adoption is voluntary and uneven. Some AI products use llms.txt as a primary source, others ignore it entirely. Treat it as an opportunity to be cited, not a guarantee.
It does not replace the linked pages. The LLM still fetches the targets you list. Make those destinations high-signal; a broken or thin linked page wastes the brief.

We rolled out llms.txt on this blog recently and wrote up the why and how in the WebPixie llms.txt adoption post. Read that for the practical implementation side, including how to handle the markdown variants.

How the three files interact in practice

Crawlers and AI assistants read combinations of the files in different orders, depending on what they are doing:

A search crawler (Googlebot, Bingbot) fetches robots.txt first to check permission, then sitemap.xml to discover URLs, then crawls the discovered URLs. robots.txt is the gate; sitemap.xml is the hint sheet.
An AI training crawler (GPTBot, ClaudeBot, Google-Extended) fetches robots.txt to check whether crawling is allowed at all. If allowed, it typically does not use llms.txt; it crawls from the homepage or sitemap.
An AI assistant doing live retrieval (the ChatGPT browser, Perplexity, Gemini grounding) increasingly fetches llms.txt first when it exists, then fetches only the listed pages. If llms.txt is absent, it falls back to sitemap-driven or search-driven retrieval.

Conflicts are rare because the files address different layers, but a few are worth watching for:

robots.txt blocks a path that llms.txt links to. The AI assistant has the brief but cannot fetch the target. Audit your llms.txt against your robots.txt before shipping.
sitemap.xml includes URLs that robots.txt disallows. Major search engines treat the disallow as authoritative; the sitemap entry is wasted. Remove or update one to match the other.
llms.txt lists pages that no longer exist. Stale 404s in llms.txt train an LLM that your site is unreliable. Generate llms.txt from a source of truth, not by hand at file-rotation time.

Which to ship first

Most sites already have a robots.txt; if you do not, ship one. It is the smallest, oldest, and least optional of the three. The second priority depends on what you want.

If discoverability in search is the goal, prioritize sitemap.xml. It is well-supported, low-risk, and search-engine-friendly. llms.txt can come later.
If citation by AI assistants is part of your traffic strategy, prioritize llms.txt. Major AI products are still building llms.txt support; early adoption pays off when adoption widens.
If you have the time, ship both. They are not mutually exclusive, and the marginal cost of generating one when you already have the other is small. Most static-site generators have plugins for both.

Three files, three jobs, one mental model. robots.txt fences off what crawlers should not touch. sitemap.xml maps what they should consider. llms.txt briefs AI assistants on what matters most. Ship them in that order, keep them in sync, and they do their work without anyone reading another comparison post on the subject.

← Back to blog