Should you block AI crawlers? Honest answer for 2026

6/5/20266 min read

Most sites should allow the major AI training crawlers in 2026, not block them. The cost of being invisible inside AI assistant answers is now larger than the upside of being uncrawled for training. The exceptions are real but narrow: IP-sensitive content, regulated data, content where commercial moats depend on scarcity. For everyone else, blocking is the lazy default that quietly costs you discovery.

Most existing posts on this topic hedge to a 'balanced hybrid approach' without explaining what the actual tradeoff looks like per provider. This post takes a clearer line: walk through what each AI provider's bots actually do, separate training crawl from live retrieval, and give the concrete robots.txt for each realistic decision.

The three providers and their bots in 2026

Every major AI provider now runs more than one crawler. The names are confusing, the purposes overlap, and most robots.txt advice on the internet treats them as one thing. They are not.

OpenAI: three bots, three jobs

OpenAI documents three bots in OpenAI's bots reference:

GPTBot collects content used to train future OpenAI models. Verifiable IP list at OpenAI's GPTBot IP list.
OAI-SearchBot crawls the web for ChatGPT search results. Verifiable at OpenAI's OAI-SearchBot IP list.
ChatGPT-User fetches pages on demand when a ChatGPT user asks a question that requires browsing. Verifiable at OpenAI's ChatGPT-User IP list.

Blocking GPTBot stops your content from feeding new model training runs. Blocking OAI-SearchBot stops you from being a citable source in ChatGPT search results. Blocking ChatGPT-User stops live retrieval when a user explicitly asks ChatGPT about your site. The three have very different consequences, and most blanket 'block ChatGPT' advice treats them as one.

Anthropic: three bots, three jobs

Anthropic documents the same structural pattern in Anthropic's crawler FAQ:

ClaudeBot collects content for training Anthropic's Claude models.
Claude-User performs live retrieval when a Claude user directs it to fetch a page.
Claude-SearchBot analyses content to improve Claude's search results.

IP ranges are published at Anthropic's crawler IP list for verification. Anthropic notes explicitly that their bots respect robots.txt 'do not crawl' directives.

Google: one control token, no separate bot

Google takes a different approach, documented in Google's common crawlers documentation.

Google-Extended is not a separate crawler with its own user-agent string. It is a robots.txt control token. The actual fetching is done by Googlebot; Google-Extended tells Google whether your content can be used for training Gemini and the Gemini API. The critical clarification from Google's own docs: 'Google-Extended does not impact a site's inclusion in Google Search nor is it used as a ranking signal in Google Search.' Blocking Google-Extended does not affect Search.

What allowing actually buys you

The honest case for allowing training crawlers comes down to citation visibility. It is stronger in 2026 than it was in 2023.

Citation share in AI answers. ChatGPT, Claude, and Gemini increasingly answer questions by citing or summarising specific pages. If your content is not in the training set and not in the live retrieval surface, you are not in the answer.
Long-tail discovery. AI assistants surface niche content that traditional search undervalues. A small blog with deep expertise on one topic can earn AI citations it could never earn against high-DA SEO competitors on Google. Blocking gives up that lane entirely.
Trust signals during sales cycles. If your buyers research vendors through ChatGPT or Claude before making contact, a site blocked from AI retrieval can surface as "unable to verify" in answers, which reads as a credibility gap.
Compounding inertia. Sites that established AI-citation footprints early benefit from retrieval models that learned them as canonical sources. Catching up later is harder than not blocking now.

These are not hypothetical benefits any more. Anecdotal reports from publishers suggest some AI-driven referral traffic, though hard numbers are scarce and vary widely by site type.

What blocking actually costs you

The cost of blocking has shifted since 2023. Three points worth noting:

Invisibility in AI answers. The most direct cost. Competitors who allow crawlers appear in answers where you do not. For SEO-equivalent visibility purposes, this is now a real lane to compete in.
No exit clause from existing training sets. Blocking GPTBot today does not remove your content from models already trained on it. Major training datasets sampled the public web years ago and content from those scrapes is already baked in. Blocking is forward-looking, not retroactive.
Operational complexity. Maintaining an accurate bot blocklist requires keeping up with new user-agent strings as providers add them. Anthropic added Claude-SearchBot after the initial ClaudeBot; OpenAI added OAI-SearchBot and ChatGPT-User after the initial GPTBot. Blocklists that aged out of date allow exactly what they were trying to prevent.

The per-provider decision

The three providers behave differently. The decision is not one-size-fits-all.

OpenAI

Most sites should allow ChatGPT-User and OAI-SearchBot, because those are the surfaces where ChatGPT actually cites you. The training crawl (GPTBot) is the one with the lowest direct upside (training-set inclusion has the longest delay between crawl and benefit) and the loudest objections (IP, compensation). Reasonable position: allow ChatGPT-User and OAI-SearchBot, block GPTBot. More permissive: allow all three.

Anthropic

Same pattern. Allow Claude-User and Claude-SearchBot so you can be cited in Claude responses and Claude-powered search. Block ClaudeBot training if you object to training-set inclusion, allow it if you do not. The middle ground is well supported by Anthropic's own bot taxonomy.

Google

Special case. There is no separate Google-Extended bot to block crawling-wise; it only controls whether your content trains Gemini. Blocking has no Search-visibility downside per Google's own documentation. The decision here is purely about training-set inclusion, with no AI-citation tradeoff (Gemini's citation behaviour is governed by separate Google products, not by Google-Extended). For most sites, the choice is roughly neutral; for IP-sensitive content, blocking Google-Extended is the cheapest privacy improvement available.

The hidden middle ground: block training, allow retrieval

Most online advice treats the decision as binary: block all AI crawlers, or allow all AI crawlers. The bot taxonomies above make a third position available.

Block training crawl, allow live retrieval and search. In concrete terms: Disallow GPTBot, allow ChatGPT-User and OAI-SearchBot. Disallow ClaudeBot, allow Claude-User and Claude-SearchBot. Disallow Google-Extended, leave Googlebot alone (which you would not have blocked anyway).

This middle ground says: 'do not train on me without consent or compensation, but do cite me when a user asks about me directly.' It preserves discovery and citation upside while objecting to training-set extraction. It is the most defensible default for an editorial content business: you keep the citation lane open and only give up the training-set inclusion you were most ambivalent about.

The position has tradeoffs. Some critics argue the distinction is naive: training data drives the same models that do live retrieval, so blocking training while allowing retrieval is morally inconsistent. Others argue the distinction is exactly the point: training is a permanent extraction, retrieval is an in-the-moment citation. Pick a side, but pick it knowingly.

How to actually configure robots.txt

Three concrete configurations cover almost every realistic scenario. Pick the one that matches your position and paste it into your /robots.txt.

Allow everything (default for most public sites)

Do nothing. The default of no rules is the most permissive option. AI crawlers respect robots.txt; if you have no rules excluding them, they fetch by default. For most content businesses interested in citation, this is the right baseline.

Block training only, allow retrieval and search

The middle-ground configuration:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# ChatGPT-User, OAI-SearchBot, Claude-User, Claude-SearchBot
# are allowed by default (no Disallow rules above)

This blocks the three training crawlers while leaving the live-retrieval and search bots free to fetch. It is the configuration that matches the block-training, allow-retrieval position above.

The maximum-blocking configuration:

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Use this only if you have a specific reason to be invisible to AI products. Most sites do not.

When blocking actually makes sense

Blocking has real use cases. The honest list is short:

IP-sensitive content. Original research, paywalled journalism, proprietary datasets, internal documentation accidentally exposed. The cost of inclusion in a training set may exceed the benefit of being cited.
Regulated content. Healthcare, legal, financial advice content that comes with compliance obligations. Regulators are paying increasing attention to AI training data sourcing; blocking is the conservative posture.
Commercial moats. If your value proposition depends on customers needing to come to your site to access the answer, citation by AI assistants is a direct threat, not an opportunity. Few sites are actually in this position, but those that are tend to know it.
Contractual constraints. Some B2B content contracts require that content not be used for training third-party models. Honour the contract; configure the blocklist accordingly.

If none of these applies to your site, blocking AI training is reflex, not strategy.

The default decision should be 'allow', with the option to block training-specific bots if you have a real reason. The hidden middle ground (block training, allow retrieval) is the most coherent position for editorial content businesses. The maximalist block-everything posture is rarely justified and usually costs more than it saves.

For background on the access-layer file these decisions get expressed alongside, see the WebPixie llms.txt adoption post. The bots respect robots.txt as the gate; llms.txt is the brief on what to read once they are inside.

← Back to blog