AI crawler

AI crawler — An automated agent that fetches web content for AI model training or for real-time retrieval in AI answers.

Overview

An AI crawler is an automated client operated by an AI provider that fetches web content for one of several distinct purposes: training data collection, building a search index, or real-time ("live") retrieval to answer a user's query. Different crawlers from the same vendor often serve different purposes and are identified by different user-agent strings, and there is no unified cross-vendor governance standard.

Distinguishing the purpose of a crawler matters because a site owner may wish to permit live retrieval (which can drive citations) while restricting training-data collection, or vice versa.

Common crawlers and purposes

User-agent (examples)	Operator	Typical purpose
GPTBot	OpenAI	Training data collection
OAI-SearchBot	OpenAI	Search indexing for ChatGPT search
ChatGPT-User	OpenAI	Live retrieval for a user request
ClaudeBot	Anthropic	Training / crawling
PerplexityBot	Perplexity	Indexing for answers
Google-Extended	Google	Controls use for Gemini/Vertex training

User-agent names and purposes change; operators publish their own documentation, and the table is illustrative rather than exhaustive.

How access is governed

Access is expressed through robots.txt directives keyed to each user-agent, and increasingly through purpose-specific tokens (for example Google-Extended for training opt-out). These mechanisms are advisory and depend on the crawler honoring them; they are separate from the content-guidance role of llms.txt.

Distinction from related terms

An AI crawler is not a single type of bot: training crawlers, search-index crawlers, and live-retrieval fetchers behave and are governed differently.
Allowing live-retrieval crawlers is not the same as allowing training crawlers; robots.txt can permit one and block the other.

Examples

Blocking GPTBot but allowing ChatGPT-User blocks training collection while permitting live retrieval for user queries.
This site's robots.txt explicitly allows major AI crawlers on content paths.

References

Anonymous

Search

AI crawler

Namespaces

More

Page actions

Contents

Overview

Common crawlers and purposes

How access is governed

Distinction from related terms

Examples

See also

References

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

AI crawler

Overview

Common crawlers and purposes

How access is governed

Distinction from related terms

Examples

See also

References

Navigation

Wiki tools

Page tools

Categories