AI crawler
Overview
An AI crawler is an automated client operated by an AI provider that fetches web content for one of several distinct purposes: training data collection, building a search index, or real-time ("live") retrieval to answer a user's query. Different crawlers from the same vendor often serve different purposes and are identified by different user-agent strings, and there is no unified cross-vendor governance standard.
Distinguishing the purpose of a crawler matters because a site owner may wish to permit live retrieval (which can drive citations) while restricting training-data collection, or vice versa.
Common crawlers and purposes
| User-agent (examples) | Operator | Typical purpose |
|---|---|---|
| GPTBot | OpenAI | Training data collection |
| OAI-SearchBot | OpenAI | Search indexing for ChatGPT search |
| ChatGPT-User | OpenAI | Live retrieval for a user request |
| ClaudeBot | Anthropic | Training / crawling |
| PerplexityBot | Perplexity | Indexing for answers |
| Google-Extended | Controls use for Gemini/Vertex training |
User-agent names and purposes change; operators publish their own documentation, and the table is illustrative rather than exhaustive.
How access is governed
Access is expressed through robots.txt directives keyed to each user-agent, and increasingly through purpose-specific tokens (for example Google-Extended for training opt-out). These mechanisms are advisory and depend on the crawler honoring them; they are separate from the content-guidance role of llms.txt.
- An AI crawler is not a single type of bot: training crawlers, search-index crawlers, and live-retrieval fetchers behave and are governed differently.
- Allowing live-retrieval crawlers is not the same as allowing training crawlers; robots.txt can permit one and block the other.
Examples
- Blocking
GPTBotbut allowingChatGPT-Userblocks training collection while permitting live retrieval for user queries. - This site's robots.txt explicitly allows major AI crawlers on content paths.