AI crawler

From llmref.wiki
(Redirected from PerplexityBot)
AI crawler — An automated agent that fetches web content for AI model training or for real-time retrieval in AI answers.

Overview

An AI crawler is an automated client operated by an AI provider that fetches web content for one of several distinct purposes: training data collection, building a search index, or real-time ("live") retrieval to answer a user's query. Different crawlers from the same vendor often serve different purposes and are identified by different user-agent strings, and there is no unified cross-vendor governance standard.

Distinguishing the purpose of a crawler matters because a site owner may wish to permit live retrieval (which can drive citations) while restricting training-data collection, or vice versa.

Common crawlers and purposes

User-agent (examples) Operator Typical purpose
GPTBot OpenAI Training data collection
OAI-SearchBot OpenAI Search indexing for ChatGPT search
ChatGPT-User OpenAI Live retrieval for a user request
ClaudeBot Anthropic Training / crawling
PerplexityBot Perplexity Indexing for answers
Google-Extended Google Controls use for Gemini/Vertex training

User-agent names and purposes change; operators publish their own documentation, and the table is illustrative rather than exhaustive.

How access is governed

Access is expressed through robots.txt directives keyed to each user-agent, and increasingly through purpose-specific tokens (for example Google-Extended for training opt-out). These mechanisms are advisory and depend on the crawler honoring them; they are separate from the content-guidance role of llms.txt.

Distinction from related terms

  • An AI crawler is not a single type of bot: training crawlers, search-index crawlers, and live-retrieval fetchers behave and are governed differently.
  • Allowing live-retrieval crawlers is not the same as allowing training crawlers; robots.txt can permit one and block the other.

Examples

  • Blocking GPTBot but allowing ChatGPT-User blocks training collection while permitting live retrieval for user queries.
  • This site's robots.txt explicitly allows major AI crawlers on content paths.

See also

References