Technical AEO

robots.txt and AI Crawlers: Control Which AI Bots Access Your Content

Feb 19, 20258 min read

Learn how to configure robots.txt for AI crawlers including GPTBot, ClaudeBot, and PerplexityBot. Control what gets indexed for AI training and citation.

The AI crawler landscape in 2025

Infographicrobots.txt & AI Crawlers — Bot Directory & Configuration

AI Crawler Directory — 2025

Bot NameCompanyPurposeCitation?Block = No Citations?
GPTBotOpenAITraining dataTraining
ChatGPT-UserOpenAIBrowse / SearchGPTCitation
PerplexityBotPerplexityLive search retrievalCitation
Google-ExtendedGoogleAI training (Bard/Gemini)Training
GooglebotGoogleSearch + AI OverviewsBoth
ClaudeBotAnthropicTraining dataTraining
BytespiderByteDanceTraining dataTraining
cohere-aiCohereTraining dataTraining

robots.txt Configuration Examples

Allow all AI bots
User-agent: *
Allow: /
Block training, allow citation
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: ChatGPT-User
Allow: /
Block all AI bots
User-agent: GPTBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /
Key Insight
Blocking GPTBot prevents training data inclusion but does not block ChatGPT-User (the live search bot). These are separate user-agents. Block the training crawler and allow the citation crawler for maximum control.
Source: OpenAI, Google, Anthropic, Perplexity official documentation · User-agent strings as of 2025

Every major AI company now operates its own web crawler. These bots serve two distinct purposes: some crawl content to include in AI model training datasets, while others crawl content to power real-time retrieval for AI search answers. Understanding the difference is critical for making robots.txt decisions.

Bot NameCompanyPurposeUser-Agent String
GPTBotOpenAITraining + BrowseGPTBot
ChatGPT-UserOpenAIBrowse (real-time)ChatGPT-User
ClaudeBotAnthropicTraining + retrievalClaudeBot
PerplexityBotPerplexitySearch retrievalPerplexityBot
GeminiGoogleIntegrated with GooglebotGooglebot
FacebookBotMetaTrainingFacebookBot
AmazonbotAmazonAlexa trainingAmazonbot

robots.txt basics for AI crawler management

The robots.txt file lives at your domain root (e.g., yourdomain.com/robots.txt) and uses a simple syntax to control which bots can access which paths. All reputable AI crawlers respect robots.txt directives.

The key syntax elements are: User-agent (specifies which bot), Disallow (blocks access to a path), and Allow (explicitly permits access). A wildcard * as the User-agent applies to all bots not specifically listed.

When to block vs allow AI bots

This is the most consequential robots.txt decision for AI visibility. Blocking AI crawlers prevents your content from appearing in AI-generated answers — a major missed opportunity. However, there are legitimate cases for selective blocking.

Reasons to block AI crawlers

  • Proprietary data you don't want in training sets
  • Paywalled content that should require login
  • Customer-specific data or private content
  • Content you plan to license to AI companies directly

Reasons to allow AI crawlers

  • Marketing and informational content
  • Blog posts and articles designed to build authority
  • FAQ and documentation pages
  • Any content where AI citation drives traffic

Citation crawlers vs training crawlers: different implications

An important distinction most robots.txt guides miss: some AI bots crawl for real-time search retrieval (which drives citations), while others crawl for training data (which doesn't directly produce citations).

If you want to prevent your content from being used in training while still allowing it to appear in AI search results, you need to block specific user agents selectively. For example, OpenAI operates both GPTBot (training) and ChatGPT-User (Browse). You can block one without blocking the other.

Blocking training does not prevent citations

If you block GPTBot but allow ChatGPT-User, your content can still appear in ChatGPT Browse answers — it just won't be included in future training datasets. This is a valid strategy for publishers who want AI citation without training data participation.

Configuration examples

Allow all AI bots to crawl all content (maximum AI visibility):

User-agent: *
Allow: /

User-agent: GPTBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

Block training bots but allow citation/search bots:

# Block OpenAI training crawler
User-agent: GPTBot
Disallow: /

# Allow ChatGPT Browse (citation crawler)
User-agent: ChatGPT-User
Allow: /

# Allow Perplexity (search citations)
User-agent: PerplexityBot
Allow: /

Block all AI bots from a specific path (e.g., paywalled content):

User-agent: GPTBot
Disallow: /premium/

User-agent: PerplexityBot
Disallow: /premium/

User-agent: ClaudeBot
Disallow: /premium/

Monitoring AI crawler access

Once you've configured your robots.txt, monitor server logs to verify that bots are respecting your directives. Most server log analysis tools can filter by user-agent string. Look for GPTBot, PerplexityBot, ClaudeBot, and FacebookBot in your access logs.

RankAsAnswer's bot verification feature tracks which AI crawlers are accessing your pages and whether your robots.txt configuration is achieving the intended access control.

Was this article helpful?
Back to all articles