robots.txt and AI Crawlers: Control Which AI Bots Access Your Content
Learn how to configure robots.txt for AI crawlers including GPTBot, ClaudeBot, and PerplexityBot. Control what gets indexed for AI training and citation.
The AI crawler landscape in 2025
AI Crawler Directory — 2025
| Bot Name | Company | Purpose | Citation? | Block = No Citations? |
|---|---|---|---|---|
| GPTBot | OpenAI | Training data | Training | — |
| ChatGPT-User | OpenAI | Browse / SearchGPT | Citation | ✓ |
| PerplexityBot | Perplexity | Live search retrieval | Citation | ✓ |
| Google-Extended | AI training (Bard/Gemini) | Training | — | |
| Googlebot | Search + AI Overviews | Both | ✓ | |
| ClaudeBot | Anthropic | Training data | Training | — |
| Bytespider | ByteDance | Training data | Training | — |
| cohere-ai | Cohere | Training data | Training | — |
robots.txt Configuration Examples
User-agent: * Allow: /
User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: ChatGPT-User Allow: /
User-agent: GPTBot Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Google-Extended Disallow: /
Every major AI company now operates its own web crawler. These bots serve two distinct purposes: some crawl content to include in AI model training datasets, while others crawl content to power real-time retrieval for AI search answers. Understanding the difference is critical for making robots.txt decisions.
GPTBotChatGPT-UserClaudeBotPerplexityBotGooglebotFacebookBotAmazonbotrobots.txt basics for AI crawler management
The robots.txt file lives at your domain root (e.g., yourdomain.com/robots.txt) and uses a simple syntax to control which bots can access which paths. All reputable AI crawlers respect robots.txt directives.
The key syntax elements are: User-agent (specifies which bot), Disallow (blocks access to a path), and Allow (explicitly permits access). A wildcard * as the User-agent applies to all bots not specifically listed.
When to block vs allow AI bots
This is the most consequential robots.txt decision for AI visibility. Blocking AI crawlers prevents your content from appearing in AI-generated answers — a major missed opportunity. However, there are legitimate cases for selective blocking.
Reasons to block AI crawlers
- →Proprietary data you don't want in training sets
- →Paywalled content that should require login
- →Customer-specific data or private content
- →Content you plan to license to AI companies directly
Reasons to allow AI crawlers
- →Marketing and informational content
- →Blog posts and articles designed to build authority
- →FAQ and documentation pages
- →Any content where AI citation drives traffic
Citation crawlers vs training crawlers: different implications
An important distinction most robots.txt guides miss: some AI bots crawl for real-time search retrieval (which drives citations), while others crawl for training data (which doesn't directly produce citations).
If you want to prevent your content from being used in training while still allowing it to appear in AI search results, you need to block specific user agents selectively. For example, OpenAI operates both GPTBot (training) and ChatGPT-User (Browse). You can block one without blocking the other.
Blocking training does not prevent citations
Configuration examples
Allow all AI bots to crawl all content (maximum AI visibility):
User-agent: * Allow: / User-agent: GPTBot Allow: / User-agent: PerplexityBot Allow: / User-agent: ClaudeBot Allow: /
Block training bots but allow citation/search bots:
# Block OpenAI training crawler User-agent: GPTBot Disallow: / # Allow ChatGPT Browse (citation crawler) User-agent: ChatGPT-User Allow: / # Allow Perplexity (search citations) User-agent: PerplexityBot Allow: /
Block all AI bots from a specific path (e.g., paywalled content):
User-agent: GPTBot Disallow: /premium/ User-agent: PerplexityBot Disallow: /premium/ User-agent: ClaudeBot Disallow: /premium/
Monitoring AI crawler access
Once you've configured your robots.txt, monitor server logs to verify that bots are respecting your directives. Most server log analysis tools can filter by user-agent string. Look for GPTBot, PerplexityBot, ClaudeBot, and FacebookBot in your access logs.
RankAsAnswer's bot verification feature tracks which AI crawlers are accessing your pages and whether your robots.txt configuration is achieving the intended access control.