CiteLarkRun free audit

robots.txt templates for AI crawlers

Updated June 5, 2026 · 5 min read

Your robots.txt file is the gatekeeper for AI search visibility. Get it wrong and you're invisible — not by accident, but by instruction. Get it right and every AI engine that matters can crawl, index, and cite you. Here are the templates that cover the most common situations, with the exact user-agent strings you need.

Why robots.txt matters for AI — and which crawlers to know

AI answer engines use two types of crawlers: search crawlers that fetch live pages to build the answer a user sees right now, and training crawlers that build the background knowledge baked into the model. They have different user-agent strings and you can allow or block each independently.

The search crawlers are the ones that determine whether you appear in today's AI answers. The training crawlers influence how well AI engines know your brand in the long run — blocking them shrinks your brand's footprint in model knowledge over time.

  • Search crawlers (allow these to be cited in live AI answers): OAI-SearchBot, ChatGPT-User (ChatGPT); PerplexityBot, Perplexity-User (Perplexity); Claude-SearchBot, Claude-User (Anthropic Claude); Googlebot, Google-Extended (Google AI Overviews).
  • Training crawlers (control these based on your content strategy): GPTBot (OpenAI model training); ClaudeBot (Anthropic training); Google-Extended (Gemini + AI Overviews training); CCBot (Common Crawl); Meta-ExternalAgent (Meta AI).

Template 1 — Allow all AI crawlers (recommended for most businesses)

This is the right default for any business that wants customers to discover them through AI search. It explicitly allows the crawlers that matter and imposes no restrictions:

  • User-agent: *
  • Allow: /
  • Sitemap: https://yourdomain.com/sitemap.xml

Template 2 — Allow live-answer crawlers, block training crawlers

Use this if you want to appear in AI-generated answers today but prefer your content not be used to train future models. Common for publishers, content creators, and sites with proprietary information:

  • # Allow search crawlers (for live AI answers)
  • User-agent: OAI-SearchBot
  • Allow: /
  • User-agent: PerplexityBot
  • Allow: /
  • User-agent: Claude-SearchBot
  • Allow: /
  • # Block training crawlers
  • User-agent: GPTBot
  • Disallow: /
  • User-agent: ClaudeBot
  • Disallow: /
  • User-agent: Google-Extended
  • Disallow: /
  • User-agent: CCBot
  • Disallow: /
  • # Default: allow everything else
  • User-agent: *
  • Allow: /
  • Sitemap: https://yourdomain.com/sitemap.xml

Template 3 — Selectively block specific areas

Use this to protect specific paths (login pages, admin areas, draft content) while keeping the rest of your site open to AI crawlers. This is better practice than blanket rules:

  • User-agent: *
  • Disallow: /admin/
  • Disallow: /account/
  • Disallow: /checkout/
  • Disallow: /draft/
  • Allow: /
  • Sitemap: https://yourdomain.com/sitemap.xml

How to verify your robots.txt is working

  • Load yourdomain.com/robots.txt in a browser and confirm it serves your intended rules — no BOM characters, no extra whitespace before 'User-agent'.
  • Use Google Search Console's robots.txt tester to check specific paths and user-agents against your rules.
  • Run an AI Search Readiness audit, which checks each major AI crawler against your robots.txt and flags accidental blocks.
  • After any change, wait 24–48 hours and re-audit — engines cache robots.txt and the update takes time to propagate.

See where your site stands in AI search

Run a free AI Search Readiness audit and get your score plus the exact fixes.

Frequently asked questions

Does robots.txt actually stop AI crawlers?

Reputable AI crawlers — OpenAI, Anthropic, Google, Perplexity — all honor robots.txt as an industry standard. Disallowing a user-agent reliably prevents those crawlers from indexing your content. Less reputable scrapers may not honor it, but the major engines do.

What happens if I already have a robots.txt with a blanket Disallow?

A rule like 'User-agent: * Disallow: /' blocks everything, including all AI crawlers. You're invisible to every engine. Replace it with specific Disallow rules for only the paths you actually want to protect, and allow everything else.

Can I block training crawlers without hurting my AI search visibility today?

Yes — Template 2 above does exactly that. Blocking GPTBot and ClaudeBot (training) while allowing OAI-SearchBot and PerplexityBot (live answers) lets you appear in today's AI results while opting out of model training. The trade-off is slower brand-knowledge growth in future model versions.