AI & Search

AI Crawler

GPTBot | AI Bot | LLM Crawler

Lukas HorvathCo-founder

What is AI Crawler?

An AI crawler is an automated program that fetches web pages on behalf of an AI company, either to train its models or to retrieve fresh content for live answers. Common AI crawlers include GPTBot from OpenAI, ClaudeBot from Anthropic, PerplexityBot, Google-Extended, and Meta-ExternalAgent. Each identifies itself with a specific user agent string, which means you can choose to allow, block, or rate-limit it in your robots.txt.

Why it matters

Your robots.txt is now a strategic decision, not a developer afterthought. Block every AI crawler and you protect your content from training datasets — but you also disappear from ChatGPT Search, Perplexity, and Google AI Overviews when those tools fetch pages live. Allow them all and you trade reach for control. Most teams pick the middle: block the training crawlers, allow the retrieval ones. Either way, the decision should be deliberate, owned by marketing and legal together, and reviewed quarterly. Pretending these bots are not a category is how brands end up cited inconsistently or not at all.

How it works

AI crawlers follow the same protocol as Googlebot — they request pages, parse the HTML, and store what they find. The difference is what happens next. Training crawlers like GPTBot and Google-Extended feed content into model training pipelines. Retrieval crawlers like ChatGPT-User, OAI-SearchBot, and PerplexityBot fetch pages on demand when a user asks a question. Each respects robots.txt directives, so you can disallow specific user agents. Some teams also use Cloudflare's AI bot controls or similar tools to rate-limit aggressive bots that ignore robots.txt or scrape at scale. Your server logs show which bots visit, how often, and which pages they hit — useful data when you are deciding the policy.

Robots.txt
SEO/AEO/GEO
A small file at the root of your site that tells search engine crawlers which pages they can and can't access — useful for keeping junk pages and crawler traps…
LLMs.txt
AI & Search
A plain markdown file you put at the root of your website that tells AI models which pages matter most and how to read them — like robots.txt, but for large…
AI Citation
AI & Search
When an AI model credits your website as a source in its answer — the new equivalent of ranking on page one, and increasingly the most important metric to…
Structured Data for LLMs
AI & Search
Marking up your website content with schema, clean HTML, and machine-readable structure so AI models can extract and cite it accurately — the technical…
Indexing
SEO/AEO/GEO
The process search engines use to store and organize web pages so they can show up in results — if your page isn't indexed, it can't rank, and most sites have…
Answer Engine Optimization
SEO/AEO/GEO
Optimizing your content so AI answer engines like ChatGPT, Perplexity, and Google AI Overviews quote you directly when buyers in your market ask a question —…
Brand Mentions in AI
AI & Search
How often and in what context your company gets named when people ask AI models questions in your category — the new equivalent of share of voice in organic…

AI Crawler

What is AI Crawler?

Why it matters

How it works

Related terms

Robots.txt

LLMs.txt

AI Citation

Structured Data for LLMs

Indexing

Answer Engine Optimization

Brand Mentions in AI