You Blocked the AI Bots. Now ChatGPT Can't Cite You. Let's Fix That.
Block GPTBot to save bandwidth, and you might still get cited by ChatGPT. The difference between training crawlers and retrieval bots, explained.
Kemal Esensoy·Modified on June 20, 2026
A few months ago I wrote a post telling you to block the AI bots. They were hammering servers, eating bandwidth, and giving nothing back. I stand by most of it.
But I left out a part. And that part is starting to cost people.
Here's the thing nobody told me when I went on my "block everything" crusade: some of those bots are the reason ChatGPT and Perplexity can mention your business at all. Block them all with one angry copy-paste, and you don't just save bandwidth. You quietly delete yourself from the answers your customers are now reading instead of Google. So when people ask me "should I allow AI crawlers or not," the honest answer is: it depends which ones, and almost nobody explains the difference.
Let me fix that.
I Told You to Block the AI Bots. Here's the Part I Skipped.
My earlier advice still holds up: AI Bots Are Crawling Your Website to Death. Here's How to Stop Them. covered the real problem. Training crawlers were generating thousands of requests, spiking server load, and not sending a single human back to your site in return.
What I treated as one big swarm of "AI bots" is actually two completely different jobs wearing the same coat. And the robots.txt rule that protects your bandwidth from one of them can make you invisible to the other.
That's the mistake. I lumped them together. So did most of the advice you've read. The fix isn't "allow everything" or "block everything." It's knowing which bot does what.
The One Distinction That Changes Everything: Training vs Retrieval
There are two kinds of AI crawlers, and they want completely different things from your site.
Training crawlers scrape your content to stuff into a dataset. They build the model. They do not link back, they do not send traffic, they just take. GPTBot, ClaudeBot, Google-Extended, CCBot, and Bytespider all live here. This is the group that justified my whole rant. Blocking them costs you basically nothing in visibility.
Retrieval crawlers are the opposite. These fetch your page so an AI can answer a live question and cite you right now. OpenAI's OAI-SearchBot, Anthropic's Claude-SearchBot, and PerplexityBot index you for AI search. ChatGPT-User and Claude-User fetch your page in real time when someone asks a question that touches your topic. This is the group that puts your name in the answer. This is how you get cited by ChatGPT in the first place.
Block the training crawler and you are still fully citable, as long as you let the retrieval crawler through. That single sentence is the whole post. Everything else is just the details.
The Bot-by-Bot Table: What Each One Actually Does
Stop guessing from the user-agent name. Here is what each major crawler is for and what it costs you to block it.
| Bot | Company | Job | Block it and... |
|---|---|---|---|
| GPTBot | OpenAI | Training | Your content stays out of model training. No visibility loss. |
| OAI-SearchBot | OpenAI | Search index | You can vanish from ChatGPT search results. |
| ChatGPT-User | OpenAI | Live fetch | You lose live citations (but it may ignore the block anyway). |
| ClaudeBot | Anthropic | Training | Excluded from training. No visibility loss. |
| Claude-SearchBot | Anthropic | Search index | You hurt your visibility in Claude search. |
| Claude-User | Anthropic | Live fetch | You drop out of live Claude answers. |
| PerplexityBot | Perplexity | Index | Lower visibility in Perplexity (if it respects the block). |
| Googlebot | Search index | You disappear from Google entirely. Never block this. | |
| Google-Extended | Training token | Opts out of Gemini training only. Ranking unaffected. | |
| Bingbot | Microsoft | Search index | You lose Bing and Microsoft Copilot. |
| Bytespider | ByteDance | Training | Excluded from TikTok's model. Aggressive, often ignores rules. |
| CCBot | Common Crawl | Training | Dataset that feeds many LLMs indirectly. |
The pattern jumps out once you see it laid out. The "Training" rows are safe to block. The "Search index" and "Live fetch" rows are the ones that decide whether an AI can recommend you.
The Google Trap Nobody Explains: Google-Extended Is Not Googlebot
This is where people panic and shoot themselves in the foot.
Google-Extended is not a crawler. It is a control token that opts your content out of Gemini's training. Blocking it does not touch your Google ranking, and it does not pull you out of AI Overviews. Those run off the regular Googlebot index, not Google-Extended. I dug into this in Google's own AI SEO guidance, and the takeaway is boring but important: the normal index still does the heavy lifting.
So you can block Google-Extended to keep your stuff out of Gemini training and lose nothing in search. What you must never block is Googlebot. Do that and you are gone from Google, AI Overviews included. There is no clever workaround. The only "opt out of AI Overviews" lever is the nosnippet tag, which also kills your normal search snippets. That trade is almost never worth it.
Do AI Bots Even Respect robots.txt? (Mostly Yes, One Big Exception)
Here is the uncomfortable bit: robots.txt is a polite request, not a locked door.
OpenAI, Anthropic, and Google honor it. Block GPTBot and GPTBot stays out. Good. But in August 2025, Cloudflare caught Perplexity ignoring robots.txt entirely. When its declared bot got blocked, traffic kept coming from rotating IP addresses spoofing a regular Chrome browser on a Mac. Cloudflare delisted Perplexity as a verified bot over it. Perplexity's defense was basically "an agent acting for a user isn't a bot and shouldn't have to obey robots.txt."
There is a second wrinkle. The live-fetch bots, ChatGPT-User and Perplexity-User, are triggered by a real person asking a question. OpenAI says in its own docs that ChatGPT-User may bypass robots.txt because it counts as a user action, not automated crawling. So "allowing" those two in robots.txt is partly symbolic. The real lever for staying citable is allowing the search-index bots: OAI-SearchBot, Claude-SearchBot, and PerplexityBot. Those are the ones that actually decide whether you show up.
The Actual Cost: What Crawl-to-Refer Ratios Tell You
If you want a number to justify blocking the training bots, here it is.
According to Cloudflare's data from early 2026, ClaudeBot crawled roughly 24,000 pages for every single visitor it referred back. GPTBot sat around 1,276 to 1. Compare that to DuckDuckGo at about 1.5 to 1. The training crawlers take an enormous amount and send almost nobody back. That is the honest case for blocking them.
But the trend on the other side is just as real. AI bot traffic is up more than 300% since the start of 2025. Adobe reported AI-referred traffic to US retailers jumped 393% year over year in early 2026, and those visitors tend to buy. The retrieval side is becoming actual traffic, not a rounding error. If you want to see what this looks like on your own site, Google Search Console now breaks out AI traffic so you can stop guessing.
So the math is simple. Block the takers. Keep the ones that send people.
Copy-Paste robots.txt: Three Setups
Enough theory. Here are three configs you can actually use.
Setup A: Let AI cite me, don't train on me. This is my default recommendation for most sites.
# Block training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: meta-externalagent
Disallow: /
# Allow retrieval and live citation
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: PerplexityBot
Allow: /
Setup B: Block everything. Maximum bandwidth savings, and you accept being invisible to AI search. Use the training block above plus a Disallow: / for the retrieval bots too. Only do this if you genuinely do not care about AI referrals.
Setup C: Retrieval only. This is just Setup A. It is the posture you want if your goal is visibility, not protecting training data.
One rule that overrides all three: never block Googlebot or Bingbot. That kills normal search, AI Overviews, and Copilot in one move.
How to Check Who's Actually Hitting You
Do not trust the user-agent string alone, because spoofing is exactly what got Perplexity in trouble.
Grep your server logs for the user-agent names above. Then verify them against the published IP ranges each company puts out, like OpenAI's gptbot.json and searchbot.json files. A request claiming to be GPTBot from an IP that is not on OpenAI's list is a spoofer, and you can block it without guilt. This is also how you avoid the opposite mistake: whitelisting a fake bot. If you have ever stared at a flood of weird requests in your error tracker, you know the feeling. I wrote about one such night in the post about Sentry errors from bot crawlers. Logs tell the truth that headers won't.
Cloudflare Changed the Default (and You Might Be Blocking Everything Without Knowing)
Here is the one that catches people completely off guard.
Since July 2025, Cloudflare blocks AI crawlers by default on new domains. Over a million sites flipped on the one-click block. Cloudflare sits in front of roughly a fifth of the entire web. Which means a huge number of site owners are blocking every AI bot, retrieval included, and have no idea they did it. They never edited a robots.txt file. The setting was just on.
There is also pay-per-crawl now, where your site can return an HTTP 402 and charge bots to read you. Interesting for big publishers, overkill for most of us. The point for you is simpler: go check your Cloudflare dashboard. You might be invisible to AI search by accident, and no amount of clever robots.txt will fix a block that lives one layer above it.
So, Should You Allow AI Crawlers? My Honest Answer.
Block the pure training bots. They take a fortune in bandwidth and hand you almost nothing. Allow the retrieval and search bots, because they are how an AI tells someone your business exists. That is the whole answer, and it is the opposite of the one-line "block them all" advice I gave you a few months back.
I will be straight with you though: this is a moving target. The bot names change, the companies redefine what counts as a bot, and "an agent isn't a crawler" arguments are going to keep muddying the rules. Anyone selling you a permanent answer is selling something. Whether any of this is worth obsessing over depends on how much AI traffic you actually get, which is a separate and very honest question I poked at in GEO Is the New Snake Oil (Or Is It?).
What you can do today is stop treating "AI crawlers" as one thing. Check your logs, check your Cloudflare settings, and make a deliberate choice instead of an accidental one. If you want to make sure what these tools say about your business is actually accurate, that is a related rabbit hole worth your time in how to control your brand in AI search results.
If you would rather not untangle robots.txt and crawler logs yourself, that is the kind of unglamorous SEO work I do for clients. Let's talk if you want a second set of eyes on what AI search can actually see on your site.
About the Author
Kemal Esensoy
Kemal Esensoy, founder of Wunderlandmedia, started his journey as a freelance web developer and designer. He conducted web design courses with over 3,000 students. Today, he leads an award-winning full-stack agency specializing in web development, SEO, and digital marketing.