AI Bots Are Crawling Your Website to Death. Here's How to Stop Them.
AI crawlers are hammering websites with aggressive traffic. Here is how to detect them, block them, and protect your server before it is too late.
A client called me last month because their site was "suddenly slow." No code changes, no new plugins, no traffic spikes in Google Analytics. Everything looked normal.
Except it wasn't.
When I dug into the server logs, I found the problem. Their little Hetzner VPS was getting hammered by AI crawlers. Thousands of requests per hour from bots with names like GPTBot, ClaudeBot, and something called Meta-ExternalAgent. The server's CPU was pinned at 95%, and the bandwidth bill was climbing fast.
Here's the uncomfortable truth: bots now account for 52% of all web traffic. More than half of the requests hitting your server aren't from humans. And the AI crawler portion of that has surged 300% since 2024.
Your site might be slow right now. And it might not be your fault.
Your Server Isn't Slow. It's Being Eaten Alive.
This isn't a theoretical problem. It's happening right now, at scale.
A Ukrainian 3D model site called Trilegangers got crushed when OpenAI's crawlers hit them from 600 different IP addresses simultaneously. They updated their robots.txt. The crawlers kept coming. At peak, AI crawlers can hit 39,000 requests per minute. For context, that's the kind of traffic most small business sites see in an entire month.
42% of small businesses have reported performance or bandwidth strain from bot traffic in the past 12 months. And most of them didn't even realize bots were the cause. They blamed their hosting, their CMS, their developer. Sound familiar?
Who's Crawling You (and Why They Don't Care About Your Permission)
Let me introduce you to the usual suspects.
Meta-ExternalAgent is the worst offender. It accounts for 52% of all AI crawler traffic. Meta's bot scrapes content for training their AI models, and it does so aggressively.
GPTBot (OpenAI) grew 305% between May 2024 and May 2025. OpenAI's crawlers combined now make 3.8 times the request volume of Googlebot. Let that sink in. The AI company's crawler is hitting your site nearly four times harder than Google's.
ChatGPT-User crawls at about 2,400 pages per hour when it visits. That's the one that fires when someone asks ChatGPT a question and it goes looking for live data.
Bytespider (ByteDance/TikTok) is the one that really makes me angry. It's been documented ignoring robots.txt entirely and spoofing its user agent to look like a regular browser. You can't even politely ask it to stop.
Then there's ClaudeBot (Anthropic), CCBot (Common Crawl), Google-Extended, and Applebot-Extended. Some of these are more respectful than others. But together, they add up to a wall of traffic your server wasn't built to handle.
Some site owners are going the opposite direction entirely, creating an llms.txt file to actually help AI understand their content. Whether that makes sense depends entirely on your business model.
The robots.txt Lie (And Why It's Still Worth Setting Up)
Here's something most "just update your robots.txt" guides won't tell you: robots.txt is a suggestion, not a law.
There is no technical enforcement. No legal requirement (yet). A well-behaved crawler reads your robots.txt and respects it. A badly-behaved one ignores it completely.
The data backs this up. A study found that 70.6% of top news sites that explicitly blocked ChatGPT-User in their robots.txt still appeared in AI-generated citations. The bots either crawled before the block was added, used cached data, or simply ignored the directive.
The IETF is working on extensions to the robots.txt standard specifically for AI crawlers, and there's a proposed standard called Web Bot Auth that would use cryptographic identity verification. But that's future stuff. Right now, it's the wild west.
That said, you should still set up robots.txt blocks. Here's why: the reputable crawlers (GPTBot, ClaudeBot, Google-Extended) actually do respect it. And having it in place is a baseline that costs nothing.
Here's a robots.txt block you can copy right now:
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
The ai-robots-txt GitHub repo maintains an updated list of all known AI crawlers if you want to be thorough. And if you want to understand how robots.txt fits into the bigger picture, technical SEO is where this all lives.
How to Actually See What's Hitting Your Server
Google Analytics won't help you here. It filters out bot traffic by design. To see what's really happening, you need raw server logs.
If you're running Nginx, your access log lives at /var/log/nginx/access.log. For Apache, it's /var/log/apache2/access.log. Here are two commands that will tell you everything you need to know:
Find the top 20 user agents hitting your server:
awk -F'"' '{print $6}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20
Find the top 20 IPs by request count:
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20
What you're looking for: massive request counts from single IPs, user agent strings containing "bot," "crawler," "spider," or any of the AI bot names listed above. If one IP is making 10,000+ requests per day and it's not Googlebot, you've found your problem.
For something more visual, GoAccess gives you a real-time dashboard from your log files. It takes about 30 seconds to set up and it's free.
Blocking at the Server Level: The Only Thing That Actually Works
robots.txt is the polite request. Server-level blocking is the bouncer.
For Nginx, add this to your server block:
if ($http_user_agent ~* "GPTBot|ClaudeBot|CCBot|Bytespider|Meta-ExternalAgent|ChatGPT-User") {
return 444;
}
The 444 status code is Nginx-specific. It drops the connection silently without sending any response. The crawler gets nothing. No headers, no body, no indication the server even exists. It's the digital equivalent of hanging up the phone.
For rate limiting (letting some bots through but throttling them):
limit_req_zone $binary_remote_addr zone=ai_bots:10m rate=5r/m;
if ($http_user_agent ~* "GPTBot|ClaudeBot") {
limit_req zone=ai_bots burst=2 nodelay;
}
For Apache, the equivalent uses mod_rewrite:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} GPTBot|ClaudeBot|CCBot|Bytespider|Meta-ExternalAgent [NC]
RewriteRule .* - [F,L]
If you use Cloudflare, they've made this dead simple. Their "Block AI Bots" toggle identifies over 200 bot types and blocks them with one click. Over 80% of Cloudflare customers have already turned this on.
I've dealt with server resource problems before, and the fix is always the same: be surgical about what you block, don't just throw more resources at the problem.
The Uncomfortable Question: Should You Block Them at All?
This is where it gets nuanced. And I think the "block everything" crowd is missing something important.
Remember that stat about 70.6% of blocked sites still appearing in AI citations? That means blocking training crawlers doesn't prevent your content from showing up in AI answers. The data is already out there, cached, embedded in models. The horse has left the barn.
But here's the flip side: if you block everything, including search-integrated crawlers like ChatGPT-User and Google's AI tools, you might be cutting yourself off from an emerging traffic channel. AI search through Perplexity, ChatGPT, and Google AI Overviews is growing. Blocking those crawlers means your site won't appear in those results.
I wrote about how your reaction to AI killing your traffic can make things worse. The same principle applies here. A knee-jerk "block everything" response might feel satisfying, but it's not always strategic.
My recommendation: be selective. Block the training-only crawlers that offer you zero value (Meta-ExternalAgent, CCBot, Bytespider). Rate-limit the ones connected to search products (GPTBot, ChatGPT-User). And monitor the impact on both your server performance and your visibility in AI-powered search results.
What I Actually Did for My Clients (and What I'd Do for Yours)
For that client I mentioned at the start, here's the layered approach I implemented:
Layer 1: robots.txt. Blocked all AI training crawlers. Took 5 minutes. Free.
Layer 2: Nginx rules. Added the return 444 block for the aggressive crawlers. Added rate limiting for the search-connected ones. Another 10 minutes.
Layer 3: Cloudflare. For clients who use it, I turned on the AI bot blocking toggle and set up a custom rule to allow specific bots through at reduced rates.
Layer 4: Monitoring. Set up a cron job to check the access logs weekly for new bot patterns. New crawlers appear constantly. This is an arms race, not a one-time fix.
The result? 75% reduction in bot traffic and roughly 30% lower hosting costs from reduced bandwidth and CPU usage. The site went from feeling sluggish to snappy, and the client stopped getting overuse warnings from their hosting provider.
I'll be honest: this isn't solved. New bots show up every month. Bytespider keeps spoofing. Some crawlers rotate through thousands of IPs. It's a moving target, and I don't have a permanent fix. Nobody does.
But you can make it manageable. Start with the robots.txt. Add the server blocks. Monitor your logs. And accept that this is now part of running a website in 2026.
If your site needs broader optimization beyond just bot management, that's a different conversation. But if your server is struggling and you can't figure out why, check the logs first. The answer might not be what you expect.
Need help blocking AI crawlers that are killing your site's performance? Let's talk about getting your server back under control.
About the Author
Kemal Esensoy
Kemal Esensoy, founder of Wunderlandmedia, started his journey as a freelance web developer and designer. He conducted web design courses with over 3,000 students. Today, he leads an award-winning full-stack agency specializing in web development, SEO, and digital marketing.