robots.txt — blocking Bytespider & GPTBot

Message

Head_on_a_Stick · #1 Post by **Head_on_a_Stick** » 2023-08-09 13:42

I've just noticed that this site's robots.txt attempts to disallow the Bytespider bot — that bot doesn't read robots.txt so you'll have to block the IP address if you want to stop it:

https://stackoverflow.com/questions/579 ... user-agent

And while we're on the subject, OpenAI have just released the user-agent for their GPTBot, I'm blocking it for my site:

Code: Select all

User-agent: GPTBot
Disallow: /

Reference: https://searchengineland.com/gptbot-ope ... ler-430360

Perhaps something to consider here as well? This place crawls enough as it is IMO

#2 Post by **donald** » 2023-08-10 03:23

Some time ago, for Bytespider we contacted the IP block owner via abuse@ to obtain a solution for the massive resource hogging and noncompliance with the robots.txt file. They were happy to work with us and cull the spider herd.

Prior to that a few months ago the preview for GPTBot crawled us at a staggering pace and we blocked entire regions of their campus owned IP address. Today they crawl us at a nominal pace, they are crawling now as I write this with no issues abound.

We take spiders seriously because Debian takes Bugs seriously.

Debian User Forums

robots.txt — blocking Bytespider & GPTBot

robots.txt — blocking Bytespider & GPTBot

Re: robots.txt — blocking Bytespider & GPTBot