robots.txt — blocking Bytespider & GPTBot

Code of conduct, suggestions, and information on forums.debian.net.
Post Reply
Message
Author
User avatar
Head_on_a_Stick
Posts: 14114
Joined: 2014-06-01 17:46
Location: London, England
Has thanked: 81 times
Been thanked: 135 times

robots.txt — blocking Bytespider & GPTBot

#1 Post by Head_on_a_Stick »

I've just noticed that this site's robots.txt attempts to disallow the Bytespider bot — that bot doesn't read robots.txt so you'll have to block the IP address if you want to stop it:

https://stackoverflow.com/questions/579 ... user-agent

And while we're on the subject, OpenAI have just released the user-agent for their GPTBot, I'm blocking it for my site:

Code: Select all

User-agent: GPTBot
Disallow: /
Reference: https://searchengineland.com/gptbot-ope ... ler-430360

Perhaps something to consider here as well? This place crawls enough as it is IMO :mrgreen:
deadbang

User avatar
donald
Debian Developer, Site Admin
Debian Developer, Site Admin
Posts: 1338
Joined: 2021-03-30 20:08
Has thanked: 238 times
Been thanked: 288 times

Re: robots.txt — blocking Bytespider & GPTBot

#2 Post by donald »

Some time ago, for Bytespider we contacted the IP block owner via abuse@ to obtain a solution for the massive resource hogging and noncompliance with the robots.txt file. They were happy to work with us and cull the spider herd.

Prior to that a few months ago the preview for GPTBot crawled us at a staggering pace and we blocked entire regions of their campus owned IP address. Today they crawl us at a nominal pace, they are crawling now as I write this with no issues abound.

We take spiders seriously because Debian takes Bugs seriously. :D
Typo perfectionish.


"The advice given above is all good, and just because a new message has appeared it does not mean that a problem has arisen, just that a new gremlin hiding in the hardware has been exposed." - FreewheelinFrank

Post Reply