Scheduled Maintenance: We are aware of an issue with Google, AOL, and Yahoo services as email providers which are blocking new registrations. We are trying to fix the issue and we have several internal and external support tickets in process to resolve the issue. Please see: viewtopic.php?t=158230

 

 

 

robots.txt — blocking Bytespider & GPTBot

Code of conduct, suggestions, and information on forums.debian.net.
Post Reply
Message
Author
User avatar
Head_on_a_Stick
Posts: 14114
Joined: 2014-06-01 17:46
Location: London, England
Has thanked: 81 times
Been thanked: 133 times

robots.txt — blocking Bytespider & GPTBot

#1 Post by Head_on_a_Stick »

I've just noticed that this site's robots.txt attempts to disallow the Bytespider bot — that bot doesn't read robots.txt so you'll have to block the IP address if you want to stop it:

https://stackoverflow.com/questions/579 ... user-agent

And while we're on the subject, OpenAI have just released the user-agent for their GPTBot, I'm blocking it for my site:

Code: Select all

User-agent: GPTBot
Disallow: /
Reference: https://searchengineland.com/gptbot-ope ... ler-430360

Perhaps something to consider here as well? This place crawls enough as it is IMO :mrgreen:
deadbang

User avatar
donald
Debian Developer, Site Admin
Debian Developer, Site Admin
Posts: 1106
Joined: 2021-03-30 20:08
Has thanked: 189 times
Been thanked: 248 times

Re: robots.txt — blocking Bytespider & GPTBot

#2 Post by donald »

Some time ago, for Bytespider we contacted the IP block owner via abuse@ to obtain a solution for the massive resource hogging and noncompliance with the robots.txt file. They were happy to work with us and cull the spider herd.

Prior to that a few months ago the preview for GPTBot crawled us at a staggering pace and we blocked entire regions of their campus owned IP address. Today they crawl us at a nominal pace, they are crawling now as I write this with no issues abound.

We take spiders seriously because Debian takes Bugs seriously. :D
Typo perfectionish.


"The advice given above is all good, and just because a new message has appeared it does not mean that a problem has arisen, just that a new gremlin hiding in the hardware has been exposed." - FreewheelinFrank

Post Reply