It's time for a lawyer letter. See the Computer Fraud and Abuse Act prosecution guidelines.[1] In general, the US Justice Department will not consider any access to open servers that's not clearly an attack to be "unauthorized access". But,
"However, when authorizers later expressly revoke authorization—for example, through unambiguous written cease and desist communications that defendants receive and understand—the Department will consider defendants from that point onward not to be authorized."
So, you get a lawyer to write an "unambiguous cease and desist" letter. You have it delivered to Amazon by either registered mail or a process server, as recommended by the lawyer. Probably both, plus email.
Then you wait and see if Amazon stops.
If they don't stop, you can file a criminal complaint. That will get Amazon's attention.
That’s if the requests are actually coming from Amazon, which seems very unlikely given some of the details in the post (rotating user agents, residential IPs, seemingly not interpreting robots.txt). The Amazon bot should come from known Amazon IP ranges and respect robots.txt. An Amazon engineer confirmed it in another comment: https://news.ycombinator.com/item?id=42751729
The blog post mentions things like changing user agent strings, ignoring robots.txt, and residential IP blocks. If the only thing that matches Amazon is the “AmazonBot”
User Agent string but not the IP ranges or behavior then lighting your money on fire would be just as effective as hiring a lawyer to write a letter to Amazon.
Honestly, I figure that being on the front page of Hacker News like this is more than shame enough to get a human from the common sense department to read and respond to the email I sent politely asking them to stop scraping my git server. If I don't get a response by next Tuesday, I'm getting a lawyer to write a formal cease and desist letter.
No one gives a fuck in this industry until someone turns up with bigger lawyers. This is behaviour which is written off with no ethical concerns as ok until that bigger fish comes along.
Put a link somewhere in your site that no human would visit, disallow it in robots.txt (under a wildcard because apparently OpenAI’s crawler specifically ignores wildcards), and when an IP address visits the link ban it for 24 hours.
I had to deal with some bot activities that used huge address space, and I tried something very similar, when condition confirming bot was detected I banned that IP for 24h
but due to amount of IPs involved this did not have any impact on about if traffic
my suggestion is to look very closely on headers that you receive (varnishlog in very nice of this and of you stare long enough at then you might stop something that all those requests have in common that would allow you to easily identify them (like very specific and usual combination of reported language and geo location, or the same outdated browser version, etc)
My favorite example of this was how folks fingerprinted the active probes of the Great Firewall of China. It has a large pool of IP addresses to work with (i.e. all ISPs in China), but the TCP timestamps were shared across a small number of probing machines:
"The figure shows that although the probers use thousands of source IP addresses, they cannot be fully independent, because they share a small number of TCP timestamp sequences"
If you just block the connection, you send a signal that you are blocking it, and they will change it. You need to impose cost per every connection through QoS buckets.
If they rotate IPs, ban by ASN, have a page with some randomized pseudo looking content in the source (not static), and explain that the traffic allocated to this ASN has exceed normal user limits and has been rate limited (to a crawl).
Have graduated responses starting at a 72 hour ban where every page thereafter regardless of URI results in that page and rate limit. Include a contact email address that is dynamically generated by bucket, and validate all inbound mail that it matches DMARC for Amazon. Be ready to provide a log of abusive IP addresses.
That way if amazon wants to take action, they can but its in their ballpark. You gatekeep what they can do on your site with your bandwidth. Letting them run hog wild and steal bandwidth from you programmatically is unacceptable.
This was indeed one mitigation used by a site to prevent bots hosted on AWS from uploading CSAM and generating bogus reports to the site's hosting provider.[1]
In any case, I agree with the sarcasm. Blocking data center IPs may not help the OP, because some of the bots are resorting to residential IP addresses.
This isn't a problem domain that models are capable of solving.
Ultimately in two party communications, computers are mostly constrained by determinism, and the resulting halting/undecidability problems (in core computer science).
All AI Models are really bad at solving stochastic types of problems. They can approximate generally only to a point after which it falls off. Temporal consistency im time series data is also a major weakness. Throw the two together, and models can't really solve it. They can pattern match to a degree but that is the limit.
8ish years ago, at the shop I worked at we had a server taken down. It was an image server for vehicles. How did it go down? Well, the crawler in question somehow had access to vehicle image links we had due to our business. Unfortunately, the perfect storm of the image not actually existing (can't remember why, mighta been one of those weird cases where we did a re-inspection without issuing new inspection ID) resulted in them essentially DOSing our condition report image server. Worse, there was a bug in the error handler somehow, such that the server process restarted when this condition happened. This had the -additional- disadvantage of invalidating our 'for .NET 2.0, pretty dang decent' caching implementation...
It comes to mind because, I'm pretty sure we started doing some canary techniques just to be safe (Ironically, doing some simple ones were still cheaper than even adding a different web server.... yes we also fixed the caching issue... yes we also added a way to 'scream' if we got too many bad requests on that service.)
When I was writing a crawler for my search engine (now offline), I found almost no crawler library actually compliant with the real world. So I ended up going to a lot of effort to write one that complied with Amazon and Google's rather complicated nested robots files, including respecting the cool off periods as requested.
... And then found their own crawlers can't parse their own manifests.
It's about 700 lines of the worst Python ever. You do not want it. I would be too embarrassed to release it, honestly.
It complied, but it was absolutely not fast or efficient. I aimed at compliance first, good code second, but never got to the second because of more human-oriented issues that killed the project.
Upvoted because we’re seeing the same behavior from all AI and Seo bots. They’re BARELY respecting Robots.txt, and hard to block. And when they crawl, they spam and drive up load so high they crash many servers for our clients.
If AI crawlers want access they can either behave, or pay. The consequence will almost universal blocks otherwise!
Not quite what the original commenter meant but: WE ARE.
A major consequence of this reckless AI scraping is that it turbocharged the move away from the web and into closed ecosystems like Discord. Away from the prying eyes of most AI scrapers ... and the search engine indexes that made the internet so useful as an information resource.
Lots of old websites & forums are going offline as their hosts either cannot cope with the load or send a sizeable bill to the webmaster who then pulls the plug.
I believe that a number of AI bots only respect robot.txt entries that explicitly define their static user agent name. They ignore wildcards in user agents.
That counts as barely imho.
I found this out after OpenAI was decimating my site and ignoring the wildcard deny all. I had to add entires specifically for their three bots to get them to stop.
IA actually has technical and moral reasons to ignore robots.txt. Namely, they want to circumvent this stuff because their goal is to archive EVERYTHING.
As I recall, this is outdated information. Internet Archive does respect robots.txt and will remove a site from its archive based on robots.txt. I have done this a few years after your linked blog post to get an inconsequential site removed from archive.org.
Amazonbot doesn't respect the `Crawl-Delay` directive. To be fair, Crawl-Delay is non-standard, but it is claimed to be respected by the other 3 most aggressive crawlers I see.
And how often does it check robots.txt? ClaudeBot will make hundreds of thousands of requests before it re-checks robots.txt to see that you asked it to please stop DDoSing you.
Here's Google, complaining of problems with pages they want to index but I blocked with robots.txt.
New reason preventing your pages from being indexed
Search Console has identified that some pages on your site are not being indexed
due to the following new reason:
Indexed, though blocked by robots.txt
If this reason is not intentional, we recommend that you fix it in order to get
affected pages indexed and appearing on Google.
Open indexing report
Message type: [WNC-20237597]
No, because they won't pay for anything they can get for free. There's only one situation where an AI company will pay for data, and that's when it's owned by someone with scary enough lawyers to pressure them into paying up. Hence why OpenAI has struck licensing deals with a handful of companies while continuing to bulk-scrape unlicensed data from everyone else.
Bold to assume that an AI scraper won't come back to download everything again, just in case there's any new scraps of data to extract. OP mentioned in the other thread that this bot had pulled 3TB so far, and I doubt their git server actually has 3TB of unique data, so the bot is probably pulling the same data over and over again.
If they're AI bots it might be fun to feed them nonsense. Just send hack megabytes of "Bezos is a bozo" or something like that. Even more fun if you could cooperate with many other otherwise-unrelated websites, e.g. via time settings in a modified tarpit.
Global tarpit is the solution. It makes sense anyway even without taking AI crawlers into account. Back when I had to implement that, I went the semi manual route - parse the access log and any IP address averaging more than X hits a second on /api gets a -j TARPIT with iptables [1].
Not sure how to implement it in the cloud though, never had the need for that there yet.
Quixotic[0] (my content obfuscator) includes a tarpit component, but for something like this, I think the main quixotic tool would be better - you run it against your content once, and it generates a pre-obfuscated version of it. It takes a lot less of your resources to serve than dynamically generating the tarpit links and content.
i would think public outcry by influencers on social media (such as this thread) is a better deterrent, and also establishes a public datapoint and exhibit for future reference.. as it is hard to scale the tarpit.
I don’t think I’d assume this is actually Amazon. The author is seeing requests from rotating residential IPs and changing user agent strings
> It's futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more.
Impersonating crawlers from big companies is a common technique for people trying to blend in. The fact that requests are coming from residential IPs is a big red flag that something else is going on.
I work for Amazon, but not directly on web crawling.
Based on the internal information I have been able to gather, it is highly unlikely this is actually Amazon. Amazonbot is supposed to respect robots.txt and should always come from an Amazon-owned IP address (You can see verification steps here: https://developer.amazon.com/en/amazonbot).
I've forwarded this internally just in case there is some crazy internal team I'm not aware of pulling this stunt, but I would strongly suggest the author treats this traffic as malicious and lying about its user agent.
Randomly selected IPs from my logs show that 80% of them have the matching that forward confirming reverse DNS domain. The most aggressive ones were from the amazonbot domain.
Believe what you want though. Search for `xeiaso.net` in ticketing if you want proof.
> The author is seeing requests from rotating residential IPs and changing user agent strings
This type of thing is commercially available as a service[1]. Hundreds of Millions of networks backdoored and used as crawlers/scrapers because of an included library somewhere -- and ostensibly legal because somewhere in some ToS they had some generic line that could plausibly be extended to using you as a patsy for quasi-legal activities.
Yes, we know, but the accusation is that Amazon is the source of the traffic.
If the traffic is coming from residential IPs then it’s most likely someone using these services and putting “AmazonBot” as a user agent to trick people.
With the amount of "if cloud IP then block" rules in place for many things (to weed out streaming VPNs and "potential" ddos-ing) I wouldn't doubt that at all.
I had this same issue recently. My Forgejo instance started to use 100 % of my home server's CPU as Claude and its AI friends from Meta and Google were hitting the basically infinite links at a high rate. I managed to curtail it with robots.txt and a user agent based blocklist in Caddy, but who knows how long that will work.
Money happened. AI companies are financially incentivized to take as much data as possible, as quickly as possible, from anywhere they can get it, and for now they have so much cash to burn that they don't really need to be efficient about it.
When various companies got signal that at least for now they have a huge overton window of what is acceptable for AI to ingest, they are going to take all they can before regulation even tries to clamp down.
The bigger danger, is that one of these companies even (or, especially) one that claims to be 'Open', does so but gets to the point of being considered 'too big to fail' from an economic/natsec interest...
When will the last Hacker News realize that Meta and OpenAI and every last massive tech company were always going to screw us all over for a quick buck?
Remember, Facebook famously made it easy to scrape your friends from MySpace, and then banned that exact same activity from their site once they got big.
The same thing that happened to courtesy in every other context: it only existed in contexts where there was no profit to be made in ignoring it. The instant that stopped being true, it was ejected.
Yes, but the point is that big company crawlers aren’t paying for questionably sourced residential proxies.
If this person is seeing a lot of traffic from residential IPs then I would be shocked if it’s really Amazon. I think someone else is doing something sketchy and they put “AmazonBot” in the user agent to make victims think it’s Amazon.
You can set the user agent string to anything you want, as we all know.
I used to work for malware detection for a security company, and we looked at residential IP proxy services.
They are very, very, very expensive for the amount of data you get. You are paying for per bit of data. Even with Amazon's money, the number quickly become untenable.
It was literally cheaper for us to subscribe to business ADSL/cable/fiber optic services to our corp office buildings and thrunk them together.
You joke, but do people analyze it continuously forever also? Because if we’re being paranoid, that’s something you’d need to do in order to account for random updates that are probably happening all the time.
To add: it’s also kinda silly on the surface of it for Amazon to use consumer devices to hide their crawling traffic, but still leave “Amazonbot” in their UA string… it’s pretty safe to assume they’re not doing this.
They worked on malware detection. The most likely reason is very obvious: if you only allow traffic from residential addresses to your Command & Control server, you make anti-malware research (which is most likely coming from either a datacenter or an office building) an awful lot harder - especially when you give non-residential IPs a different and harmless response instead of straight-up blocking them.
Wild. While I'm sure the service is technically legal since it can be used for non-nefarious purposes, signing up for a service like that seems like a guarantee that you are contributing to problematic behavior.
My site (Pinboard) is also getting hammered by what I presume are AI crawlers. It started out this summer with Chinese and Singapore IPs, but now I can't even block by IP range, and have to resort to captchas. The level of traffic is enough to immediately crash the site, and I don't even have any interesting text for them to train on, just link.
I'm curious how OP figured out it's Amazon's crawler to blame. I would love to point the finger of blame.
The best way to fight this would not to block them, that
does not cause Amazon/others anything. (clearly).
What if instead it was possible to feed the bots
clearly damaging and harmfull content?
If done on a larger scale, and Amazon discovers the poisoned pills
they could have to spend money rooting it out, quick like, and
make attempts to stop their bots to ingest it.
Of course nobody wants to have that tuff on their own site though.
That is the biggest problem with this.
> What if instead it was possible to feed the bots clearly damaging and harmfull content?
With all respect, you're completely misunderstanding the scope of AI companies' misbehaviour.
These scrapers already gleefully chow down on CSAM and all other likewise horrible things. OpenAI had some of their Kenyan data-tagging subcontractors quit on them over this. (2023, Time)
The current crop of AI firms do not care about data quality. Only quantity. The only thing you can do to harm them is to hand them 0 bytes.
You would go directly to jail for things even a tenth as bad as Sam Altman has authorized.
I've seen this tarpit recommended for this purpose. it creates endless nests of directories and endless garbage content, as the site is being crawled. The bot can spend hours on it.
I'd love it if Amazon could give me some AWS credit as a sign of good faith to make up for the egress overages their and other bots are causing, but the ads on this post are likely going to make up for it. Unblock ads and I come out even!
I’m working on a centralized platform[1] to help web crawlers be polite by default by respecting robots.txt, 429s, etc, and sharing a platform-wide TTL cache just for crawlers. The goal is to reduce global bot traffic by providing a convenient option to crawler authors that makes their bots play nice with the open web.
Even so, traffic funnels through the same cache, so website owners would see the same amount of hits whether there was 1 crawler or 1000 crawlers on the platform
CommonCrawl’s data is up to a month old. Here you can write a custom crawler, and have a REST/RAG API that you can use with your apps and agents for the data that your crawler finds. All the while, it builds up a platform-wide cache, so duplicate/redundant requests don’t reach their origin servers.
Probably not copyright infringement. But it is probably (hopefully?) a violation of CFAA, both because it is effectively DDoSing you, and they are ignoring robots.txt.
Big thing worth asking here. Depending on what 'amazon' means here (i.e. known to be Amazon specific IPs vs Cloud IPs) it could just be someone running a crawler on AWS.
Or, folks failing the 'shared security model' of AWS and their stuff is compromised with botnets running on AWS.
Or, folks that are quasi-spoofing 'AmazonBot' because they think it will have a better not-block rate than anonymous or other requests...
From the information in the post, it sounds like the last one to me. That is, someone else spoofing an Amazonbot user agent. But it could potentially be all three.
* There is knowledge that the intended access was unauthorised
* There is an intention to secure access to any program or data held in a computer
I imagine US law has similar definitions of unauthorized access?
`robots.txt` is the universal standard for defining what is unauthorised access for bots. No programmer could argue they aren't aware of this, and ignoring it, for me personally, is enough to show knowledge that the intended access was unauthorised. Is that enough for a court? Not a goddamn clue. Maybe we need to find out.
robots.txt isn't a standard. It is a suggestion, and not legally binding AFAIK. In US law at least a bot scraping a site doesn't involve a human being and therefore the TOS do not constitute a contract. According to the Robotstxt organization itself: “There is no law stating that /robots.txt must be obeyed, nor does it constitute a binding contract between site owner and user, but having a /robots.txt can be relevant in legal cases.”
The last part basically means the robots.txt file can be circumstantial evidence of intent, but there needs to be other factors at the heart of the case.
I mean it might just be a matter of the right UK person filing a case; I suppose my main understanding is UK Libel/Slander laws but if my US brain goes with that my head says the burden of proof is on non-infringement.
I wind up in jail for ten years if I download an episode of iCarly; Sam Altman inhales every last byte on the internet and gets a ticker tape parade. Make it make sense.
>I'm working on a proof of work reverse proxy to protect my server from bots in the future.
Honestly I think this might end up being the mid-term solution.
For legitimate traffic it's not too onerous, and recognized users can easily have bypasses. For bulk traffic it's extremely costly, and can be scaled to make it more costly as abuse happens. Hashcash is a near-ideal corporate-bot combat system, and layers nicely with other techniques too.
Using status code 418 (I'm a teapot), while cute, actually works against you since even well behaved bots don't know how to handle it and thus might not treat it as a permanent status causing them to try to recrawl again later.
Plus you'll want to allow access to /robots.txt.
Of course, if they're hammering new connections, then automatically adding temporary firewall rules if the user agent requests anything but /robots.txt might be the easiest solution. Well or just stick Cloudflare in front of everything.
I put my personal site behind Cloudflare last year specifically to combat AI bots. It's very effective, but I hate that the web has devolved to a state where using a service like Cloudflare is practically no longer optional.
We have had the same problem at my client now for the last couple of months, but from Facebook (using their IP ranges). They don’t even respect the 429 headers and the business is hesitant to outright ban them in case it impacts open graph or Facebook advertising tooling.
I ran into this weeks ago and was super impressed to solve a self-hosted captcha and login as "anonymous". I use cgit currently but have dabbled with fossil previously and if bots were a problem I'd absolutely consider this
just add a forged link in the main page, pointing to a page that doesn't exist. when hit, block that ip. they will crawl only the first page in this way maybe?
Excuse my technical ignorance, but is it actually trying to get all the files in your git repo? Couldn’t you just have everything behind an user/pass if so?
Author of the article here. The behavior of the bot seems like this:
while true {
const page = await load_html_page(read_from_queue());
save_somewhere(page);
foreach link in page {
enqueue(link);
}
}
This means that every link on every page gets enqueued and saved to do something. Naturally, this means that every file of every commit gets enqueued and scraped.
Having everything behind auth defeats the point of making the repos public.
It seems like git self-hosters frequently encounter DDoS issues. I know it's not typical for free software, but I wonder if gating file contents behind a login and allowing registrations could be the answer for self-hosting repositories on the cheap.
You can still have other limits by IPs. 429s tends to slow the scrapers, and it means you are spending a lot less on bandwidth and compute when they get too aggressive. Monitor and adjust the regex list over time as needed.
Note that if SEO is a goal, this does make you vulnerable to blackhat SEO by someone faking a UA of a search engine you care about and eating their 6 req/minute quota with fake bots. You could treat Google differently.
This approach won't solve for the case where the UA is dishonest and pretends to be a browser - that's an especially hard problem if they have a large pool of residential IPs and emulate / are headless browsers, but that's a whole different problem that needs different solutions.
For Google, just read their publicly-published list of crawler IPs. They’re broken down into 3 JSON files by category. One set of IPs is for GoogleBot (the web crawler), one is for special requests like things from Google Search Console, and one is special crawlers related to things like Google Ads.
You can ingest this IP list periodically and set rules based on those IPs instead. Makes you not prone to the blackhat SEO tactic you mentioned. In fact, you could completely block GoogleBot UA strings that don’t match the IPs, without harming SEO, since those UA strings are being spoofed ;)
Unless we start chopping these tech companies down there's not much hope for the public internet. They now have an incentive to crawl anything they can and have vastly more resources than even most governments. Most resources I need to host in a way that's internet facing are behind keyauth and I'm not sure I see a way around doing that for at least a while
What are the actual rules/laws about scraping? I have a few projects I'd like to do that involve scraping but have always been conscious about respecting the host's servers, plus whether private content is copyrighted. But sounds like AI companies don't give a shit lol. If anyone has a good resource on the subject I'd be grateful!
If you go to a police station and ask them to arrest Amazon for accessing your website too often, will they arrest Amazon, or laugh at you?
While facetious in nature, my point is that people walking around in real brick and mortar locations simply do not care. If you want police to enforce laws, those are the kinds of people that need to care about your problem. Until that occurs, youll have to work around the problem.
Do they keep retrieving the same data from the same links over and over and over
again, like stuck in a forever loop, that runs week after week?
Or are they crawling your site at a hype aggressive way but getting more and more data?
So it may tea them last say 2 days to crawl over it and then they go away?
Indeed: https://marcusb.org/hacks/quixotic.html try not to block LLM bot traffic and start injecting spurious content for ""improving"" their data. Markov chain at its finest!
Depends on what you are hosting. I found that source code repository viewers in particular (OP mentions Gitea, but I have seen it with others as well) are really troublesome: Each and every commit that exists in your repository can potentially cause dozens if not hundres of new unique pages to exist (diff against previous version, diff against current version, show file history, show file blame, etc...).
Plus many repo viewers of them take this information directly from the source repository without much caching involved, as it seems.
This is different from typical blogging or forum software, which is often designed to be able to handle really huge websites and thus have strong caching support. So far, nobody expected source code viewers to be so popular that performance could be an issue, but with AI scrapers this is quickly changing.
Gitea in particular is a worst case for this. Gitea shows details about every file at every version and every commit if you click enough. The bots click every link. This fixed cost adds up when hundreds of IPs are at different levels of clicking of every link.
Seems some of these bots are behaving abusively on sites with lots of links (like git forges). I have some sites receiving 200 requests per day and some receiving 1 million requests per day from these AI bots, depending on the design of the site.
Has any group of tech companies ever behaved so badly in so many ways so uniformly? There was 90s Microsoft, that was bad enough to get the DOJ involved, but that was one actor.
It’s like the friggin tobacco companies or something. Is anyone being the “good guys” on this?
Unacceptable, sorry this is happening. Do you know about fail2ban? You can have it automatically filter IPs that violate certain rules. One rule could be matching on the bot trying certain URLs. You might be able to get some kind of honeypot going with that idea. Good luck
They list a service for each address, so maybe you could block all the non-Route 53 IP addresses. Although that assumes they aren’t using the Route 53 IPs or unlisted IPs for scraping (the page warns it’s not a comprehensive list).
Regardless, it sucks that you have to deal with this. The fact that you’re a customer makes it all the more absurd.
One time had an AWS customer (Nokia Deepfield) spamming my shadowsocks server logs with unauthenticated attempts. Emailed AWS to curtail the abuse and they sided with their customer ofc.
Not a crawler writer but have FAFOd with data structures in the past to large career success.
...
The closest you could possibly do with any meaningful influence, is option C, with the general observations of:
1. You'd need to 'randomize' the generated output link
2. You'd also want to maximize cachability of the replayed content to minimize work.
3. Add layers of obfuscation on the frontend side, for instance a 'hidden link (maybe with some prompt fuckery if you are brave) inside the HTML with a random bad link on your normal pages.
4. Randomize parts of the honeypot link pattern. At some point someone monitoring logs/etc will see that it's a loop and blacklist the path.
5. Keep up at 4 and eventually they'll hopefully stop crawling.
---
On the lighter side...
1. do some combination of above but have all honeypot links contain the right words that an LLM will just nope out of for regulatory reasons.
That said, all above will do is minimize pain (except, perhaps ironically, the joke response which will more likely blacklist you but potentially get you on a list or a TLA visit)...
... Most pragmatically, I'd start by suggesting the best option is a combination of nonlinear rate limiting, both on the ramp-up and the ramp-down. That is, the faster requests come in, the more you increment their 'valueToCheckAgainstLimit`. The longer it's been since last request, the more you decrement.
Also pragmatically, if you can extend that to put together even semi-sloppy code to then scan when a request to a junk link that results in a ban immediately results in another IP trying to hit the same request... well ban that IP as soon as you see it, at least for a while.
With the right sort of lookup table, IP Bans can be fairly simple to handle on a software level, although the 'first-time' elbow grease can be a challenge.
Personally I'm not trying to block the bots, I'm trying to avoid the bandwidth bill.
I've recently blocked everything that isn't offering a user agent. If it had only pulled text I probably wouldn't have cared, but it was pulling images as well (bot designers, take note - you can have orders of magnitude less impact if you skip the images).
For me personally, what's left isn't eating enough bandwidth for me to care, and I think any attempt to serve some bots is doomed to failure.
If I really, really hated chatbots (I don't), I'd look at approaches that poison the well.
Most of those of those artists aren’t any better though. I’m on a couple artists’ forums and outlets like Tumblr, and I saw firsthand the immediate, total 180 re: IP protection when genAI showed up. Overnight, everybody went from “copying isn’t theft, it leaves the original!” and other such mantras, to being die-hard IP maximalists. To say nothing of how they went from “anything can be art and it doesn’t matter what tools you’re using” to forming witch-hunt mobs against people suspected of using AI tooling. AI has made a hypocrite out of everybody.
Manga nerds on Tumblr aren't the artists I'm worried about. I'm talking about people whose intellectual labor is being laundered by gigacorps and the inane defenses mounted by their techbro serfdom.
It's time for a lawyer letter. See the Computer Fraud and Abuse Act prosecution guidelines.[1] In general, the US Justice Department will not consider any access to open servers that's not clearly an attack to be "unauthorized access". But,
"However, when authorizers later expressly revoke authorization—for example, through unambiguous written cease and desist communications that defendants receive and understand—the Department will consider defendants from that point onward not to be authorized."
So, you get a lawyer to write an "unambiguous cease and desist" letter. You have it delivered to Amazon by either registered mail or a process server, as recommended by the lawyer. Probably both, plus email.
Then you wait and see if Amazon stops.
If they don't stop, you can file a criminal complaint. That will get Amazon's attention.
[1] https://www.justice.gov/jm/jm-9-48000-computer-fraud
Do we need a "robots must respect robots.txt" law?
If we did, bot authors would comply by just changing their User-Agent to something different that’s not expressly forbidden.
(Disallowing * isn’t usually an option since it makes you disappear from search engines).
> Then you wait and see if Amazon stops.
That’s if the requests are actually coming from Amazon, which seems very unlikely given some of the details in the post (rotating user agents, residential IPs, seemingly not interpreting robots.txt). The Amazon bot should come from known Amazon IP ranges and respect robots.txt. An Amazon engineer confirmed it in another comment: https://news.ycombinator.com/item?id=42751729
The blog post mentions things like changing user agent strings, ignoring robots.txt, and residential IP blocks. If the only thing that matches Amazon is the “AmazonBot” User Agent string but not the IP ranges or behavior then lighting your money on fire would be just as effective as hiring a lawyer to write a letter to Amazon.
Honestly, I figure that being on the front page of Hacker News like this is more than shame enough to get a human from the common sense department to read and respond to the email I sent politely asking them to stop scraping my git server. If I don't get a response by next Tuesday, I'm getting a lawyer to write a formal cease and desist letter.
Someone from Amazon already responded: https://news.ycombinator.com/item?id=42751729
> If I don't get a response by next Tuesday, I'm getting a lawyer to write a formal cease and desist letter.
Given the details, I wouldn’t waste your money on lawyers unless you have some information other than the user agent string.
It's computer science, nothing changes on corpo side until they get a lawyer letter.
And even then, it's probably not going to be easy
No one gives a fuck in this industry until someone turns up with bigger lawyers. This is behaviour which is written off with no ethical concerns as ok until that bigger fish comes along.
Really puts me off it.
Lol you really think an ephemeral HN ranking will make change?
It's not unheard of. But neither would I count on it.
It did yesterday!
https://news.ycombinator.com/item?id=42740516
There's only one way to find out!
I like the solution in this comment: https://news.ycombinator.com/item?id=42727510.
Put a link somewhere in your site that no human would visit, disallow it in robots.txt (under a wildcard because apparently OpenAI’s crawler specifically ignores wildcards), and when an IP address visits the link ban it for 24 hours.
I had to deal with some bot activities that used huge address space, and I tried something very similar, when condition confirming bot was detected I banned that IP for 24h
but due to amount of IPs involved this did not have any impact on about if traffic
my suggestion is to look very closely on headers that you receive (varnishlog in very nice of this and of you stare long enough at then you might stop something that all those requests have in common that would allow you to easily identify them (like very specific and usual combination of reported language and geo location, or the same outdated browser version, etc)
My favorite example of this was how folks fingerprinted the active probes of the Great Firewall of China. It has a large pool of IP addresses to work with (i.e. all ISPs in China), but the TCP timestamps were shared across a small number of probing machines:
"The figure shows that although the probers use thousands of source IP addresses, they cannot be fully independent, because they share a small number of TCP timestamp sequences"
https://censorbib.nymity.ch/pdf/Alice2020a.pdf
This is a very cool read, thanks
If you just block the connection, you send a signal that you are blocking it, and they will change it. You need to impose cost per every connection through QoS buckets.
If they rotate IPs, ban by ASN, have a page with some randomized pseudo looking content in the source (not static), and explain that the traffic allocated to this ASN has exceed normal user limits and has been rate limited (to a crawl).
Have graduated responses starting at a 72 hour ban where every page thereafter regardless of URI results in that page and rate limit. Include a contact email address that is dynamically generated by bucket, and validate all inbound mail that it matches DMARC for Amazon. Be ready to provide a log of abusive IP addresses.
That way if amazon wants to take action, they can but its in their ballpark. You gatekeep what they can do on your site with your bandwidth. Letting them run hog wild and steal bandwidth from you programmatically is unacceptable.
Maybe ban ASNs /s
This was indeed one mitigation used by a site to prevent bots hosted on AWS from uploading CSAM and generating bogus reports to the site's hosting provider.[1]
In any case, I agree with the sarcasm. Blocking data center IPs may not help the OP, because some of the bots are resorting to residential IP addresses.
[1] https://news.ycombinator.com/item?id=26865236
Ya if it's also coming from residences it's probably some kind of botnet
Why work hard… Train a model to recognize the AI bots!
Because you have to decide in less than 1ms, using AI is too slow in that context
You can delay the first request from an IP by a lot more than that without causing problems.
Train with a bdt.
This isn't a problem domain that models are capable of solving.
Ultimately in two party communications, computers are mostly constrained by determinism, and the resulting halting/undecidability problems (in core computer science).
All AI Models are really bad at solving stochastic types of problems. They can approximate generally only to a point after which it falls off. Temporal consistency im time series data is also a major weakness. Throw the two together, and models can't really solve it. They can pattern match to a degree but that is the limit.
When all you have is a Markov generator and $5 billion, every problem starts to look like a prompt. Or something like that.
Uggh, web crawlers...
8ish years ago, at the shop I worked at we had a server taken down. It was an image server for vehicles. How did it go down? Well, the crawler in question somehow had access to vehicle image links we had due to our business. Unfortunately, the perfect storm of the image not actually existing (can't remember why, mighta been one of those weird cases where we did a re-inspection without issuing new inspection ID) resulted in them essentially DOSing our condition report image server. Worse, there was a bug in the error handler somehow, such that the server process restarted when this condition happened. This had the -additional- disadvantage of invalidating our 'for .NET 2.0, pretty dang decent' caching implementation...
It comes to mind because, I'm pretty sure we started doing some canary techniques just to be safe (Ironically, doing some simple ones were still cheaper than even adding a different web server.... yes we also fixed the caching issue... yes we also added a way to 'scream' if we got too many bad requests on that service.)
When I was writing a crawler for my search engine (now offline), I found almost no crawler library actually compliant with the real world. So I ended up going to a lot of effort to write one that complied with Amazon and Google's rather complicated nested robots files, including respecting the cool off periods as requested.
... And then found their own crawlers can't parse their own manifests.
Could you link the source of your crawler library?
It's about 700 lines of the worst Python ever. You do not want it. I would be too embarrassed to release it, honestly.
It complied, but it was absolutely not fast or efficient. I aimed at compliance first, good code second, but never got to the second because of more human-oriented issues that killed the project.
Upvoted because we’re seeing the same behavior from all AI and Seo bots. They’re BARELY respecting Robots.txt, and hard to block. And when they crawl, they spam and drive up load so high they crash many servers for our clients.
If AI crawlers want access they can either behave, or pay. The consequence will almost universal blocks otherwise!
> The consequence will almost universal blocks otherwise!
How? The difficulty of doing that is the problem, isn't it? (Otherwise we'd just be doing that already.)
> (Otherwise we'd just be doing that already.)
Not quite what the original commenter meant but: WE ARE.
A major consequence of this reckless AI scraping is that it turbocharged the move away from the web and into closed ecosystems like Discord. Away from the prying eyes of most AI scrapers ... and the search engine indexes that made the internet so useful as an information resource.
Lots of old websites & forums are going offline as their hosts either cannot cope with the load or send a sizeable bill to the webmaster who then pulls the plug.
What do you mean by "barely" respecting robots.txt? Wouldn't that be more binary? Are they respecting some directives and ignoring others?
I believe that a number of AI bots only respect robot.txt entries that explicitly define their static user agent name. They ignore wildcards in user agents.
That counts as barely imho.
I found this out after OpenAI was decimating my site and ignoring the wildcard deny all. I had to add entires specifically for their three bots to get them to stop.
Even some non-profit ignore it now, Internet Archive stopped respecting it years ago: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
IA actually has technical and moral reasons to ignore robots.txt. Namely, they want to circumvent this stuff because their goal is to archive EVERYTHING.
Isn’t this a weak argument? OpenAI could also say their goal is to learn everything, feed it to AI, advance humanity etc etc.
OAI is using others' work to resell it in models. IA uses it to presrrve the history of the web
there is a case to be made about the value of the traffic you'll get from oai search though...
I also don't think they hit servers repeatedly so much
As I recall, this is outdated information. Internet Archive does respect robots.txt and will remove a site from its archive based on robots.txt. I have done this a few years after your linked blog post to get an inconsequential site removed from archive.org.
This is highly annoying and rude. Is there a complete list of all known bots and crawlers?
https://darkvisitors.com/agents
https://github.com/ai-robots-txt/ai.robots.txt
Amazonbot doesn't respect the `Crawl-Delay` directive. To be fair, Crawl-Delay is non-standard, but it is claimed to be respected by the other 3 most aggressive crawlers I see.
And how often does it check robots.txt? ClaudeBot will make hundreds of thousands of requests before it re-checks robots.txt to see that you asked it to please stop DDoSing you.
One would think they'd at least respect the cache-control directives. Those have been in the web standards since forever.
Here's Google, complaining of problems with pages they want to index but I blocked with robots.txt.
Is there some way website can sell those Data to AI bot in a large zip file rather than being constantly DDoS?
Or they could at least have the curtesy to scrap during night time / off peak hours.
No, because they won't pay for anything they can get for free. There's only one situation where an AI company will pay for data, and that's when it's owned by someone with scary enough lawyers to pressure them into paying up. Hence why OpenAI has struck licensing deals with a handful of companies while continuing to bulk-scrape unlicensed data from everyone else.
There is project whose goal is to avoid this crawling-induced DDoS by maintaining a single web index: https://commoncrawl.org/
Is existing intellectual property law not sufficient? Why aren't companies being prosecuted for large-scale theft?
> The consequence will almost universal blocks otherwise!
Who cares? They've already scraped the content by then.
Bold to assume that an AI scraper won't come back to download everything again, just in case there's any new scraps of data to extract. OP mentioned in the other thread that this bot had pulled 3TB so far, and I doubt their git server actually has 3TB of unique data, so the bot is probably pulling the same data over and over again.
FWIW that includes other scrapers, Amazon's is just the one that showed up the most in the logs.
If they only needed a one-time scrape we really wouldn't be seeing noticeable not traffic today.
That's the spirit!
If they're AI bots it might be fun to feed them nonsense. Just send hack megabytes of "Bezos is a bozo" or something like that. Even more fun if you could cooperate with many other otherwise-unrelated websites, e.g. via time settings in a modified tarpit.
Global tarpit is the solution. It makes sense anyway even without taking AI crawlers into account. Back when I had to implement that, I went the semi manual route - parse the access log and any IP address averaging more than X hits a second on /api gets a -j TARPIT with iptables [1].
Not sure how to implement it in the cloud though, never had the need for that there yet.
[1] https://gist.github.com/flaviovs/103a0dbf62c67ff371ff75fc62f...
One such tarpit (Nepenthes) was just recently mentioned on Hacker News: https://news.ycombinator.com/item?id=42725147
Their site is down at the moment, but luckily they haven't stopped Wayback Machine from crawling it: https://web.archive.org/web/20250117030633/https://zadzmo.or...
Quixotic[0] (my content obfuscator) includes a tarpit component, but for something like this, I think the main quixotic tool would be better - you run it against your content once, and it generates a pre-obfuscated version of it. It takes a lot less of your resources to serve than dynamically generating the tarpit links and content.
0 - https://marcusb.org/hacks/quixotic.html
How do you know their site is down? You probably just hit their tarpit. :)
i would think public outcry by influencers on social media (such as this thread) is a better deterrent, and also establishes a public datapoint and exhibit for future reference.. as it is hard to scale the tarpit.
This doesn't work with the kind of highly distributed crawling that is the problem now.
Don't we have intellectual property law for this tho?
Don't worry, though, because IP law only applies to peons like you and me. :)
I don’t think I’d assume this is actually Amazon. The author is seeing requests from rotating residential IPs and changing user agent strings
> It's futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more.
Impersonating crawlers from big companies is a common technique for people trying to blend in. The fact that requests are coming from residential IPs is a big red flag that something else is going on.
I work for Amazon, but not directly on web crawling.
Based on the internal information I have been able to gather, it is highly unlikely this is actually Amazon. Amazonbot is supposed to respect robots.txt and should always come from an Amazon-owned IP address (You can see verification steps here: https://developer.amazon.com/en/amazonbot).
I've forwarded this internally just in case there is some crazy internal team I'm not aware of pulling this stunt, but I would strongly suggest the author treats this traffic as malicious and lying about its user agent.
Randomly selected IPs from my logs show that 80% of them have the matching that forward confirming reverse DNS domain. The most aggressive ones were from the amazonbot domain.
Believe what you want though. Search for `xeiaso.net` in ticketing if you want proof.
Reverse DNS doesn't mean much, they can set it to anything; can you forward match them to any amazon domain?
It's forward confirming reverse DNS. I assumed that everyone does that by default.
So you said the IPs are residential IP, but their reverse DNS points to a amazonbot domain? Does that even make sense?
> The author is seeing requests from rotating residential IPs and changing user agent strings
This type of thing is commercially available as a service[1]. Hundreds of Millions of networks backdoored and used as crawlers/scrapers because of an included library somewhere -- and ostensibly legal because somewhere in some ToS they had some generic line that could plausibly be extended to using you as a patsy for quasi-legal activities.
[1] https://brightdata.com/proxy-types/residential-proxies
Yes, we know, but the accusation is that Amazon is the source of the traffic.
If the traffic is coming from residential IPs then it’s most likely someone using these services and putting “AmazonBot” as a user agent to trick people.
I wouldn't put it past any company these days doing crawling in an aggressive manner to use proxy networks.
With the amount of "if cloud IP then block" rules in place for many things (to weed out streaming VPNs and "potential" ddos-ing) I wouldn't doubt that at all.
I had this same issue recently. My Forgejo instance started to use 100 % of my home server's CPU as Claude and its AI friends from Meta and Google were hitting the basically infinite links at a high rate. I managed to curtail it with robots.txt and a user agent based blocklist in Caddy, but who knows how long that will work.
Whatever happened to courtesy in scraping?
> Whatever happened to courtesy in scraping?
Money happened. AI companies are financially incentivized to take as much data as possible, as quickly as possible, from anywhere they can get it, and for now they have so much cash to burn that they don't really need to be efficient about it.
not only money, but also a culture of "all your data belong to us" because our ai going to save you and the world.
the hubris reminds me of dot-com era. that bust left a huge wreckage. not sure how this one is going to land.
It's gonna be rough. If you can't make money charging people $200 a month for your service then something is deeply wrong.
Need to act fast before the copyright cases in the court gets handled.
> Whatever happened to courtesy in scraping?
When various companies got signal that at least for now they have a huge overton window of what is acceptable for AI to ingest, they are going to take all they can before regulation even tries to clamp down.
The bigger danger, is that one of these companies even (or, especially) one that claims to be 'Open', does so but gets to the point of being considered 'too big to fail' from an economic/natsec interest...
Mind sharing a decent robots.txt and/or user-agent list to block the AI crawlers?
Linked upthread: https://github.com/ai-robots-txt/ai.robots.txt/blob/main/rob...
Any of the big chat models should be able to reproduce it :)
When will the last Hacker News realize that Meta and OpenAI and every last massive tech company were always going to screw us all over for a quick buck?
Remember, Facebook famously made it easy to scrape your friends from MySpace, and then banned that exact same activity from their site once they got big.
Wake the f*ck up.
The same thing that happened to courtesy in every other context: it only existed in contexts where there was no profit to be made in ignoring it. The instant that stopped being true, it was ejected.
Are you sure it isn't a DDoS masquerading as Amazon?
Requests coming from residential ips is really suspicious.
Edit: the motivation for such a DDoS might be targeting Amazon, by taking down smaller sites and making it look like amazon is responsible.
If it is Amazon one place to start is blocking all the the ip ranges they publish. Although it sounds like there are requests outside those ranges...
You should check your websites like grass dot io (I refuse to give them traffic).
They pay you for your bandwidth while they resell it to 3rd parties, which is why a lot of bot traffic looks like it comes from residential IPs.
Yes, but the point is that big company crawlers aren’t paying for questionably sourced residential proxies.
If this person is seeing a lot of traffic from residential IPs then I would be shocked if it’s really Amazon. I think someone else is doing something sketchy and they put “AmazonBot” in the user agent to make victims think it’s Amazon.
You can set the user agent string to anything you want, as we all know.
I used to work for malware detection for a security company, and we looked at residential IP proxy services.
They are very, very, very expensive for the amount of data you get. You are paying for per bit of data. Even with Amazon's money, the number quickly become untenable.
It was literally cheaper for us to subscribe to business ADSL/cable/fiber optic services to our corp office buildings and thrunk them together.
I wonder if anyone has checked whether Alexa devices serve as a private proxy network for AmazonBot’s use.
Yes, people have probably analyzed Alexa traffic once or twice over the years.
You joke, but do people analyze it continuously forever also? Because if we’re being paranoid, that’s something you’d need to do in order to account for random updates that are probably happening all the time.
They could be using echo devices to proxy their traffic…
Although I’m not necessarily gonna make that accusation, because it would be pretty serious misconduct if it were true.
To add: it’s also kinda silly on the surface of it for Amazon to use consumer devices to hide their crawling traffic, but still leave “Amazonbot” in their UA string… it’s pretty safe to assume they’re not doing this.
I worked for Microsoft doing malware detection back 10+ years ago, and questionably sourced proxies were well and truly on the table
>> but the point is that big company crawlers aren’t paying for questionably sourced residential proxies.
> I worked for Microsoft doing malware detection back 10+ years ago, and questionably sourced proxies were well and truly on the table
Big Company Crawlers using questionably sourced proxies - this seems striking. What can you share about it?
They worked on malware detection. The most likely reason is very obvious: if you only allow traffic from residential addresses to your Command & Control server, you make anti-malware research (which is most likely coming from either a datacenter or an office building) an awful lot harder - especially when you give non-residential IPs a different and harmless response instead of straight-up blocking them.
they probably can't because some of the proxies were used by TLAs is my guess...
> Yes, but the point is that big company crawlers aren’t paying for questionably sourced residential proxies
You'd be surprised...
>> Yes, but the point is that big company crawlers aren’t paying for questionably sourced residential proxies
> You'd be surprised...
Surprised by what? What do you know?
It’s not residential proxies. It’s Amazon using IPs they sublease from residential ISPs.
Wild. While I'm sure the service is technically legal since it can be used for non-nefarious purposes, signing up for a service like that seems like a guarantee that you are contributing to problematic behavior.
My site (Pinboard) is also getting hammered by what I presume are AI crawlers. It started out this summer with Chinese and Singapore IPs, but now I can't even block by IP range, and have to resort to captchas. The level of traffic is enough to immediately crash the site, and I don't even have any interesting text for them to train on, just link.
I'm curious how OP figured out it's Amazon's crawler to blame. I would love to point the finger of blame.
The best way to fight this would not to block them, that does not cause Amazon/others anything. (clearly).
What if instead it was possible to feed the bots clearly damaging and harmfull content?
If done on a larger scale, and Amazon discovers the poisoned pills they could have to spend money rooting it out, quick like, and make attempts to stop their bots to ingest it.
Of course nobody wants to have that tuff on their own site though. That is the biggest problem with this.
> What if instead it was possible to feed the bots clearly damaging and harmfull content?
With all respect, you're completely misunderstanding the scope of AI companies' misbehaviour.
These scrapers already gleefully chow down on CSAM and all other likewise horrible things. OpenAI had some of their Kenyan data-tagging subcontractors quit on them over this. (2023, Time)
The current crop of AI firms do not care about data quality. Only quantity. The only thing you can do to harm them is to hand them 0 bytes.
You would go directly to jail for things even a tenth as bad as Sam Altman has authorized.
I've seen this tarpit recommended for this purpose. it creates endless nests of directories and endless garbage content, as the site is being crawled. The bot can spend hours on it.
https://zadzmo.org/code/nepenthes/
If with damaging content you mean incorrect content, another comment in this thread has a user doing what you said https://marcusb.org/hacks/quixotic.html
Why is it my responsibility to piss into the wind because these billionaire companies keep getting to break the law with impunity?
I'd love it if Amazon could give me some AWS credit as a sign of good faith to make up for the egress overages their and other bots are causing, but the ads on this post are likely going to make up for it. Unblock ads and I come out even!
Before I configured Nginx to block them:
- Bytespider (59%) and Amazonbot (21%) together accounted for 80% of the total traffic to our Git server.
- ClaudeBot drove more traffic through our Redmine in a month than it saw in the combined 5 years prior to ClaudeBot.
I’m working on a centralized platform[1] to help web crawlers be polite by default by respecting robots.txt, 429s, etc, and sharing a platform-wide TTL cache just for crawlers. The goal is to reduce global bot traffic by providing a convenient option to crawler authors that makes their bots play nice with the open web.
[1] https://crawlspace.dev
Are you sure you're not just encouraging more people to run them?
Even so, traffic funnels through the same cache, so website owners would see the same amount of hits whether there was 1 crawler or 1000 crawlers on the platform
How does it compare to common crawl?
CommonCrawl’s data is up to a month old. Here you can write a custom crawler, and have a REST/RAG API that you can use with your apps and agents for the data that your crawler finds. All the while, it builds up a platform-wide cache, so duplicate/redundant requests don’t reach their origin servers.
Can demonstrable ignoring of robots.txt help the cases of copyright infringement lawsuits against the "AI" companies, their partners, and customers?
Probably not copyright infringement. But it is probably (hopefully?) a violation of CFAA, both because it is effectively DDoSing you, and they are ignoring robots.txt.
Maybe worth contacting law enforcement?
Although it might not actually be Amazon.
Big thing worth asking here. Depending on what 'amazon' means here (i.e. known to be Amazon specific IPs vs Cloud IPs) it could just be someone running a crawler on AWS.
Or, folks failing the 'shared security model' of AWS and their stuff is compromised with botnets running on AWS.
Or, folks that are quasi-spoofing 'AmazonBot' because they think it will have a better not-block rate than anonymous or other requests...
From the information in the post, it sounds like the last one to me. That is, someone else spoofing an Amazonbot user agent. But it could potentially be all three.
On what legal basis?
In the UK, the Computer Misuse Act applies if:
* There is knowledge that the intended access was unauthorised
* There is an intention to secure access to any program or data held in a computer
I imagine US law has similar definitions of unauthorized access?
`robots.txt` is the universal standard for defining what is unauthorised access for bots. No programmer could argue they aren't aware of this, and ignoring it, for me personally, is enough to show knowledge that the intended access was unauthorised. Is that enough for a court? Not a goddamn clue. Maybe we need to find out.
robots.txt isn't a standard. It is a suggestion, and not legally binding AFAIK. In US law at least a bot scraping a site doesn't involve a human being and therefore the TOS do not constitute a contract. According to the Robotstxt organization itself: “There is no law stating that /robots.txt must be obeyed, nor does it constitute a binding contract between site owner and user, but having a /robots.txt can be relevant in legal cases.”
The last part basically means the robots.txt file can be circumstantial evidence of intent, but there needs to be other factors at the heart of the case.
> `robots.txt` is the universal standard
Quite the assumption, you just upset a bunch of alien species.
Dammit. Unchecked geocentric model privilege, sorry about that.
I mean it might just be a matter of the right UK person filing a case; I suppose my main understanding is UK Libel/Slander laws but if my US brain goes with that my head says the burden of proof is on non-infringement.
(But again, I don't know UK law.)
Universal within the scope of the Internet.
I wind up in jail for ten years if I download an episode of iCarly; Sam Altman inhales every last byte on the internet and gets a ticker tape parade. Make it make sense.
Terms of use contract violation?
Robots.txt is completely irrelevant. TOU/TOS are also irrelevant unless you restrict access to only those who have agreed to terms.
good thought but zippy chance this holds up in Court
>I'm working on a proof of work reverse proxy to protect my server from bots in the future.
Honestly I think this might end up being the mid-term solution.
For legitimate traffic it's not too onerous, and recognized users can easily have bypasses. For bulk traffic it's extremely costly, and can be scaled to make it more costly as abuse happens. Hashcash is a near-ideal corporate-bot combat system, and layers nicely with other techniques too.
I'm gonna publish more about this tomorrow. I didn't expect this to blow up so much.
Using status code 418 (I'm a teapot), while cute, actually works against you since even well behaved bots don't know how to handle it and thus might not treat it as a permanent status causing them to try to recrawl again later.
Plus you'll want to allow access to /robots.txt.
Of course, if they're hammering new connections, then automatically adding temporary firewall rules if the user agent requests anything but /robots.txt might be the easiest solution. Well or just stick Cloudflare in front of everything.
Sounds like a job for nepenthes: https://news.ycombinator.com/item?id=42725147
Has anyone tried using Cloudflare Bot Management and how effective is it for such bots ?
I put my personal site behind Cloudflare last year specifically to combat AI bots. It's very effective, but I hate that the web has devolved to a state where using a service like Cloudflare is practically no longer optional.
I, too, hate this future. This[0] might be a satisfying way to fight back.
[0] https://zadzmo.org/code/nepenthes/
Maybe you can use
https://www.cloudflare.com/lp/ppc/bot-management
We have had the same problem at my client now for the last couple of months, but from Facebook (using their IP ranges). They don’t even respect the 429 headers and the business is hesitant to outright ban them in case it impacts open graph or Facebook advertising tooling.
I wonder if there is a good way to copy something out of fossil scm or externalize this component for more general use.
https://fossil-scm.org/home/doc/trunk/www/antibot.wiki
I ran into this weeks ago and was super impressed to solve a self-hosted captcha and login as "anonymous". I use cgit currently but have dabbled with fossil previously and if bots were a problem I'd absolutely consider this
just add a forged link in the main page, pointing to a page that doesn't exist. when hit, block that ip. they will crawl only the first page in this way maybe?
Excuse my technical ignorance, but is it actually trying to get all the files in your git repo? Couldn’t you just have everything behind an user/pass if so?
Author of the article here. The behavior of the bot seems like this:
This means that every link on every page gets enqueued and saved to do something. Naturally, this means that every file of every commit gets enqueued and scraped.Having everything behind auth defeats the point of making the repos public.
>Having everything behind auth defeats the point of making the repos public.
Maybe add a captcha? Can be something simple and ad hoc, but unique enough to throw off most bots.
That's what I'm working on right now.
What is the proof that a hit from a residential IP address is actually Amazon? And if you have a way to tell, why not make use of it.
It seems like git self-hosters frequently encounter DDoS issues. I know it's not typical for free software, but I wonder if gating file contents behind a login and allowing registrations could be the answer for self-hosting repositories on the cheap.
If you use nginx to front it, consider something like this in the `http` block of your config:
You can still have other limits by IPs. 429s tends to slow the scrapers, and it means you are spending a lot less on bandwidth and compute when they get too aggressive. Monitor and adjust the regex list over time as needed.Note that if SEO is a goal, this does make you vulnerable to blackhat SEO by someone faking a UA of a search engine you care about and eating their 6 req/minute quota with fake bots. You could treat Google differently.
This approach won't solve for the case where the UA is dishonest and pretends to be a browser - that's an especially hard problem if they have a large pool of residential IPs and emulate / are headless browsers, but that's a whole different problem that needs different solutions.
For Google, just read their publicly-published list of crawler IPs. They’re broken down into 3 JSON files by category. One set of IPs is for GoogleBot (the web crawler), one is for special requests like things from Google Search Console, and one is special crawlers related to things like Google Ads.
You can ingest this IP list periodically and set rules based on those IPs instead. Makes you not prone to the blackhat SEO tactic you mentioned. In fact, you could completely block GoogleBot UA strings that don’t match the IPs, without harming SEO, since those UA strings are being spoofed ;)
Unless we start chopping these tech companies down there's not much hope for the public internet. They now have an incentive to crawl anything they can and have vastly more resources than even most governments. Most resources I need to host in a way that's internet facing are behind keyauth and I'm not sure I see a way around doing that for at least a while
That is an unworkable solution considering most of the people here are employed by these companies.
> About 10% of the requests do not have the amazonbot user agent.
Is there any bot string in the user agent? I'd wonder if it's GPTBot as I believe they don't respect a robots.txt deny wildcard.
What are the actual rules/laws about scraping? I have a few projects I'd like to do that involve scraping but have always been conscious about respecting the host's servers, plus whether private content is copyrighted. But sounds like AI companies don't give a shit lol. If anyone has a good resource on the subject I'd be grateful!
If you go to a police station and ask them to arrest Amazon for accessing your website too often, will they arrest Amazon, or laugh at you?
While facetious in nature, my point is that people walking around in real brick and mortar locations simply do not care. If you want police to enforce laws, those are the kinds of people that need to care about your problem. Until that occurs, youll have to work around the problem.
Oh, police will happily enforce IP law, if the MPAA or RIAA want them to. You only get a pass if you're a thought leader.
How many TB is your repo?
Do they keep retrieving the same data from the same links over and over and over again, like stuck in a forever loop, that runs week after week?
Or are they crawling your site at a hype aggressive way but getting more and more data? So it may tea them last say 2 days to crawl over it and then they go away?
Indeed: https://marcusb.org/hacks/quixotic.html try not to block LLM bot traffic and start injecting spurious content for ""improving"" their data. Markov chain at its finest!
Earlier: https://news.ycombinator.com/item?id=42740095
Recent related discussion: https://news.ycombinator.com/item?id=42549624
I’m surprised everyone else’s servers are struggling to handle a couple of bot scrapes.
I run a couple of public facing websites on a NUC and it just… chugs along? This is also amidst the constant barrage of OSINT attempts at my IP.
Depends on what you are hosting. I found that source code repository viewers in particular (OP mentions Gitea, but I have seen it with others as well) are really troublesome: Each and every commit that exists in your repository can potentially cause dozens if not hundres of new unique pages to exist (diff against previous version, diff against current version, show file history, show file blame, etc...). Plus many repo viewers of them take this information directly from the source repository without much caching involved, as it seems. This is different from typical blogging or forum software, which is often designed to be able to handle really huge websites and thus have strong caching support. So far, nobody expected source code viewers to be so popular that performance could be an issue, but with AI scrapers this is quickly changing.
Gitea in particular is a worst case for this. Gitea shows details about every file at every version and every commit if you click enough. The bots click every link. This fixed cost adds up when hundreds of IPs are at different levels of clicking of every link.
Seems some of these bots are behaving abusively on sites with lots of links (like git forges). I have some sites receiving 200 requests per day and some receiving 1 million requests per day from these AI bots, depending on the design of the site.
Has any group of tech companies ever behaved so badly in so many ways so uniformly? There was 90s Microsoft, that was bad enough to get the DOJ involved, but that was one actor.
It’s like the friggin tobacco companies or something. Is anyone being the “good guys” on this?
Unacceptable, sorry this is happening. Do you know about fail2ban? You can have it automatically filter IPs that violate certain rules. One rule could be matching on the bot trying certain URLs. You might be able to get some kind of honeypot going with that idea. Good luck
They said that it is coming from different ip addresses every time, so fail2ban wouldn't help.
Amazon does publish every IP address range used by AWS, so there is the nuclear option of blocking them all pre-emptively.
https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-rang...
I'd do that, but my DNS is via route 53. Blocking AWS would block my ability to manage DNS automatically as well as certificate issuance via DNS-01.
They list a service for each address, so maybe you could block all the non-Route 53 IP addresses. Although that assumes they aren’t using the Route 53 IPs or unlisted IPs for scraping (the page warns it’s not a comprehensive list).
Regardless, it sucks that you have to deal with this. The fact that you’re a customer makes it all the more absurd.
If you only block new inbound requests, it shouldn't impact your route 53 or DNS-01 usage.
It’ll most likely eventually help, as long as they don’t have an infinite address pool.
Do these bots use some client software (browser plugin, desktop app) that’s consuming unsuspecting users bandwidth for distributed crawling?
Monitor access logs for links that only crawlers can find.
Edit: oh, I got your point now.
One time had an AWS customer (Nokia Deepfield) spamming my shadowsocks server logs with unauthenticated attempts. Emailed AWS to curtail the abuse and they sided with their customer ofc.
Feed them false data. If feed by enough people(I looking at you HN), their AI will be inaccurate to the point of being useless.
It's already useless, and companies seem to be doubling down.
https://ip-ranges.amazonaws.com/ip-ranges.json ?
crazy how what seemed like an excellent landmark case around webcrawling turned around like this so quickly due to AI
Have had several clients hit by bad AI robots in the last few months. Sad because it’s easy to honor robots.txt.
suffering with it as well. why can't they just `git clone` and do their stuff? =)
Lol this isn’t Amazon. This is an impersonator. They’re crawling your git repo because people accidentally push secrets to git repos.
Back to Gopher. They'll never get us there!
Cloudflare free plan has bot protection.
Return "402 Payment Required" and block?
No. Feed them shit. Code with deliberate security vulns and so on.
https://marcusb.org/hacks/quixotic.html
If blacklisting user-agents doesn't work, whitelist the clients you actually want to accept.
It's been pounding one of my sites too. Here's the url it's trying to get to. I wonder if someone will ever figure it out and stop it.
> /wp-content/uploads/2014/09/contact-us/referanslar/petrofac/wp-content/uploads/2014/09/products_and_services/products_and_services/catalogue/references/capabilities/about-company/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/themes/domain/images/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/themes/domain/images/wp-content/uploads/2014/09/wp-content/themes/domain/images/about-company/corporate-philosophy/index.htm
"User-Agent":["Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"
I bet that URL (or a predecessor) resolves to a 404 page that has a broken link on it. technically your problem.
Probably dumb question, but any enlightenment would be welcome to help me learn:
Could this be prevented by having a link that when followed would serve a dynamically generated page that does all of the following:
A) insert some fake content outlining the oligarcs more lurid rumours or whichever disinformation you choose to push
C) embed links to assets in oligarchs companies so they get hit with some bandwith
C) dynamically create new Random pages that link to itself
And thus create an infinite loop, similar to a gzip bomb, which could potentially taint the model if done by enough people.
Not a crawler writer but have FAFOd with data structures in the past to large career success.
...
The closest you could possibly do with any meaningful influence, is option C, with the general observations of:
1. You'd need to 'randomize' the generated output link
2. You'd also want to maximize cachability of the replayed content to minimize work.
3. Add layers of obfuscation on the frontend side, for instance a 'hidden link (maybe with some prompt fuckery if you are brave) inside the HTML with a random bad link on your normal pages.
4. Randomize parts of the honeypot link pattern. At some point someone monitoring logs/etc will see that it's a loop and blacklist the path.
5. Keep up at 4 and eventually they'll hopefully stop crawling.
---
On the lighter side...
1. do some combination of above but have all honeypot links contain the right words that an LLM will just nope out of for regulatory reasons.
That said, all above will do is minimize pain (except, perhaps ironically, the joke response which will more likely blacklist you but potentially get you on a list or a TLA visit)...
... Most pragmatically, I'd start by suggesting the best option is a combination of nonlinear rate limiting, both on the ramp-up and the ramp-down. That is, the faster requests come in, the more you increment their 'valueToCheckAgainstLimit`. The longer it's been since last request, the more you decrement.
Also pragmatically, if you can extend that to put together even semi-sloppy code to then scan when a request to a junk link that results in a ban immediately results in another IP trying to hit the same request... well ban that IP as soon as you see it, at least for a while.
With the right sort of lookup table, IP Bans can be fairly simple to handle on a software level, although the 'first-time' elbow grease can be a challenge.
He seems to have a mistake in his rule?
He's got "(Amazon)" while Amazon lists their useragent as "(Amazonbot/0.1;"
The author's pronouns can be found here: https://github.com/Xe
[flagged]
It's a regular expression.
No evidence provided that this is amazonbot or AI related. Someone is just upset that their website is getting traffic, which seems asinine.
HN when it's a photographer, writer, or artist concerned about IP laundering: "Fair use! Information wants to be free!"
HN when it's bots hammering some guy's server "Hey this is wrong!"
A lot of you are unfamiliar with the tragedy of the commons. I have seen the paperclip maximizer – and he is you lot.
https://en.wikipedia.org/wiki/Tragedy_of_the_commons
I think there’s a difference between crawling websites at a reasonable pace instead of just hammering the server to the point it’s unusable.
Nobody has problems with the Google Search indexer trying to crawl websites in a responsible way
For sure.
I'm really just pointing out the inconsistent technocrat attitude towards labor, sovereignty, and resources.
This submission has nothing to do with IP laundering. The bot is straining their server and causing OP technical issues.
Commentary is often second- and third-order.
True, but it tends to flow there organically. This comment was off topic from the start.
Their*
Fixed.
Thanks!
Personally I'm not trying to block the bots, I'm trying to avoid the bandwidth bill.
I've recently blocked everything that isn't offering a user agent. If it had only pulled text I probably wouldn't have cared, but it was pulling images as well (bot designers, take note - you can have orders of magnitude less impact if you skip the images).
For me personally, what's left isn't eating enough bandwidth for me to care, and I think any attempt to serve some bots is doomed to failure.
If I really, really hated chatbots (I don't), I'd look at approaches that poison the well.
HN isnt a monolith
Tell that to the moderation team.
Most of those of those artists aren’t any better though. I’m on a couple artists’ forums and outlets like Tumblr, and I saw firsthand the immediate, total 180 re: IP protection when genAI showed up. Overnight, everybody went from “copying isn’t theft, it leaves the original!” and other such mantras, to being die-hard IP maximalists. To say nothing of how they went from “anything can be art and it doesn’t matter what tools you’re using” to forming witch-hunt mobs against people suspected of using AI tooling. AI has made a hypocrite out of everybody.
Manga nerds on Tumblr aren't the artists I'm worried about. I'm talking about people whose intellectual labor is being laundered by gigacorps and the inane defenses mounted by their techbro serfdom.
Something something man understand, something something salary depends on.