For the site I work for, about 20-30% of our monthly hosting costs go towards servicing bot/scraping traffic. We’ve generally priced this into the cost of doing business, as we’ve prioritised making our site as freely accessible as possible.
But after this week, where some amateur did real damage to us with a ham-fisted attempt to scrape too much too quickly, we’re forced to degrade the experience for ALL users by introducing captchas and other techniques we’d really rather not.
I had a particularly bad time not so long ago, when a customer's site - a shop - was brought to its knees because someone, probably a competitor, hired some scraper-company of some sort to scrape every product and price.
The scraper would systematically go through every single product page.
And by scraper, I mean - 100's of them. All at the same time, using the old trick of 1 scraper requesting 3 or 4 product pages at a time then pausing for a while.
They used umpteen different IP address blocks from all over the globe - but mainly using OVH vps IP address blocks from France.
Now, maybe if they'd just thrown, say, 5 or 10 of the scraper "units" at the site, no one would have noticed in amongst Googlebot (which they wanted to use anyway because they are using Google Shopping to try to bring in more sales).
But no. This shower of arseholes threw 100's of scraper "tasks" at the site. They got greedy.
Now, the site was robust enough to handle this load - barely - which was massive, however, having to do that /and/ also handle normal day-to-day traffic? Nah. The bastards got greedy and like you I spent a few days unfucking the damage they were causing.
Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.
Not everybody in this space is out to destroy your site. Some of us actively try to put as little load on your site as possible. My scraper puts less load on sites than I do when I browse them normally, I've measured it. Really sucks when we get lumped together with the other abusers and blocked.
At their request, we built a method to flag accounts for data poisoning. Once flagged, those accounts would start getting plausible-ish looking garbage data.
It was pretty effective. One competitor went offline for a few days about a week after that started, and had a more limited offering when they came back up.
My favorite is Varnish,[0] which I have used with great success for _many_ web sites throughout the years. Even a web site that 10+ millions of requests per day ran from a single web server for a long time a decade-ish ago.
Wait till you find out what half of Google's business is based on (spoiler - scraping).
I really don't think scraping itself is an issue 90% of the time. It's the behavior of the out of control scrapers that are the problem. A well behaved scraper should barely be noticeable, if at all.
if a scraper is effectively DDoSing you, call it what it is -- a denial of service attack.
i've found from experience that most scraping attempts originate against host-sites that are generally user-hostile; no APIs to use, JS tricks to bother user browsing, or groups that profit from first-mover advantage and thus try to obscure data.
So, if your sites are commonly the victim of scrapers that are harvesting publicly available data i've found that it's more useful to ask myself what alternatives I could provide those that feel the need to scrape.
As for a 'lack of ethics' on how publicly available data is wrangled -- well, i'll just say that I feel that it remains the responsibility of the administrator rather than being something to push the blame onto clients for. There are plenty of technical avenues to pursue before appealing to morals and ethics for help.
At the same time, even as someone who runs a web crawler, I have zero qualms about blocking misbehaving bots.
Not saying that doesn't suck - it does, it's why many ideas don't work in practice as an online service.
We chose to switch to the JS challenge screen as it requires no human interaction. We now block 75% (estimated to the best of our knowledge) of bot traffic but some customers are livid over the challenge screen.
Also, not all types of company will provide API endpoints. It all depends on the type of site - for example, an online shop might not wish to provide easily accessible data on offered products and prices, to their competitors who may wish to undercut them. Why would an online shop do that?
If an amateur can do damage to you, then I have some bad news for you...
I believe the point wasn't surprise that damage occurred at all, but frustration that damage can occur just out laziness/ignorance rather than malice.
It’s just selfish. If you’re going to take the product of other people’s work in a manner they don’t consent to, at least do it in a way that doesn’t cost them twice over.
I asked them if my customer could pay to access this data point via their API and they quoted 3600 EUR/month! Enter the scraper...
Can't speak for the op but we have APIs and move the ones scraping and reselling our content to APIs. The majority are just a worthless suck on resources though.
SERP: Search Engine Results Page
That said, I do a lot of SEO work.
Still, it should be best practice to define any acronym or initialism the first time you use it
Rate Limiting by login,
Limiting data to know workflows ...
But our most fruitful effort was when we removed limits and started giving "bad" data. By bad I mean alter the price up or down by a small percentage. This hit them in the pocket but again, wasn't a golden bullet. If the customer made a transaction on the altered figure we we informed them and took it at the correct price.
It's a cool problem to tackle but it is just an arms race.
Like displaying a table with semantic elements, then divs, then using an iframe with css grid and floating values over the top.
This almost seems like a problem for AI to solve.
Plus, it's one you're going to lose. I was once asked at an All-Hands why we don't defend ourselves against bots even more vigorously.
My answer was: "Because I don't know how to build a publically available website that I could not scrape myself if I really wanted to."
Is that legal? It would be a big blow to trust if I was the customer, but that's without knowing what you were selling and in what market.
Taking a stab at answering it: you scrape the data and build a business around selling it. Stock prices? But that's boring, plus how many others are doing it? I bet a lot.
Artificial scarcity - every week you release a "limited edition item", but if you do the math, it's not limited edition at all if you integrate over a year.
Prices (are yours high, low compared to competition?), reviews, locations of physical stores, search result placement (where does your widget show up when someone searches "widget" on your site?), just to name a few use cases.
With only 10 dongles and 10 dataplans, you can have a lot of IP addresses that are extremely hard to block. It's an one time investment, paying proxy providers is a fixed cost.
We tried to get some, but all of the ones we could get were various levels of broken or unsupported.
>>Because I could not fully trust the other customers with whom I shared the proxy bandwidth. What if I share proxy servers with criminals that do more malicious stuff than the somewhat innocent SERP scraping?
I feel like I'm getting a glimpse into the dark underbelly of the web.
We use it for data-entry on a government website. A human would average around 10 minutes of clicking and typing, where the bot takes maybe 10 seconds. Last year we did 12000 entries. Good bot.
It should be a matter of a simple GET request to fetch plain html and parse the OpenGraph meta tags out if that. There are many open source libraries to do that for you depending on your language.
If bot blocks really are a problem, a SaaS solution like Microlink could probably do it for you.
Microlink is a good tip, thanks!
I used to scrape websites to generate content for higher SERPs.
Ended up going into the adult industry lols. (https://javfilms.net)
I've always wondered, and since you're right here... how do sites like this make money?
It looks like you're probably crawling all the JAV vendors, finding free clips of today's releases, embedding them in your own site to draw traffic, and making money with affiliate links to buy the full content?
Am I missing anything? It seems hard to believe you'd get enough affiliate signups to make it worthwhile.
I can imagine your site as being a few hours a year of script maintenance and a money printer, or a 40hr/week SEO job with 1000s of similar sites across the adult industry.
I'd love to know anything you're willing to share about how the business works.
https://github.com/NikolaiT/Crawling-Infrastructure
And here I am writing about it (but its quite old): https://incolumitas.com/2019/08/31/web-scraping-puppeteer-aw...
I believe the future will make us more free by using more bot / AI technology since who wants to spend their whole day in front of a computer and research information if machines can do the job just fine?
In the past we've had the most success defeating bots by just finding stupid tricks to use against them. Identify the traffic, identify anything that is correlated with the botnet traffic, and throw a monkey wrench at it. They're only using one User Agent? Return fake results. 90% of the botnet traffic is coming from one network source (country/region/etc)? Cause "random" network delays and timeouts. They still won't quit? During attacks, redirect to captchas for specific pages. During active attacks this is enough to take them out for days to weeks while they figure it out and work around it.
Then graduated to JavaScript for surrounding logic e.g. data transformation
I had assumed I'd quickly give up and move to a headless browser, BUT I can't bring myself to move away from tiny CPU utilization of curl.
Throwing together a "plugin" probably takes me less than 20 minutes normally.
I'll probably have a look at using prowl to ping my phone.
And if I get more serious I'll look at auto authenticate options on npm. But I'm not sure if the overhead of maintaining a bunch of spoofy requests will be worth it.