Scrape like the big boys (opens in new tab)

(incolumitas.com)

374 pointsincolumitas4y ago189 comments

189 comments

86 comments · 17 top-level

ebbp4y ago· 27 in thread

Having spent a week battling a particularly inconsiderate scraping attempt, I’m quite unsurprised by the juvenile tone and fairly glib approach to the ethics of bots/scraping presented by the piece.

For the site I work for, about 20-30% of our monthly hosting costs go towards servicing bot/scraping traffic. We’ve generally priced this into the cost of doing business, as we’ve prioritised making our site as freely accessible as possible.

But after this week, where some amateur did real damage to us with a ham-fisted attempt to scrape too much too quickly, we’re forced to degrade the experience for ALL users by introducing captchas and other techniques we’d really rather not.

scarygliders4y ago

Right with you there.

I had a particularly bad time not so long ago, when a customer's site - a shop - was brought to its knees because someone, probably a competitor, hired some scraper-company of some sort to scrape every product and price.

The scraper would systematically go through every single product page.

And by scraper, I mean - 100's of them. All at the same time, using the old trick of 1 scraper requesting 3 or 4 product pages at a time then pausing for a while.

They used umpteen different IP address blocks from all over the globe - but mainly using OVH vps IP address blocks from France.

Now, maybe if they'd just thrown, say, 5 or 10 of the scraper "units" at the site, no one would have noticed in amongst Googlebot (which they wanted to use anyway because they are using Google Shopping to try to bring in more sales).

But no. This shower of arseholes threw 100's of scraper "tasks" at the site. They got greedy.

Now, the site was robust enough to handle this load - barely - which was massive, however, having to do that /and/ also handle normal day-to-day traffic? Nah. The bastards got greedy and like you I spent a few days unfucking the damage they were causing.

Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

matheusmoreira4y ago

> Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

Not everybody in this space is out to destroy your site. Some of us actively try to put as little load on your site as possible. My scraper puts less load on sites than I do when I browse them normally, I've measured it. Really sucks when we get lumped together with the other abusers and blocked.

1 more reply

_lqaf4y ago

In a past life, we were consulting with a startup that offered a subscription data service. They were very sensitive about scrapers, especially on the time limited try-before-you-buy accounts, which competitors were abusing.

At their request, we built a method to flag accounts for data poisoning. Once flagged, those accounts would start getting plausible-ish looking garbage data.

It was pretty effective. One competitor went offline for a few days about a week after that started, and had a more limited offering when they came back up.

2 more replies

thatwasunusual4y ago

It sucks when this happens, but it's easily avoidable by using a caching frontend of some sort.

My favorite is Varnish,[0] which I have used with great success for _many_ web sites throughout the years. Even a web site that 10+ millions of requests per day ran from a single web server for a long time a decade-ish ago.

[0] https://varnish-cache.org/

funnyflamigo4y ago

> Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

Wait till you find out what half of Google's business is based on (spoiler - scraping).

I really don't think scraping itself is an issue 90% of the time. It's the behavior of the out of control scrapers that are the problem. A well behaved scraper should barely be noticeable, if at all.

2 more replies

serf4y ago

>Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

if a scraper is effectively DDoSing you, call it what it is -- a denial of service attack.

i've found from experience that most scraping attempts originate against host-sites that are generally user-hostile; no APIs to use, JS tricks to bother user browsing, or groups that profit from first-mover advantage and thus try to obscure data.

So, if your sites are commonly the victim of scrapers that are harvesting publicly available data i've found that it's more useful to ask myself what alternatives I could provide those that feel the need to scrape.

As for a 'lack of ethics' on how publicly available data is wrangled -- well, i'll just say that I feel that it remains the responsibility of the administrator rather than being something to push the blame onto clients for. There are plenty of technical avenues to pursue before appealing to morals and ethics for help.

un-devmox4y ago

This and the post you are replying to both sound like sabotage by a competitor rather than legit data collecting.

mdoms4y ago

If your site is so poorly written it can't handle a few hundred computers trying to do something as simple as loading your product pages then sorry, but that's on you. The information is on the public web and scrapers are as entitled to access it as any web browser.

1 more reply

marginalia_nu4y ago

Bots are one of those things that are easy to build and hard to get right, and there's really no way of preparing for the chaotic reality of real web pages other than fixing the problems as they show up. Weird and unexpected interactions are going to happen. Crawling the real web involves navigating a fractal of unexpected, undocumented and non-standard corner cases. Nobody gets that right on the first try. Because of that I do think we need to be a bit patient with bots.

At the same time, even as someone who runs a web crawler, I have zero qualms about blocking misbehaving bots.

chillfox4y ago

I kinda feel like rate limiting your request to individual domains and IP addresses is an easy thing that goes a long way towards getting it right.

1 more reply

devwastaken4y ago

If an amateur can do that to your service by scraping, imagine what someone can do if they actually intend to do you harm. With cloud pricing models someone could find a little misconfiguration or oversight and put you in the hole in operating costs. Anti-abuse is a necessary design when your service is exposed to the internet.

Not saying that doesn't suck - it does, it's why many ideas don't work in practice as an online service.

paco33464y ago

I'm right there with you. I'm the lead engineer for an automotive SaaS provider (with ~6000 customers and ~4 billion requests per month) and we recently started moving all our services to Cloudflare's WAF to take advantage of their bot protection. We were getting scrapes from botnets in the 100000+ per minute range that was affecting performance.

We chose to switch to the JS challenge screen as it requires no human interaction. We now block 75% (estimated to the best of our knowledge) of bot traffic but some customers are livid over the challenge screen.

Andoryuuta4y ago

I'm really surprised that the JS challenges helped so much, given that there are open source libraries for bypassing them (e.g. cloudscraper[0]).

[0]: https://github.com/venomous/cloudscraper

1 more reply

EdwardDiego4y ago

What were they scraping, if I can ask? Was it targeted or just wget -r style?

1 more reply

RobSm4y ago

Why do you think those bots were scraping your data in the first place?

kulikalov4y ago

Why not create api endpoint and charge mild cost for that data? You’ll make money instead of spending it.

scarygliders4y ago

Do you honestly believe all site scraper people/companies are ethical enough to go to whoever pays /them/ to scrape data from a competitor's site and say "oh they offer an API to access this data let's pay for that", instead of "why pay for that data when we can scrape it right off their site"?

Also, not all types of company will provide API endpoints. It all depends on the type of site - for example, an online shop might not wish to provide easily accessible data on offered products and prices, to their competitors who may wish to undercut them. Why would an online shop do that?

4 more replies

ebbp4y ago

My point was more that we can accept with, and live with, scrapers but expect some minimal level of consideration if you’re going to abuse our very expensively gathered dataset. Sending us 10x daily traffic so you can scrape quicker than the fair usage policy of our API allows is just… poor etiquette? Unkind? Not really sure how to phrase it. I’m exhausted after multiple 18 hours days trying to keep our website online for the public.

krzyk4y ago

As a programmer that just sometimes wants to check if given item is available in store I would like to be able to use API for that. But if it is not available one has to scrape.

taytus4y ago

>where some amateur did real damage to us

If an amateur can do damage to you, then I have some bad news for you...

convolutionart4y ago

This is nonsense. It's always easier to destroy than to build/mantain. If you got any real advice, by all means...

Goronmon4y ago

If an amateur can do damage to you, then I have some bad news for you...

I believe the point wasn't surprise that damage occurred at all, but frustration that damage can occur just out laziness/ignorance rather than malice.

1 more reply

ebbp4y ago

To be clear, they did “damage” was to our bottom line. Most sites don’t capacity plan for random cliff walls of 2-10x traffic (clearly we should!). We’re scalable enough to handle that traffic after a period, but a) it caused intermittent periods of low availability (costing us money because we didn’t generate income the way we normally do) and b) cost us money from scaling all our services up.

It’s just selfish. If you’re going to take the product of other people’s work in a manner they don’t consent to, at least do it in a way that doesn’t cost them twice over.

jtdev4y ago

Considering the demand for your content, why haven’t you created and provided an API? Maybe you could monetize?

throwaway29934y ago

I wrote a scraper a couple of years ago to get a single data point from a website where my client was already a paying customer. This website had an API, which they were also paying for, but the API didn't cover that data point, so at the time they had one of their admin people populating that missing piece of data manually, which was taking them around ten minutes a day.

I asked them if my customer could pay to access this data point via their API and they quoted 3600 EUR/month! Enter the scraper...

ebbp4y ago

We do offer an API - the scrapers are trying to circumvent using that, presumably.

5 more replies

chewmieser4y ago

Like everyone and their brother has a web spider. And some of them are VERY badly designed. We block them when they use too many resources, although we'd rather just let them be.

Can't speak for the op but we have APIs and move the ones scraping and reselling our content to APIs. The majority are just a worthless suck on resources though.

joekrill4y ago· 11 in thread

A little pet-peeve I have is when an obscure(ish) acronym is used and never defined. Is SERP a well-known acronym? Perhaps this is a niche blog and I'm not the intended audience.

tptacek4y ago

Yes; a SERP is a Google search result page. It's the most important acronym in SEO.

nomdep4y ago

I don’t remember never ever hearing it and I’ve been in the industry for some time

3 more replies

marginalia_nu4y ago

The word SERP feels like a bit of a shibboleth for SEO-people. They seem to take it for granted, the rest of the world just looks puzzled when they hear it.

daveguy4y ago

I had to look it up.

SERP: Search Engine Results Page

fergie4y ago

Unintroduced acronyms should always be avoided.

praptak4y ago

Depends on the audience-acronym pair. I don't think HTTP needs an introduction in a technical article, OTOH (on the other hand ;) ) a general newspaper should probably expand HTTP but not WWW.

runnerup4y ago

Perhaps a stroll through your own comment history (or the comments of any other HN (hacker news) user) would illuminate a lot of places where acronyms are used without introductions. TBH (to be honest) though, I'm not sure if every one of those should always have one or sometimes not.

nsotelo4y ago

As English speakers we often take for granted acronyms such as DB or even USA. For foreigners these can also be inscrutable.

wiether4y ago

On HN I'm used to SE meaning Software Engineer so I came up with "Software Engineer Ranting Board" before asking Google to give me the SERP that would provide me with the true meaning of SERP.

joncp4y ago

An all-too-common occurrence in HN comments as well.

bgroat4y ago

Not the OP, but I thought it was well known.

That said, I do a lot of SEO work.

Still, it should be best practice to define any acronym or initialism the first time you use it

biosed4y ago· 9 in thread

I used to lead Sys Eng for a FTSE 100 company. Our data was valuable but only for a short amount of time. We were constantly scraped which cost us in hosting etc. We even seen competitors use our figures (good ones used it to offset their prices, bad ones just used it straight). As the article suggest, we couldn't block mobile operator IPs, some had over 100k customers behind them. Forcing the users to login did little as the scrapers just created accounts. We had a few approaches that minimised the scraping:

Rate Limiting by login,

Limiting data to know workflows ...

But our most fruitful effort was when we removed limits and started giving "bad" data. By bad I mean alter the price up or down by a small percentage. This hit them in the pocket but again, wasn't a golden bullet. If the customer made a transaction on the altered figure we we informed them and took it at the correct price.

It's a cool problem to tackle but it is just an arms race.

rootusrootus4y ago

I know a guy at Nike that had to deal with a similar problem. As I recall, they basically gave in -- instead of trying to fight the scrapers, they built them an API so they'd quit trashing the performance of the retail site with all the scraping.

matheusmoreira4y ago

Yes. That's exactly what everyone should do.

2 more replies

chadwittman4y ago

The real Jedi move

1 more reply

gonzo414y ago

I think there's an opportunity for a new JS framework to have something like randomly generated dom that will always display the page and elements the same to a human but constantly break paths for computers.

Like displaying a table with semantic elements, then divs, then using an iframe with css grid and floating values over the top.

This almost seems like a problem for AI to solve.

4 more replies

endymi0n4y ago

> It's a cool problem to tackle but it is just an arms race.

Plus, it's one you're going to lose. I was once asked at an All-Hands why we don't defend ourselves against bots even more vigorously.

My answer was: "Because I don't know how to build a publically available website that I could not scrape myself if I really wanted to."

wolverine8764y ago

> But our most fruitful effort was when we removed limits and started giving "bad" data. By bad I mean alter the price up or down by a small percentage. ... If the customer made a transaction on the altered figure we we informed them and took it at the correct price.

Is that legal? It would be a big blow to trust if I was the customer, but that's without knowing what you were selling and in what market.

killingtime744y ago

It’s legal if it’s in the contract. Standard for contracts to allow for mistakes and confirmations of prices

1 more reply

ransom15384y ago

I love the honey pot approach. Put tons of valued hrefs on the page that are invisible (css) that the scrapper would find. Then just rate limit that ip address and randomize the data coming back. Profit.

histriosum4y ago

I think this falls into the "arms race" trap, though. If you can make an href invisible via CSS, then the scraper can certainly be written to understand CSS, and thus filter out the invisible hrefs..

hall0ween4y ago· 5 in thread

Basic question, how does one profit from scraping data and what kinda data?

Taking a stab at answering it: you scrape the data and build a business around selling it. Stock prices? But that's boring, plus how many others are doing it? I bet a lot.

throw12346512344y ago

1. Be job site. 2. Have employees that cost money call facilities and get job listings. 3. Establishing relationships with facilities to list jobs. 4. Buy job listings from 3rd parties. 5. List them for free hoping to make margin. 6. Scraper steals all jobs, lags site, and gets value of hard work for free.

hall0ween4y ago

ahh thanks

3234y ago

These are scraping artificially limited releases clothes/shoes. You buy a shoe at $100 and immediately sell it at $1000.

Artificial scarcity - every week you release a "limited edition item", but if you do the math, it's not limited edition at all if you integrate over a year.

ushtaritk4214y ago

Anything you might look up or keep track of online that helps your business is probably being scraped by someone who is using it themselves or selling access to the curated data set.

Prices (are yours high, low compared to competition?), reviews, locations of physical stores, search result placement (where does your widget show up when someone searches "widget" on your site?), just to name a few use cases.

ushtaritk4214y ago

Here's a project that's been in the news recently that relies heavily on scraped data. http://www.thebillionpricesproject.com/

IceWreck4y ago· 4 in thread

The author says proxys are expensive and then proceeds to spend a shitton of money buying all that hardware.

incolumitasOP4y ago

4G proxies are just soo much better than so called "residential" or straight datacenter proxies. It makes sense to create your own 4G proxy farm if you conduct business in that area.

With only 10 dongles and 10 dataplans, you can have a lot of IP addresses that are extremely hard to block. It's an one time investment, paying proxy providers is a fixed cost.

bsder4y ago

Where do you get 4G dongles that don't suck nowadays?

We tried to get some, but all of the ones we could get were various levels of broken or unsupported.

palijer4y ago

That was not the authors main argument against proxies, that was just an additional point. You ignored the primary argument in your judgment.

>>Because I could not fully trust the other customers with whom I shared the proxy bandwidth. What if I share proxy servers with criminals that do more malicious stuff than the somewhat innocent SERP scraping?

RandomThrow3214y ago

Can they not call out a secondary point?

1 more reply

abc034y ago· 3 in thread

I scrap government sites a lot as they don't provide apis. For mobile proxies, I use the proxidize dongles and mobinet.io (free, with Android devices). As stated in the article, with cgNAT it's basically impossible to block them as in my case, half the country couldn't access the sites anymore (if you place them in several locations and use one carrier each there).

exhilaration4y ago

Wow, this is super interesting:

https://proxidize.com/

https://mobinet.io/

I feel like I'm getting a glimpse into the dark underbelly of the web.

rdtwo4y ago

Is it just one ip per dongle at a time? Or can you have multiple ips on the same device.

abc034y ago

Just one IP at a time but you can change every 5 Min or more if you like

1 more reply

max0024y ago· 3 in thread

Its easy to detect chrome headless so scraping with it is not really how "big" boys do it :D the only scrapers/bots that are really hard to detect are the ones running and controlling real browser and not chromium. I do a lot od research aggainst abitbot systems, some times is friday night. If you spend each one in pub it doesnt mean your normal.

beaugunderson4y ago

puppeteer-extra and undetected-chromedriver beg to differ :)

max0024y ago

Not really, i did test it (and use it for some cases), but there are still sites that detects it. I can and anyone who can check webgl renderer name, though this can be done by faking driver name but thats just one of many ways:) Its ongoing fight. If you dont move your mouse or type faster than 95% of my portal users i can detect you with js script written in under 1 minute.

incolumitasOP4y ago

neals4y ago· 2 in thread

In a particularly hard to scrape website, using some kind of bot protection that I just couldn't reliably get working (if anybody wants to know what that was exactly, I'll go and check it) I now have a small Intel NUC running with firefox that listens to a local server and uses Temper Monkey to perform commands. Works like a charm and I can actualy see what it's doing and where it's going wrong. (though it's not scalable, of course)

We use it for data-entry on a government website. A human would average around 10 minutes of clicking and typing, where the bot takes maybe 10 seconds. Last year we did 12000 entries. Good bot.

funnyflamigo4y ago

I'm curious what bot protection it was? It couldn't have been trying too hard unless you were employing multiple anti-fingerprinting techniques, I'm assuming you used firefox's built in anti-fingerprinting?

nkozyra4y ago

You can use chromium/chrome/cdp and turn headless off and see the same thing.

wilg4y ago· 2 in thread

Not the same kind of scraping, but does anyone have thoughts/resources/best practices for doing link previews (like Twitter/iMessage/Facebook)?

kall4y ago

You shouldn‘t really need to do any scraping tricks to get that, because it‘s data the websites (usually) want to give to bots. Or are people getting bot block screens from Cloudflare et all for that basic action these days?

It should be a matter of a simple GET request to fetch plain html and parse the OpenGraph meta tags out if that. There are many open source libraries to do that for you depending on your language.

If bot blocks really are a problem, a SaaS solution like Microlink could probably do it for you.

wilg4y ago

Bot blocks are definitely an issue for certain sites, I've implemented it that way currently.

Microlink is a good tip, thanks!

InvOfSmallC4y ago· 1 in thread

Where I was working we stopped caring about ips browser etc because it was just a race. What we did was analyzing behaviour of clicks and acted on that. When we recognized it we went on serving a fake page. It cuts down a little bit of costs because it was static pages. In general it took a lot of time for them to discover the pattern and it was way more manageable for us.

ahofmann4y ago

We did the same and the bot developers wrote bots that acted like humans. It took them not very long to find out.

kerokerokero4y ago· 1 in thread

Thanks for the share. Great stuff.

I used to scrape websites to generate content for higher SERPs.

Ended up going into the adult industry lols. (https://javfilms.net)

anon90014y ago

Neat! I've run across your site organically :P

I've always wondered, and since you're right here... how do sites like this make money?

It looks like you're probably crawling all the JAV vendors, finding free clips of today's releases, embedding them in your own site to draw traffic, and making money with affiliate links to buy the full content?

Am I missing anything? It seems hard to believe you'd get enough affiliate signups to make it worthwhile.

I can imagine your site as being a few hours a year of script maintenance and a money printer, or a 40hr/week SEO job with 1000s of similar sites across the adult industry.

I'd love to know anything you're willing to share about how the business works.

devops0004y ago· 1 in thread

Could you share your code for AWS lambda and puppetter? It’s definitely interesting for other websites

incolumitasOP4y ago

Sure.

https://github.com/NikolaiT/Crawling-Infrastructure

And here I am writing about it (but its quite old): https://incolumitas.com/2019/08/31/web-scraping-puppeteer-aw...

chrisMyzel4y ago

We are seeing a lot of bot traffic too but chose to accept it as reality. We are aware if thousands of bots create unpredictable cost surges that there is something wrong with our product, it should not create such heavy loads to our servers in the first place to fulfil it's mission.

I believe the future will make us more free by using more bot / AI technology since who wants to spend their whole day in front of a computer and research information if machines can do the job just fine?

throwaway9843934y ago

If you want to avoid bot detection, learn how bot detection work. A lot of commercial "webapp firewalls" and the like actually have minimum requirements before they flag certain traffic as a botnet; stay below those limits and you can keep hammering away. Sometimes those limits are quite high.

In the past we've had the most success defeating bots by just finding stupid tricks to use against them. Identify the traffic, identify anything that is correlated with the botnet traffic, and throw a monkey wrench at it. They're only using one User Agent? Return fake results. 90% of the botnet traffic is coming from one network source (country/region/etc)? Cause "random" network delays and timeouts. They still won't quit? During attacks, redirect to captchas for specific pages. During active attacks this is enough to take them out for days to weeks while they figure it out and work around it.

KuhlMensch4y ago

Doing a bit of low-stakes monitoring of webpages lately. It started (as I'm assuming it often does) with right-clicking a network request in Chrome and selecting "copy as curl"

Then graduated to JavaScript for surrounding logic e.g. data transformation

I had assumed I'd quickly give up and move to a headless browser, BUT I can't bring myself to move away from tiny CPU utilization of curl.

Throwing together a "plugin" probably takes me less than 20 minutes normally.

I'll probably have a look at using prowl to ping my phone.

And if I get more serious I'll look at auto authenticate options on npm. But I'm not sure if the overhead of maintaining a bunch of spoofy requests will be worth it.

DeathArrow4y ago

You can put some wasm crypto mining code and at least profit from bots. :D

mrg3_20134y ago

wow! That was an interesting read.

j / k navigate · click thread line to collapse

189 comments

86 comments · 17 top-level

ebbp4y ago· 27 in thread

scarygliders4y ago

Right with you there.

The scraper would systematically go through every single product page.

And by scraper, I mean - 100's of them. All at the same time, using the old trick of 1 scraper requesting 3 or 4 product pages at a time then pausing for a while.

They used umpteen different IP address blocks from all over the globe - but mainly using OVH vps IP address blocks from France.

But no. This shower of arseholes threw 100's of scraper "tasks" at the site. They got greedy.

Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

matheusmoreira4y ago

> Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

1 more reply

_lqaf4y ago

At their request, we built a method to flag accounts for data poisoning. Once flagged, those accounts would start getting plausible-ish looking garbage data.

It was pretty effective. One competitor went offline for a few days about a week after that started, and had a more limited offering when they came back up.

2 more replies

thatwasunusual4y ago

It sucks when this happens, but it's easily avoidable by using a caching frontend of some sort.

[0] https://varnish-cache.org/

funnyflamigo4y ago

> Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

Wait till you find out what half of Google's business is based on (spoiler - scraping).

I really don't think scraping itself is an issue 90% of the time. It's the behavior of the out of control scrapers that are the problem. A well behaved scraper should barely be noticeable, if at all.

2 more replies

serf4y ago

>Seriously, I hate scrapers. I hate the people who make scrapers. I hate their lack of ethics. Fuck those guys.

if a scraper is effectively DDoSing you, call it what it is -- a denial of service attack.

un-devmox4y ago

This and the post you are replying to both sound like sabotage by a competitor rather than legit data collecting.

mdoms4y ago

1 more reply

marginalia_nu4y ago

At the same time, even as someone who runs a web crawler, I have zero qualms about blocking misbehaving bots.

chillfox4y ago

I kinda feel like rate limiting your request to individual domains and IP addresses is an easy thing that goes a long way towards getting it right.

1 more reply

devwastaken4y ago

Not saying that doesn't suck - it does, it's why many ideas don't work in practice as an online service.

paco33464y ago

Andoryuuta4y ago

I'm really surprised that the JS challenges helped so much, given that there are open source libraries for bypassing them (e.g. cloudscraper[0]).

[0]: https://github.com/venomous/cloudscraper

1 more reply

EdwardDiego4y ago

What were they scraping, if I can ask? Was it targeted or just wget -r style?

1 more reply

RobSm4y ago

Why do you think those bots were scraping your data in the first place?

kulikalov4y ago

Why not create api endpoint and charge mild cost for that data? You’ll make money instead of spending it.

scarygliders4y ago

4 more replies

ebbp4y ago

krzyk4y ago

As a programmer that just sometimes wants to check if given item is available in store I would like to be able to use API for that. But if it is not available one has to scrape.

taytus4y ago

>where some amateur did real damage to us

If an amateur can do damage to you, then I have some bad news for you...

convolutionart4y ago

This is nonsense. It's always easier to destroy than to build/mantain. If you got any real advice, by all means...

Goronmon4y ago

If an amateur can do damage to you, then I have some bad news for you...

I believe the point wasn't surprise that damage occurred at all, but frustration that damage can occur just out laziness/ignorance rather than malice.

1 more reply

ebbp4y ago

It’s just selfish. If you’re going to take the product of other people’s work in a manner they don’t consent to, at least do it in a way that doesn’t cost them twice over.

jtdev4y ago

Considering the demand for your content, why haven’t you created and provided an API? Maybe you could monetize?

throwaway29934y ago

I asked them if my customer could pay to access this data point via their API and they quoted 3600 EUR/month! Enter the scraper...

ebbp4y ago

We do offer an API - the scrapers are trying to circumvent using that, presumably.

5 more replies

chewmieser4y ago

Like everyone and their brother has a web spider. And some of them are VERY badly designed. We block them when they use too many resources, although we'd rather just let them be.

Can't speak for the op but we have APIs and move the ones scraping and reselling our content to APIs. The majority are just a worthless suck on resources though.

joekrill4y ago· 11 in thread

A little pet-peeve I have is when an obscure(ish) acronym is used and never defined. Is SERP a well-known acronym? Perhaps this is a niche blog and I'm not the intended audience.

tptacek4y ago

Yes; a SERP is a Google search result page. It's the most important acronym in SEO.

nomdep4y ago

I don’t remember never ever hearing it and I’ve been in the industry for some time

3 more replies

marginalia_nu4y ago

The word SERP feels like a bit of a shibboleth for SEO-people. They seem to take it for granted, the rest of the world just looks puzzled when they hear it.

daveguy4y ago

I had to look it up.

SERP: Search Engine Results Page

fergie4y ago

Unintroduced acronyms should always be avoided.

praptak4y ago

Depends on the audience-acronym pair. I don't think HTTP needs an introduction in a technical article, OTOH (on the other hand ;) ) a general newspaper should probably expand HTTP but not WWW.

runnerup4y ago

nsotelo4y ago

As English speakers we often take for granted acronyms such as DB or even USA. For foreigners these can also be inscrutable.

wiether4y ago

On HN I'm used to SE meaning Software Engineer so I came up with "Software Engineer Ranting Board" before asking Google to give me the SERP that would provide me with the true meaning of SERP.

joncp4y ago

An all-too-common occurrence in HN comments as well.

bgroat4y ago

Not the OP, but I thought it was well known.

That said, I do a lot of SEO work.

Still, it should be best practice to define any acronym or initialism the first time you use it

biosed4y ago· 9 in thread

Rate Limiting by login,

Limiting data to know workflows ...

It's a cool problem to tackle but it is just an arms race.

rootusrootus4y ago

matheusmoreira4y ago

Yes. That's exactly what everyone should do.

2 more replies

chadwittman4y ago

The real Jedi move

1 more reply

gonzo414y ago

Like displaying a table with semantic elements, then divs, then using an iframe with css grid and floating values over the top.

This almost seems like a problem for AI to solve.

4 more replies

endymi0n4y ago

> It's a cool problem to tackle but it is just an arms race.

Plus, it's one you're going to lose. I was once asked at an All-Hands why we don't defend ourselves against bots even more vigorously.

My answer was: "Because I don't know how to build a publically available website that I could not scrape myself if I really wanted to."

wolverine8764y ago

Is that legal? It would be a big blow to trust if I was the customer, but that's without knowing what you were selling and in what market.

killingtime744y ago

It’s legal if it’s in the contract. Standard for contracts to allow for mistakes and confirmations of prices

1 more reply

ransom15384y ago

histriosum4y ago

I think this falls into the "arms race" trap, though. If you can make an href invisible via CSS, then the scraper can certainly be written to understand CSS, and thus filter out the invisible hrefs..

hall0ween4y ago· 5 in thread

Basic question, how does one profit from scraping data and what kinda data?

Taking a stab at answering it: you scrape the data and build a business around selling it. Stock prices? But that's boring, plus how many others are doing it? I bet a lot.

throw12346512344y ago

hall0ween4y ago

ahh thanks

3234y ago

These are scraping artificially limited releases clothes/shoes. You buy a shoe at $100 and immediately sell it at $1000.

Artificial scarcity - every week you release a "limited edition item", but if you do the math, it's not limited edition at all if you integrate over a year.

ushtaritk4214y ago

Anything you might look up or keep track of online that helps your business is probably being scraped by someone who is using it themselves or selling access to the curated data set.

ushtaritk4214y ago

Here's a project that's been in the news recently that relies heavily on scraped data. http://www.thebillionpricesproject.com/

IceWreck4y ago· 4 in thread

The author says proxys are expensive and then proceeds to spend a shitton of money buying all that hardware.

incolumitasOP4y ago

4G proxies are just soo much better than so called "residential" or straight datacenter proxies. It makes sense to create your own 4G proxy farm if you conduct business in that area.

With only 10 dongles and 10 dataplans, you can have a lot of IP addresses that are extremely hard to block. It's an one time investment, paying proxy providers is a fixed cost.

bsder4y ago

Where do you get 4G dongles that don't suck nowadays?

We tried to get some, but all of the ones we could get were various levels of broken or unsupported.

palijer4y ago

That was not the authors main argument against proxies, that was just an additional point. You ignored the primary argument in your judgment.

RandomThrow3214y ago

Can they not call out a secondary point?

1 more reply

abc034y ago· 3 in thread

exhilaration4y ago

Wow, this is super interesting:

https://proxidize.com/

https://mobinet.io/

I feel like I'm getting a glimpse into the dark underbelly of the web.

rdtwo4y ago

Is it just one ip per dongle at a time? Or can you have multiple ips on the same device.

abc034y ago

Just one IP at a time but you can change every 5 Min or more if you like

1 more reply

max0024y ago· 3 in thread

beaugunderson4y ago

puppeteer-extra and undetected-chromedriver beg to differ :)

max0024y ago

incolumitasOP4y ago

neals4y ago· 2 in thread

We use it for data-entry on a government website. A human would average around 10 minutes of clicking and typing, where the bot takes maybe 10 seconds. Last year we did 12000 entries. Good bot.

funnyflamigo4y ago

nkozyra4y ago

You can use chromium/chrome/cdp and turn headless off and see the same thing.

wilg4y ago· 2 in thread

Not the same kind of scraping, but does anyone have thoughts/resources/best practices for doing link previews (like Twitter/iMessage/Facebook)?

kall4y ago

It should be a matter of a simple GET request to fetch plain html and parse the OpenGraph meta tags out if that. There are many open source libraries to do that for you depending on your language.

If bot blocks really are a problem, a SaaS solution like Microlink could probably do it for you.

wilg4y ago

Bot blocks are definitely an issue for certain sites, I've implemented it that way currently.

Microlink is a good tip, thanks!

InvOfSmallC4y ago· 1 in thread

ahofmann4y ago

We did the same and the bot developers wrote bots that acted like humans. It took them not very long to find out.

kerokerokero4y ago· 1 in thread

Thanks for the share. Great stuff.

I used to scrape websites to generate content for higher SERPs.

Ended up going into the adult industry lols. (https://javfilms.net)

anon90014y ago

Neat! I've run across your site organically :P

I've always wondered, and since you're right here... how do sites like this make money?

Am I missing anything? It seems hard to believe you'd get enough affiliate signups to make it worthwhile.

I can imagine your site as being a few hours a year of script maintenance and a money printer, or a 40hr/week SEO job with 1000s of similar sites across the adult industry.

I'd love to know anything you're willing to share about how the business works.

devops0004y ago· 1 in thread

Could you share your code for AWS lambda and puppetter? It’s definitely interesting for other websites

incolumitasOP4y ago

Sure.

https://github.com/NikolaiT/Crawling-Infrastructure

And here I am writing about it (but its quite old): https://incolumitas.com/2019/08/31/web-scraping-puppeteer-aw...

chrisMyzel4y ago

throwaway9843934y ago

KuhlMensch4y ago

Doing a bit of low-stakes monitoring of webpages lately. It started (as I'm assuming it often does) with right-clicking a network request in Chrome and selecting "copy as curl"

Then graduated to JavaScript for surrounding logic e.g. data transformation

I had assumed I'd quickly give up and move to a headless browser, BUT I can't bring myself to move away from tiny CPU utilization of curl.

Throwing together a "plugin" probably takes me less than 20 minutes normally.

I'll probably have a look at using prowl to ping my phone.

And if I get more serious I'll look at auto authenticate options on npm. But I'm not sure if the overhead of maintaining a bunch of spoofy requests will be worth it.

DeathArrow4y ago

You can put some wasm crypto mining code and at least profit from bots. :D

mrg3_20134y ago

wow! That was an interesting read.

j / k navigate · click thread line to collapse