undefined | Better HN

0 pointsmadamelic9y ago0 comments

>Scraping against the TOS is super bad netizen stuff, and I dont think people should be posting positive reviews of people doing this. Breaking captchas and the like is basically blackhat work and should be looked down upon, not congratulated as I see in this thread.

Not really.

Scraping, in my opinion, isn't black hat unless you are actually affecting their service or stealing info.

If you are slamming the site with requests because of your scraping, yeah you need to knock it off. If you throttle your scraper in proportion to the size of their site, you aren't really harming them.

In regards to "stealing info", as long as you aren't taking info and selling it as your own (which it seems OP is indeed doing), that is just fine.

tl;dr: Scraping isn't bad / blackhat as long as you aren't affecting their service or business.

0 comments

16 comments · 3 top-level

muglug9y ago· 10 in thread

> If you throttle your scraper in proportion to the size of their site, you aren't really harming them.

And do you understand their site infrastructure to know whether you're doing harm? It's perfectly possible that your script somehow bypasses safeguards they had in place to deal with heavy usage, and now their database is locking unnecessarily.

cookiecaper9y ago

Eh, this is pretty weak. Scrapers are no different from other browsing devices. The web speaks HTTP. There's no reason that using another HTTP browser would cause any disparate impact just by virtue of not being a conventional desktop browser -- you've thrown out a pretty absurd hypothetical. In fact, scrapers usually cause less impact because they usually don't download images or execute JavaScript.

I did an analysis and a session browsed with my specialized browser would always consume less than 100K of bandwidth (and often far less), whereas a session browsed with a conventional desktop browser would consume at least 1.2 MB, even if everything was cached, and sometimes up to 5 MB. In addition, on the desktop, a JavaScript heartbeat was sent back every few seconds, so all of that data was conserved too.

Because we were a specialized browser used by people looking for a very specific piece of data, we could employ caching mechanisms that meant that each person could get their request fulfilled without having to hit the data source's servers. We also had a regular pacing algorithm that meant our users were contacting the site way less than they would've been if they were using a conventional desktop browser.

Our service saved the data source a large amount of resource cost. When we were shut down, their site struggled for about two weeks to return to stability. I think they had anticipated the opposite effect.

Our service also saved our users a large amount of time. We were accessing publicly-available factual data that was not copyrightable (but only available from this one source's site). There's no reason that the user should be able to choose between Firefox and Chrome but not a task-specialized browser.

It is true that some people will (usually accidentally) cause a DDoS with scrapers because the target site is not properly configured, but the same thing could be done with desktop browsers. It doesn't mean that scrapers should be disadvantaged.

hchasestevens9y ago

A small counterpoint to this -- in the airline industry, it's relatively commonplace for seat reservations to be made for a user _before_ payment has occurred. In this case, if you're mirroring normal browser activity, you can (temporarily) reduce availability on a flight, potentially even bumping up the price for other, legitimate users, and almost certainly causing the airline to incur costs beyond normal bandwidth and server costs. I'm sure there are many other domains for which this is also the case, however rare.

4 more replies

muglug9y ago

> you've thrown out a pretty absurd hypothetical

Not even remotely absurd. Where is the data your scraper consuming coming from? It's almost always served from some sort of data repository (SQL or otherwise). That data costs far more per MB to serve up quickly than JS/CSS/images.

Suppose, for example, you host a blogging platform that has one very popular user. Most accounts on your site don't get a ton of visitors, and that one very popular user's post are all stored in cache.

Then along comes a scraper. He thinks, "Hey, this site is serving up a million page impressions a day. It can definitely handle me scraping the site".

But when he runs the scraper, he fills up the cache with a ton of data that it doesn't need, causing cache evictions and general performance degradation for everyone else.

1 more reply

DashRattlesnake9y ago

> I did an analysis and a session browsed with my specialized browser would always consume less than 100K of bandwidth (and often far less), whereas a session browsed with a conventional desktop browser would consume at least 1.2 MB, even if everything was cached, and sometimes up to 5 MB. In addition, on the desktop, a JavaScript heartbeat was sent back every few seconds, so all of that data was saved too.

Bandwidth is certainly part of it, but there's also also database and app-server load (which may be the actual bottleneck) that a scraper isn't necessarily bypassing.

1 more reply

chucksmash9y ago

Have run into exactly this before. Wrote a scraper that retrieved results from a trivia league website. Tried to be a polite scraper (<1 request per second) but the site still crashed - even with 5 seconds of sleep between requests. They were doing something weird with DB connection management (maybe just forgetting to close it and letting it timeout? I remember figuring it out but it's been quite a while) and so after N very reasonably spaced queries the site would reproducibly start throwing an uncaught MAX_DB_CONNECTIONS_EXCEEDED and just be down for everybody everywhere who might've wanted to use it.

baddox9y ago

It seems like you could easily hit those scaling issues by manually browsing the website. While I agree that it sucks to take down a site by scraping, in that specific case it sounds like the performance issues are their fault and not yours. That said, once I realized the effect my scraping had, I would (hopefully) cease my scraping.

1 more reply

chucksmash9y ago

Now that I think about it a bit more, I think my hypothesis was that DB connections were allocated at the session level and that without cookies enabled each request initiated a new session.

I'd consider that a bug not a feature but I still think it's incumbent on me, the guy scraping the website, not to trigger it.

3 more replies

bigiain9y ago

I mostly agree with the post's author on the "I'm just automating something I'd otherwise be doing manually". If the local weather service publishes, say, barometric charts on their site - but has a TOS that prohibits me from scraping, and my alternative was to just hit their site every day and right-click and save-as on the chart - I feel absolutely no compunction in automating that. You need to be careful of the slippery slope though, when it's easy to grab every day's local barometric chart, it become too easy to think "Hey, I just need to stick that in a loop and I can grab 1000 different charts every day!". I'd personally _not_ do that. If it's something I'm likely to do "by hand" but would occasionally miss a day or three, I'll automate it no matter what the TOS says.

antisthenes9y ago

You're saying the site should pay him a consulting fee for the free load-testing service he provides?

wwweston9y ago

One fair baseline is whether or not the custom User Agent you're using to scrape has request timing that's on the order of what a fast human visitor might do. If the site can't handle that, it's certainly not the UA's fault.

minimaxir9y ago· 3 in thread

> tl;dr: Scraping isn't bad / blackhat as long as you aren't affecting their service or business.

Analyzing data that you're not allowed to access gives you/your company a competitive advantage, which is affecting their service/business even if it's not posted/distributed publically.

Nicksil9y ago

I don't follow your argument. How does one get their scraper access to data they would otherwise not be able to access through 'normal' browsing techniques?

Raphmedia9y ago

Example I know of: You can scrape your competitor's Facebook pages since their creation and output nice graphs of which posts generated what kind of likes and suscriptions. This data is usually limited to the owner of the page.

4 more replies

ori_b9y ago

By ignoring robots.txt and bypassing capchas.

sjwright9y ago

> Scraping, in my opinion, isn't black hat unless you are ... stealing info.

And as a webmaster, how can I tell the difference before it's too late?

j / k navigate · click thread line to collapse

0 comments

16 comments · 3 top-level

muglug9y ago· 10 in thread

> If you throttle your scraper in proportion to the size of their site, you aren't really harming them.

cookiecaper9y ago

hchasestevens9y ago

4 more replies

muglug9y ago

> you've thrown out a pretty absurd hypothetical

Then along comes a scraper. He thinks, "Hey, this site is serving up a million page impressions a day. It can definitely handle me scraping the site".

But when he runs the scraper, he fills up the cache with a ton of data that it doesn't need, causing cache evictions and general performance degradation for everyone else.

1 more reply

DashRattlesnake9y ago

Bandwidth is certainly part of it, but there's also also database and app-server load (which may be the actual bottleneck) that a scraper isn't necessarily bypassing.

1 more reply

chucksmash9y ago

baddox9y ago

1 more reply

chucksmash9y ago

Now that I think about it a bit more, I think my hypothesis was that DB connections were allocated at the session level and that without cookies enabled each request initiated a new session.

I'd consider that a bug not a feature but I still think it's incumbent on me, the guy scraping the website, not to trigger it.

3 more replies

bigiain9y ago

antisthenes9y ago

You're saying the site should pay him a consulting fee for the free load-testing service he provides?

wwweston9y ago

minimaxir9y ago· 3 in thread

> tl;dr: Scraping isn't bad / blackhat as long as you aren't affecting their service or business.

Analyzing data that you're not allowed to access gives you/your company a competitive advantage, which is affecting their service/business even if it's not posted/distributed publically.

Nicksil9y ago

I don't follow your argument. How does one get their scraper access to data they would otherwise not be able to access through 'normal' browsing techniques?

Raphmedia9y ago

4 more replies

ori_b9y ago

By ignoring robots.txt and bypassing capchas.

sjwright9y ago

> Scraping, in my opinion, isn't black hat unless you are ... stealing info.

And as a webmaster, how can I tell the difference before it's too late?

j / k navigate · click thread line to collapse