undefined | Better HN

0 pointsminimaxir9y ago0 comments

And you don't see this as unethical?

0 comments

4 comments · 1 top-level

cglee9y ago· 3 in thread

This is a huge topic of debate, but the consensus from "experts" seems to be that if it's public data that's not behind some auth, then it's fair. By the way, this is what web crawlers, like Google's crawler, does.

See: https://www.quora.com/Is-website-scraping-legal-and-ethical

But I don't necessarily want to go down this rabbit hole here, as it detracts from the interesting technical issues outlined in the article. If you haven't read the entire article, I recommend that you do, because they had to overcome many interesting technical challenges while working on this.

minimaxirOP9y ago

A little web scraping is fine, if and only if the website lacks a canonical way to do so (i.e. an API). Ignoring the API and scraping hard enough that you have to write an article assessing scalability of the scraping is bad.

michaelrm9y ago

A bit more context may be helpful. While scraping scalability was a concern for us (for latency reasons), it was only significant to retrieve historical data as a one-time job, after which we throttled back. A latter segment talks about a priority queue object (SRRPQQ) which we use in lieu of scaling further.

cglee9y ago

It's weird that you're dictating rules around scraping not outlined in Github's TOS.

The article actually mostly talks about how they scaled working with the accumulated data, and only a little about scaling the scraper.

Just like the interesting part about Google is how they process and index the data and not their crawler, the most interesting part of this article actually not the scraping, but how they handled all the data and processed it.

Edit: I found Github's policy on scraping. I hope the link brings some closure to this concern: https://help.github.com/articles/github-terms-of-service-dra...

j / k navigate · click thread line to collapse