undefined | Better HN

0 pointsscarygliders4y ago0 comments

Do you honestly believe all site scraper people/companies are ethical enough to go to whoever pays /them/ to scrape data from a competitor's site and say "oh they offer an API to access this data let's pay for that", instead of "why pay for that data when we can scrape it right off their site"?

Also, not all types of company will provide API endpoints. It all depends on the type of site - for example, an online shop might not wish to provide easily accessible data on offered products and prices, to their competitors who may wish to undercut them. Why would an online shop do that?

0 comments

15 comments · 4 top-level

jadell4y ago· 6 in thread

I run a large scraper farm against several large sites. They're not online shops, and we don't compete with them. But they do have hundreds of thousands of data points that we use to provide reports and analytics for our clients, who also do not compete with the sites.

I absolutely would pay for an API that provides that data. I'd be willing to pay 10x more than the cost of maintaining and running the scrapers.

But the sites being scraped have no interest in that.

CWuestefeld4y ago

Have you tried approaching those sites and asking them to provide an API, pointing out that it would be easier for both of you in the long run? Or are you just assuming they wouldn't do it.

Because right now, I sure wish that the bots - which comprise probably 2/3 of my traffic - are causing me huge headaches and I wish that the people doing it would tell me what the heck they want.

jadell4y ago

Yes, we have. And no, they are not interested.

texasbigdata4y ago

Building and maintaining the scraper is the not cost they would use to measure it internally. It’s the cost to build the API, and support it and perhaps any perverse incentive it creates where even more data flows out to competitors.

jadell4y ago

For all intents and purposes, this isn't competitive data for them. There aren't really competitors in the space anyway, the barrier to entry is ridiculous. In fact, by law, operators in the industry are required to share this particular data with each other and industry regulators. But they don't share it with outside parties in the aggregate form we need it in. Hence, the scraping.

RobSm4y ago

Building API is 5 times easier than building routes for your public webpages, which is basically an 'API' as well.

wolverine8764y ago

And the cost of being scraped.

zivkovicp4y ago· 5 in thread

Well, you don't need an api, just a CSV file with a catalog.

The scraping company WILL use the API/CSV file... they will probably also still charge their customer for scraping, so it's a win-win :D

You can think of it this way, the prices and product data are publicly visible already on the website, there are no real secrets, none of it is password protected.

You can be principled and insist on blocking bots and spend a lot of time and money on tools, people, and ultimately hosting because the bots will always win; or you can offer the data for free/minimal fee and serve it with almost zero cost and cache it so you can do that with a micro sized server.

You can always lie about some of the prices if you want, but you will just encourage bots again.

Ethics are nice, but let's be honest, very lacking. Sometimes it's better to be pragmatic.

scaryglidersOP4y ago

> You can think of it this way, the prices and product data are publicly visible already on the website, there are no real secrets, none of it is password protected.

There's the problem right there. The prices and product data are publicy visible - because there is a target audience of /humans/ for whom the site is designed and intended to be used by. The site is not there to cater for a competitor's scrapers.

I don't care how much people couch their unethical behaviour in "the data is publically available", the basic fact is most if not all websites exist for human eyeballs to look at them. They do not exist for arseholes to DOS them by inundating them with scrapers.

zo14y ago

From my perspective, the problem is that the data that is offered isn't really "for humans". The data is for convincing the humans to buy/pay or worse, browse and watch ads as a result.

But overall, information is one of those goods that has intrinsic properties like no other. It can be copied, infinitely. And we haven't yet figured out the dynamics of how to reason about it, so it feels like we're pretending they're physical goods.

Edit. Side note. I'd go further and say that some of the data is even worse, it's "offered" with the real intention being to confuse the users into performing non-optimally in the market. Look at Amazon/Ebay/AliExpress/Google listings for evidence of that. Just Google - Google is a ML and scraping power house, and the best they can muster is to be spammed with fake websites and duplicate/confusing listings.

1 more reply

TeMPOraL4y ago

> the basic fact is most if not all websites exist for human eyeballs to look at them.

There's a whole ethical subthread here of websites trying to making the experience for those humans miserable, and taking away the agency necessary to protect oneself from that. A browser is a user agent. So is a screen reader. So is a script one writes to not deal with bullshit fluff, when all one wants is a simple table of products, features and prices.

zivkovicp4y ago

I agree 100%, but it is a fact of life, and sometimes it's better to just minimize the fuzz and focus on the things that matter.

Your argument is perfectly valid and applies to offline activities as well (what stops a competitor from walking through the aisles of a Walmart or Costco?), but this is a battle that can't be won, there are too many parasitic actors. It is human nature.

2 more replies

0xdeadbeefbabe4y ago

Let's not encourage these unethical people to even think of using human eyeballs and manual data entry for their scraping instead of bots. That sounds pretty darn unethical.

matheusmoreira4y ago

> Why would an online shop do that?

Because otherwise the HTML will become the API.

kulikalov4y ago

Ethical - of course not. Practical.

Valuable public data is going to be scraped - this is inevitable. Even paywalled or signup protected valuable data is going to be scraped.

Why not sell valuable data for reasonable price then.

j / k navigate · click thread line to collapse

0 comments

15 comments · 4 top-level

jadell4y ago· 6 in thread

I absolutely would pay for an API that provides that data. I'd be willing to pay 10x more than the cost of maintaining and running the scrapers.

But the sites being scraped have no interest in that.

CWuestefeld4y ago

Have you tried approaching those sites and asking them to provide an API, pointing out that it would be easier for both of you in the long run? Or are you just assuming they wouldn't do it.

Because right now, I sure wish that the bots - which comprise probably 2/3 of my traffic - are causing me huge headaches and I wish that the people doing it would tell me what the heck they want.

jadell4y ago

Yes, we have. And no, they are not interested.

texasbigdata4y ago

jadell4y ago

RobSm4y ago

Building API is 5 times easier than building routes for your public webpages, which is basically an 'API' as well.

wolverine8764y ago

And the cost of being scraped.

zivkovicp4y ago· 5 in thread

Well, you don't need an api, just a CSV file with a catalog.

The scraping company WILL use the API/CSV file... they will probably also still charge their customer for scraping, so it's a win-win :D

You can think of it this way, the prices and product data are publicly visible already on the website, there are no real secrets, none of it is password protected.

You can always lie about some of the prices if you want, but you will just encourage bots again.

Ethics are nice, but let's be honest, very lacking. Sometimes it's better to be pragmatic.

scaryglidersOP4y ago

> You can think of it this way, the prices and product data are publicly visible already on the website, there are no real secrets, none of it is password protected.

zo14y ago

From my perspective, the problem is that the data that is offered isn't really "for humans". The data is for convincing the humans to buy/pay or worse, browse and watch ads as a result.

1 more reply

TeMPOraL4y ago

> the basic fact is most if not all websites exist for human eyeballs to look at them.

zivkovicp4y ago

I agree 100%, but it is a fact of life, and sometimes it's better to just minimize the fuzz and focus on the things that matter.

2 more replies

0xdeadbeefbabe4y ago

Let's not encourage these unethical people to even think of using human eyeballs and manual data entry for their scraping instead of bots. That sounds pretty darn unethical.

matheusmoreira4y ago

> Why would an online shop do that?

Because otherwise the HTML will become the API.

kulikalov4y ago

Ethical - of course not. Practical.

Valuable public data is going to be scraped - this is inevitable. Even paywalled or signup protected valuable data is going to be scraped.

Why not sell valuable data for reasonable price then.

j / k navigate · click thread line to collapse