Web Scraping in 2016 (opens in new tab)

lerpa9y ago

> Who is anyone to tell me what I can and can't automate in my life?

You are exactly right. But although a site can deny you access for any arbitrary reason (it's their website, after all) obviously government think they are the ones to enforce this crap.

What if the ToS say you can only access a site while jumping hoops? Only read the ToS after a while and wasn't hooping? Well too bad, now you are being sued for reading the main page _and_ the ToS page without jumping around.

This comment Terms of Service: If you read any of this text you owe lerpa $1.000.000 to be paid up until 09/01/2016.

tangue9y ago

It would be ok if it wasn't "You can't scrap my site. Unless of course you're Google" this double standard drives me mad.

sjwright9y ago

As the owner of a large website, I don't care what you think. I block by default and whitelist when I decide it's in my interest.

If you don't think this is reasonable, chances are you've never run a large website, or analyzed the logs of a large website. You'd be astonished how much robotic activity you'll receive. If left unchecked it can easily swamp legitimate traffic.

Unless you have a way for me to automatically identify "honourable" scrapers such as yourself as distinct from the thousands upon thousands of extremely dodgy scrapers from across the world, my policy shall remain.

Analemma_9y ago

Why is it a double standard? Google scraping usually benefits the site with increased traffic and revenue, in a way most other scraping does not. Saying "you can scrape me if it benefits me" isn't totally in keeping with the principles of the open web, but it's not hypocritical.

_c_9y ago

Worse is that Google tries to stop scraping. It's like they don't want anyone to see past the first page of results.

They could scrape your website and then they prevent you form scraping your own data back.

The whole process is silly; it reflects the duct tape and chicken wire nature of the www.

No one should have to "scrape" or "crawl".

Data should be put into a open universal format (no tags) and submitted when necessary (rsynced) to a public access archive, mirrored around the world.

This to bridge the gap until we reach a more content addressable system (cf. location based).

Clients (text readers, media players, whatever) can download and transform the universally formatted data into markup, binary, etc. -- whatever they wish, but all the design creativity and complexity of "web pages" or "web apps" can be handled at the network edge, client-side.

"Crawling" should not be necessary.

No one should have to store HTML tags and other window dressing for data.

Dream on.

codeddesign9y ago

Double standard? The difference is that Google Bot is built on being unobtrusive. I can easily built a scraper that will quickly ddos a site. Linkedin for example...if they allow 10,000 people to send 100 scraping requests per second everyday then that is stolen bandwidth that Linkedin has to pay for and the scrapers get free data. The difference is that Google has standards in which site's unusually benefit from, not to mention that they allow for you to disallow their bot. It just doesn't work the same way with some random developer building a scraper.

desireco429y ago

Google and others, with legitimate reasons obey robots.txt

hobs9y ago

This post is kind of crazy, aggrandizing bad behavior and misuse of other's resources against their will.

Scraping against the TOS is super bad netizen stuff, and I dont think people should be posting positive reviews of people doing this. Breaking captchas and the like is basically blackhat work and should be looked down upon, not congratulated as I see in this thread.

madamelic9y ago

>Scraping against the TOS is super bad netizen stuff, and I dont think people should be posting positive reviews of people doing this. Breaking captchas and the like is basically blackhat work and should be looked down upon, not congratulated as I see in this thread.

Not really.

Scraping, in my opinion, isn't black hat unless you are actually affecting their service or stealing info.

If you are slamming the site with requests because of your scraping, yeah you need to knock it off. If you throttle your scraper in proportion to the size of their site, you aren't really harming them.

In regards to "stealing info", as long as you aren't taking info and selling it as your own (which it seems OP is indeed doing), that is just fine.

tl;dr: Scraping isn't bad / blackhat as long as you aren't affecting their service or business.

angry_octet9y ago

'Bad netizen stuff'? Is this a comment from 1997? Breaking captchas is 'blackhat'? What cozy hippy internet alternative reality does this come from?

These are the same websites and companies that are loading evercookies and doing browser fingerprinting, that break as much as possible the anonymity citizens should enjoy, with Real Name policies, using network analysis to find who your friends are and what your politics and buying habbits are, that routinely rip private information from you cell phone and share it with oppressive regimes.

You're not in Kansas anymore Toto.

malz9y ago

> misuse of other's resources against their will

Nonsense, there is no implication that this activity is illicit. Many sites (I have worked with hundreds) are happy to be included in my service, but don't have the technical ability to provide a data feed. They were delighted when I told them I could aggregate their content without any extra work on their part.

We respect TOS, we respect robots.txt and so on. Just because you study scraping techniques doesn't mean you intend to break the law.

> Breaking captchas and the like is basically blackhat work

Um, captchas only work if they work. If breaking them is trivial, they shouldn't exist. Don't shoot the messenger for pointing out the front door is unlocked.

Fiahil9y ago

Sometimes, scraping a website is the only way at your disposal to fetch relevant information, you've paid for (or not). It could be simply the opening schedule of your local administration or the status of different pieces of public infrastructure.

If your administration don't have the resources (and it's often the case) to maintain a proper JSON API for you to fetch with a fancy python lib, then, it's not "super bad netizen stuff" to scrap a few HTML/PDF/XLS, parse them and display them for convenient public consumption on your personal website (and paying for the bandwidth).

It's 2016. State-companies holding a third party responsible for their own outages and poor planning is _bad faith_[1]. ETL? Never heard of it?

[1]: https://citymapper.com/i/1208/soutenez-citymapper-et-lopen-d... (french)

pierrebai9y ago

To know the TOS of a page, you need to read it. To know which links are part of a site and which are not, you need to follow the link. Having a TOS as part of the page content is akin to having a sign in a room, only readable by entering the room, that says "you are not allowed to enter this room".

Yes, this defense is being petty abotu details, but I find businesses using post-hoc discoverable limitations to limit people rights annoying.

Mikushi9y ago

Instagram or Facebook, they thrive on stolen or relinked content and monetize it day in day out.

Being amazed at this kind of bad behaviour where the targets are some of the most despicable companies on the web is a bit ironic. Scrape away, these companies hurt the web, let's hurt them (even though, all the scraping in the world won't have any impact).

flukus9y ago

> This post is kind of crazy, aggrandizing bad behavior and misuse of other's resources against their will.

How so? I send a web request, they send me the content in a response. If they aren't happy with that then they should refuse my request.

TeMPOraL9y ago

I disagree. DOSing a site is bad behaviour, regardless of how you do this. But accessing it in an automated way instead of a browser? Not really. The deal on the Internet is like this: a website owner can provide whatever they want, and a visitor can read it however they want. Discriminating visitors based on whether or not they seem to be bots instead of people is going beyond what the site provider should do. So is detecting and blocking people using adblockers.

https://en.wikibooks.org/wiki/UK_Database_Law#Database_Right

madshiva9y ago

So you think that people should waste their time solving captcha are the solution? people are paid to solve captcha, there's always something like that and then the users suffer more. It's not a solution at all.

twa9279y ago

How can TOS have legal power for the case scraping? A website is a public property. If I'm visiting it without logging in, I don't have a chance to accept TOS.

Imagine a hotel that makes guests sign a document saying they will not make photographs of the building. If I'm not a guest, I can take photographs of it and I can't even know that would be illegal.

buro99y ago

The UK has a database law:

If you scrape, and effectively reconstitute a database, then so long as the database originally had a "substantial investment" in it's "obtaining, verifying or presenting the contents" then yup... you have breached the database right, which is a modified form of copyright.

You may access said database (via the web), but as soon as you start reconstituting the database from scraping... you're in breach.

It's a law, it is illegal in the UK, I'm sure most countries have some equivalent law on their books, all of the EU does. The law looks recent, but UK copyright and patent used to cover it, the 1997 date is just a separate statute to clarify the position.

duaneb9y ago

> A website is a public property.

This isn't even true metaphorically. It's like a shop front: there may be public access, but it is NOT public property.

cookiecaper9y ago

>How can TOS have legal power for the case scraping? A website is a public property. If I'm visiting it without logging in, I don't have a chance to accept TOS.

This is called "clickwrap". There is usually a notice in the footer of each page that says something like "By using this site, you agree to our Terms of Service." Typically, this kind of notice has been held enforceable. More recently, judges have been demanding that such notices be placed more prominently before they're held enforceable (e.g., somewhere above the fold), but that's it.

>Imagine a hotel that makes guests sign a document saying they will not make photographs of the building. If I'm not a guest, I can take photographs of it and I can't even know that would be illegal.

The reasonable laws that exist in meatspace are not applicable online, because once you hit someone else's server, you're considered to be on their property and they have the right to control what you do there. There is no "public property" from which to safely stand and take photographs in the internet.

Also, photographs of structures may not be free to use. Architectural copyrights went into effect in the early 90s and have a term of either 90 or 120 years. Thus, if you take a photograph of a building built in 1991 and the year is not yet 2111, there is a chance that the architect can claim infringement.

https://en.wikipedia.org/wiki/EBay_v._Bidder%27s_Edge

sseveran9y ago

One of the original court cases covering this was eBay vs Bidders Edge.

The courts have generally disagreed with that interpretation.

dragonwriter9y ago

> A website is a public property.

No, its not. It may be in public view, but that's a different issue.

gsnedders9y ago

Both the LinkedIn and OKC cases involved the scrapers using logged in accounts.

hooph00p9y ago

> A website is public property.

This is a gross misunderstanding of how the internet works.

minimaxir9y ago

That analogy is not equitable. If you take photographs of a building while on the building's property, they have the right to tell you to stop, or call the police to escort you off if you refuse to do so.

pyre9y ago

A better metaphor for this would be the "Sunday Flyers" that come in the newspaper (e.g. for big box stores like Best Buy). They sent that information to you, they can not then attempt to restrict how you use that information (though they have tried to claim copyright over pricing against sites that aggregate the flyers).

cookiecaper9y ago

Yes, it's important to understand that in the United States, web scraping is usually an illegal activity under the CFAA. If you draw enough attention, your scrape target will notice and threaten you, and probably follow through with the suit. Since the CFAA prescribes both civil and criminal penalties, you may even find yourself in jail for accessing data without the company's approval. Aaron Swartz was being prosecuted under these provisions for scraping public domain data.

The CFAA is a really bad law and creates the network effect lock-in that we all considered a natural part of the web. It doesn't have to be that way -- users should be free to use any browsing appliance they want, including so-called "scrapers".

Big companies like Google not only got their start by flagrantly violating the CFAA, copyright, and privacy laws, but they continue to do so. The moral of the story is hurry up and get big before you get sued or arrested.

There's a long history of ridiculous web scraping rulings based on technical misunderstandings by neophyte judges, including Ticketmaster v. RMG, where infringement was found because the company scraped data out of a page with the Ticketmaster logo on it.

Facebook sued a company called Power Ventures which read out only the user's own data. The founder was found personally liable for $3 million in damages. Facebook did this because they don't want it to be easy for their users to move between social media services. If it's easy, Facebook has to compete on merit instead of just keeping switching costs high. Facebook doesn't like that, so they sue people who make it possible -- and the law says they should win.

We badly need a revised law, but the powers-that-be will strongly oppose it because it would threaten their monopoly over web properties. They continue to flaunt their strategic ignorance of these laws and then take shelter behind them to stop risk from small innovators (i.e., having to compete fair and square).

In the real world, we have a lot of laws that mostly prevent this kind of bad behavior. In cyberspace, the structure is such that most of those laws are not applicable. We need to update and port the pro-small-business logic we have for meatspace companies so that it counts online too. The state of affairs online is really bad.

I want to get a law called the "Consumer Data Freedom Act" passed, which would allow users to access any web property with any non-disruptive browsing device, including custom scrapers that don't impose much more load than a typical user browsing session would.

0xdeadbeefbabe9y ago

Yes laws are neat and a reason for attending law school I suppose. I'm of the simpleton opinion that TCP/IP and the other protocols are the law of the net, and you ought to start with those.

kuschku9y ago

Or you just move to a locale where scraping is legal, and any contractual terms saying otherwise are null and void.

I’d assume a lot of HN users are from such locales.

We don’t always have to assume US laws apply globally – they don’t.

cookiecaper9y ago

I was actually searching for such a jurisdiction as my startup was shut down by a company that invoked the CFAA late last year. What do you suggest? The EU is even worse than the US when it comes to data freedom and tech access. The law on the books in many former British colonies provides marginally more protection (the "Telecommunications Act"), but it'd probably still be disputable, and you'd be shut down anyway unless you had millions sitting around with your lawyers' name on it.

madamelic9y ago

Scraping being illegal is as dumb as saying it's illegal to take photos in public. You aren't affecting anyone if you do it respectfully.

downandout9y ago

Obviously it's a good idea to follow TOS. But as a practical matter, they have to know that you're doing it before they can take action. You wouldn't want to put up a site announcing that you're selling scraped LinkedIn data, for example. But if that data is valuable to your business - collecting names of people that work in certain positions at certain companies so that you can do targeted snail mail campaigns for example - you could quietly scrape and use the data without issue. Use proxies and prosper.

cookiecaper9y ago

This only goes so far, and if you get found out, you're looking at willful infringement (which usually triples damages) and probably criminal charges under the CFAA. However, it should be acknowledged that there are many people making quiet livings off scrapes that are not legal. There are even a few companies making loud livings off such scrapes, like Google.

If you're not going to run it totally anonymously, you should be prepared to jettison and repackage it when you get found it (so that you appear to be complying with the C&D).

Scraping is a huge part of the web, and everyone does it. It sucks that it has to live underground because only big companies can duke it out in court.

siegecraft9y ago

I think the lessons learned from those lawsuits was to always have some sort of 3rd-party intermediary scraping consultancy firm you engage that is totally not just your business under another name.

kbenson9y ago

> (Disclosure: I have developed a Facebook Page Post Scraper [https://github.com/minimaxir/facebook-page-post-scraper] which explicitly follows the permissions set by the Facebook API.)

I played with the idea of creating some social aggregation type service with some friends (as a business). The more I read about FB's past behavior with regard to this, and how essential they are to any sort of service like, that, I canned the project. Regardless of what their TOS say, if you get on their radar and they send you a cease-and-desist, it's game over. Facebook is not in the business of subverting their revenue stream, so if you are making money off them and it's preventing them from capitalizing on their users, don't expect to last long if you exist by their grace.

Really, there's an interesting space between so small nobody cares and large enough that getting shut down is a real problem. A lot of projects start small and end up (relatively) large, but without a good way to pay for the service itself. While not every service needs to be a business and make money, once you reach the level where you risk either being shut out of your data source or you need to somehow work out an understanding with that source, how do you approach that when being able to pay is off the table? Not to mention the problem approaching before you have to and forcing the situation, or waiting too long and risking the wrath of the source because you've abused their service as long as you have. Has anyone else been in this situation and found an approach that works?

headmelted9y ago

One thing I'm not quite clear on here.

I understand the use of ToS clauses to prevent scraping but I do kind of wonder to what extent they have authority here.

IANAL, but surely this would fall under copyright law? While re-publishing copyright-protected data without consent is probably unlawful in your region (like scraping an art site and re-posting the images), I wouldn't think just scraping data points for a different purpose (like scraping amazon for the purposes of price comparison) is nearly so clear cut (or enforceable), but maybe I'm just naive.

cookiecaper9y ago

The content falls under copyright law. The problem is that you have to enter the company's servers to obtain this data, and the CFAA says that the company can treat their public-facing web servers like private property, and if you're caught "trespassing", you can be sued and jailed. Scraping plaintiffs are usually granted an injunction based on "trespass to chattels" (among other rationales), i.e., trespass to an individual's property (as opposed to land).

Companies like PriceZombie are forced to stop because the CFAA says that Amazon can prevent them from accessing their servers by decree alone. A ToS isn't even really necessary for this, but it helps them pin down their argument.

PriceZombie could try to get the data from third-party caches, but it only solves part of the problem, because copyright and trademarks come back into the picture once you have a replica of the target page. In Ticketmaster v. RMG Technologies, the judge found RMG infringing on Ticketmaster's trademarks and copyrights because the page they were scraping included Ticketmaster's logo. The judge said the copy of the full page that existed momentarily in RAM while the scraper extracted the non-copyrightable data constituted a copy that infringed on Ticketmaster's rights, even though the logo was never used by the application in any way, it just happened to be on the page.

ChuckMcM9y ago

I was going to post something similar. When you go to all that trouble that the web site owner is pretty clearly trying to prevent, that is convincing evidence that you are breaking the terms of service. And breaking the terms of service for a web site has been held to be a civil violation (a number of times on Ebay and Amazon) and potentially a CFAA violation by the Justice department.

downandout9y ago

Actually it's been held that TOS violations are NOT subject to the criminal provisions of the CFAA.

yeowMeng9y ago

I am no expert, but I always thought you could scrape without consequence provided you never distribute your scrapings?

jurgenwerk9y ago

There are hundreds of paid services that scrape Google heavily (search engine ranking trackers). How are they legal?

lossolo9y ago

They probably doing it from country where it's legal. In most countries there is no law that would be applicable in this case.

cookiecaper9y ago

They aren't, or at least, they won't be if Google decides it doesn't like them anymore and decides to bring the matter to court.

The CFAA says it's a crime to exceed "authorized access". Authorized access is whatever the server's owner says it is. If they change their mind, you must cease and desist or risk both civil and criminal penalties. A contract defining the length and nature of your authorization from the server's owner would go a long way to establishing your rights to access, but no one is going to give that to a small player.

knicholes9y ago

Isn't Google search based off of Google "scraping" the web?

- https://github.com/fake-name/ExHentai-Archival

sandGorgon9y ago

The code is pretty cool. Thanks for releasing that! May I ask why you built your own scraper infrastructure and not build it on top of a known framework like scrapy (which is in python as well).

jbmorgado9y ago

I actually wonder if that constraining by the ToS has any legal validity in EU. Since in here, typical that kind of stuff cannot be enforced legally.

fake-name9y ago· 14 in thread

I do a significant amount of scraping for hobby projects, albeit mostly open websites. As a result, I've gotten pretty good a circumventing rate-limiting and most other controls.

I suspect I'm one of those bad people your parents tell you to avoid - by that I mean I completely ignore robots.txt.

At this point, my architecture has settled on a distributed RPC system with a rotating swarm of clients. I use RabbitMQ for message passing middleware, SaltStack for automated VM provisioning, and python everywhere for everything else. Using some randomization, and a list of the top n user agents, I can randomly generate about ~800K unique but valid-looking UAs. Selenium+PhantomJS gets you through non-capcha cloudflare. Backing storage is Postgres.

Database triggers do row versioning, and I wind up with what is basically a mini internet-archive of my own, with periodic snapshots of a site over time. Additionally, I have a readability-like processing layer that re-writes the page content in hopes of making the resulting layout actually pleasant to read on, with pluggable rulesets that determine page element decomposition.

At this point, I have a system that is, as far as I can tell, definitionally a botnet. The only things is I actually pay for the hosts.

---

Scaling something like this up to high volume is really an interesting challenge. My hosts are physically distributed, and just maintaining the RabbitMQ socket links is hard. I've actually had to do some hacking on the RabbitMQ library to let it handle the various ways I've seen a socket get wedged, and I still have some reliability issues in the SaltStack-DigitalOcean interface where VM creation gets stuck in a infinite loop, leading to me bleeding all my hosts. I also had to implement my own message fragmentation on top of RabbitMQ, because literally no AMQP library I found could reliably handle large (>100K) messages without eventually wedging.

There are other fun problems too, like the fact that I have a postgres database that's ~700 GB in size, which means you have to spend time considering your DB design and doing query optimization too. I apparently have big data problems in my bedroom (My home servers are in my bedroom closet).

---

It's all on github, FWIW:

Manager: https://github.com/fake-name/ReadableWebProxy

Agent and salt scheduler: https://github.com/fake-name/AutoTriever

kough9y ago

Yet another incredible technical achievement due to someone's quest for more porn (https://github.com/fake-name/AutoTriever/blob/master/setting...).

fake-name9y ago

That's a separate project:

- https://github.com/fake-name/PatreonArchiver

- https://github.com/fake-name/xA-Scraper

- https://github.com/fake-name/DanbooruScraper

Or... well, 4 separate projects. Whoops?

At one point, a friend and I were looking at trying to basically replicate the google deep-dream neural net thing, only with a training set of porn. It turns out getting a well tagged dataset for training is somewhat challenging.

Well-tagged hentai is trivially accessible, though. I think there's probably a paper or two in there about the demographics of the two fan groups. People are fascinating.

Next up, automate the consumption too!

monsoon229y ago

How do you circumvent cloud provider IP blocks? For example, one site blocks all requests from AWS EC2 servers.

fake-name9y ago

None of the sites I'm scraping do that, mostly.

I'm not scraping high value sites like that (I mostly target amateur original content). It's not really of interest to other businesses. As such, I tend to just run into things like normal cloud-flare wrapped sites, and one place that tried to detect bots and return intentionally garbled data.

If I run into that sort of thing, I guess we'll see.

nickysielicki9y ago

I've never used this, and it's incredibly shady considering the users probably do not realize that their Hola browser plugin does this, but Hola runs a paid VPN service where you can get thousands of low-bandwidth connections on unique residential IP addresses, provided generously through their "free" VPN users.... It's essentially a legitimate attempt at running a botnet as a service.

But if the end justifies the means... http://luminati.io/

siegecraft9y ago

I'm sure there'd be a ton of people that would love to pay to use your platform (who cares if the source is available, I don't want to run my own because once the code is written, it's ops thats hard). But then I suppose it would be hard to stay unnoticed.

fake-name9y ago

Yeah, running this thing publicly would be a huge mess from a copyright perspective, since it literally re-hosts everything as a core part of how it works.

As it is, I think I'm OK, since it's basically just a "website DVR" type thing, for my own use.

Really, if nothing else, the project has been enormously educational for me. I've learnt a boatload about distributed systems, learned a bit of SQL, dicked about with databases a bunch, and actually experienced deploying a complex multi-component application across multiple disparate data centers.

atmosx9y ago

Similar, paid solution: https://scrapinghub.com/crawlera/

uptown9y ago

What would the rough costs be to run the 800k UA scenario?

fake-name9y ago

To be clear, I have a pool of 800K theoretical UAs derived from the mechanism I use to generate them, not 800K clients.

Regarding costs, I really have no idea. It depends on how rapidly you cycle the UA, and how fast whatever you're scraping is.

franciskimOP9y ago

take my money!

Greg-J9y ago

How can I get ahold of you directly?

fake-name9y ago

connorw at imaginaryindustries dot com

chmars9y ago

Wow, that's impressive!

XCSme9y ago· 14 in thread

Tbh I didn't enjoy the article, it just seems like someone who has just learned about Node.js tried to explain (and mostly failed) how to use some packages to scrape a page. I was expecting to learn some new techniques, but all it explained was how to make a few API calls in order to solve a very specific problem. Also, there was the overall arrogant tone: "I found their interview approach a bit of a turn off so I did not proceed to the next interview and ignored her emails ", this just shows a lot of immaturity.

jhwhite9y ago

>Also, there was the overall arrogant tone: "I found their interview approach a bit of a turn off so I did not proceed to the next interview and ignored her emails ", this just shows a lot of immaturity.

I'm not saying you're one of these people, but it's frustrating when companies do this to potential employees and the potential is told by friends and other management type people, "well that's the company you just have to deal with it".

When someone flips it on the company then it's immature.

I applied somewhere recently and they invited me out to a pre-interview lunch. That went well so they called me in for an interview. That went well and the VP told me he would call me back to set up a second (third?) interview.

I never heard back from him. An ex-coworker there went to the VP to find out what was going on and the VP said he decided he wanted someone with more experience in the specific area they're working in.

But last he told me was he liked me and would schedule another interview, then when he changed his mind he never let me know.

I think people on both sides should be courteous and respectful through the process, but if employers are treating interviewees poorly then they shouldn't be surprised when they start getting treated poorly.

wtracy9y ago

>When someone flips it on the company then it's immature.

First, it's hard to know when companies are doing this intentionally versus when things just get lost in the shuffle. (Never attribute to malice what can be explained by incompetence, and all that.) Meanwhile, the author was clearly ignoring the interviewer intentionally.

Second, the fact that Company A treated you rudely doesn't give you license to treat unrelated Company B rudely. For that matter, I'm not sure that the fact that Employee 1 at Company A treated you rudely gives you moral license to treat Employee 2 at Company A rudely. Show a little compassion for someone trapped in a dead-end job trying to put food on their family's table, for crying out loud.

fizx9y ago

As someone who has written a scraping framework, this article is useful AF.

> but all it explained was how to make a few API calls in order to solve a very specific problem.

Yeah, the very specific problems everyone runs into time after time. He presents specific solutions, and reasonable context. If I was googling for one of these problems, I'd be very happy to run into this page.

> Also, there was the overall arrogant tone: "I found their interview approach a bit of a turn off so I did not proceed to the next interview and ignored her emails "

Your arrogance is my matter-of-fact.

zasz9y ago

You can't count as 'matter-of-fact' if you're not even bothering to communicate.

franciskimOP9y ago

Thanks for your feedback, I do appreciate it.

kentt9y ago

Not many people can take criticism is stride like that. You are awesome.

state9y ago

As a counterpoint: I think this article is fantastic. API restrictions are incredibly annoying when they pertain to what I consider to be my data. Should data and interface be so tightly joined? Of course not.

kafkaesq9y ago

I had a different take -- if one hasn't been keeping up with the "arms race" of modern web scraping (and the countermeasures some sites are adopting these days), or with the JS scraping ecosystem generally, then it seems like this article could make for a decent introduction.

nathancahill9y ago

Part of the turnoff for me was the middle-schooler tone and vocabulary. Good walkthrough with good code examples though, obviously written by a very smart JS dev.

franciskimOP9y ago

Ok, I'll try to explain to this thread. I actually thought about removing the Facebook part, but I kept it in there because that is kind of how I felt and it is real. The middle-schooler tone and vocab is probably because I don't read a lot of books, and English is my 2nd language.

In reply to XCSme - no I am not new to Node and my point of the post is to illustrate some of the techniques that I haven't seen published anywhere to HN and the community. My focus is quite different from what you think it is, so maybe it is my bad for bad writing skills, I'm still new to writing and learning.

https://contently.com/strategist/2015/01/28/this-surprising-...

23andwalnut9y ago

Since when is a 'middle-schooler vocabulary' a bad thing? I distinctly remember learning on hn (when the Hemingway app became popular) that simple is better for readability.

franciskimOP9y ago

Ok guys, I've elaborated a bit - tried to make the aim of the post a bit clearer, and I've removed the stuff about Facebook because I don't want to discomfort other readers. It's 5:25AM, it's been a crazy morning and I've got work tomorrow! XD

namelezz9y ago

> "I found their interview approach a bit of a turn off so I did not proceed to the next interview and ignored her emails ", this just shows a lot of immaturity.

I believe you should treat others how you want to be treated. FYI, recruiters do not usually follow up with rejected candidates and many are unresponsive. It's their way of telling candidates they are no longer interested.

XCSme9y ago

When I interviewed for Google the recruiter was really nice and called to tell me that I didn't make it. I also gave her feedback on what I thought went wrong in the interview and what was wrong with their interview process. I don't know how the ones at Facebook are, but my recruiter did her job well and I appreciated that, even though some interviewers messed up.

elchief9y ago· 7 in thread

To fight scrapers, we show some values as images that look like text (but not all the time)

And we insert random (non-visible) html and css classes in our site to screw with em, and use randomized css classnames. This fucks with xpaths and css selectors.

You can't stop them, but you can make their lives painful.

elmigranto9y ago

> To fight scrapers, we show some values as images that look like text

You are fighting screen readers more than anything; as well as legitimate plugins, form autofills, etc. If this is for captcha, you are fighting all the users as well.

> And we insert random (non-visible) html and css classes in our site to screw with em, and use randomized css classnames.

Legitimate browser plugins, etc. I'd just use electron or selenium with `nth-child`, `:visible`, `[class*="…"]`, etc.

What you effectively doing is wasting time on useless stuff. This is even more useless than trying to prevent copying of DVDs or pirating games.

BrandonMarc9y ago

> What you effectively doing is wasting time on useless stuff. This is even more useless than trying to prevent copying of DVDs or pirating games.

Can you be so sure? The Union blockade of the Confederacy had plenty of holes, and smugglers / privateers / blockade-runners made good money getting through (when they survived) ... but that doesn't mean the blockade wasn't effective all the same at weakening the Confederate military and economy.

elchief9y ago

Except traffic from known scrapers (or what appear to be) is down 20%

Sure, xpath and css selector experts can figure it out, but that's not everyone

emodendroket9y ago

This also hurts accessibility for disabled users.

soared9y ago

Not if the img alt text is the same

insulanian9y ago

It would be better to invest that time in making an API so they don't need to scrape.

franciskimOP9y ago

Haven't come across those yet but yes I guess it could be painful.

mack739y ago· 6 in thread

Corporations will abuse your personal integrity whenever they get a chance, while abiding the law. Corporations will cry like babies when their publicly available data (their livelyhood) gets scraped. They will take you to court.

They consider their data to be theirs, even though they published it on the internet. They consider your data (your personal integrity) to be theirs as well, because how can you assume personal integrity when you are surfing the internet?

I have high hopes that the judicial system some time not too far from now will realize that since the law should be a reflection of the current moral standings it will always be behind, trying to catch up with us and that those who break the law while not breaking the current moral standings are still "good citizens" unworthy of prison or fines.

I guess Google won this iteration of the internet because of the double-standars site owners stand by, to allow Google to scrape anything while hindering any competitors from doing the same. There will only be a true competitor to Google when we in the next iteration of the internet realize that searching vast amounts of data (the internet) is a solved problem, that anyone can do as good a job as Google, and move on to the next quirk, around wich there will be competition, and in the end that quirk will be solved, we'll have a winner, signaling that is it time to move on to the next iteration.

dspillett9y ago

> Corporations will abuse your personal integrity whenever they get a chance, while abiding the law.

Call my cynical if you will, but I'd leave "while abiding the law" out of that, or at least replace it with "while hoping they aren't breaking the law". Due diligence on these matters is often sadly lacking. They'll take the information first and only consider any such implications when/if they come up later.

Large organisations like Google probably will make the up-front effort to remain legal, because they are in the public eye enough for lack of doing so to attract a lot of unwanted press, but you don't have to get a lot smaller than that to start finding companies who are a lot less careful (or in some cases wilfully negligent).

cm21879y ago

I would use Microsoft as a precedent. Sure they will attempt to stay legal but by pushing it as far as they can.

For instance the browser choice script that came with Windows imposed by the EU never worked. It was a "bug". Somehow they must have omitted to test the feature...

Until last year Microsoft started playing nice, and I think Google and Facebook have become the new corporate villains. But recently the Windows team seems to be minded to challenge them in that position.

niftich9y ago

Often, it's indeed cheaper to pay a government-mandated fine than lose market opportunities afforded by behavior that later runs afoul of some law or regulation.

nerdponx9y ago

The difference is that Google didn't agree to not scrape your data. You, as per their TOS, agreed not to scrape theirs, as part of the condition of using their service.

Bootvis9y ago

Which TOS?

I might have accepted terms when I created a Google Account but in no way do I agree to a TOS by visiting a URL.

joantune9y ago

Regarding the scraping and the legality of it all. I wonder if it's still illegal if you respect the robots.txt and other meta elements in html standards.

If Google's actions were illegal, I'm sure that they would have been sued even if their scraping and indexing usually is helpful for the website owner

franciskimOP9y ago· 6 in thread

Sorry guys, hit by traffic - just scaling my EC2 at the moment.

niftich9y ago

No worries, we had your page scraped just in case ;)

Google Cache link: http://webcache.googleusercontent.com/search?q=cache:https:/...

Archive.is link: http://archive.is/DQccs

franciskimOP9y ago

haha :)

fareesh9y ago

Is it common for developers in the eCommerce space to use scrapers as a means to aggressively push automated price-match algorithms? I've been asked to do this a number of times, was just curious as to how prevalent it is.

hk__29y ago

Yes, everybody scrapes the prices of the others.

ksahin9y ago

I just like to know, how much traffic did you got ?

franciskimOP9y ago

ok on m4.4xlarge now :)

prashnts9y ago· 5 in thread

A neat trick I sometimes use to "scrape" data from sites that use jquery ajax to load data is to plug in a middleware in jquery xhr:

      $.ajaxSetup({
        dataFilter: function (data, type) {
          if (this.url === 'some url that you want to watch!') {
            // Do anything with the data here
            awesomeMethod(this.data)
          }
          return data
        }
      })

I remember last using it with an infinite-scroll page with a periodic callback that scrolled the page down every 2 seconds, and the `awesomeMethod` just initiated the download. Pasted it all in dev-tools console, and the cheap "scraper" was ready!

macromaniac9y ago

Another trick: You can hover over elements in chrome with F12 and the inspect tool, then right click > copy > copy selector and chrome will generate one for you that way you don't have to actually do any work.

With a selector it's easy to grab data, here's a linux command that gets every user that posted in this thread:

  lynx -base -source 'https://news.ycombinator.com/item?id=12345693' | hxnormalize -x | \
    hxselect -c -s '\n' "td > table > tbody > tr > td.default > div:nth-child(1) > span > a.hnuser"

Here are the most frequent commenters:

     27 cookiecaper
     22 franciskim
      6 fake-name
      4 niftich
      4 flukus
      4 elmigranto
      4 downandout
      3 tedunangst
      3 siegecraft
      3 muglug
      3 minimaxir
      3 madamelic

pault9y ago

You can also build a chrome extension if you need to navigate to multiple pages and use a long-running scraping process. I've done this several times and it's really easy to get one up and running if you use an extension boilerplate (30 minutes tops).

esac9y ago

do you have something? i was going to write the very same extension (but distributed so i could add it to my pc and my friends) but never did that

[1]https://github.com/adam-s/playboy-fm/blob/master/server/scra...

franciskimOP9y ago

Yup I've been known to do this as well :) I'd have a Node.js + Mongo endpoint ready on the other side.

zappo29389y ago

Why not user Nightmare with Node.js + Mongo?

Here is an example of injecting a jQuery script into a page with jQuery loaded and getting nicely formatted information returned. [1]

dchuk9y ago· 5 in thread

Scraping with Selenium in Docker is pretty great, especially because you can use the Docker API itself to spin up/shut down containers at will. So you can spin up a container to hit a specific URL in a second, scrape whatever you're looking for, then kill the container. This can be done via a job queue (sidekiq if you're using Ruby) to do all sorts of fun stuff.

That aside, hitting Insta like this is playing with fire, because you're really dealing with Facebook and their legal team.

spikej9y ago

Serious question: What do you gain from having an extra layer like docker?

siegecraft9y ago

Well it does make it extra easy to deploy a scrape node to any type of machine you might encounter (and having a diverse set of source IPs is extra important for scraping; that means you might need to deploy to AWS, Azure, Google Cloud, rackspace, digitalocean, random vps provider X and so on). So instead of having to have custom provisioning profiles for every hosting provider/image combination, you just need to get docker running on a host and you're good to go.

dchuk9y ago

Because you can use pre-packaged Selenium in Docker images with a few commands: https://github.com/SeleniumHQ/docker-selenium

franciskimOP9y ago

Selenium grid runs in docker, so it's easy to have multiple instances running. Better control.

franciskimOP9y ago

True that, I hope Zuck sues me so I'll get extra famous

writeslowly9y ago· 5 in thread

Have you run into any issues from running all of your scrapers off of AWS, or just from sites detecting that you're accessing large numbers of pages in some sort of obvious pattern? I guess I was hoping there would be sites with more interesting ways to screw with web scrapers (rearranging certain page elements or something) than just throwing up a CAPTCHA.

madamelic9y ago

Most really don't. A lot of big sites don't seem to care, at least in my experience.

The few that I've seen just 'ban' your IP for a few minutes. If you hit Wikipedia too much too quickly, they will essentially refuse to serve you for a while. It was a number of years ago I was doing it, but basically you would be scraping then you would just stop getting info (Maybe I wasn't reading response codes and could've realized quicker what was happening)

detaro9y ago

Wikipedia provides you with an API and guidelines on how to use it, so you really shouldn't be scraping it directly or so much you hit enforced limits.

[1] https://www.npmjs.com/package/downcache

franciskimOP9y ago

I'm not actually doing a lot of hits, so it's generally been ok. I can just rotate my IP or solve the CAPTCHA.

freshhawk9y ago

A surprisingly small number of sites care. There are some really fun things one can do with random class/id/order variations. It's also fun to feed garbage data to scrapers when you can identify them with very high probability.

But there seems to be little demand for these kinds of systems and just throttling/blocking/CAPTCHA solutions are much simpler.

xur179y ago

There are definitely some sites that block entire ip blocks (ex: all of aws). The only real way around this is to use proxies, but if a site's trying to block you, it's probably best to comply, and just stop.

slig9y ago· 4 in thread

I wonder how effective the CloudFlare anti-scrapper protection is against this approach of breaking captchas.

Also, I find it interesting that big websites don't just block all traffic from AWS IPs as they do with Tor.

user59944619y ago

There can be legitimate traffic coming from AWS, if not the site itself.

It's especially true when the site provides an API and is meant to be integrated by people/companies. In which case, the AWS traffic is likely to include major and/or important and/or paying customers. You really don't want to block that.

On the other hand, Tor is likely to be 90% evil. When in doubt, just block it. (That makes me think, I should run some proper stats and maybe publish a blog post about that. )

slig9y ago

> There can be legitimate traffic coming from AWS, if not the site itself.

The traffic from the site itself, if it's hosted there, would come from the intranet IP address, right? Not the public facing one.

> It's especially true when the site provides an API and is meant to be integrated by people/companies. In which case, the AWS traffic is likely to include major and/or important and/or paying customers. You really don't want to block that.

Agreed, but it's fairly easy to block the AWS IP traffic on web endpoints and not on the API endpoints.

franciskimOP9y ago

I think I might have some trouble with some reCAPTCHA stuff, but there must be ways around it. I agree with you on your point about AWS.

greglindahl9y ago

There are a fair number of people in China etc running personal VPNs on AWS.

ben_jones9y ago· 4 in thread

Currently getting 502 Gateway. Guessing this post is also trending on reddit and we hugged it to death :(.

franciskimOP9y ago

Just upgraded my EC2 :)

franciskimOP9y ago

I'm on Reddit?

soared9y ago

Google Analytics > Acquisition > Source/Medium > type "reddit" in search bar. Add secondary dimension "referral path"

ben_jones9y ago

It's a total guess. HN rarely hugs sites to death compared to Reddit (IMO).

stupidcar9y ago· 3 in thread

I wrote a fairly complex spidering and scraping script in Node a few months ago. I found downcache[1] to be absolutely invaluable, particularly as I was debugging my parsing scripts, a I was able to rerun them relatively quickly over the cached responses.

However, when the network was no longer a bottleneck, I found that the speed and single-threaded nature of Node became one. It wasn't really that slow, relatively speaking, but I had a few hundred gigs of HTML to chew through every time I made a correction, so it was important to keep the turnaround as fast as possible.

I eventually managed to manually partition the task so I could launch separate Node scripts to handle different parts of it, but it wasn't a perfect split, and there was a fair bit of duplicated work, where a shared cache would have helped a great deal.

In retrospect, I should have thrown my JS away and started again in something with easy threading like Java or C#. But -- familiar story -- I'd underestimated the complexity of the task to begin with, and by the time I understood, I'd sunk a lot of time into writing my JS parsing code and didn't fancy converting it all to another language, particular when it always seemed like "just one more" correction to the parsing would make everything work right. In the end, what was supposed to take a weekend took about three months of work, off and on, to finish.

ralusek9y ago

Threading in node is very easy, just use clusters. Alternatively, take any of the CPU intensive activity, like parsing the HTML and formatting as JSON, and just put that on an AWS lambda.

You can invoke as many lambdas from your application as you want in parallel and you're not going to be bottlenecked by your CPU :)

stupidcar9y ago

Clustering in Node creates isolated child processes, not threads. I needed to have shared queues, in-memory caches, and hashes to coordinate workers and avoid them doing duplicate work.

I'm did consider using clustering and having some master process coordinate everything, and using some shared-memory caching library. But it would not be "easy" to set up, especially compared to something like Java where you get thread pools and synchronized thread-safe collections out of the box.

And Lambda would have been totally impractical. As I said, I had hundred of gigs of data to process. If I'd been uploading this over my puny ADSL upstream every time, I'd still be waiting for a single run to complete.

I'm not trashing Node. I like it. There's a reason I used in the first place, after all. But for this particular use-case, I didn't find it was very good fit.

http://jakeaustwick.me/python-web-scraping-resource/

Bjartr9y ago

Wonder how difficult it would have been to pull the JS portion into Java by way of Rhino.

mosburger9y ago· 3 in thread

> AngelList even detects PhamtomJS (have not seen other sites do this).

I run a site that aggregates/crawls job boards for remote job postings, and AngelList has been VERY difficult to crawl for various reasons, but you easily get PhantomJS to work (I have). Having said that, I've never felt very good about the fact that I'm defeating their attempts to block me (even though I feel like I'm doing them a favor) and will likely retire that bot soon.

It kinda sucks that I'm just grabbing publicly-available content in a very low-bandwidth way, but I really can't convince myself that what I'm doing is very ethical.

My to-do list includes making my crawler into a more well-behaved bot and that will have to go.

ixtli9y ago

I think you may want to decouple your ethical analysis from which private company is making the most money. Remember that the only functional difference between you and somewhere like kayak.com or padmapper is business relationships.

franciskimOP9y ago

I think PhantomJS has a bit of a giveaway in the headers where 2 lines are reversed compared to a normal Chrome, I always thought AngelList detected this flaw. Although I have heard there are builds of PhantomJS where this flaw does not exist.

ralusek9y ago

Mind giving a quick description of getting PhantomJS to work?

Jake2329y ago· 2 in thread

Not wanting to thread hijack, but just going to post an article I wrote a few years back as it covers a few other things that are still relevant; and often still gets referenced. May it'll help some people out in combination with OP's post.

mdaniel9y ago

I was surprised to not see Scrapy listed, but then I saw there were some comments about it - but seriously, doing by hand what Scrapy has spent years perfecting is highly suboptimal.

I guess the distinction is between whether one wants to just "toy around" or run the spider for-real.

franciskimOP9y ago

Nice post Jake!

jgmmo9y ago· 2 in thread

Good stuff.

I do a good bit of scraping, and made RubyRetriever[1] to make my life easier but it seems like I'm getting roadblocked on occasion, probably due to some of the things you mention in your article.

Is there any way for a site to verify that only their JS and CSS files are linked? Like preventing injection?

[1]: https://github.com/joenorton/rubyretriever

throwanem9y ago

You could inspect the src attributes of script tags, and the href attributes of link tags with rel="stylesheet", for acceptable domains. I doubt it would cover all cases, but it might be a start.

franciskimOP9y ago

I got the 100th star on the repo! What do you mean by the verifying part?

IANAD9y ago· 2 in thread

> But if you are automating your exact actions that happen via a browser, can this be blocked?

Yes, by checking times between actions and number of actions in a time period, and blocking atypical activity. I was IP banned from a site once for a few months, after trying to scrape it too much and hitting links on the site that were hidden from humans.

The random wait settings specified in the post are better than nothing, but still too flimsy. You would need to put hours between requests, only request during a certain 15 hour periods, take days off, and eventually you aren't scraping regularly enough to do much good.

Scraping is not an API, and I should know- I used to do it for a living. Its unreliable. It requires constant maintenance. APIs can break too, but they are meant for the sort of consumption you are trying for.

If you scrape for a living, only do it as a side job.

red_blobs9y ago

It really depends on the data you are scraping. My main business relies on scraping and my data mining application has been running for over 5 years. If you have enough IP addresses available to you, it becomes almost impossible to distinguish it from normal users hitting the site...and bandwidth has gotten so cheap, the overhead is very affordable.

I've noticed that most sites actually don't change that often. I deal with changes once or twice every 3 months.

"If you scrape for a living, only do it as a side job."

This is true if you are scraping the low hanging fruit. I scrape 40+ sources (I do have access to a few APIs as well) and then have to extract the patterns/data I need to then integrate it into my business model. This is all automatic now and I only work on upgrading for speed and efficiency.

If you have to scan millions of urls daily from 1 site, it's probably not going to work out. You need to figure out clever ways of getting the data and using it without breaking any laws or pissing off the site owner.

deejbee9y ago

Not scraping but banks don't even do this for their security which I found surprising. I just finished building a chrome extension (https://chrome.google.com/webstore/detail/uyp-free-blasts-th...) that auto-logins into pretty much any bank or financial web site without having to type anything. The key difference to other password managers is it can auto-fill pretty much anything.

I guess it's part password manager (it stores passwords encrypted in browser storage, not remotely) and part automation wizard :)

Twisell9y ago· 2 in thread

What bother me the most is that recently I wanted to extract and archive of all the threads I participated in from an Internet forum. The webmaster told me that the BBS he use don't provide such a function and that I just had to download each thread manually... (300+ thread in my case).

He then say that it don't bother him if I scrape theses thread. And I'm currently figuring out how to manage his site's cookie protected search feature, so that my painstaking effort (I'm not a dev, more a DB guy) could be reproducible more easily by other users of this service.

But this shouldn't appen in the first place because all post of this service are stored in a cleanly organized MySQl DB. Yet as no method is provided the only way to get back structured data is by scrapping (as the webmaster told me that no, he won't run custom SQL because "he don't want to mess his DB").

So even if all the data is publicly available through the internet forum only a geek can download a personal archive... or google because google scrape and store everything.

CalRobert9y ago

It's overkill for most things, but I have found that on occasion the best way to scrape stuff behind annoying frontends is with Selenium. pysaunter is a useful library that's one layer of abstraction higher, if you're familiar with Python.

CalRobert9y ago

Well I see now that I'm really late to the party with that comment.

skeletonjelly9y ago· 2 in thread

Hooray Melbourne! Would be interested seeing this at a meetup group if you were thinking of presenting.

danieltrembath9y ago

Another 3000'nder. Would be great to see this turned into a talk somewhere.

skeletonjelly9y ago

For sure. Trying to think which ones. Probably the MelbJS one and maybe dddmelb? You could modify it to talk at the OWASP one perhaps. Which ones have you been to?

rch9y ago· 2 in thread

There is so much that's missing from this. What about gathering tokens from customers vs. paying for social data feeds? How about canned services like 80legs?

franciskimOP9y ago

Hmm yeah there are a lot of other things that I could write about. 80legs seem like another Scrapy type of SaaS? Not sure what you mean about gathering tokens from customers.

rch9y ago

I've heard of companies that scrape on behalf of customers who will walk marketing people through the process of creating an API token to help mitigate rate limiting.

danso9y ago· 1 in thread

This good list of tactics underscores, for me, how the state of the Web has made it a lot more difficult to teach web scraping as a fun exercise for newbie programmers. It used to be you could get by with an assumption that what you see in the browser is what you get when you download the raw HTML...but that's increasingly less common the case. So now you have to teach how to debug via the console and network panel, on top of basic HTTP concepts (such as query parameters).

(Even more problematic is that college kids today seem to have a decaying understanding of what a URL is, given how much web navigation we do through the omnibar or apps, particularly on mobile, but that's another issue).

I've been archiving a few government sites to preserve them for web scraping exercises [0] (the Texas death penalty site is a classic, for both being relatively simple at first, and being incredibly convoluted depending on what level of detail you want to scrape [1])). But I imagine even government sites will move more toward AJAX/app-like sites, if the trend at the federal level means anything.

That said, I think the analytics.usa.gov site is a great place to demonstrate the difference between server-generated HTML and client-rendered HTML.

But as someone who just likes doing web-scraping, I feel the tools have mostly kept up with the changes to the web. It's been relatively easy, for example, to run Selenium through Python to mimic user action [2]. Same with PhantomJS through node, which has vastly improved how accurately it renders pages for screenshots compared to what I remember a few years back

[0] https://github.com/wgetsnaps

[1] https://github.com/wgetsnaps/tdcj-state-tx-us--death_row

[2] https://gist.github.com/dannguyen/8a6fa49253c1d6a0eb92

niftich9y ago

It's unfortunate that nearly every webpage these days is a Javascript State Machine which you have to execute in a sandbox and inspect its internal state to get stuff out of.

On a blog post by Paul Kinlan ('Open Web Advocate' at Google and Chromium) [1], I lamented that we ended up here instead of the semantic web because the semantic web was hard to execute. Instead, every web page is a black-box, only navigable by an intelligent and/or sufficiently persuadable human.

But this is also why I don't buy ethical arguments against scraping. Sure, legally any company can unilaterally set any TOS prohibition against behavior they don't want, and these terms may be tested in court. But navigating a page in an automated manner that's designed to resemble interactions of humans (ie. through Selenium) is in my opinion ethical, because it merely time-shifts a user's activity.

[1] https://news.ycombinator.com/item?id=12206846

nreece9y ago· 1 in thread

At Feedity (https://feedity.com), we "index" webpages to generate custom feeds. Over the years, we've designed our system to use a mix of technologies like .NET (C#) and node.js, and implemented a bunch of tweaks and optimizations for seamless & scalable access to public content.

LunaSea9y ago

Any tips and tricks you are able to share about the technologies you guys developed? It would be especially interesting to see what you use for text extraction from HTML.

lamby9y ago· 1 in thread

Whilst they mean well, I find this a fundamentally deceptive — the arduous parts of "real world" scraping simply aren't in the parsing and extraction of data from the target page, the typical focus of these "scrape the web with X" articles.

The difficulties are invariably in "post-processing"; working around incomplete data on the page, handling errors gracefully and retrying in some (but not all) situations, keeping on top of layout/URL/data changes to the target site, not hitting your target site too often, logging into the target site if necessary and rotating credentials and IP addresses, respecting robots.txt, target site being utterly braindead, keeping users meaningfully informed of scraping progress if they are waiting of it, target site adding and removing data resulting in a null-leaning database schema, sane parallelisation in the presence of prioritisation of important requests, difficulties in monitoring a scraping system due to its implicitly non-deterministic nature, and general problems associated with long-running background processes in web stacks.

Et cetera.

In other words, extracting the right text on the page is the easiest and trivial part by far, with little practical difference between an admittedly cute jQuery-esque parsing library or even just using a blunt regular expression.

It would be quixotic to simply retort that sites should provide "proper" APIs but I would love to see more attempts at solutions that go beyond the superficial.

dismantlethesun9y ago

> the arduous parts of "real world" scraping simply aren't in the parsing and extraction of data from the target page, the typical focus of these "scrape the web with X" articles.

I can agree with this after having written a scraper as part of core business functionality (we paid a company for access, but access was just to bare HTML blobs and CVS and not an actual API).

However, to what degree you want to do all this is negotiable whereas the 'core' of screen-scraping is not---all scrapers have to first figure out how to get text, parse it, then stick it back in their system.

An example of what I mean when I say 'negotiable' is....

> working around incomplete data on the page

Deciding how to do this depends on your problem domain. Sometimes, we'd get bad computed data from our source but not care because it just meant more work putting more work in calculating it from a more raw source.

> not hitting your target site too often

If they publish how often you are allowed to scrape, this isn't too difficult. If not, then trial and error is the only solution. On occasion, a site simply just doesn't know/care. For example, in my case, the site was static content behind a CDN, so that if we were anywhere under 200 req/second then no flags would ever be raised.

For most smaller sites, that you are unofficially scraping, you may be limited to 1 request every 2 seconds.

KennyCason9y ago· 1 in thread

As someone who does a lot of scraping, I was happy to learn about Antigate :)

KennyCason9y ago

Just joking as I don't scrape unless scraping is allowed. :)

etatoby9y ago· 1 in thread

Does anybody know what the author means by "lead" (noun)?

I don't think it's any of the regular meanings: http://www.ldoceonline.com/search/?q=Lead

But it doesn't seem to be any of these slang terms either: http://www.urbandictionary.com/define.php?term=lead

stedaniels9y ago

https://en.wikipedia.org/wiki/Lead_generation

ge959y ago· 1 in thread

How do you push a button like hit next on a paginated page?

oli56799y ago

Right click on the 'next' button in chrome and use 'inspect element' to find its id/class/css selector and then:

browser.findElement(webdriverio.By.id('#Next')).click();

pault9y ago

I don't know why more people don't use chrome extensions for scraping. Using a boilerplate[1], you can get a scraper up and running in minutes. Start a node server that serves up urls and stores parsed data, and run the scraper in the browser. Best of all, you can watch it running and debug if something goes wrong. I know it doesn't scale well if you're running a SaaS, but for personal projects and research/data normalization it's the lowest barrier to entry, in my opinion.

[1] http://extensionizr.com

headmelted9y ago

I actually love Selenium for this purpose, for much the same reasons the author mentions here.

It's almost impossible for a website to reliably detect that a client web browser is being automated, and I find I can make Selenium scripts much more adaptable to breaking changes in websites when they occur than I can when hooking up my code directly.

I actually disagree with the contention that Selenium is slower than directly scraping though. The Firefox driver has always been lightning fast for me and the bottleneck is almost always server requests that would have been necessary either way.

kingkool689y ago

It's trivial to scrape public Instagram URLs...

https://github.com/kingkool68/zadieheimlich/blob/master/func...

zzzcpan9y ago

> But if you are automating your exact actions that happen via a browser, can this be blocked?

Of course it can! You won't be able to defeat even the simplest attempt on anti-scraping based on statistical data. Like even keeping a list of individual rate-limits for /16 subnets of actual visiting users and you are in trouble.

kevindeasis9y ago

Does cheerio account for single page apps? In any case thanks for the tutorial!

Anyways I added your stuff here along with other data mining resource:

https://github.com/kevindeasis/awesome-fullstack#web-scrapin...

frostymarvelous9y ago

While everyone is busy debating whether scraping is bad or legal, I just can't stop thinking a out Antigate.

Of the sweatshops that must have been setup to deliver this service. That, is to me the true horror of this story.

unixhero9y ago

And from the trenches:

- rails application

- scraping with nokogiri gem on Ruby

- simple models doing the scraping in rails app

- some scraping is parsed with CSS selectors - nokogiri

- some scraping is parsed with regex - nokogiri

- persisting to DB, Text, even Google docs

- presentation on web, text, pdf, xls

Boom

rezashirazian9y ago

When I was building liisted.com I scraped using Selenium and it worked great.

j / k navigate · click thread line to collapse

387 comments

176 comments · 33 top-level

minimaxir9y ago· 49 in thread

OKCupid did a DMCA takedown for researchers releasing scraped data: https://www.engadget.com/2016/05/17/publicly-released-okcupi...

(Disclosure: I have developed a Facebook Page Post Scraper [https://github.com/minimaxir/facebook-page-post-scraper] which explicitly follows the permissions set by the Facebook API.)

franciskimOP9y ago

Who is anyone to tell me what I can and can't automate in my life?

ccvannorman9y ago

It's baloney.

lerpa9y ago

> Who is anyone to tell me what I can and can't automate in my life?

You are exactly right. But although a site can deny you access for any arbitrary reason (it's their website, after all) obviously government think they are the ones to enforce this crap.

This comment Terms of Service: If you read any of this text you owe lerpa $1.000.000 to be paid up until 09/01/2016.

tangue9y ago

It would be ok if it wasn't "You can't scrap my site. Unless of course you're Google" this double standard drives me mad.

sjwright9y ago

As the owner of a large website, I don't care what you think. I block by default and whitelist when I decide it's in my interest.

Analemma_9y ago

_c_9y ago

Worse is that Google tries to stop scraping. It's like they don't want anyone to see past the first page of results.

They could scrape your website and then they prevent you form scraping your own data back.

The whole process is silly; it reflects the duct tape and chicken wire nature of the www.

No one should have to "scrape" or "crawl".

Data should be put into a open universal format (no tags) and submitted when necessary (rsynced) to a public access archive, mirrored around the world.

This to bridge the gap until we reach a more content addressable system (cf. location based).

"Crawling" should not be necessary.

No one should have to store HTML tags and other window dressing for data.

Dream on.

codeddesign9y ago

desireco429y ago

Google and others, with legitimate reasons obey robots.txt

hobs9y ago

This post is kind of crazy, aggrandizing bad behavior and misuse of other's resources against their will.

madamelic9y ago

Not really.

Scraping, in my opinion, isn't black hat unless you are actually affecting their service or stealing info.

In regards to "stealing info", as long as you aren't taking info and selling it as your own (which it seems OP is indeed doing), that is just fine.

tl;dr: Scraping isn't bad / blackhat as long as you aren't affecting their service or business.

angry_octet9y ago

'Bad netizen stuff'? Is this a comment from 1997? Breaking captchas is 'blackhat'? What cozy hippy internet alternative reality does this come from?

You're not in Kansas anymore Toto.

malz9y ago

> misuse of other's resources against their will

We respect TOS, we respect robots.txt and so on. Just because you study scraping techniques doesn't mean you intend to break the law.

> Breaking captchas and the like is basically blackhat work

Um, captchas only work if they work. If breaking them is trivial, they shouldn't exist. Don't shoot the messenger for pointing out the front door is unlocked.

Fiahil9y ago

It's 2016. State-companies holding a third party responsible for their own outages and poor planning is _bad faith_[1]. ETL? Never heard of it?

[1]: https://citymapper.com/i/1208/soutenez-citymapper-et-lopen-d... (french)

pierrebai9y ago

Yes, this defense is being petty abotu details, but I find businesses using post-hoc discoverable limitations to limit people rights annoying.

Mikushi9y ago

Instagram or Facebook, they thrive on stolen or relinked content and monetize it day in day out.

flukus9y ago

> This post is kind of crazy, aggrandizing bad behavior and misuse of other's resources against their will.

How so? I send a web request, they send me the content in a response. If they aren't happy with that then they should refuse my request.

TeMPOraL9y ago

https://en.wikibooks.org/wiki/UK_Database_Law#Database_Right

madshiva9y ago

twa9279y ago

How can TOS have legal power for the case scraping? A website is a public property. If I'm visiting it without logging in, I don't have a chance to accept TOS.

Imagine a hotel that makes guests sign a document saying they will not make photographs of the building. If I'm not a guest, I can take photographs of it and I can't even know that would be illegal.

buro99y ago

The UK has a database law:

You may access said database (via the web), but as soon as you start reconstituting the database from scraping... you're in breach.

duaneb9y ago

> A website is a public property.

This isn't even true metaphorically. It's like a shop front: there may be public access, but it is NOT public property.

cookiecaper9y ago

>How can TOS have legal power for the case scraping? A website is a public property. If I'm visiting it without logging in, I don't have a chance to accept TOS.

>Imagine a hotel that makes guests sign a document saying they will not make photographs of the building. If I'm not a guest, I can take photographs of it and I can't even know that would be illegal.

https://en.wikipedia.org/wiki/EBay_v._Bidder%27s_Edge

sseveran9y ago

One of the original court cases covering this was eBay vs Bidders Edge.

The courts have generally disagreed with that interpretation.

dragonwriter9y ago

> A website is a public property.

No, its not. It may be in public view, but that's a different issue.

gsnedders9y ago

Both the LinkedIn and OKC cases involved the scrapers using logged in accounts.

hooph00p9y ago

> A website is public property.

This is a gross misunderstanding of how the internet works.

minimaxir9y ago

pyre9y ago

cookiecaper9y ago

0xdeadbeefbabe9y ago

Yes laws are neat and a reason for attending law school I suppose. I'm of the simpleton opinion that TCP/IP and the other protocols are the law of the net, and you ought to start with those.

kuschku9y ago

Or you just move to a locale where scraping is legal, and any contractual terms saying otherwise are null and void.

I’d assume a lot of HN users are from such locales.

We don’t always have to assume US laws apply globally – they don’t.

cookiecaper9y ago

madamelic9y ago

Scraping being illegal is as dumb as saying it's illegal to take photos in public. You aren't affecting anyone if you do it respectfully.

downandout9y ago

cookiecaper9y ago

If you're not going to run it totally anonymously, you should be prepared to jettison and repackage it when you get found it (so that you appear to be complying with the C&D).

Scraping is a huge part of the web, and everyone does it. It sucks that it has to live underground because only big companies can duke it out in court.

siegecraft9y ago

I think the lessons learned from those lawsuits was to always have some sort of 3rd-party intermediary scraping consultancy firm you engage that is totally not just your business under another name.

kbenson9y ago

> (Disclosure: I have developed a Facebook Page Post Scraper [https://github.com/minimaxir/facebook-page-post-scraper] which explicitly follows the permissions set by the Facebook API.)

headmelted9y ago

One thing I'm not quite clear on here.

I understand the use of ToS clauses to prevent scraping but I do kind of wonder to what extent they have authority here.

cookiecaper9y ago

ChuckMcM9y ago

downandout9y ago

Actually it's been held that TOS violations are NOT subject to the criminal provisions of the CFAA.

yeowMeng9y ago

I am no expert, but I always thought you could scrape without consequence provided you never distribute your scrapings?

jurgenwerk9y ago

There are hundreds of paid services that scrape Google heavily (search engine ranking trackers). How are they legal?

lossolo9y ago

They probably doing it from country where it's legal. In most countries there is no law that would be applicable in this case.

cookiecaper9y ago

They aren't, or at least, they won't be if Google decides it doesn't like them anymore and decides to bring the matter to court.

knicholes9y ago

Isn't Google search based off of Google "scraping" the web?

- https://github.com/fake-name/ExHentai-Archival

sandGorgon9y ago

The code is pretty cool. Thanks for releasing that! May I ask why you built your own scraper infrastructure and not build it on top of a known framework like scrapy (which is in python as well).

jbmorgado9y ago

I actually wonder if that constraining by the ToS has any legal validity in EU. Since in here, typical that kind of stuff cannot be enforced legally.

fake-name9y ago· 14 in thread

I do a significant amount of scraping for hobby projects, albeit mostly open websites. As a result, I've gotten pretty good a circumventing rate-limiting and most other controls.

I suspect I'm one of those bad people your parents tell you to avoid - by that I mean I completely ignore robots.txt.

At this point, I have a system that is, as far as I can tell, definitionally a botnet. The only things is I actually pay for the hosts.

---

It's all on github, FWIW:

Manager: https://github.com/fake-name/ReadableWebProxy

Agent and salt scheduler: https://github.com/fake-name/AutoTriever

kough9y ago

Yet another incredible technical achievement due to someone's quest for more porn (https://github.com/fake-name/AutoTriever/blob/master/setting...).

fake-name9y ago

That's a separate project:

- https://github.com/fake-name/PatreonArchiver

- https://github.com/fake-name/xA-Scraper

- https://github.com/fake-name/DanbooruScraper

Or... well, 4 separate projects. Whoops?

Well-tagged hentai is trivially accessible, though. I think there's probably a paper or two in there about the demographics of the two fan groups. People are fascinating.

Next up, automate the consumption too!

monsoon229y ago

How do you circumvent cloud provider IP blocks? For example, one site blocks all requests from AWS EC2 servers.

fake-name9y ago

None of the sites I'm scraping do that, mostly.

If I run into that sort of thing, I guess we'll see.

nickysielicki9y ago

But if the end justifies the means... http://luminati.io/

siegecraft9y ago

fake-name9y ago

Yeah, running this thing publicly would be a huge mess from a copyright perspective, since it literally re-hosts everything as a core part of how it works.

As it is, I think I'm OK, since it's basically just a "website DVR" type thing, for my own use.

atmosx9y ago

Similar, paid solution: https://scrapinghub.com/crawlera/

uptown9y ago

What would the rough costs be to run the 800k UA scenario?

fake-name9y ago

To be clear, I have a pool of 800K theoretical UAs derived from the mechanism I use to generate them, not 800K clients.

Regarding costs, I really have no idea. It depends on how rapidly you cycle the UA, and how fast whatever you're scraping is.

franciskimOP9y ago

take my money!

Greg-J9y ago

How can I get ahold of you directly?

fake-name9y ago

connorw at imaginaryindustries dot com

chmars9y ago

Wow, that's impressive!

XCSme9y ago· 14 in thread

jhwhite9y ago

When someone flips it on the company then it's immature.

But last he told me was he liked me and would schedule another interview, then when he changed his mind he never let me know.

wtracy9y ago

>When someone flips it on the company then it's immature.

fizx9y ago

As someone who has written a scraping framework, this article is useful AF.

> but all it explained was how to make a few API calls in order to solve a very specific problem.

> Also, there was the overall arrogant tone: "I found their interview approach a bit of a turn off so I did not proceed to the next interview and ignored her emails "

Your arrogance is my matter-of-fact.

zasz9y ago

You can't count as 'matter-of-fact' if you're not even bothering to communicate.

franciskimOP9y ago

Thanks for your feedback, I do appreciate it.

kentt9y ago

Not many people can take criticism is stride like that. You are awesome.

state9y ago

kafkaesq9y ago

nathancahill9y ago

Part of the turnoff for me was the middle-schooler tone and vocabulary. Good walkthrough with good code examples though, obviously written by a very smart JS dev.

franciskimOP9y ago

https://contently.com/strategist/2015/01/28/this-surprising-...

23andwalnut9y ago

Since when is a 'middle-schooler vocabulary' a bad thing? I distinctly remember learning on hn (when the Hemingway app became popular) that simple is better for readability.

franciskimOP9y ago

namelezz9y ago

> "I found their interview approach a bit of a turn off so I did not proceed to the next interview and ignored her emails ", this just shows a lot of immaturity.

XCSme9y ago

elchief9y ago· 7 in thread

To fight scrapers, we show some values as images that look like text (but not all the time)

And we insert random (non-visible) html and css classes in our site to screw with em, and use randomized css classnames. This fucks with xpaths and css selectors.

You can't stop them, but you can make their lives painful.

elmigranto9y ago

> To fight scrapers, we show some values as images that look like text

You are fighting screen readers more than anything; as well as legitimate plugins, form autofills, etc. If this is for captcha, you are fighting all the users as well.

> And we insert random (non-visible) html and css classes in our site to screw with em, and use randomized css classnames.

Legitimate browser plugins, etc. I'd just use electron or selenium with `nth-child`, `:visible`, `[class*="…"]`, etc.

What you effectively doing is wasting time on useless stuff. This is even more useless than trying to prevent copying of DVDs or pirating games.

BrandonMarc9y ago

> What you effectively doing is wasting time on useless stuff. This is even more useless than trying to prevent copying of DVDs or pirating games.

elchief9y ago

Except traffic from known scrapers (or what appear to be) is down 20%

Sure, xpath and css selector experts can figure it out, but that's not everyone

emodendroket9y ago

This also hurts accessibility for disabled users.

soared9y ago

Not if the img alt text is the same

insulanian9y ago

It would be better to invest that time in making an API so they don't need to scrape.

franciskimOP9y ago

Haven't come across those yet but yes I guess it could be painful.

mack739y ago· 6 in thread

dspillett9y ago

> Corporations will abuse your personal integrity whenever they get a chance, while abiding the law.

cm21879y ago

I would use Microsoft as a precedent. Sure they will attempt to stay legal but by pushing it as far as they can.

For instance the browser choice script that came with Windows imposed by the EU never worked. It was a "bug". Somehow they must have omitted to test the feature...

niftich9y ago

Often, it's indeed cheaper to pay a government-mandated fine than lose market opportunities afforded by behavior that later runs afoul of some law or regulation.

nerdponx9y ago

The difference is that Google didn't agree to not scrape your data. You, as per their TOS, agreed not to scrape theirs, as part of the condition of using their service.

Bootvis9y ago

Which TOS?

I might have accepted terms when I created a Google Account but in no way do I agree to a TOS by visiting a URL.

joantune9y ago

Regarding the scraping and the legality of it all. I wonder if it's still illegal if you respect the robots.txt and other meta elements in html standards.

If Google's actions were illegal, I'm sure that they would have been sued even if their scraping and indexing usually is helpful for the website owner

franciskimOP9y ago· 6 in thread

Sorry guys, hit by traffic - just scaling my EC2 at the moment.

niftich9y ago

No worries, we had your page scraped just in case ;)

Google Cache link: http://webcache.googleusercontent.com/search?q=cache:https:/...

Archive.is link: http://archive.is/DQccs

franciskimOP9y ago

haha :)

fareesh9y ago

hk__29y ago

Yes, everybody scrapes the prices of the others.

ksahin9y ago

I just like to know, how much traffic did you got ?

franciskimOP9y ago

ok on m4.4xlarge now :)

prashnts9y ago· 5 in thread

A neat trick I sometimes use to "scrape" data from sites that use jquery ajax to load data is to plug in a middleware in jquery xhr:

      $.ajaxSetup({
        dataFilter: function (data, type) {
          if (this.url === 'some url that you want to watch!') {
            // Do anything with the data here
            awesomeMethod(this.data)
          }
          return data
        }
      })

macromaniac9y ago

With a selector it's easy to grab data, here's a linux command that gets every user that posted in this thread:

  lynx -base -source 'https://news.ycombinator.com/item?id=12345693' | hxnormalize -x | \
    hxselect -c -s '\n' "td > table > tbody > tr > td.default > div:nth-child(1) > span > a.hnuser"

Here are the most frequent commenters:

     27 cookiecaper
     22 franciskim
      6 fake-name
      4 niftich
      4 flukus
      4 elmigranto
      4 downandout
      3 tedunangst
      3 siegecraft
      3 muglug
      3 minimaxir
      3 madamelic

pault9y ago

esac9y ago

do you have something? i was going to write the very same extension (but distributed so i could add it to my pc and my friends) but never did that

[1]https://github.com/adam-s/playboy-fm/blob/master/server/scra...

franciskimOP9y ago

Yup I've been known to do this as well :) I'd have a Node.js + Mongo endpoint ready on the other side.

zappo29389y ago

Why not user Nightmare with Node.js + Mongo?

Here is an example of injecting a jQuery script into a page with jQuery loaded and getting nicely formatted information returned. [1]

dchuk9y ago· 5 in thread

That aside, hitting Insta like this is playing with fire, because you're really dealing with Facebook and their legal team.

spikej9y ago

Serious question: What do you gain from having an extra layer like docker?

siegecraft9y ago

dchuk9y ago

Because you can use pre-packaged Selenium in Docker images with a few commands: https://github.com/SeleniumHQ/docker-selenium

franciskimOP9y ago

Selenium grid runs in docker, so it's easy to have multiple instances running. Better control.

franciskimOP9y ago

True that, I hope Zuck sues me so I'll get extra famous

writeslowly9y ago· 5 in thread

madamelic9y ago

Most really don't. A lot of big sites don't seem to care, at least in my experience.

detaro9y ago

Wikipedia provides you with an API and guidelines on how to use it, so you really shouldn't be scraping it directly or so much you hit enforced limits.

[1] https://www.npmjs.com/package/downcache

franciskimOP9y ago

I'm not actually doing a lot of hits, so it's generally been ok. I can just rotate my IP or solve the CAPTCHA.

freshhawk9y ago

But there seems to be little demand for these kinds of systems and just throttling/blocking/CAPTCHA solutions are much simpler.

xur179y ago

slig9y ago· 4 in thread

I wonder how effective the CloudFlare anti-scrapper protection is against this approach of breaking captchas.

Also, I find it interesting that big websites don't just block all traffic from AWS IPs as they do with Tor.

user59944619y ago

There can be legitimate traffic coming from AWS, if not the site itself.

On the other hand, Tor is likely to be 90% evil. When in doubt, just block it. (That makes me think, I should run some proper stats and maybe publish a blog post about that. )

slig9y ago

> There can be legitimate traffic coming from AWS, if not the site itself.

The traffic from the site itself, if it's hosted there, would come from the intranet IP address, right? Not the public facing one.

Agreed, but it's fairly easy to block the AWS IP traffic on web endpoints and not on the API endpoints.

franciskimOP9y ago

I think I might have some trouble with some reCAPTCHA stuff, but there must be ways around it. I agree with you on your point about AWS.

greglindahl9y ago

There are a fair number of people in China etc running personal VPNs on AWS.

ben_jones9y ago· 4 in thread

Currently getting 502 Gateway. Guessing this post is also trending on reddit and we hugged it to death :(.

franciskimOP9y ago

Just upgraded my EC2 :)

franciskimOP9y ago

I'm on Reddit?

soared9y ago

Google Analytics > Acquisition > Source/Medium > type "reddit" in search bar. Add secondary dimension "referral path"

ben_jones9y ago

It's a total guess. HN rarely hugs sites to death compared to Reddit (IMO).

stupidcar9y ago· 3 in thread

ralusek9y ago

Threading in node is very easy, just use clusters. Alternatively, take any of the CPU intensive activity, like parsing the HTML and formatting as JSON, and just put that on an AWS lambda.

You can invoke as many lambdas from your application as you want in parallel and you're not going to be bottlenecked by your CPU :)

stupidcar9y ago

Clustering in Node creates isolated child processes, not threads. I needed to have shared queues, in-memory caches, and hashes to coordinate workers and avoid them doing duplicate work.

I'm not trashing Node. I like it. There's a reason I used in the first place, after all. But for this particular use-case, I didn't find it was very good fit.