OKCupid did a DMCA takedown for researchers releasing scraped data: https://www.engadget.com/2016/05/17/publicly-released-okcupi...
Since both of these incidents, I now only scrape if it's a) through the API following rate limits or b) if there is no API, and the data has the explicit purpose of being shared publically (e.g blogs), I follow robots.txt. Of course, most companies have a do-not-scrape clause in their ToS anyways, to my personal frustration.
(Disclosure: I have developed a Facebook Page Post Scraper [https://github.com/minimaxir/facebook-page-post-scraper] which explicitly follows the permissions set by the Facebook API.)
For me it's purely for personal use and my little side projects. I don't even like the word scraping because it comes loaded with so many negative connotations (which sparked this whole comment thread) - and for a good reason - it's reflective of how the the demand in the market. People want cheap leads to spam, and that's bad use of technology.
Generally I tend to focus more on words and phases like 'automation' and 'scripting a bot'. I'm just automating my life, I'm writing a bot to replace what I would have to do on a daily basis - like looking on Facebook for some gifs and videos then manually posting them to my site. Would I spend an hour each and every day doing this? No, I'm much more lazier than that.
Who is anyone to tell me what I can and can't automate in my life?
It's baloney.
You are exactly right. But although a site can deny you access for any arbitrary reason (it's their website, after all) obviously government think they are the ones to enforce this crap.
What if the ToS say you can only access a site while jumping hoops? Only read the ToS after a while and wasn't hooping? Well too bad, now you are being sued for reading the main page _and_ the ToS page without jumping around.
This comment Terms of Service: If you read any of this text you owe lerpa $1.000.000 to be paid up until 09/01/2016.
If you don't think this is reasonable, chances are you've never run a large website, or analyzed the logs of a large website. You'd be astonished how much robotic activity you'll receive. If left unchecked it can easily swamp legitimate traffic.
Unless you have a way for me to automatically identify "honourable" scrapers such as yourself as distinct from the thousands upon thousands of extremely dodgy scrapers from across the world, my policy shall remain.
They could scrape your website and then they prevent you form scraping your own data back.
The whole process is silly; it reflects the duct tape and chicken wire nature of the www.
No one should have to "scrape" or "crawl".
Data should be put into a open universal format (no tags) and submitted when necessary (rsynced) to a public access archive, mirrored around the world.
This to bridge the gap until we reach a more content addressable system (cf. location based).
Clients (text readers, media players, whatever) can download and transform the universally formatted data into markup, binary, etc. -- whatever they wish, but all the design creativity and complexity of "web pages" or "web apps" can be handled at the network edge, client-side.
"Crawling" should not be necessary.
No one should have to store HTML tags and other window dressing for data.
Dream on.
Scraping against the TOS is super bad netizen stuff, and I dont think people should be posting positive reviews of people doing this. Breaking captchas and the like is basically blackhat work and should be looked down upon, not congratulated as I see in this thread.
Not really.
Scraping, in my opinion, isn't black hat unless you are actually affecting their service or stealing info.
If you are slamming the site with requests because of your scraping, yeah you need to knock it off. If you throttle your scraper in proportion to the size of their site, you aren't really harming them.
In regards to "stealing info", as long as you aren't taking info and selling it as your own (which it seems OP is indeed doing), that is just fine.
tl;dr: Scraping isn't bad / blackhat as long as you aren't affecting their service or business.
These are the same websites and companies that are loading evercookies and doing browser fingerprinting, that break as much as possible the anonymity citizens should enjoy, with Real Name policies, using network analysis to find who your friends are and what your politics and buying habbits are, that routinely rip private information from you cell phone and share it with oppressive regimes.
You're not in Kansas anymore Toto.
Nonsense, there is no implication that this activity is illicit. Many sites (I have worked with hundreds) are happy to be included in my service, but don't have the technical ability to provide a data feed. They were delighted when I told them I could aggregate their content without any extra work on their part.
We respect TOS, we respect robots.txt and so on. Just because you study scraping techniques doesn't mean you intend to break the law.
> Breaking captchas and the like is basically blackhat work
Um, captchas only work if they work. If breaking them is trivial, they shouldn't exist. Don't shoot the messenger for pointing out the front door is unlocked.
If your administration don't have the resources (and it's often the case) to maintain a proper JSON API for you to fetch with a fancy python lib, then, it's not "super bad netizen stuff" to scrap a few HTML/PDF/XLS, parse them and display them for convenient public consumption on your personal website (and paying for the bandwidth).
It's 2016. State-companies holding a third party responsible for their own outages and poor planning is _bad faith_[1]. ETL? Never heard of it?
[1]: https://citymapper.com/i/1208/soutenez-citymapper-et-lopen-d... (french)
Yes, this defense is being petty abotu details, but I find businesses using post-hoc discoverable limitations to limit people rights annoying.
Being amazed at this kind of bad behaviour where the targets are some of the most despicable companies on the web is a bit ironic. Scrape away, these companies hurt the web, let's hurt them (even though, all the scraping in the world won't have any impact).
How so? I send a web request, they send me the content in a response. If they aren't happy with that then they should refuse my request.
Imagine a hotel that makes guests sign a document saying they will not make photographs of the building. If I'm not a guest, I can take photographs of it and I can't even know that would be illegal.
https://en.wikibooks.org/wiki/UK_Database_Law#Database_Right
If you scrape, and effectively reconstitute a database, then so long as the database originally had a "substantial investment" in it's "obtaining, verifying or presenting the contents" then yup... you have breached the database right, which is a modified form of copyright.
You may access said database (via the web), but as soon as you start reconstituting the database from scraping... you're in breach.
It's a law, it is illegal in the UK, I'm sure most countries have some equivalent law on their books, all of the EU does. The law looks recent, but UK copyright and patent used to cover it, the 1997 date is just a separate statute to clarify the position.
This isn't even true metaphorically. It's like a shop front: there may be public access, but it is NOT public property.
This is called "clickwrap". There is usually a notice in the footer of each page that says something like "By using this site, you agree to our Terms of Service." Typically, this kind of notice has been held enforceable. More recently, judges have been demanding that such notices be placed more prominently before they're held enforceable (e.g., somewhere above the fold), but that's it.
>Imagine a hotel that makes guests sign a document saying they will not make photographs of the building. If I'm not a guest, I can take photographs of it and I can't even know that would be illegal.
The reasonable laws that exist in meatspace are not applicable online, because once you hit someone else's server, you're considered to be on their property and they have the right to control what you do there. There is no "public property" from which to safely stand and take photographs in the internet.
Also, photographs of structures may not be free to use. Architectural copyrights went into effect in the early 90s and have a term of either 90 or 120 years. Thus, if you take a photograph of a building built in 1991 and the year is not yet 2111, there is a chance that the architect can claim infringement.
https://en.wikipedia.org/wiki/EBay_v._Bidder%27s_Edge
The courts have generally disagreed with that interpretation.
No, its not. It may be in public view, but that's a different issue.
This is a gross misunderstanding of how the internet works.
The CFAA is a really bad law and creates the network effect lock-in that we all considered a natural part of the web. It doesn't have to be that way -- users should be free to use any browsing appliance they want, including so-called "scrapers".
Big companies like Google not only got their start by flagrantly violating the CFAA, copyright, and privacy laws, but they continue to do so. The moral of the story is hurry up and get big before you get sued or arrested.
There's a long history of ridiculous web scraping rulings based on technical misunderstandings by neophyte judges, including Ticketmaster v. RMG, where infringement was found because the company scraped data out of a page with the Ticketmaster logo on it.
Facebook sued a company called Power Ventures which read out only the user's own data. The founder was found personally liable for $3 million in damages. Facebook did this because they don't want it to be easy for their users to move between social media services. If it's easy, Facebook has to compete on merit instead of just keeping switching costs high. Facebook doesn't like that, so they sue people who make it possible -- and the law says they should win.
We badly need a revised law, but the powers-that-be will strongly oppose it because it would threaten their monopoly over web properties. They continue to flaunt their strategic ignorance of these laws and then take shelter behind them to stop risk from small innovators (i.e., having to compete fair and square).
In the real world, we have a lot of laws that mostly prevent this kind of bad behavior. In cyberspace, the structure is such that most of those laws are not applicable. We need to update and port the pro-small-business logic we have for meatspace companies so that it counts online too. The state of affairs online is really bad.
I want to get a law called the "Consumer Data Freedom Act" passed, which would allow users to access any web property with any non-disruptive browsing device, including custom scrapers that don't impose much more load than a typical user browsing session would.
I’d assume a lot of HN users are from such locales.
We don’t always have to assume US laws apply globally – they don’t.
If you're not going to run it totally anonymously, you should be prepared to jettison and repackage it when you get found it (so that you appear to be complying with the C&D).
Scraping is a huge part of the web, and everyone does it. It sucks that it has to live underground because only big companies can duke it out in court.
I played with the idea of creating some social aggregation type service with some friends (as a business). The more I read about FB's past behavior with regard to this, and how essential they are to any sort of service like, that, I canned the project. Regardless of what their TOS say, if you get on their radar and they send you a cease-and-desist, it's game over. Facebook is not in the business of subverting their revenue stream, so if you are making money off them and it's preventing them from capitalizing on their users, don't expect to last long if you exist by their grace.
Really, there's an interesting space between so small nobody cares and large enough that getting shut down is a real problem. A lot of projects start small and end up (relatively) large, but without a good way to pay for the service itself. While not every service needs to be a business and make money, once you reach the level where you risk either being shut out of your data source or you need to somehow work out an understanding with that source, how do you approach that when being able to pay is off the table? Not to mention the problem approaching before you have to and forcing the situation, or waiting too long and risking the wrath of the source because you've abused their service as long as you have. Has anyone else been in this situation and found an approach that works?
I understand the use of ToS clauses to prevent scraping but I do kind of wonder to what extent they have authority here.
IANAL, but surely this would fall under copyright law? While re-publishing copyright-protected data without consent is probably unlawful in your region (like scraping an art site and re-posting the images), I wouldn't think just scraping data points for a different purpose (like scraping amazon for the purposes of price comparison) is nearly so clear cut (or enforceable), but maybe I'm just naive.
Companies like PriceZombie are forced to stop because the CFAA says that Amazon can prevent them from accessing their servers by decree alone. A ToS isn't even really necessary for this, but it helps them pin down their argument.
PriceZombie could try to get the data from third-party caches, but it only solves part of the problem, because copyright and trademarks come back into the picture once you have a replica of the target page. In Ticketmaster v. RMG Technologies, the judge found RMG infringing on Ticketmaster's trademarks and copyrights because the page they were scraping included Ticketmaster's logo. The judge said the copy of the full page that existed momentarily in RAM while the scraper extracted the non-copyrightable data constituted a copy that infringed on Ticketmaster's rights, even though the logo was never used by the application in any way, it just happened to be on the page.
The CFAA says it's a crime to exceed "authorized access". Authorized access is whatever the server's owner says it is. If they change their mind, you must cease and desist or risk both civil and criminal penalties. A contract defining the length and nature of your authorization from the server's owner would go a long way to establishing your rights to access, but no one is going to give that to a small player.
I suspect I'm one of those bad people your parents tell you to avoid - by that I mean I completely ignore robots.txt.
At this point, my architecture has settled on a distributed RPC system with a rotating swarm of clients. I use RabbitMQ for message passing middleware, SaltStack for automated VM provisioning, and python everywhere for everything else. Using some randomization, and a list of the top n user agents, I can randomly generate about ~800K unique but valid-looking UAs. Selenium+PhantomJS gets you through non-capcha cloudflare. Backing storage is Postgres.
Database triggers do row versioning, and I wind up with what is basically a mini internet-archive of my own, with periodic snapshots of a site over time. Additionally, I have a readability-like processing layer that re-writes the page content in hopes of making the resulting layout actually pleasant to read on, with pluggable rulesets that determine page element decomposition.
At this point, I have a system that is, as far as I can tell, definitionally a botnet. The only things is I actually pay for the hosts.
---
Scaling something like this up to high volume is really an interesting challenge. My hosts are physically distributed, and just maintaining the RabbitMQ socket links is hard. I've actually had to do some hacking on the RabbitMQ library to let it handle the various ways I've seen a socket get wedged, and I still have some reliability issues in the SaltStack-DigitalOcean interface where VM creation gets stuck in a infinite loop, leading to me bleeding all my hosts. I also had to implement my own message fragmentation on top of RabbitMQ, because literally no AMQP library I found could reliably handle large (>100K) messages without eventually wedging.
There are other fun problems too, like the fact that I have a postgres database that's ~700 GB in size, which means you have to spend time considering your DB design and doing query optimization too. I apparently have big data problems in my bedroom (My home servers are in my bedroom closet).
---
It's all on github, FWIW:
Manager: https://github.com/fake-name/ReadableWebProxy
Agent and salt scheduler: https://github.com/fake-name/AutoTriever
- https://github.com/fake-name/ExHentai-Archival
- https://github.com/fake-name/PatreonArchiver
- https://github.com/fake-name/xA-Scraper
- https://github.com/fake-name/DanbooruScraper
Or... well, 4 separate projects. Whoops?
At one point, a friend and I were looking at trying to basically replicate the google deep-dream neural net thing, only with a training set of porn. It turns out getting a well tagged dataset for training is somewhat challenging.
Well-tagged hentai is trivially accessible, though. I think there's probably a paper or two in there about the demographics of the two fan groups. People are fascinating.
Next up, automate the consumption too!
I'm not scraping high value sites like that (I mostly target amateur original content). It's not really of interest to other businesses. As such, I tend to just run into things like normal cloud-flare wrapped sites, and one place that tried to detect bots and return intentionally garbled data.
If I run into that sort of thing, I guess we'll see.
But if the end justifies the means... http://luminati.io/
As it is, I think I'm OK, since it's basically just a "website DVR" type thing, for my own use.
Really, if nothing else, the project has been enormously educational for me. I've learnt a boatload about distributed systems, learned a bit of SQL, dicked about with databases a bunch, and actually experienced deploying a complex multi-component application across multiple disparate data centers.
Regarding costs, I really have no idea. It depends on how rapidly you cycle the UA, and how fast whatever you're scraping is.
I'm not saying you're one of these people, but it's frustrating when companies do this to potential employees and the potential is told by friends and other management type people, "well that's the company you just have to deal with it".
When someone flips it on the company then it's immature.
I applied somewhere recently and they invited me out to a pre-interview lunch. That went well so they called me in for an interview. That went well and the VP told me he would call me back to set up a second (third?) interview.
I never heard back from him. An ex-coworker there went to the VP to find out what was going on and the VP said he decided he wanted someone with more experience in the specific area they're working in.
But last he told me was he liked me and would schedule another interview, then when he changed his mind he never let me know.
I think people on both sides should be courteous and respectful through the process, but if employers are treating interviewees poorly then they shouldn't be surprised when they start getting treated poorly.
First, it's hard to know when companies are doing this intentionally versus when things just get lost in the shuffle. (Never attribute to malice what can be explained by incompetence, and all that.) Meanwhile, the author was clearly ignoring the interviewer intentionally.
Second, the fact that Company A treated you rudely doesn't give you license to treat unrelated Company B rudely. For that matter, I'm not sure that the fact that Employee 1 at Company A treated you rudely gives you moral license to treat Employee 2 at Company A rudely. Show a little compassion for someone trapped in a dead-end job trying to put food on their family's table, for crying out loud.
> but all it explained was how to make a few API calls in order to solve a very specific problem.
Yeah, the very specific problems everyone runs into time after time. He presents specific solutions, and reasonable context. If I was googling for one of these problems, I'd be very happy to run into this page.
> Also, there was the overall arrogant tone: "I found their interview approach a bit of a turn off so I did not proceed to the next interview and ignored her emails "
Your arrogance is my matter-of-fact.
In reply to XCSme - no I am not new to Node and my point of the post is to illustrate some of the techniques that I haven't seen published anywhere to HN and the community. My focus is quite different from what you think it is, so maybe it is my bad for bad writing skills, I'm still new to writing and learning.
https://contently.com/strategist/2015/01/28/this-surprising-...
I believe you should treat others how you want to be treated. FYI, recruiters do not usually follow up with rejected candidates and many are unresponsive. It's their way of telling candidates they are no longer interested.
And we insert random (non-visible) html and css classes in our site to screw with em, and use randomized css classnames. This fucks with xpaths and css selectors.
You can't stop them, but you can make their lives painful.
You are fighting screen readers more than anything; as well as legitimate plugins, form autofills, etc. If this is for captcha, you are fighting all the users as well.
> And we insert random (non-visible) html and css classes in our site to screw with em, and use randomized css classnames.
Legitimate browser plugins, etc. I'd just use electron or selenium with `nth-child`, `:visible`, `[class*="…"]`, etc.
What you effectively doing is wasting time on useless stuff. This is even more useless than trying to prevent copying of DVDs or pirating games.
Can you be so sure? The Union blockade of the Confederacy had plenty of holes, and smugglers / privateers / blockade-runners made good money getting through (when they survived) ... but that doesn't mean the blockade wasn't effective all the same at weakening the Confederate military and economy.
Sure, xpath and css selector experts can figure it out, but that's not everyone
They consider their data to be theirs, even though they published it on the internet. They consider your data (your personal integrity) to be theirs as well, because how can you assume personal integrity when you are surfing the internet?
I have high hopes that the judicial system some time not too far from now will realize that since the law should be a reflection of the current moral standings it will always be behind, trying to catch up with us and that those who break the law while not breaking the current moral standings are still "good citizens" unworthy of prison or fines.
I guess Google won this iteration of the internet because of the double-standars site owners stand by, to allow Google to scrape anything while hindering any competitors from doing the same. There will only be a true competitor to Google when we in the next iteration of the internet realize that searching vast amounts of data (the internet) is a solved problem, that anyone can do as good a job as Google, and move on to the next quirk, around wich there will be competition, and in the end that quirk will be solved, we'll have a winner, signaling that is it time to move on to the next iteration.
Call my cynical if you will, but I'd leave "while abiding the law" out of that, or at least replace it with "while hoping they aren't breaking the law". Due diligence on these matters is often sadly lacking. They'll take the information first and only consider any such implications when/if they come up later.
Large organisations like Google probably will make the up-front effort to remain legal, because they are in the public eye enough for lack of doing so to attract a lot of unwanted press, but you don't have to get a lot smaller than that to start finding companies who are a lot less careful (or in some cases wilfully negligent).
For instance the browser choice script that came with Windows imposed by the EU never worked. It was a "bug". Somehow they must have omitted to test the feature...
Until last year Microsoft started playing nice, and I think Google and Facebook have become the new corporate villains. But recently the Windows team seems to be minded to challenge them in that position.
I might have accepted terms when I created a Google Account but in no way do I agree to a TOS by visiting a URL.
If Google's actions were illegal, I'm sure that they would have been sued even if their scraping and indexing usually is helpful for the website owner
Google Cache link: http://webcache.googleusercontent.com/search?q=cache:https:/...
Archive.is link: http://archive.is/DQccs
$.ajaxSetup({
dataFilter: function (data, type) {
if (this.url === 'some url that you want to watch!') {
// Do anything with the data here
awesomeMethod(this.data)
}
return data
}
})
I remember last using it with an infinite-scroll page with a periodic callback that scrolled the page down every 2 seconds, and the `awesomeMethod` just initiated the download. Pasted it all in dev-tools console, and the cheap "scraper" was ready!With a selector it's easy to grab data, here's a linux command that gets every user that posted in this thread:
lynx -base -source 'https://news.ycombinator.com/item?id=12345693' | hxnormalize -x | \
hxselect -c -s '\n' "td > table > tbody > tr > td.default > div:nth-child(1) > span > a.hnuser"
Here are the most frequent commenters: 27 cookiecaper
22 franciskim
6 fake-name
4 niftich
4 flukus
4 elmigranto
4 downandout
3 tedunangst
3 siegecraft
3 muglug
3 minimaxir
3 madamelicHere is an example of injecting a jQuery script into a page with jQuery loaded and getting nicely formatted information returned. [1]
[1]https://github.com/adam-s/playboy-fm/blob/master/server/scra...
That aside, hitting Insta like this is playing with fire, because you're really dealing with Facebook and their legal team.
The few that I've seen just 'ban' your IP for a few minutes. If you hit Wikipedia too much too quickly, they will essentially refuse to serve you for a while. It was a number of years ago I was doing it, but basically you would be scraping then you would just stop getting info (Maybe I wasn't reading response codes and could've realized quicker what was happening)
But there seems to be little demand for these kinds of systems and just throttling/blocking/CAPTCHA solutions are much simpler.
Also, I find it interesting that big websites don't just block all traffic from AWS IPs as they do with Tor.
It's especially true when the site provides an API and is meant to be integrated by people/companies. In which case, the AWS traffic is likely to include major and/or important and/or paying customers. You really don't want to block that.
On the other hand, Tor is likely to be 90% evil. When in doubt, just block it. (That makes me think, I should run some proper stats and maybe publish a blog post about that. )
The traffic from the site itself, if it's hosted there, would come from the intranet IP address, right? Not the public facing one.
> It's especially true when the site provides an API and is meant to be integrated by people/companies. In which case, the AWS traffic is likely to include major and/or important and/or paying customers. You really don't want to block that.
Agreed, but it's fairly easy to block the AWS IP traffic on web endpoints and not on the API endpoints.
However, when the network was no longer a bottleneck, I found that the speed and single-threaded nature of Node became one. It wasn't really that slow, relatively speaking, but I had a few hundred gigs of HTML to chew through every time I made a correction, so it was important to keep the turnaround as fast as possible.
I eventually managed to manually partition the task so I could launch separate Node scripts to handle different parts of it, but it wasn't a perfect split, and there was a fair bit of duplicated work, where a shared cache would have helped a great deal.
In retrospect, I should have thrown my JS away and started again in something with easy threading like Java or C#. But -- familiar story -- I'd underestimated the complexity of the task to begin with, and by the time I understood, I'd sunk a lot of time into writing my JS parsing code and didn't fancy converting it all to another language, particular when it always seemed like "just one more" correction to the parsing would make everything work right. In the end, what was supposed to take a weekend took about three months of work, off and on, to finish.
You can invoke as many lambdas from your application as you want in parallel and you're not going to be bottlenecked by your CPU :)
I'm did consider using clustering and having some master process coordinate everything, and using some shared-memory caching library. But it would not be "easy" to set up, especially compared to something like Java where you get thread pools and synchronized thread-safe collections out of the box.
And Lambda would have been totally impractical. As I said, I had hundred of gigs of data to process. If I'd been uploading this over my puny ADSL upstream every time, I'd still be waiting for a single run to complete.
I'm not trashing Node. I like it. There's a reason I used in the first place, after all. But for this particular use-case, I didn't find it was very good fit.
I run a site that aggregates/crawls job boards for remote job postings, and AngelList has been VERY difficult to crawl for various reasons, but you easily get PhantomJS to work (I have). Having said that, I've never felt very good about the fact that I'm defeating their attempts to block me (even though I feel like I'm doing them a favor) and will likely retire that bot soon.
It kinda sucks that I'm just grabbing publicly-available content in a very low-bandwidth way, but I really can't convince myself that what I'm doing is very ethical.
My to-do list includes making my crawler into a more well-behaved bot and that will have to go.
I guess the distinction is between whether one wants to just "toy around" or run the spider for-real.
I do a good bit of scraping, and made RubyRetriever[1] to make my life easier but it seems like I'm getting roadblocked on occasion, probably due to some of the things you mention in your article.
Is there any way for a site to verify that only their JS and CSS files are linked? Like preventing injection?
Yes, by checking times between actions and number of actions in a time period, and blocking atypical activity. I was IP banned from a site once for a few months, after trying to scrape it too much and hitting links on the site that were hidden from humans.
The random wait settings specified in the post are better than nothing, but still too flimsy. You would need to put hours between requests, only request during a certain 15 hour periods, take days off, and eventually you aren't scraping regularly enough to do much good.
Scraping is not an API, and I should know- I used to do it for a living. Its unreliable. It requires constant maintenance. APIs can break too, but they are meant for the sort of consumption you are trying for.
If you scrape for a living, only do it as a side job.
I've noticed that most sites actually don't change that often. I deal with changes once or twice every 3 months.
"If you scrape for a living, only do it as a side job."
This is true if you are scraping the low hanging fruit. I scrape 40+ sources (I do have access to a few APIs as well) and then have to extract the patterns/data I need to then integrate it into my business model. This is all automatic now and I only work on upgrading for speed and efficiency.
If you have to scan millions of urls daily from 1 site, it's probably not going to work out. You need to figure out clever ways of getting the data and using it without breaking any laws or pissing off the site owner.
I guess it's part password manager (it stores passwords encrypted in browser storage, not remotely) and part automation wizard :)
He then say that it don't bother him if I scrape theses thread. And I'm currently figuring out how to manage his site's cookie protected search feature, so that my painstaking effort (I'm not a dev, more a DB guy) could be reproducible more easily by other users of this service.
But this shouldn't appen in the first place because all post of this service are stored in a cleanly organized MySQl DB. Yet as no method is provided the only way to get back structured data is by scrapping (as the webmaster told me that no, he won't run custom SQL because "he don't want to mess his DB").
So even if all the data is publicly available through the internet forum only a geek can download a personal archive... or google because google scrape and store everything.
(Even more problematic is that college kids today seem to have a decaying understanding of what a URL is, given how much web navigation we do through the omnibar or apps, particularly on mobile, but that's another issue).
I've been archiving a few government sites to preserve them for web scraping exercises [0] (the Texas death penalty site is a classic, for both being relatively simple at first, and being incredibly convoluted depending on what level of detail you want to scrape [1])). But I imagine even government sites will move more toward AJAX/app-like sites, if the trend at the federal level means anything.
That said, I think the analytics.usa.gov site is a great place to demonstrate the difference between server-generated HTML and client-rendered HTML.
But as someone who just likes doing web-scraping, I feel the tools have mostly kept up with the changes to the web. It's been relatively easy, for example, to run Selenium through Python to mimic user action [2]. Same with PhantomJS through node, which has vastly improved how accurately it renders pages for screenshots compared to what I remember a few years back
[0] https://github.com/wgetsnaps
[1] https://github.com/wgetsnaps/tdcj-state-tx-us--death_row
On a blog post by Paul Kinlan ('Open Web Advocate' at Google and Chromium) [1], I lamented that we ended up here instead of the semantic web because the semantic web was hard to execute. Instead, every web page is a black-box, only navigable by an intelligent and/or sufficiently persuadable human.
But this is also why I don't buy ethical arguments against scraping. Sure, legally any company can unilaterally set any TOS prohibition against behavior they don't want, and these terms may be tested in court. But navigating a page in an automated manner that's designed to resemble interactions of humans (ie. through Selenium) is in my opinion ethical, because it merely time-shifts a user's activity.
The difficulties are invariably in "post-processing"; working around incomplete data on the page, handling errors gracefully and retrying in some (but not all) situations, keeping on top of layout/URL/data changes to the target site, not hitting your target site too often, logging into the target site if necessary and rotating credentials and IP addresses, respecting robots.txt, target site being utterly braindead, keeping users meaningfully informed of scraping progress if they are waiting of it, target site adding and removing data resulting in a null-leaning database schema, sane parallelisation in the presence of prioritisation of important requests, difficulties in monitoring a scraping system due to its implicitly non-deterministic nature, and general problems associated with long-running background processes in web stacks.
Et cetera.
In other words, extracting the right text on the page is the easiest and trivial part by far, with little practical difference between an admittedly cute jQuery-esque parsing library or even just using a blunt regular expression.
It would be quixotic to simply retort that sites should provide "proper" APIs but I would love to see more attempts at solutions that go beyond the superficial.
I can agree with this after having written a scraper as part of core business functionality (we paid a company for access, but access was just to bare HTML blobs and CVS and not an actual API).
However, to what degree you want to do all this is negotiable whereas the 'core' of screen-scraping is not---all scrapers have to first figure out how to get text, parse it, then stick it back in their system.
An example of what I mean when I say 'negotiable' is....
> working around incomplete data on the page
Deciding how to do this depends on your problem domain. Sometimes, we'd get bad computed data from our source but not care because it just meant more work putting more work in calculating it from a more raw source.
> not hitting your target site too often
If they publish how often you are allowed to scrape, this isn't too difficult. If not, then trial and error is the only solution. On occasion, a site simply just doesn't know/care. For example, in my case, the site was static content behind a CDN, so that if we were anywhere under 200 req/second then no flags would ever be raised.
For most smaller sites, that you are unofficially scraping, you may be limited to 1 request every 2 seconds.
I don't think it's any of the regular meanings: http://www.ldoceonline.com/search/?q=Lead
But it doesn't seem to be any of these slang terms either: http://www.urbandictionary.com/define.php?term=lead
browser.findElement(webdriverio.By.id('#Next')).click();
It's almost impossible for a website to reliably detect that a client web browser is being automated, and I find I can make Selenium scripts much more adaptable to breaking changes in websites when they occur than I can when hooking up my code directly.
I actually disagree with the contention that Selenium is slower than directly scraping though. The Firefox driver has always been lightning fast for me and the bottleneck is almost always server requests that would have been necessary either way.
https://github.com/kingkool68/zadieheimlich/blob/master/func...
Of course it can! You won't be able to defeat even the simplest attempt on anti-scraping based on statistical data. Like even keeping a list of individual rate-limits for /16 subnets of actual visiting users and you are in trouble.
Anyways I added your stuff here along with other data mining resource:
https://github.com/kevindeasis/awesome-fullstack#web-scrapin...
Of the sweatshops that must have been setup to deliver this service. That, is to me the true horror of this story.
- rails application
- scraping with nokogiri gem on Ruby
- simple models doing the scraping in rails app
- some scraping is parsed with CSS selectors - nokogiri
- some scraping is parsed with regex - nokogiri
- persisting to DB, Text, even Google docs
- presentation on web, text, pdf, xls
Boom