I've not used Scraperjs yet, but cheerio is pretty great.
CasperJS can also scrape dynamic websites. What criteria would someone want to use ScraperJS instead of CasperJS for that task? Are there features in ScraperJS that don't exist in CasperJS? Does it take 10x less lines-of-code to accomplish the same task? Etc.
Would it be possible to follow a list of URLs from a home page (Ex: List of Marathon Runners), and then follow the link in their name that goes to their stats page, and download / save the scraped data as JSON to a text file on the local machine's C:\Runners\Data\ folder for example?
Also, does anyone know of a reliable and tested C# / .Net / ASP.Net web page scrapper?
[0] http://msdn.microsoft.com/en-us/library/system.net.http.http... [1] http://htmlagilitypack.codeplex.com/
I've been exploring Github but could not find a well mantained framework (or at least updated to last month).
I haven't used the ruby version but I am pretty happy with the python port of it. It's lighter and faster than phantom, but it won't do javascript interpretation.
Note, not tried this myself!
On the right, you'll notice that under the sidebar "Opening This Week" is a movie titled "Love Is Strange".
With that in mind, press Ctrl+U (view html source).
Try to search for the word "Strange" anywhere in the source. (It's not there.) If it's not there, how did it get shown on the screen?!
The answer is that it is "dynamically" loaded. A simple scraper that only works on a static download of html source won't be able to retrieve that string. You need web scrapers that can process dynamic pages (execute Javascript).
Btw, you'll notice that you can find the string "Strange" via F12 (Developer Tools). That's because the F12 inspector shows the html after the DOM has been dynamically modified by javascript whereas Ctrl+U does not.
I assume the Simple version might be completely written in Node.js - so parses the HTML content, but no dynamic scripting.
The important thing to note is that in the Dynamic, you can't use closures in your internal functions as it wont get executed within your Node.js context, but will in PhantomJS.
As for use case, I do it for https://myshopdata.com to allow retailers to extract their product information with rich content and variation support (even if loaded by the user interacting with a dropdown on variations). It then allows you publish this in marketplaces, while information in sync by monitoring.
Also, do you install scraperjs into each project directory you want to use it for? Or just install it once?
For more information about npm install: https://www.npmjs.org/doc/cli/npm-install.html
I'll give it another go with this library next week!
disclaimer: i wrote it.