Show HN: Scraperjs – A versatile web scraper (opens in new tab)

(github.com)

192 pointsruipgil11y ago36 comments

36 comments

34 comments · 13 top-level

jasode11y ago· 5 in thread

It would be helpful if the documentation compared how Scraperjs is different from, or better than, CasperJS for scraping. CasperJS is the older and more well-known wrapper around PhantomJS so comparisons would help people decide what the appropriate tool would be.

http://casperjs.org/

tsenkov11y ago

I guess the biggest difference is, that Casper isn't a NodeJS module. Interaction with node (using the archive of npm) becomes hard. I am releasing something similar in a couple of weeks, but aiming at js sandboxing instead of scraping. :)

jdc058911y ago

For instances where you don't need a full featured browser to get the data you need, Scraperjs using the Cheerio backend should be WAY faster than casper/phantom.

I've not used Scraperjs yet, but cheerio is pretty great.

findjashua11y ago

I think you'd need CasperJS when you need to perform browser actions (login, click on a particular button, fill a form etc). But if you just want to scrape content (eg episode urls of The Daily Show from Hulu), then ScraperJS should be enough (and faster?)

thibauts11y ago

As far as I can see it has not much in common with casperjs, apart from the fact that it can use phantomjs.

jasode11y ago

If you mean that the syntax is different, yes, I get that.

CasperJS can also scrape dynamic websites. What criteria would someone want to use ScraperJS instead of CasperJS for that task? Are there features in ScraperJS that don't exist in CasperJS? Does it take 10x less lines-of-code to accomplish the same task? Etc.

1 more reply

justboxing11y ago· 5 in thread

This is awesome. I am very new to scraping, so bear with me if this is very obvious.

Would it be possible to follow a list of URLs from a home page (Ex: List of Marathon Runners), and then follow the link in their name that goes to their stats page, and download / save the scraped data as JSON to a text file on the local machine's C:\Runners\Data\ folder for example?

Also, does anyone know of a reliable and tested C# / .Net / ASP.Net web page scrapper?

cwbrandsma11y ago

On the second question, Typically a web scraper just interacts with the output of a web server, is shouldn't matter if it is asp.net or any other system.

misterbwong11y ago

Mostly this. In ASP.NET/C# you're probably looking at using the built in HttpClient lib [0] and an html parser lib like HTMLAgilityPack [1]. I've used this combo in the past and am happy with it.

[0] http://msdn.microsoft.com/en-us/library/system.net.http.http... [1] http://htmlagilitypack.codeplex.com/

1 more reply

guiomie11y ago

Since this runs un nodejs, you could use edge.js to use scraperjs in your .net project.

ruipgilOP11y ago

On the first question, yes, it's possible and easily done with the Router.

justboxing11y ago

Thank you!

pibefision11y ago· 5 in thread

Could someone recommend a similar framework but Ruby based? Just because I'm more skilled in Ruby than in Node (not for trolling purposes)

I've been exploring Github but could not find a well mantained framework (or at least updated to last month).

wc-11y ago

Check out Mechanize: https://github.com/sparklemotion/mechanize

I haven't used the ruby version but I am pretty happy with the python port of it. It's lighter and faster than phantom, but it won't do javascript interpretation.

jwarren11y ago

I found a tutorial making it look pretty easy to do with Mechanize: http://readysteadycode.com/howto-scrape-websites-with-ruby-a...

Note, not tried this myself!

findjashua11y ago

While I haven't tried any, I think if you want to handle dynamic Javascript content, you'd have to go with a JS library. Feel free to correct me if I'm wrong.

riffraff11y ago

you can do it in pure ruby with one of the webkit wrappers (i.e. poltergeist[0])

[0] https://github.com/teampoltergeist/poltergeist)

joeyspn11y ago

I've always used Mechanize+BeautifulSoup in Python.. I think Mechanize has also a ruby lib...

mr5iff11y ago· 3 in thread

I don't quite get the point of the DynamicScraper... Any real use cases for that?

jasode11y ago

For example, go to http://www.imdb.com

On the right, you'll notice that under the sidebar "Opening This Week" is a movie titled "Love Is Strange".

With that in mind, press Ctrl+U (view html source).

Try to search for the word "Strange" anywhere in the source. (It's not there.) If it's not there, how did it get shown on the screen?!

The answer is that it is "dynamically" loaded. A simple scraper that only works on a static download of html source won't be able to retrieve that string. You need web scrapers that can process dynamic pages (execute Javascript).

Btw, you'll notice that you can find the string "Strange" via F12 (Developer Tools). That's because the F12 inspector shows the html after the DOM has been dynamically modified by javascript whereas Ctrl+U does not.

martin-adams11y ago

The latter probably runs the script as though you are within the context of a web page (so full Ajax/JS support).

I assume the Simple version might be completely written in Node.js - so parses the HTML content, but no dynamic scripting.

The important thing to note is that in the Dynamic, you can't use closures in your internal functions as it wont get executed within your Node.js context, but will in PhantomJS.

As for use case, I do it for https://myshopdata.com to allow retailers to extract their product information with rich content and variation support (even if loaded by the user interacting with a dropdown on variations). It then allows you publish this in marketplaces, while information in sync by monitoring.

riffraff11y ago

I _think_ the latter interprets javascript while the former only allows you to read the rendered html ?

halcyondaze11y ago· 2 in thread

If you're interested in scraping in python, then I recommend giving this a read: http://jakeaustwick.me/python-web-scraping-resource/

cridenour11y ago

I think Scrapy is a better Python scraping tool.

baldfat11y ago

And I prefer to use R for scraping. There are so many way to scrape and it is really down to personal preferences. It is a good time to be alive :) So many good choices.

brianzelip11y ago· 1 in thread

It's unclear to me how to actually run this. Only executing the two commands listed under the Installing section does not run it - I had to `cd` into the scraperjs dir, then `npm install`, then continue with the second Install command (`grunt test`) to actually test.

Also, do you install scraperjs into each project directory you want to use it for? Or just install it once?

ruipgilOP11y ago

Scraperjs is supposed to be used as npm package. So, if you do "npm install <package-name>", you download the latest version of package to the same folder as the closest package.json file (if there's none it will go to your ~/ folder). At that point you can just use with "require('scraperjs')". The test part is a bit more foggy, and I'll add more information to the README in due time. To test you've got to npm-install, with the save-dev flag (npm install --save-dev scraperjs), it will also add the package to your development dependencies, this is so that people that want to use the package won't need to download all scraperjs' development dependencies.

For more information about npm install: https://www.npmjs.org/doc/cli/npm-install.html

jdrock11y ago

Let us know if you'd like to integrate this with http://www.80legs.com!

1 more reply

andrejewski11y ago

If anyone is interested in just scrapping links between webpages with JavaScript, I made Slinky (https://github.com/andrejewski/slinky). The API is simple and easily overridable.

jwarren11y ago

Nice! Could've used that this weekend when I got caught in callback hell trying to build a simple NodeJS scraper. Ended up doing it in PHP just because I know it well.

I'll give it another go with this library next week!

roux_rc11y ago

Artoo is soooo much better :) https://medialab.github.io/artoo/

bshimmin11y ago

I really like the router aspect of this. That's a nice idea and not (to the best of my limited memory) one I can recall seeing in any other scraper.

novaleaf11y ago

if you want a scraper as service, you can try: https://PhantomJsCloud.com

disclaimer: i wrote it.

woah11y ago

Looks pretty good, shame about the promises.

j / k navigate · click thread line to collapse

36 comments

34 comments · 13 top-level

jasode11y ago· 5 in thread

http://casperjs.org/

tsenkov11y ago

jdc058911y ago

For instances where you don't need a full featured browser to get the data you need, Scraperjs using the Cheerio backend should be WAY faster than casper/phantom.

I've not used Scraperjs yet, but cheerio is pretty great.

findjashua11y ago

thibauts11y ago

As far as I can see it has not much in common with casperjs, apart from the fact that it can use phantomjs.

jasode11y ago

If you mean that the syntax is different, yes, I get that.

1 more reply

justboxing11y ago· 5 in thread

This is awesome. I am very new to scraping, so bear with me if this is very obvious.

Also, does anyone know of a reliable and tested C# / .Net / ASP.Net web page scrapper?

cwbrandsma11y ago

On the second question, Typically a web scraper just interacts with the output of a web server, is shouldn't matter if it is asp.net or any other system.

misterbwong11y ago

Mostly this. In ASP.NET/C# you're probably looking at using the built in HttpClient lib [0] and an html parser lib like HTMLAgilityPack [1]. I've used this combo in the past and am happy with it.

[0] http://msdn.microsoft.com/en-us/library/system.net.http.http... [1] http://htmlagilitypack.codeplex.com/

1 more reply

guiomie11y ago

Since this runs un nodejs, you could use edge.js to use scraperjs in your .net project.

ruipgilOP11y ago

On the first question, yes, it's possible and easily done with the Router.

justboxing11y ago

Thank you!

pibefision11y ago· 5 in thread

Could someone recommend a similar framework but Ruby based? Just because I'm more skilled in Ruby than in Node (not for trolling purposes)

I've been exploring Github but could not find a well mantained framework (or at least updated to last month).

wc-11y ago

Check out Mechanize: https://github.com/sparklemotion/mechanize

I haven't used the ruby version but I am pretty happy with the python port of it. It's lighter and faster than phantom, but it won't do javascript interpretation.

jwarren11y ago

I found a tutorial making it look pretty easy to do with Mechanize: http://readysteadycode.com/howto-scrape-websites-with-ruby-a...

Note, not tried this myself!

findjashua11y ago

While I haven't tried any, I think if you want to handle dynamic Javascript content, you'd have to go with a JS library. Feel free to correct me if I'm wrong.

riffraff11y ago

you can do it in pure ruby with one of the webkit wrappers (i.e. poltergeist[0])

[0] https://github.com/teampoltergeist/poltergeist)

joeyspn11y ago

I've always used Mechanize+BeautifulSoup in Python.. I think Mechanize has also a ruby lib...

mr5iff11y ago· 3 in thread

I don't quite get the point of the DynamicScraper... Any real use cases for that?

jasode11y ago

For example, go to http://www.imdb.com

On the right, you'll notice that under the sidebar "Opening This Week" is a movie titled "Love Is Strange".

With that in mind, press Ctrl+U (view html source).

Try to search for the word "Strange" anywhere in the source. (It's not there.) If it's not there, how did it get shown on the screen?!

martin-adams11y ago

The latter probably runs the script as though you are within the context of a web page (so full Ajax/JS support).

I assume the Simple version might be completely written in Node.js - so parses the HTML content, but no dynamic scripting.

The important thing to note is that in the Dynamic, you can't use closures in your internal functions as it wont get executed within your Node.js context, but will in PhantomJS.

riffraff11y ago

I _think_ the latter interprets javascript while the former only allows you to read the rendered html ?

halcyondaze11y ago· 2 in thread

If you're interested in scraping in python, then I recommend giving this a read: http://jakeaustwick.me/python-web-scraping-resource/

cridenour11y ago

I think Scrapy is a better Python scraping tool.

baldfat11y ago

And I prefer to use R for scraping. There are so many way to scrape and it is really down to personal preferences. It is a good time to be alive :) So many good choices.

brianzelip11y ago· 1 in thread

Also, do you install scraperjs into each project directory you want to use it for? Or just install it once?

ruipgilOP11y ago

For more information about npm install: https://www.npmjs.org/doc/cli/npm-install.html

jdrock11y ago

Let us know if you'd like to integrate this with http://www.80legs.com!

1 more reply

andrejewski11y ago

If anyone is interested in just scrapping links between webpages with JavaScript, I made Slinky (https://github.com/andrejewski/slinky). The API is simple and easily overridable.

jwarren11y ago

Nice! Could've used that this weekend when I got caught in callback hell trying to build a simple NodeJS scraper. Ended up doing it in PHP just because I know it well.

I'll give it another go with this library next week!

roux_rc11y ago

Artoo is soooo much better :) https://medialab.github.io/artoo/

bshimmin11y ago

I really like the router aspect of this. That's a nice idea and not (to the best of my limited memory) one I can recall seeing in any other scraper.

novaleaf11y ago

if you want a scraper as service, you can try: https://PhantomJsCloud.com

disclaimer: i wrote it.

woah11y ago

Looks pretty good, shame about the promises.

j / k navigate · click thread line to collapse