Ask HN: What are best tools for web scraping?

502 pointspydox8y ago228 comments

228 comments

189 comments · 100 top-level

sharmi8y ago· 14 in thread

If you are a programmer, scrapy[0] will be a good bet. It can handle robots.txt, request throttling by ip, request throttling by domain, proxies and all other common nitty-gritties of crawling. The only drawback is handling pure javascript sites. We have to manually dig into the api or add a headless browser invocation within the scrapy handler.

Scrapy also has the ability to pause and restart crawls [1], run the crawlers distributed [2] etc. It is my goto option.

[0] https://scrapy.org/

[1] https://doc.scrapy.org/en/latest/topics/jobs.html

[2] https://github.com/rmax/scrapy-redis

stoneridge8y ago

Haven't tried this[0] yet, but Scrapy should be able to handle JavaScript sites with the JavaScript rendering service Splash[1]. scrapy-splash[2] is the plugin to integrate Scrapy and Splash.

[0] https://blog.scrapinghub.com/2015/03/02/handling-javascript-...

[1] https://splash.readthedocs.io/en/stable/index.html

[2] https://github.com/scrapy-plugins/scrapy-splash

PaulHoule8y ago

HTMLUnit in Java is a good browser emulator and can be used to work JavaScript-heavy web sites, form submission, etc.

maxisme8y ago

Reading this from my phone looked like you meant there was a web scraping tool actually called “this[0]” which would be a cracking name.

arien8y ago

I've recently made a little project with scrapy (for crawling) and BeautifulSoup (for parsing html) and it works out great. One more thing to add to the above list are pipelines, they make downloading files quite easy.

Merthurian8y ago

I made a little BTC price ticker on an OLED with and arduino. I used BeautifulSoup to get the data. Went from knowing nothing about web scraping to getting the thing working pretty quick. Very easy to use.

dataslap8y ago

scrapy has a pretty decent parser too

harperlee8y ago

I've had mixed results with scrapy, probably more based in my inexperience than other thing, but for example retrieving a posting in idealista.com with vanilla scrapy begets an error page whereas a basic wget command retrieves the correct page.

So the learning curve for simple things makes me jump to bash scripts; scrapy might prove more valuable when your project starts to scale.

But also of course: normally the best tool is the one you already know!

Bromskloss8y ago

Would you still recommend Scrapy if the task wasn't specifically crawling?

sharmi8y ago

Nope. It is very specifically tailored to crawling. If you just need something distributed why not check out RQ [0], Gearman [1] or Celery [2]? RQ and Celery are python specific.

[0] : http://python-rq.org/docs/

[1] : http://gearman.org/

[2] : http://docs.celeryproject.org

luckystarr8y ago

I once used it to automate the, well, scraping of statistics from an affiliate network account. So you can do pretty specific stuff, as long as it involves HTTP/HTTPS requests.

dataslap8y ago

depends on the task. For example they have a decent file/image downloading middleware.

ddorian438y ago

Would you recommend it for scalable projects ? Like, crawl twitter or tumblr ?

sharmi8y ago

Yes. It beats building up your own crawler that handles all the edge cases. That said, before you reach the limits of scrapy, you will more likely be restricted by preventive measures put in place by twitter(or any other large website) to limit any one user hogging too much resources. Services like cloudflare or similar are aware of all the usual proxy servers and such and will immediately block such requests.

1 more reply

sklarsa8y ago

I've used it for some larger scrapes (nothing at the scale you're talking about, but still sizeable) and scrapy has very tight integration with scrapinghub.com to handle all of the deployment issues (including worker uptime, result storage, rate-limiting, etc). Not affiliated with them in any way, just have had a good experience using them in the past.

1 more reply

samtc8y ago· 7 in thread

I maintain ~30 different crawlers. Most of them are using Scrapy. Some are using PhantomJS/CasperJS but they are called from Scrapy via a simple web service.

All data (zip files, pdf, html, xml, json) we collect are stored as-is (/path/to/<dataset name>/<unique key>/<timestamp>) and processed later using a Spark pipeline. lxml.html is WAY faster than beautifulsoup and less prone to exception.

We have cronjob (cron + jenkins) that trigger dataset update and discovery. For example, we scrape corporate registry, so everyday we update the 20k oldest companies version. We also implement "discovery" logic in all of our crawlers so they can find new data (ex.: newly registered company). We use Redis to send task (update / discovery) to our crawlers.

mapster8y ago

Mind if I ask what info/data you are scraping and for what ends?

frik8y ago

> We use Redis to send task (update / discovery) to our crawlers.

Some kind of queue implemented with Redis? How does it work?

samtc8y ago

It's a simple redis list containing JSON task. We have a custom Scrapy Spider hooked to next_request and item_scraped [1]. It check (lpop) for update/discovery tasks in the list and build a Request [2]. We only crawl max ~1 request per second, so performance is not an issue.

For every website we crawl we implement a custom discovery/update logic.

Discovery can be, for example, crawl a specific date range, seq number, postal code.... We usually seed discovery based on the actual data we have, like highest_company_number + 1000, so we get the newly registered companies.

Update is to update a single document. Like crawl document for company number 1234. We generate a Request [2] to crawl only that document.

[1] https://doc.scrapy.org/en/latest/topics/signals.html

[2] https://doc.scrapy.org/en/latest/topics/request-response.htm...

thibaut_barrere8y ago

See https://sidekiq.org for instance.

CGamesPlay8y ago

Probably not what the GP uses, but Resque does this in Ruby land.

1 more reply

CGamesPlay8y ago

I have a similar set up! How do you monitor for failures and deal with the scrape target changing?

samtc8y ago

We monitor exceptions with Sentry. We store raw data so we don't have to hurry to fix the ETL, we only have to fix navigation logic and we keep crawling.

1 more reply

danso8y ago· 7 in thread

Always fascinated by how diverse the discussion and answers is for HN threads on web-scraping. Goes to show that "web-scraping" has a ton of connotations, everything from automated-fetching of URLs via wget or cURL, to data management via something like scrapy.

Scrapy is a whole framework that may be worthwhile, but if I were just starting out for a specific task, I would use:

- requests http://docs.python-requests.org/en/master/

- lxml http://lxml.de/

- cssselect https://cssselect.readthedocs.io/en/latest/

Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize. But using the web developer tools you can usually figure out the requests made by the browser and then use the Session object in the Requests library to deal with stateful requests:

http://docs.python-requests.org/en/master/user/advanced/

I usually just download pages/data/files as raw files and worry about parsing/collating them later. I try to focus on the HTTP mechanics and, if needed, the HTML parsing, before worrying about data extraction.

hydragit8y ago

> Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize. But using the web developer tools you can usually figure out the requests made by the browser and then use the Session object in the Requests library to deal with stateful requests

You could also use the WebOOB (http://weboob.org) framework. It's built on requests+lxml and it provides a Browser class usable like mechanize's one (ability to access doc, select HTML forms, etc.).

It also has nice companion features like associating url patterns to some custom Page classes where you can write what data to retrieve when a page with this url pattern is browsed.

djtriptych8y ago

All great advice. I've written dozens of small purpose-built scrapers and I love your last point.

It's pretty much always a great idea to completely separate the parts that perform the HTTP fetches and the part that figures out what those payloads mean.

Buttons8408y ago

lxml has good xpath support too; the best I've seen. I miss good xpath support in some of the other scraping options I've tried in other languages.

upofadown8y ago

>Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize.

Did the version of Mechanize written in Py2 stop being supported?

danso8y ago

Looks like it's recently been updated but no big announcement that it's Python 3 ready: https://github.com/python-mechanize/mechanize

I've also seen these alternatives:

- https://robobrowser.readthedocs.io/en/latest/

- https://github.com/MechanicalSoup/MechanicalSoup

MechanicalSoup seems well updated but the last time I tried these libraries, they were either buggy (and/or I was ignorant) and I just couldn't get things to work as I was used to in Ruby and Mechanize.

sebcat8y ago

lxml can be hit-or-miss on HTML5 docs. I've had greater success with a modified version of gumbo-parser.

danso8y ago

Ah very cool, had seen various python libraries about HTML5, but not gumbo (or at least I had starred it).

https://github.com/google/gumbo-parser

Is the modified version you use a personal version or a well-known fork?

1 more reply

jackschultz8y ago· 5 in thread

I've actually wrote about this! General tips that I've found from doing more than a few projects [0], and then an overview of Python libraries I use [1].

If you don't want to clock on the links, requests and BeautifulSoup / lxml is all you need 90% of the time. Throw gevent in there and you can get a lot of scraping done in not as much time as you think it would take.

And as long as we're talking about web scraping, I'm a huge fan of it. There's so much data out there that's not easily accessible and needs to be cleaned and organized. When running a learning algorithm, for example, a very hard part that isn't talked about a lot is getting the data before throwing it in a learning function or library. Of course, there the legal side of it if companies are not happy with people being able to scrape, but that's a different topic.

I'll keep going. The best way to learn about what are the best tools is to do a project on your own and teat them all out. Then you'll know what suits you. That's absolutely the best way to learn something about programming -- doing it instead of reading about it.

[0] https://bigishdata.com/2017/05/11/general-tips-for-web-scrap...

[1] https://bigishdata.com/2017/06/06/web-scraping-with-python-p...

Bromskloss8y ago

> BeautifulSoup / lxml

When should one use one or the other, would you say?

ivansavz8y ago

You can use the BeautifulSoup API with the `lxml` parser: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#insta...

I've heard that `lxml` can choke on certain badly-formed markup, but it's very fast. Personally has never failed on me.

3 more replies

j_s8y ago

Use https://github.com/kovidgoyal/html5-parser, which (in my limited understanding) does a better job faster and is backwards-compatible with both.

Recommendation by the author (of Calibre fame) on a similar discussion: https://news.ycombinator.com/item?id=15539853

Dedicated discussion: https://news.ycombinator.com/item?id=14588333

jackschultz8y ago

BeautifulSoup. The difference is that lxml can run a little faster in certain cases for a huge scrape, but you'll very very very if ever need that. It's interesting and probably worthwhile to try both and know the difference, but bs BeautifulSoup is definitely where to start

darpa_escapee8y ago

BeautifulSoup has a friendly API, but it is slow. It has a lxml backend, however.

If you're familiar with writing XPath queries, lxml is great.

CGamesPlay8y ago· 5 in thread

If you can get away without a JS environment, do so. Something like scrapy will be much easier than a full browser environment. If you cannot, don’t bother going halfway and just go straight for headless chrome or Firefox. Unfortunately Selenium seems to be past its useful life as Firefox dropped support and chrome has a chrome driver which wraps around it. Phantom.js is woefully out of date and since it’s a different environment than your target site was designed for just leads to problems.

AutomatedTester8y ago

I manage the WebDriver work at Mozilla making Firefox work with Selenium. I can categorically State we haven’t killed Selenium. We, over the last few years, have invested more in Selenium than other browsers.

Selenium IDE no longer works in Firefox for a number of reasons; 1) Selenium IDE didn’t have a maintainer 2) Selenium IDE is a Firefox add on and Mozilla changed how adding worked. They did this for numerous security reasons.

CGamesPlay8y ago

My apologies, I was mistaken, but I can't edit my post now. It looks like the selenium code has moved into something called geckodriver, which I suppose is a wrapper around the underlying Marionette protocol.

hugs8y ago

Firefox did not drop support for Selenium. Selenium IDE, a record/playback test creation tool, stopped working in newer versions of Firefox, but a) Selenium IDE is only one part of the Selenium project, and b) The Selenium team is working on a new version of IDE compatible with the new Firefox add-on APIs.

triangleman8y ago

Can you explain a little more? How do you drive FF/Chrome without Selenium?

softawre8y ago

https://developers.google.com/web/updates/2017/04/headless-c...

Risse8y ago· 4 in thread

If you use PHP, Simple HTML DOM[0] is an awesome and simple scraping library.

[0] http://simplehtmldom.sourceforge.net/

ge968y ago

I also have used Simple HTML Dom

One thing I haven't worked on yet is waiting for stuff to load if that is a problem. Otherwise you try to limit hitting a site either using sleep/CRON

What's also interesting is session tokens, one site I was able to hunt down the generated token bread crumb which JS produced, but it wasn't valid. Still had to visit the site, interesting.

SubZtep8y ago

Indeed it's very easy to use, I really like it. There is a newer version on Github: https://github.com/sunra/php-simple-html-dom-parser

ge968y ago

I also have used Simple HTML Dom

One thing I haven't worked on yet is waiting for stuff to load if that is a problem. Otherwise you try to limit hitting a site either using sleep/CRON

wolco8y ago

If you use php laravel dusk might be another good choice.

elchief8y ago· 4 in thread

Anyone who suggests a tool that can't understand JavaScript doesn't know what they are talking about

You should be using Headless Chrome or Headless Firefox with a library that can control them in a user-friendly manner

sp0rk8y ago

There are a great many sites that degrade gracefully when JS support is not available. It makes absolutely no sense to waste the resources required to run a full headless browser when simple HTTP requests will retrieve the same information faster, more efficiently, and in a way that's easier to parallelize.

xur178y ago

A lot of times you can also watch the api calls JS pages (or apps) make and retrieve nice structured json data.

I personally avoid executing js unless it's necessary, as it adds more complexity, and is noticeably more brittle.

1 more reply

jordanpg8y ago

Yes, but a great many sites don't, and for those, you need Selenium + browser, full stop.

bdcravens8y ago

I haven't dug deep recently, but if you need to automate browser download dialog this wasn't possible with Headless Chrome. (I'd love to find out that this has changed, and you can control it as well as you can with Selenium)

bootcat8y ago· 3 in thread

One of the important avenues to scrape AJAX heavy and phantomjs avoiding websites is using the google chrome extension support. They can mirror the dom and send it to an external server for processing where we can use python lxml to xpath to appropriate nodes. This worked for me to scrape Google, before we hit the capatcha. If anyone is interested, i can share code i wrote to scrape websites !

If you can scrape findthecompany database ? I have done it successfully !!

visarga8y ago

> This worked for me to scrape Google, before we hit the capatcha.

If Google wanted to give back something to the community, it would offer cheap automated searches (current prices are absurd). Another thing - more depth after the first 1000 results. Sometimes you want to know the next result. We shouldn't need to do all these stupid things to batch query a search engine, it should be open. That makes it all the more important to invent an open-source, federated search engine, so we can query to our heart's content (and have privacy).

zapperdapper8y ago

Agree 100% too.

As for 'federated search engine' - it's not 'federated' per se but check out Gigablast search engine. Open source (source on GitHub) and a TOTALLY AWESOME piece of software written by one guy. You can do good searches at the Gigablast site[1], or set up your own search engine. Gigablast also offers an API (I may be wrong but I think DuckDuckGo uses that API for some tasks).

[1] http://gigablast.com

bootcat8y ago

I absolutely agree, and I am thinking strategies to even automate the capatcha, using crowdsourcing or better, using AI/ML ( which is not trivial ).

duckduckgo is good but not there yet.

Would you be interested to work on a search engine ? Some projects are bitfunnel and so forth.

bantersaurus8y ago· 3 in thread

beautifulsoup

oddeyed8y ago

Also good is RoboBrowser which combines beautifulsoup with Requests to get a nice 'Browser' abstraction. It also has good built-in functionality for filling in forms.

cjsuk8y ago

Using this as well with Requests to automate eBay/gumtree/craigslist. Works very well

djaychela8y ago

Any details on this anywhere, or is it not for public consumption? I'm just getting started in Python and want to do something with Gumtree and eBay as an idea to help me in a different sphere.

1 more reply

marvinpinto8y ago· 2 in thread

I would recommend using Headless Chrome along with a library like puppeteer[0]. You get the advantage of using a real browser with which you run pages' javascript, load custom extensions, etc.

[0]: https://github.com/GoogleChrome/puppeteer

pteredactyl8y ago

I second this. I built using beautiful soup before and found Puppeteer much easier when interacting with the web. Especially nasty .NET sites.

elyrly8y ago

Simple and straight forward, +1

indescions_20178y ago· 2 in thread

Headless Chrome, Puppeteer, NodeJS (jsdom), and MongoDB. Fantastic stack for web data mining. Async based using promises for explicit user input flow automation.

jdc05898y ago

I had a ton of issues with JsDom historically. They could have been fixed, but Cheerio always worked out better for me.

c0nfused8y ago

I agree with headless chrome.

I have used it with a locally hosted extension to allow easy access to dom and JavaScript after load. Then dumped results to a node app. Was very happy with the results.

levi_n8y ago· 2 in thread

I use a combination of Selenium and python packages (beautifulsoup). I'm primarily interested in scraping data that is supplied via javascript, and I find Selenium to be the most reliable way scrape that info. I use BS when the scraped page has a lot of data, thereby slowing down Selenium, and I pipe the page source from Selenium, with all javascript rendered, into BS.

I use explicit waits exclusively (no direct calls like `driver.find_foo_by_bar`), and find it vastly improves selenium reliability. (Shameless plug) I have a python package, Explicit[1], that makes it easier to use explicit waits.

[1] https://pypi.python.org/pypi/explicit

bluntfang8y ago

>I'm primarily interested in scraping data that is supplied via javascript, and I find Selenium to be the most reliable way scrape that info.

Have you found that you aren't able to find accessible APIs to request against? Have you ever tried to contact the administrators to see if there's an API you could access? Are you scraping data that would be against ToS if you tried to get it in a way that would benefit both you and the target web site?

levi_n8y ago

>Have you found that you aren't able to find accessible APIs to request against?

I'm scraping from variety of different websites (1000+) that my org doesn't own. Reconfiguring to hit APIs would be complex, and a maintenance problem, both of which I easily avoid by using selenium to drive an actual browser, at the expense of time.

>Have you ever tried to contact the administrators to see if there's an API you could access?

Just not feasible given the scope and breadth of the scraping.

>Are you scraping data that would be against ToS if you tried to get it in a way that would benefit both you and the target web site?

I inspect and respect the robots.txt

giarc8y ago· 2 in thread

For non-coders, import.io is great. However, they used to have a generous free plan that has since went away (you are limited to 500 records now). Still a great product, problem is they don't have a small plan (starts at $299/month and goes up to $9,999).

adventured8y ago

I was looking at services in this area a few weeks ago to automate a small need I had and ran across these guys. They offer a free 5,000 monthly request basic plan. I gave it a try, worked fine (I ended up building my own solution for greater control). It's just for scraping open graph (with some fall-back capability) tags though.

https://www.opengraph.io/

iagovar8y ago

I use Grepsr. Really recommend, they have a Chrome extension that works like Kimono. Really easy for non technical people. If you have someone in Marketing or whatever that needs some data, maybe the only thing that they need to know is to use CSS Selectors and so on.

jmkni8y ago· 2 in thread

I've had a surprising amount of success with the HTML Agility Pack in .net, if you have a decent understanding of HTML it's pretty usable.

inglor8y ago

Try CsQuery, it's much nicer in terms of APIs.

dsschnau8y ago

same. I'm a .NET person and i do web scraping stuff on the side, HTML Agility Pack has been easy to pick up.

Doctor_Fegg8y ago· 2 in thread

If you speak Ruby, mechanize is good: https://github.com/sparklemotion/mechanize

DrSayre8y ago

I generally use mechanize when I need to scrape something from the web. I found this awhile back and it's helped me https://www.chrismytton.uk/2015/01/19/web-scraping-with-ruby...

faitswulff8y ago

I remember trying to use mechanize as a beginning rubyist and I can't recommend it from that experience. Specifically I remember poor documentation and confusing layers of abstraction. It might be better now that I know what the DOM is and how jQuery selectors work, but my first impression was abysmal.

dsacco8y ago· 2 in thread

I've done this professionally in an infrastructure processing several terabytes per day. A robust, scalable scraping system comprises several distinct parts:

1. A crawler, for retrieving resources over HTTP, HTTPS and sometimes other protocols a bit higher or lower on the network stack. This handles data ingestion. It will need to be sophisticated these days - sometimes you'll need to emulate a browser environment, sometimes you'll need to perform a JavaScript proof of work, and sometimes you can just do regular curl commands the old fashioned way.

2. A parser, for correctly extracting specific data from JSON, PDF, HTML, JS, XML (and other) formatted resources. This handles data processing. Naturally you'll want to parse JSON wherever you can, because parsing HTML and JS is a pain. But sometimes you'll need to parse images, or outdated protocols like SOAP.

3. A RDBMS, with databases for both the raw and normalized data, and columns that provide some sort of versioning to the data in a particular point in time. This is quite important, because if you collect the raw data and store it, you can re-parse it in perpetuity instead of needing to retrieve it again. This will happen somewhat frequently if you come across new data while scraping that you didn't realize you'd need or could use. Furthermore, if you're updating the data on a regular cadence, you'll need to maintain some sort of "retrieved_at", "updated_at" awareness in your normalized database. MySQL or PostgreSQL are both fine.

4. A server and event management system, like Redis. This is how you'll allocate scraping jobs across available workers and handle outgoing queuing for resources. You want a centralized terminal for viewing and managing a) the number of outstanding jobs and their resource allocations, b) the ongoing progress of each queue, c) problems or blockers for each queue.

5. A scheduling system, assuming your data is updated in batches. Cron is fine.

6. Reverse engineering tools, so you can find mobile APIs and scrape from them instead of using web targets. This is important because mobile API endpoints a) change far less frequently than web endpoints, and b) are far more likely to be JSON formatted, instead of HTML or JS, because the user interface code is offloaded to the mobile client (iOS or Android app). The mobile APIs will be private, so you'll typically have to reverse engineer the HMAC request signing algorithm, but that is virtually always trivial, with the exception of companies that really put effort into obfuscating the code. apktool, jadx and dex2jar are typically sufficient for this if you're working with an Android device.

7. A proxy infrastructure, this way you're not constantly pinging a website from the same IP address. Even if you're being fairly innocuous with your scraping, you probably want this, because many websites have been burned by excessive spam and will conscientiously and automatically ban any IP address that issues something nominally more than a regular user, regardless of volume. Your proxies come in several flavors: datacenter, residential and private. Datacenter proxies are the first to be banned, but they're cheapest. These are proxies resold from datacenter IP ranges. Residential IP addresses are IP addresses that are not associated with spam activity and which come from ISP IP ranges, like Verison Fios. Private IP addresses are IP addresses that have not been used for spam activity before and which are reserved for use by only your account. Naturally this is in order from lower to greater expense; it's also in order from most likely to least likely to be banned by a scraping target. NinjaProxies, StormProxies, Microleaf, etc are all good options. Avoid Luminati, which offers residential IP addresses contributed by users who don't realize their IP addresses are being leased through the use of Hola VPN.

Each website you intend to scrape is given a queue. Each queue is assigned a specific allotment of workers for processing scraping jobs in that queue. You'll write a bunch of crawling, parsing and database querying code in an "engine" class to manage the bulk of the work. Each scraping target will then have its own file which inherits functionality from the core class, with the specific crawling and parsing requirements in that file. For example, implementations of the POST requests, user agent requirements, which type of parsing code needs to be called, which database to write to and read from, which proxies should be used, asynchronous and concurrency settings, etc should all be in here.

Once triggered in a job, the individual scraping functions will call to the core functionality, which will build the requests and hand them off to one of a few possible functions. If your code is scraping a target that has sophisticated requirements, like a JavaScript proof of work system or browser emulation, it will be handed off to functionality that implements those requirements. Most of the time, this won't be needed and you can just make your requests look as human as possible - then it will be handed off to what is basically a curl script.

Each request to the endpoint is a job, and the queue will manage them as such: the request is first sent to the appropriate proxy vendor via the proxy's API, then the response is sent back through the proxy. The raw response data is stored in the raw database, then normalized data is processed out of the raw data and inserted into the normalized database, with corresponding timestamps. Then a new job is sent to a free worker. Updates to the normalized data will be handled by something like cron, where each queue is triggered at a specific time on a specific cadence.

You'll want to optimize your workflow to use endpoints which change infrequently and which use lighter resources. If you are sending millions of requests, loading the same boilerplate HTML or JS data is a waste. JSON resources are preferable, which is why you should invest some amount of time before choosing your endpoint into seeing if you can identify a usable mobile endpoint. For the most part, your custom code is going to be in middleware and the parsing particularities of each target; BeautifulSoup, QueryPath, Headless Chrome and JSDOM will take you 80% of the way in terms of pure functionality.

kbenson8y ago

> 3. A RDBMS, with databases for both the raw and normalized data

I've found the filesystem (local or network, depending on scale) works well for the raw data. A normalized file name with a timestamp and job identifier in a hashed directory structure of some sort (I generally use $jobtype/%Y-%m-%d/%H/ as a start) works well, and reading and writing gzip is trivial (and often you can just output the raw content of gzip encoded payloads). The filesystem is an often overlooked database. If you end up needing more transactional support, or to easily identify what's been processed or not, look at how Maildir works.

After normalization, the database is ideal though.

That said, I was doing a few gigabytes a day, not a dew terabytes, so you might have run into some scale issues I didn't. I was able to keep it to mostly one box for crawling and parsing, but crawlers ended up being complex and job-queue driven enough that expanding to multiple systems wouldn't have been all that much extra work (an assessment I feel confident in, having done similar things before).

tomc19858y ago

A decade ago I worked for a company that also scraped data at this scale and your advice is spot-on!

austincheney8y ago· 2 in thread

This is perhaps the fastest way to screenscrape a dynamically executed website.

1. First go get and run this code, which allows immediate gathering of all text nodes from the DOM: https://github.com/prettydiff/getNodesByType/blob/master/get...

2. Extract the text content from the text nodes and ignore nodes that contain only white space:

let text = document.getNodesByType(3), a = 0, b = text.length, output = []; do { if ((/^(\s+)$/).test(text[a].textContent) === false) { output.push(text[a].textContent); } a = a + 1; } while (a < b); output;

That will gather ALL text from the page. Since you are working from the DOM directly you can filter your results by various contextual and stylistic factors. Since this code is small and executes stupid fast it can be executed by bots easily.

Test this out in your browser console.

AznHisoka8y ago

And how do you do #1? Node, I presume?

austincheney8y ago

No, manually go there and copy/paste the code. Then when building your scraper bot use that code.

1 more reply

jppope8y ago· 2 in thread

Headless chrome in the form of puppeteer (https://github.com/GoogleChrome/puppeteer) or Chromeless (https://github.com/graphcool/chromeless) or for smaller gigs use nightmare.js (http://www.nightmarejs.org/).

scapy is fine but selenium, phantom, etc are all outdated IMO

blowski8y ago

> are all outdated IMO

For what reason? Genuine question.

CGamesPlay8y ago

Phantom is woefully out of date, you need a polyfill even for Function.bind. Firefox dropped support for Selenium in 47, and chromedriver only supports it with a wrapper called chromedriver.

1 more reply

riekus8y ago· 2 in thread

Depends on your skillset and the data you want to scrape. I am testing waters for a new business that relies on scraped data. As a non programmer I had good success testing stuff with contentgrabber. Import.io also get mentioned a lot. Tried out octoparse but wast stable with the scraping.

selllikesybok8y ago

I find the desktop tool by import.io a little challenging to work with. Their toy web-demo is solid for simple table extraction, though.

wtfdaemon8y ago

It's gotten light-years better since the desktop tool existed.

They've completely deprecated/sun-setted the desktop tool in favor of a greatly improved web application.

1 more reply

OzzyB8y ago· 2 in thread

A good host xD

Preferably one that doesn't mind giving you a bunch of IPs, and if they do, don't charge a fortune for them.

Then you can worry about what software you're gonna use.

eccfcco158y ago

Which hosts have you used, or would you recommend?

OzzyB8y ago

OVH

You can get upto 256 IPs per server and _not_ pay monthly fees -- just a $3 upfront setup charge.

You're welcome xD

1 more reply

frausto8y ago· 2 in thread

Been getting blocked by recaptcha more and more, do any of these tools handle dealing with that or workarounds by default? Tried routing through proxies and swapping IP addresses, slowing down, etc... Any specific ways people get around that?

jakubbalada8y ago

You can use services like Anti-captcha [1]

We have a public API on Apify for that [2]

[1] https://anti-captcha.com/mainpage

[2] https://www.apify.com/petr_cermak/anti-captcha-recaptcha

levi_n8y ago

The excepted answer on this stack overflow question[1] might help. tl;dr is to build your own chromedriver, but with renamed variables.

[1] https://stackoverflow.com/a/41220267/4079962

phsource8y ago· 1 in thread

For someone on a Javascript stack, I highly recommend combining a requester (e.g., "request" or "axios") with Cheerio, a server-side jQuery clone. Having a familiar, well-known interface for selection helps a lot.

We use this stack at WrapAPI (https://wrapapi.com), which we highly recommend as a tool to turn webpages into APIs. It doesn't completely do all the scraping (you still need to write a script), but it does make turning a HTML page into a JSON structure much easier.

nn7578y ago

Isn't cheerio only for static content?

mping8y ago· 1 in thread

I use nightmarejs https://github.com/segmentio/nightmare which is based on electron; I recommend it if you're on js

Cyph0n8y ago

That looks like a pretty interesting scraping library.

baldfat8y ago· 1 in thread

I use R since that is the language I use mostly httr and rvest. Edit I missed typing rvest thanks for the comments you use the two together.

https://cran.r-project.org/web/packages/httr/vignettes/quick...

amrrs8y ago

Rvest is also another nice option in R.

ravenstine8y ago· 1 in thread

It depends on what you're trying to do.

For most things, I use Node.js with the Cheerio library, which is basically a stripped-down version of jQuery without the need for a browser environment. I find using the jQuery API far more desirable than the clunky, hideous Beautiful Soup or Nokogiri APIs.

For something that requires an actual DOM or code execution, PhantomJS with Horseman works well, though everyone is talking about headless Chrome these days so IDK. I've not had nearly as many bad experiences with PhantomJS as others have purportedly experienced.

imjasonmiller8y ago

I have been playing around with Cheerio for a short while and it is quite cool! Although extracting comments wasn't as straightforward as I thought it would be.

Do you have any experience with processing and scraping large files using Cheerio? It doesn't support streaming does it? I am currently faced with processing a ~75 MB XML and I am not sure if Cheerio is suited for that.

mrskitch8y ago· 1 in thread

I’d recommend puppeteer or some other Chrome driver. It’s fast and resilient even on single page apps.

If you’re looking to run it on a Linux machine also take a look at https://browserless.io (full disclosure I’m the creator of that site).

mrskitch8y ago

I should note that this doesn't lock you into any particular lib, just solves the problem of running on Chrome in a service like fashion.

hmottestad8y ago· 1 in thread

If you know Java, then my go to library is Jsoup https://jsoup.org/

It lets you use jQuery-like selectors to extract data.

Like this: Elements newsHeadlines = doc.select("#mp-itn b a");

jasondc8y ago

+1 Saves a ton of time, and very simple to use

cdolan8y ago· 1 in thread

Outwit Hub, specifically the advanced or enterprise levels.

It has a GUI on it that is not designed very well, and documentation that is complete, but hard to search...

But it can do just about any type of scrape, including getting started from a command line script

selllikesybok8y ago

Second this. My go-to for years now. Inexpensive for what it does. Factor in the cost of building out it's features in your home rolled solution, and you'll be saving a ton. Plus the team is very responsive if you need support. And is open to small consulting projects if you need something beyond your own abilities.

mfontani8y ago· 1 in thread

Whatever you end up using for scraping, I beg you to pick a unique user-agent which allows a webmaster to understand which crawler is it, to better allow it to pass through (or be banned, depending).

Don't stick with the default "scrapy" or "Ruby" or "Jakarta Commons-HttpClient/...", which end up (justly) being banned more easily than unique ones, like "ABC/2.0 - https://example.com/crawler" or the like.

danso8y ago

Note that for some libraries, the agent is set to empty or whatever the default is for the tool (e.g. `curl/7.43.0` for curl). It's always worth setting it to something.

As a frequent scraper of government sites, and sometimes commercial sites for research purposes, I avoid as much as possible as faking a User Agent, i.e. copying the default strings for popular browsers:

`Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36`

Almost always, if a site rejects my scraper on the basis of agent, they're doing a regex for "curl", "wget" or for an empty string. Setting a user-agent to something unique and explicit, i.e. "Dan's program by danso@myemail.com" works fine without feeling shady.

Maybe for old government sites that break on anything but IE, you'll have to pretend to be IE, but that's very rare.

256cats8y ago· 1 in thread

I use Node and either puppeteer[0] or plain Curl[1]. IMO Curl is years ahead of any Node.js request lib. For proxies I use (shameless plug!) https://gimmeproxy.com .

[0] https://github.com/GoogleChrome/puppeteer

[1] https://github.com/JCMais/node-libcurl

sagivo8y ago

Really nice concept.

traviswingo8y ago· 1 in thread

I’ve been using puppeteer to scrape and it’s been fantastic. Since it’s a headless browser, it can handle SPA just as well as server side loaded traditional websites. It’s also incredibly easy to use with async/await.

ajcodez8y ago

I assume this puppeteer:

- https://github.com/GoogleChrome/puppeteer

mateuszf8y ago· 1 in thread

`clj-http`, `enlive`, `cheshire` in case of `clojure` worked fine for me

tuddman8y ago

and 'hickory' [https://github.com/davidsantiago/hickory] to work with the site data however you want.

kzisme8y ago· 1 in thread

So in general what do most people use web scraping for? Is it building up their on database of things not available via an API or something? It always sounds interesting, but the need for it is what confuses me.

tmuir8y ago

I've generally used it to sort data in some way that's not available on the original webpage. Either into a csv file, making large lists easier to view, or to determine some optimum, such as the best price.

- Which squares have historically hit the most often in Superbowl Squares (http://www.picks.org/nfl/super-bowl-squares)

- Search a job website for a search term and list of locations, collecting each job title, company, location, and link, to view as one large spreadsheet, instead of having to navigate through 10 results per page.

- Collect cost of living indices in a list of cities

21stio8y ago· 1 in thread

golang

deathemperor8y ago

I signed up for proxycrawl, used the javascript api to access a SPA website written in React and it just show a blank page. https://api.proxycrawl.com/?token=aDcC1lB-NZ5_r4vMSN-L3A&url... (I don't mind my token is exposed)

beernutz8y ago

The absolute best tool i have found for scraping is Visual Web Ripper.

It is not open source, and runs in windows only, but it is one of the easiest to use tools that i have found. I can set up scrapes entirely visually, and it handles complex cases like infinite scroll pages, highly javascript dependent pages and the like. I really wish there were an open source solution that was as good as this one.

I use it with one of my clients professionally. Their support is VERY good btw.

http://visualwebripper.com/

hydragit8y ago

WebOOB [0] is a good Python framework for scraping websites. It's mostly used to aggregate data from multiple websites by organizing each site backend implement an abstract interface (for example the CapBank abstract interface for parsing banking sites) but it can be used without that part.

On the pure scraping side, it has a "declarative parsing" to avoid painful plain-old procedural code [1]. You can parse pages by simply specifying a bunch of XPaths and indicating a few filters from the library to apply on those XPath elements, for example CleanText to remove whitespace nonsense, Lower (to lower-case), Regexp, CleanDecimal (to parse as number) and a lot more. URL patterns can be associated to a Page class of such declarative parsing. If declarative becomes too verbose, it can always be replaced locally by writing a plain-old Python method.

A set of applications are provided to visualize extracted data, and other niceties are provided for debug easing. Simply put: « Wonderful, Efficient, Beautiful, Outshining, Omnipotent, Brilliant: meet WebOOB ».

[0] http://weboob.org/

[1] http://dev.weboob.org/guides/module.html#parsing-of-pages

zapperdapper8y ago

No one has mentioned it so I will: consider Lynx, the text-mode web-browser. Being command-line you can automate with Bash or even Python. I have used it quite happily to crawl largeish static sites (10,000+ web pages per site). Do a `man lynx` the options of interest are -crawl, -traversal, and -dump. Pro tip - use in conjunction with HTML TIDY prior to the parsing phase (see below).

I have also used custom written Python crawlers in a lot of cases.

The other thing I would emphasize is that a web scraper has multiple parts, such as crawling (downloading pages) and then actually parsing the page for data. The systems I've set up in the past typically are structured like this:

1. crawl - download pages to file system 2. clean then parse (extract data) 3. ingest extracted data into database 4. query - run adhoc queries on database

One of the trickiest things in my experience is managing updates. So when new articles/content are added to the site you only want to have to get and add that to your database, rather than crawl the whole site again. Also detecting updated content can be tricky. The brute force approach of course is just to crawl the whole site again and rebuild the database - not ideal though!

Of course, this all depends really on what you are trying to do!

deathemperor8y ago

I've just finished my research on web scraping for my company (took me about 7 days). I started with import.io and scrapinghub.com for point and click scraping to see if I could do it without writing codes. Ultimately, UI point and click scraping is for none-technical. There are many data you would find it hard to scrape. For example, lazada.com.my stores the product's SKU inside an attribute that looks like <div data-sku-simple="SKU11111"></div> which I couldn't get. import.io's pricing is also something. I need to pay $999 a month for accessing API data is just too high.

So I decided to use scrapy, the core of scrapinghub.com.

I haven't written much python before but scrapy was very easy to learn. I wrote 2 spiders and run on scrapinghub (their serverless cloud). Scrapinghub support jobs scheduling and many other things at a cost. I prefer scrapinghub because in my team we don't have DevOps. It also supports Crawlera to prevent IP banning, Portia for point and click (still in beta, it was still hard to use), and Splash for SPA websites but it's buggy and the github repo is not under active maintenance.

For DOM query I use BeautifulSoup4. I love it. It's jQuery for python.

For SPA websites I wrote a scrapy middleware which uses puppeteer. The puppeteer is deployed on Amazon Lambda (1m free request first 365 days, more than enough for scraping) using this https://github.com/sambaiz/puppeteer-lambda-starter-kit

I am planning to use Amazon RDS to store scraped data.

cholmon8y ago

I recently stumbled across http://go-colly.org/, that looks well thought out and simple to use. It seems like a slimmed down Go version of Scrapy.

khuknows8y ago

Shameless plug - I build this tiny API for scraping and it works a treat for my uses: https://jsonify.link/

A few similar tools also exist, like https://page.rest/.

polote8y ago

I maintain about 8 crawlers and I use only vanilla Python

I have a function to help me search :

   def find_r(value, ind, array,stop_word):
   	indice = ind
   	for i in array:
   		indice = value.find(i,indice)+1
   	end =  value.find(stop_word,indice)
   	return value[indice: end], end

You can use it like that :

   resulting_text , end_index = find_r(string, start_index, ["<td", ">"], "</td")

To find text it is quite fast and you don't need to master a framwork

jacinda8y ago

If you're specifically looking at news articles, go for the Python library Newspaper: http://newspaper.readthedocs.io/en/latest/

Auto-detection of languages, and will automatically give you things like the following:

>>> article.parse()

>>> article.authors [u'Leigh Ann Caldwell', 'John Honway']

>>> article.text u'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'

>>> article.top_image u'http://someCDN.com/blah/blah/blah/file.png'

>>> article.movies [u'http://youtube.com/path/to/link.com', ...]

mmmnt8y ago

For very simple tasks Listly seems to be a fast and good solution: http://www.listly.io/

If you need more power, I heard good stuff about http://80legs.com/ though never tried them myself.

If you really need to do crazy shit like crawling the iOS App Store really fast and keep thing up to date. I suggest using Amazon Lambda and a custom Python parser. Though Lambda is not meant for this kind of things it works really well and is super scalable at a reasonable price.

btb8y ago

We have been using kapow robosuite for close to 10 years now. Its a commercial GUI based tool which have worked well for us, it saves us a lot of maintenance time compared to our previous hand-rolled code extraction pipeline. Only problem is that its very expensive(pricing seems catered towards very large enterprises).

So I was really hoping this this thread would have revealed some newer commercial GUI-based alternatives(on-premise, not SaaS). Because I dont really ever want to go back the maintenance hell of hand rolled robots ever again :)

kanishkalinux8y ago

for mostly static pages requests/pycurl + beautifulsoup more than sufficient. For advance scraping, take a look at scrapy.

for javascript heavy pages most people rely on selenium webdriver. However you can also try hlspy (https://github.com/kanishka-linux/hlspy), which is a little utility I made a while ago for dealing with javascript heavy pages for simple usage.

etatoby8y ago

If you need to scrape content from complex JS apps (eg. React) where it doesn't pay to reverse engineer their backend API (or worse, it's encrypted/obfuscated) you may want to look at CasperJS.

It's a very easy to use frontend to PhantomJS. You can code your interactions in JS or CoffeeScript and scrape virtually anything with a few lines of code.

If you need crawling, just pair a CasperJS script with any spider library like the ones mentioned around here.

theden8y ago

I've had good success with scrapy (https://scrapy.org/) for my personal projects

Jeaye8y ago

I've written a bit on web scraping with Clojure and Enlive here: https://blog.jeaye.com/2017/02/28/clojure-apartments/

That's what I'd use, if I had to scrape again (no JS support).

vrathee8y ago

If you are looking for SaaS or managed services, Try https://www.agenty.com/

Agenty is cloud-hosted web scraping app and you can setup scraping agents using their point and click CSS Selector Chrome extension to extract anything from HTML with these 3 modes below: - TEXT : Simple clean text - HTML : Outer or Inner HTML - ATTR : Any attribute of a html tag like image src, hyperlink href…

Or advance mode like REGEX, XPATH etc.

And then save the scraping agent to execute on cloud-hosted app with most advanced features like batch crawling, scheduling, multiple website scraping simultaneously without worrying in ip-address block or speed like never before.

doominasuit8y ago

If you need to interpret javascript, or otherwise simulate regular browsing as closely as possible, you may consider running a browser inside a container and controlling it with selenium. I have found it’s necessary to run inside the container if you do not have a desktop environment. This is better suited for specific use cases rather than mass collection because it is slower to run a full browsing stack than to only operate at the HTTP layer. I have found that alternatives like phantomJS are hard to debug. Consider opening VNC on the container for debugging. Containers like this that I know of are SeleniumHQ and elgalu/selenium.

jpetersonmn8y ago

I used to use a combo of python tools. Requests, beautifulsoup mostly. However the last few things I've built used selenium to drive headless chrome browsers. This allows me to run the javascript most sites use these days.

jancurn8y ago

Apify (https://www.apify.com) is a web scraping and automation platform where you can extract data from any website using a few simple lines of JavaScript. It's using headless browsers, so that people can extract data from pages that have complex structure, dynamic content or employ pagination.

Recently the platform added support for headless Chrome and Puppeteer, you can even run jobs written in Scrapy or any other library as long as it can be packaged as Docker container.

Disclaimer: I'm a co-founder of Apify

servitor8y ago

I agree with others, with curl and the likes you will hit insurmountable roadblocks sooner or later. It's better to go full headless browser from the start.

I use a python->selenium->chrome stack. The Page Object Model [0] has been a revelation for me. My scripts went from being a mess of spaghetti code to something that's a pleasure to write and maintain.

[0] https://www.guru99.com/page-object-model-pom-page-factory-in...

sl0wik8y ago

I had great experience with www.apify.com.

Softcadbury8y ago

With node, you can use cheerio [0]. It allows you to parse html pages with a JQuery similar syntax. I use it in production on my project [1]

[0] https://github.com/cheeriojs/cheerio [1] https://github.com/Softcadbury/football-peek/blob/master/ser...

colinchartier8y ago

We had a really tough time scraping dynamic web content using scrapy, and both scrapy and selenium require you to write a program (and maintain it) for every separate website that you have to scrape. If the website's structure changes you need to debug your scraper. Not fun if you need to manage more than 5 scrapers.

It was so hard that we made our own company JUST to scrape stuff easily without requiring programming. Take a look at https://www.parsehub.com

mitchtbaum8y ago

I made this https://www.drupal.org/project/example_web_scraper and produced the underlying code many years ago. The idea is to map xpath queries to your data model and use some reusable infrastructure to simply apply it. It was very good, imho (for what it was). (I'm writing this comment since I don't see any other comments with the words map or model :/ )

bbayer8y ago

I am really surprised nobody mentioned pyspider. It is simple, has a web dashboard and can handle JS pages. It can store data to a database of your choice. It can handle scheduling, recrawling. I have used it to crawl Google Play. 5$ Digital Ocean VPS with pyspider installed on it could handle millions of pages crawled, processed and saved to a database.

http://docs.pyspider.org/en/latest/

mrkeen8y ago

I made a crawler https://github.com/jahaynes/crawler

It outputs to the warc file format (https://en.wikipedia.org/wiki/Web_ARChive), in case your workflow is to gather web pages and then process them afterwards.

ngneer8y ago

https://github.com/featurist/coypu is nice for browser automation. A related question: what are good tools for database scraping, meaning replicating a backend database via a web interface (not referring to compromising the application, rather using allowed queries to fully extract the database).

dineshr938y ago

If you know java then jsoup will be very handy. [1] https://jsoup.org/

charlus8y ago

For a little diversity on tools, if you're looking for something quick that others can access the data easily - Google Apps script in a Google Sheet can be quite useful.

https://sites.google.com/site/scriptsexamples/learn-by-examp...

buildops8y ago

Why are you looking to scrape? Here's a list of some scraper bots: https://www.incapsula.com/blog/web-scraping-bots.html

What about Botscraper: http://www.botscraper.com/

wiradikusuma8y ago

I tinkered with Apache Nutch (http://nutch.apache.org/), but I found it overkill. In the end, since I use Scala, I use https://github.com/ruippeixotog/scala-scraper

laktek8y ago

One of the challenges with modern day scraping is you need to account for client-side JS rendering.

If you prefer an API as a service that can pre-render pages, I built Page.REST (https://www.page.rest). It allows you to get rendered page content via CSS selectors as a JSON response.

blueadept1118y ago

Jaunt [http://jaunt-api.com] is a good java tool.

0xdeadbeefbabe8y ago

The best tool for web scraping, for me, is something easy to deploy and redeploy; and something that doesn't rely on three working programs--eliminating selenium sounds great.

For those reasons I like https://github.com/knq/chromedp

ksahin8y ago

I wrote some blog post about Java web scraping here : https://ksah.in/introduction-to-web-scraping-with-java/

As others said, phantomJS (and now headless Chrome) are good tools to deal with heavy js websites

teremin8y ago

I use Colly[0][1] which is a young but decent scraping framework for Golang.

[0] http://go-colly.org/ [1] https://github.com/gocolly/colly

tmaly8y ago

I just tried puppeteer yesterday for the first time. It seems to work very well. My only complaint is that it is very new and does now have a plethora of examples.

I previously have used WWW::Mechanize in the Perl world, but single page applications with Javascript really require something with a browser engine.

1 more reply

RandomBookmarks8y ago

The "best tool" is different for web developers and non-coders. If you are a non-technical person that just needs some data there is:

(1) hosted services like mozenda

(2) visual automation tools like Kantu Web Automation (which includes OCR)

(3) and last but not least outsourcing the scraping on sites like Freelancer.com

thallian8y ago

I used CasperJS[0] in the past to scrap a javascript heavy forum (ProBoards) and it worked well. But that was a few years ago, I have no idea what new strategies came up in the meantime.

[0] http://casperjs.org/

tn_8y ago

Check out Heritrix if you're looking for an open-source webscraping archival tool: https://webarchive.jira.com/wiki/spaces/Heritrix

brycematheson8y ago

Shameless plug. I wrote a blog post on how I use Powershell to scrape sites: http://brycematheson.io/webscraping-with-powershell/

jschuur8y ago

If you want to extract content and specific meta data, you might find the Mercury Web Parser useful:

https://mercury.postlight.com/web-parser/

Karupan8y ago

I've had some success using portia[1]. Its a visual wrapper over scrapy, but is actually quite useful.

https://github.com/scrapinghub/portia

askz8y ago

A friend released a little tool to only scrap html from websites, with tor and proxy chaining

https://github.com/AlexMili/Scraptory

freeslugs8y ago

If you need simple scraping, I like traditional http request lib. For more robust scraping (ie clicking buttons / filling text), use capybara and either phantomjs or chromedriver - easy to install using homebrew!

thegrif8y ago

A ton of people recommended Scrapy - and I am always looking for senior Scrapy resources that have experience scraping at scale. Please feel free to reach out - contact info is in my profile.

sananth128y ago

If you are looking for image scraping: https://github.com/sananth12/ImageScraper

pudo8y ago

We're about to announce a new Python scraping toolkit, memorious: https://github.com/alephdata/memorious - it's a pretty lightweight toolkit, using YAML config files to glue together pre-built and custom-made components into flexible and distributed pipelines. A simple web UI helps track errors and execution can be scheduled via celery.

We looked at scrapy, but it just seemed like the wrong type of framing for the type of scrapers we build: requests, some html/xml parser, and output into a service API or a SQL store.

Maybe some people will enjoy it.

kbd8y ago

For simple tasks, curl into pup is very convenient.

https://github.com/ericchiang/pup

kopos8y ago

Scrapy [https://github.com/scrapy/scrapy] works really well.

vinitagr8y ago

https://github.com/matthewmueller/x-ray

Lxr8y ago

Python requests + lxml, with Selenium as a last resort.

1 more reply

fazkan8y ago

scrapy and BS4, for serious stuff. Selenium, for automating logging and other UI related stuff, you can even play games with it.

kazinator8y ago

TXR: http://www.nongnu.org/txr

crispytx8y ago

I did a little web scraping project a few years ago using:

* cURL

* regex

thejosh8y ago

If you are scraping specific pages on a site, curl. Then transform that into the language you use.

cm20128y ago

For non developers dexi.io is great.

novaleaf8y ago

i wrote a tool: PhantomJsCloud.com

it's getting a little long in the tooth, but I will be updating it soon to use a Chrome based renderer. If you have any suggestions, you can leave it here or PM me :)

aaronhoffman8y ago

This tool takes a list of URIs and crawls each site for contact info. Phone, email, twitter, etc

https://github.com/aaronhoffman/WebsiteContactHarvester

jpepinho8y ago

WebDriver.io using Selenium and PhantomJS would be a good way to go!

greyfox8y ago

i did a quick search and didnt see this listed here:

https://www.httrack.com/

etattva8y ago

Scrapy and Jsoup are best combinations

tomc19858y ago

Perl or Ruby and Regular Expressions

herbst8y ago

Nokogiri

vsupalov8y ago

That really depends on your project and tech stack. If you're into Python and are going to deal with relatively static HTML, then the Python modules Scrapy [1], BeautifulSoup [2] and the whole Python data crunching ecosystem are at your disposal. There's lots of great posts about getting such a stack off the ground and using it in the wild [3]. It can get you pretty darn far, the architecture is solid and there are lots of services and plugins which probably do everything you need.

Here's where I hit the limit with that setup: dynamic websites. If you're looking at something like discourse-powered communities or similar, and don't feel a bit too lazy to dig into all the ways requests are expected to look, it's no fun anymore. Luckily, there's lots of js-goodness which can handle dynamic website, inject your javascript for convenience and more [4].

The recently published Headless Chrome [5] and puppeteer [6] (a Node API for it), are really promising for many kinds of tasks - scraping among them. You can get a first impression in this article [7]. The ecosystem does not seem to be as mature yet, but I think this will be foundation of the next go-to scraping tech stack.

If you want to try it yourself, I've written a brief intro [8] and published a simple dockerized development environment [9], so you can give it a go without cluttering your machine or find out what dependencies you need and how the libraries are called.

[1] https://scrapy.org/

[2] https://www.crummy.com/software/BeautifulSoup/bs4/doc/

[3] http://sangaline.com/post/advanced-web-scraping-tutorial/

[4] https://franciskim.co/dont-need-no-stinking-api-web-scraping...

[5] https://developers.google.com/web/updates/2017/04/headless-c...

[6] https://github.com/GoogleChrome/puppeteer

[7] https://blog.phantombuster.com/web-scraping-in-2017-headless...

[8] https://vsupalov.com/headless-chrome-puppeteer-docker/

[9] https://github.com/vsupalov/docker-puppeteer-dev

pwaai8y ago

hey I'm working on this thing called BAML (browser automation markup language) and it looks something like this:

    OPEN http://asdf.com
    CRAWL a
    EXTRACT {'title': '.title'}

It's meant to be super simple and built from ground up to support crawling Single Page Applications.

Also, creating a terminal client (early ver: https://imgur.com/a/RYx5g) for it which will launch a Chrome browser and scrape everything. http://export.sh is still very early in the works, I'd appreciate any feedback (email in profile, contact form doesn't work).

dor_jack8y ago

If you need to perform a web-scale crawl I strongly recommend https://www.mixnode.com.

j / k navigate · click thread line to collapse

228 comments

189 comments · 100 top-level

sharmi8y ago· 14 in thread

Scrapy also has the ability to pause and restart crawls [1], run the crawlers distributed [2] etc. It is my goto option.

[0] https://scrapy.org/

[1] https://doc.scrapy.org/en/latest/topics/jobs.html

[2] https://github.com/rmax/scrapy-redis

stoneridge8y ago

Haven't tried this[0] yet, but Scrapy should be able to handle JavaScript sites with the JavaScript rendering service Splash[1]. scrapy-splash[2] is the plugin to integrate Scrapy and Splash.

[0] https://blog.scrapinghub.com/2015/03/02/handling-javascript-...

[1] https://splash.readthedocs.io/en/stable/index.html

[2] https://github.com/scrapy-plugins/scrapy-splash

PaulHoule8y ago

HTMLUnit in Java is a good browser emulator and can be used to work JavaScript-heavy web sites, form submission, etc.

maxisme8y ago

Reading this from my phone looked like you meant there was a web scraping tool actually called “this[0]” which would be a cracking name.

arien8y ago

Merthurian8y ago

dataslap8y ago

scrapy has a pretty decent parser too

harperlee8y ago

So the learning curve for simple things makes me jump to bash scripts; scrapy might prove more valuable when your project starts to scale.

But also of course: normally the best tool is the one you already know!

Bromskloss8y ago

Would you still recommend Scrapy if the task wasn't specifically crawling?

sharmi8y ago

Nope. It is very specifically tailored to crawling. If you just need something distributed why not check out RQ [0], Gearman [1] or Celery [2]? RQ and Celery are python specific.

[0] : http://python-rq.org/docs/

[1] : http://gearman.org/

[2] : http://docs.celeryproject.org

luckystarr8y ago

I once used it to automate the, well, scraping of statistics from an affiliate network account. So you can do pretty specific stuff, as long as it involves HTTP/HTTPS requests.

dataslap8y ago

depends on the task. For example they have a decent file/image downloading middleware.

ddorian438y ago

Would you recommend it for scalable projects ? Like, crawl twitter or tumblr ?

sharmi8y ago

1 more reply

sklarsa8y ago

1 more reply

samtc8y ago· 7 in thread

I maintain ~30 different crawlers. Most of them are using Scrapy. Some are using PhantomJS/CasperJS but they are called from Scrapy via a simple web service.

mapster8y ago

Mind if I ask what info/data you are scraping and for what ends?

frik8y ago

> We use Redis to send task (update / discovery) to our crawlers.

Some kind of queue implemented with Redis? How does it work?

samtc8y ago

For every website we crawl we implement a custom discovery/update logic.

Update is to update a single document. Like crawl document for company number 1234. We generate a Request [2] to crawl only that document.

[1] https://doc.scrapy.org/en/latest/topics/signals.html

[2] https://doc.scrapy.org/en/latest/topics/request-response.htm...

thibaut_barrere8y ago

See https://sidekiq.org for instance.

CGamesPlay8y ago

Probably not what the GP uses, but Resque does this in Ruby land.

1 more reply

CGamesPlay8y ago

I have a similar set up! How do you monitor for failures and deal with the scrape target changing?

samtc8y ago

We monitor exceptions with Sentry. We store raw data so we don't have to hurry to fix the ETL, we only have to fix navigation logic and we keep crawling.

1 more reply

danso8y ago· 7 in thread

Scrapy is a whole framework that may be worthwhile, but if I were just starting out for a specific task, I would use:

- requests http://docs.python-requests.org/en/master/

- lxml http://lxml.de/

- cssselect https://cssselect.readthedocs.io/en/latest/

http://docs.python-requests.org/en/master/user/advanced/

hydragit8y ago

You could also use the WebOOB (http://weboob.org) framework. It's built on requests+lxml and it provides a Browser class usable like mechanize's one (ability to access doc, select HTML forms, etc.).

It also has nice companion features like associating url patterns to some custom Page classes where you can write what data to retrieve when a page with this url pattern is browsed.

djtriptych8y ago

All great advice. I've written dozens of small purpose-built scrapers and I love your last point.

It's pretty much always a great idea to completely separate the parts that perform the HTTP fetches and the part that figures out what those payloads mean.

Buttons8408y ago

lxml has good xpath support too; the best I've seen. I miss good xpath support in some of the other scraping options I've tried in other languages.

upofadown8y ago

>Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize.

Did the version of Mechanize written in Py2 stop being supported?

danso8y ago

Looks like it's recently been updated but no big announcement that it's Python 3 ready: https://github.com/python-mechanize/mechanize

I've also seen these alternatives:

- https://robobrowser.readthedocs.io/en/latest/

- https://github.com/MechanicalSoup/MechanicalSoup

sebcat8y ago

lxml can be hit-or-miss on HTML5 docs. I've had greater success with a modified version of gumbo-parser.

danso8y ago

Ah very cool, had seen various python libraries about HTML5, but not gumbo (or at least I had starred it).

https://github.com/google/gumbo-parser

Is the modified version you use a personal version or a well-known fork?

1 more reply

jackschultz8y ago· 5 in thread

I've actually wrote about this! General tips that I've found from doing more than a few projects [0], and then an overview of Python libraries I use [1].

[0] https://bigishdata.com/2017/05/11/general-tips-for-web-scrap...

[1] https://bigishdata.com/2017/06/06/web-scraping-with-python-p...

Bromskloss8y ago

> BeautifulSoup / lxml

When should one use one or the other, would you say?

ivansavz8y ago

You can use the BeautifulSoup API with the `lxml` parser: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#insta...

I've heard that `lxml` can choke on certain badly-formed markup, but it's very fast. Personally has never failed on me.

3 more replies

j_s8y ago

Use https://github.com/kovidgoyal/html5-parser, which (in my limited understanding) does a better job faster and is backwards-compatible with both.

Recommendation by the author (of Calibre fame) on a similar discussion: https://news.ycombinator.com/item?id=15539853

Dedicated discussion: https://news.ycombinator.com/item?id=14588333

jackschultz8y ago

darpa_escapee8y ago

BeautifulSoup has a friendly API, but it is slow. It has a lxml backend, however.

If you're familiar with writing XPath queries, lxml is great.

CGamesPlay8y ago· 5 in thread

AutomatedTester8y ago

CGamesPlay8y ago

hugs8y ago

triangleman8y ago

Can you explain a little more? How do you drive FF/Chrome without Selenium?

softawre8y ago

https://developers.google.com/web/updates/2017/04/headless-c...

Risse8y ago· 4 in thread

If you use PHP, Simple HTML DOM[0] is an awesome and simple scraping library.

[0] http://simplehtmldom.sourceforge.net/

ge968y ago

I also have used Simple HTML Dom

One thing I haven't worked on yet is waiting for stuff to load if that is a problem. Otherwise you try to limit hitting a site either using sleep/CRON

What's also interesting is session tokens, one site I was able to hunt down the generated token bread crumb which JS produced, but it wasn't valid. Still had to visit the site, interesting.

SubZtep8y ago

Indeed it's very easy to use, I really like it. There is a newer version on Github: https://github.com/sunra/php-simple-html-dom-parser

ge968y ago

I also have used Simple HTML Dom

One thing I haven't worked on yet is waiting for stuff to load if that is a problem. Otherwise you try to limit hitting a site either using sleep/CRON

wolco8y ago

If you use php laravel dusk might be another good choice.

elchief8y ago· 4 in thread

Anyone who suggests a tool that can't understand JavaScript doesn't know what they are talking about

You should be using Headless Chrome or Headless Firefox with a library that can control them in a user-friendly manner

sp0rk8y ago

xur178y ago

A lot of times you can also watch the api calls JS pages (or apps) make and retrieve nice structured json data.

I personally avoid executing js unless it's necessary, as it adds more complexity, and is noticeably more brittle.

1 more reply

jordanpg8y ago

Yes, but a great many sites don't, and for those, you need Selenium + browser, full stop.

bdcravens8y ago

bootcat8y ago· 3 in thread

If you can scrape findthecompany database ? I have done it successfully !!

visarga8y ago

> This worked for me to scrape Google, before we hit the capatcha.

zapperdapper8y ago

Agree 100% too.

[1] http://gigablast.com

bootcat8y ago

I absolutely agree, and I am thinking strategies to even automate the capatcha, using crowdsourcing or better, using AI/ML ( which is not trivial ).

duckduckgo is good but not there yet.

Would you be interested to work on a search engine ? Some projects are bitfunnel and so forth.

bantersaurus8y ago· 3 in thread

beautifulsoup

oddeyed8y ago

Also good is RoboBrowser which combines beautifulsoup with Requests to get a nice 'Browser' abstraction. It also has good built-in functionality for filling in forms.

cjsuk8y ago

Using this as well with Requests to automate eBay/gumtree/craigslist. Works very well

djaychela8y ago

Any details on this anywhere, or is it not for public consumption? I'm just getting started in Python and want to do something with Gumtree and eBay as an idea to help me in a different sphere.

1 more reply

marvinpinto8y ago· 2 in thread

I would recommend using Headless Chrome along with a library like puppeteer[0]. You get the advantage of using a real browser with which you run pages' javascript, load custom extensions, etc.

[0]: https://github.com/GoogleChrome/puppeteer

pteredactyl8y ago

I second this. I built using beautiful soup before and found Puppeteer much easier when interacting with the web. Especially nasty .NET sites.

elyrly8y ago

Simple and straight forward, +1

indescions_20178y ago· 2 in thread

Headless Chrome, Puppeteer, NodeJS (jsdom), and MongoDB. Fantastic stack for web data mining. Async based using promises for explicit user input flow automation.

jdc05898y ago

I had a ton of issues with JsDom historically. They could have been fixed, but Cheerio always worked out better for me.

c0nfused8y ago

I agree with headless chrome.

I have used it with a locally hosted extension to allow easy access to dom and JavaScript after load. Then dumped results to a node app. Was very happy with the results.

levi_n8y ago· 2 in thread

[1] https://pypi.python.org/pypi/explicit

bluntfang8y ago

>I'm primarily interested in scraping data that is supplied via javascript, and I find Selenium to be the most reliable way scrape that info.

levi_n8y ago

>Have you found that you aren't able to find accessible APIs to request against?

>Have you ever tried to contact the administrators to see if there's an API you could access?

Just not feasible given the scope and breadth of the scraping.

>Are you scraping data that would be against ToS if you tried to get it in a way that would benefit both you and the target web site?

I inspect and respect the robots.txt

giarc8y ago· 2 in thread

adventured8y ago

https://www.opengraph.io/

iagovar8y ago

jmkni8y ago· 2 in thread

I've had a surprising amount of success with the HTML Agility Pack in .net, if you have a decent understanding of HTML it's pretty usable.

inglor8y ago

Try CsQuery, it's much nicer in terms of APIs.

dsschnau8y ago

same. I'm a .NET person and i do web scraping stuff on the side, HTML Agility Pack has been easy to pick up.

Doctor_Fegg8y ago· 2 in thread

If you speak Ruby, mechanize is good: https://github.com/sparklemotion/mechanize

DrSayre8y ago

I generally use mechanize when I need to scrape something from the web. I found this awhile back and it's helped me https://www.chrismytton.uk/2015/01/19/web-scraping-with-ruby...

faitswulff8y ago

dsacco8y ago· 2 in thread

I've done this professionally in an infrastructure processing several terabytes per day. A robust, scalable scraping system comprises several distinct parts:

5. A scheduling system, assuming your data is updated in batches. Cron is fine.

kbenson8y ago

> 3. A RDBMS, with databases for both the raw and normalized data

After normalization, the database is ideal though.

tomc19858y ago

A decade ago I worked for a company that also scraped data at this scale and your advice is spot-on!

austincheney8y ago· 2 in thread

This is perhaps the fastest way to screenscrape a dynamically executed website.

1. First go get and run this code, which allows immediate gathering of all text nodes from the DOM: https://github.com/prettydiff/getNodesByType/blob/master/get...

2. Extract the text content from the text nodes and ignore nodes that contain only white space:

Test this out in your browser console.

AznHisoka8y ago

And how do you do #1? Node, I presume?

austincheney8y ago

No, manually go there and copy/paste the code. Then when building your scraper bot use that code.

1 more reply

jppope8y ago· 2 in thread

scapy is fine but selenium, phantom, etc are all outdated IMO

blowski8y ago

> are all outdated IMO

For what reason? Genuine question.

CGamesPlay8y ago

Phantom is woefully out of date, you need a polyfill even for Function.bind. Firefox dropped support for Selenium in 47, and chromedriver only supports it with a wrapper called chromedriver.

1 more reply

riekus8y ago· 2 in thread

selllikesybok8y ago

I find the desktop tool by import.io a little challenging to work with. Their toy web-demo is solid for simple table extraction, though.

wtfdaemon8y ago

It's gotten light-years better since the desktop tool existed.

They've completely deprecated/sun-setted the desktop tool in favor of a greatly improved web application.

1 more reply

OzzyB8y ago· 2 in thread

A good host xD

Preferably one that doesn't mind giving you a bunch of IPs, and if they do, don't charge a fortune for them.

Then you can worry about what software you're gonna use.

eccfcco158y ago

Which hosts have you used, or would you recommend?

OzzyB8y ago

OVH

You can get upto 256 IPs per server and _not_ pay monthly fees -- just a $3 upfront setup charge.

You're welcome xD

1 more reply

frausto8y ago· 2 in thread

jakubbalada8y ago

You can use services like Anti-captcha [1]

We have a public API on Apify for that [2]

[1] https://anti-captcha.com/mainpage

[2] https://www.apify.com/petr_cermak/anti-captcha-recaptcha

levi_n8y ago

The excepted answer on this stack overflow question[1] might help. tl;dr is to build your own chromedriver, but with renamed variables.

[1] https://stackoverflow.com/a/41220267/4079962

phsource8y ago· 1 in thread

nn7578y ago

Isn't cheerio only for static content?

mping8y ago· 1 in thread

I use nightmarejs https://github.com/segmentio/nightmare which is based on electron; I recommend it if you're on js

Cyph0n8y ago

That looks like a pretty interesting scraping library.

baldfat8y ago· 1 in thread

I use R since that is the language I use mostly httr and rvest. Edit I missed typing rvest thanks for the comments you use the two together.

https://cran.r-project.org/web/packages/httr/vignettes/quick...

amrrs8y ago

Rvest is also another nice option in R.

ravenstine8y ago· 1 in thread

It depends on what you're trying to do.

imjasonmiller8y ago

I have been playing around with Cheerio for a short while and it is quite cool! Although extracting comments wasn't as straightforward as I thought it would be.

mrskitch8y ago· 1 in thread

I’d recommend puppeteer or some other Chrome driver. It’s fast and resilient even on single page apps.

If you’re looking to run it on a Linux machine also take a look at https://browserless.io (full disclosure I’m the creator of that site).

mrskitch8y ago

I should note that this doesn't lock you into any particular lib, just solves the problem of running on Chrome in a service like fashion.

hmottestad8y ago· 1 in thread

If you know Java, then my go to library is Jsoup https://jsoup.org/

It lets you use jQuery-like selectors to extract data.

Like this: Elements newsHeadlines = doc.select("#mp-itn b a");

jasondc8y ago

+1 Saves a ton of time, and very simple to use

cdolan8y ago· 1 in thread

Outwit Hub, specifically the advanced or enterprise levels.

It has a GUI on it that is not designed very well, and documentation that is complete, but hard to search...

But it can do just about any type of scrape, including getting started from a command line script

selllikesybok8y ago

mfontani8y ago· 1 in thread

Whatever you end up using for scraping, I beg you to pick a unique user-agent which allows a webmaster to understand which crawler is it, to better allow it to pass through (or be banned, depending).

danso8y ago

Note that for some libraries, the agent is set to empty or whatever the default is for the tool (e.g. `curl/7.43.0` for curl). It's always worth setting it to something.

`Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36`

Maybe for old government sites that break on anything but IE, you'll have to pretend to be IE, but that's very rare.

256cats8y ago· 1 in thread

I use Node and either puppeteer[0] or plain Curl[1]. IMO Curl is years ahead of any Node.js request lib. For proxies I use (shameless plug!) https://gimmeproxy.com .

[0] https://github.com/GoogleChrome/puppeteer

[1] https://github.com/JCMais/node-libcurl

sagivo8y ago

Really nice concept.

traviswingo8y ago· 1 in thread

ajcodez8y ago

I assume this puppeteer:

- https://github.com/GoogleChrome/puppeteer

mateuszf8y ago· 1 in thread

`clj-http`, `enlive`, `cheshire` in case of `clojure` worked fine for me

tuddman8y ago

and 'hickory' [https://github.com/davidsantiago/hickory] to work with the site data however you want.

kzisme8y ago· 1 in thread

tmuir8y ago

- Which squares have historically hit the most often in Superbowl Squares (http://www.picks.org/nfl/super-bowl-squares)

- Collect cost of living indices in a list of cities

21stio8y ago· 1 in thread

golang

deathemperor8y ago

beernutz8y ago

The absolute best tool i have found for scraping is Visual Web Ripper.

I use it with one of my clients professionally. Their support is VERY good btw.

http://visualwebripper.com/

hydragit8y ago

[0] http://weboob.org/

[1] http://dev.weboob.org/guides/module.html#parsing-of-pages

zapperdapper8y ago

I have also used custom written Python crawlers in a lot of cases.

1. crawl - download pages to file system 2. clean then parse (extract data) 3. ingest extracted data into database 4. query - run adhoc queries on database

Of course, this all depends really on what you are trying to do!

deathemperor8y ago

So I decided to use scrapy, the core of scrapinghub.com.

For DOM query I use BeautifulSoup4. I love it. It's jQuery for python.

I am planning to use Amazon RDS to store scraped data.

cholmon8y ago

I recently stumbled across http://go-colly.org/, that looks well thought out and simple to use. It seems like a slimmed down Go version of Scrapy.

khuknows8y ago

Shameless plug - I build this tiny API for scraping and it works a treat for my uses: https://jsonify.link/

A few similar tools also exist, like https://page.rest/.

polote8y ago

I maintain about 8 crawlers and I use only vanilla Python

I have a function to help me search :

   def find_r(value, ind, array,stop_word):
   	indice = ind
   	for i in array:
   		indice = value.find(i,indice)+1
   	end =  value.find(stop_word,indice)
   	return value[indice: end], end

You can use it like that :

   resulting_text , end_index = find_r(string, start_index, ["<td", ">"], "</td")

To find text it is quite fast and you don't need to master a framwork

jacinda8y ago

If you're specifically looking at news articles, go for the Python library Newspaper: http://newspaper.readthedocs.io/en/latest/

Auto-detection of languages, and will automatically give you things like the following:

>>> article.parse()

>>> article.authors [u'Leigh Ann Caldwell', 'John Honway']

>>> article.text u'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'

>>> article.top_image u'http://someCDN.com/blah/blah/blah/file.png'

>>> article.movies [u'http://youtube.com/path/to/link.com', ...]

mmmnt8y ago

For very simple tasks Listly seems to be a fast and good solution: http://www.listly.io/

If you need more power, I heard good stuff about http://80legs.com/ though never tried them myself.

btb8y ago

kanishkalinux8y ago

for mostly static pages requests/pycurl + beautifulsoup more than sufficient. For advance scraping, take a look at scrapy.

etatoby8y ago

If you need to scrape content from complex JS apps (eg. React) where it doesn't pay to reverse engineer their backend API (or worse, it's encrypted/obfuscated) you may want to look at CasperJS.

It's a very easy to use frontend to PhantomJS. You can code your interactions in JS or CoffeeScript and scrape virtually anything with a few lines of code.

If you need crawling, just pair a CasperJS script with any spider library like the ones mentioned around here.

theden8y ago

I've had good success with scrapy (https://scrapy.org/) for my personal projects

Jeaye8y ago

I've written a bit on web scraping with Clojure and Enlive here: https://blog.jeaye.com/2017/02/28/clojure-apartments/

That's what I'd use, if I had to scrape again (no JS support).

vrathee8y ago

If you are looking for SaaS or managed services, Try https://www.agenty.com/

Or advance mode like REGEX, XPATH etc.

doominasuit8y ago

jpetersonmn8y ago

jancurn8y ago

Recently the platform added support for headless Chrome and Puppeteer, you can even run jobs written in Scrapy or any other library as long as it can be packaged as Docker container.

Disclaimer: I'm a co-founder of Apify

servitor8y ago

I agree with others, with curl and the likes you will hit insurmountable roadblocks sooner or later. It's better to go full headless browser from the start.

[0] https://www.guru99.com/page-object-model-pom-page-factory-in...

sl0wik8y ago

I had great experience with www.apify.com.

Softcadbury8y ago

With node, you can use cheerio [0]. It allows you to parse html pages with a JQuery similar syntax. I use it in production on my project [1]

[0] https://github.com/cheeriojs/cheerio [1] https://github.com/Softcadbury/football-peek/blob/master/ser...

colinchartier8y ago

It was so hard that we made our own company JUST to scrape stuff easily without requiring programming. Take a look at https://www.parsehub.com

mitchtbaum8y ago

bbayer8y ago

http://docs.pyspider.org/en/latest/

mrkeen8y ago

I made a crawler https://github.com/jahaynes/crawler

It outputs to the warc file format (https://en.wikipedia.org/wiki/Web_ARChive), in case your workflow is to gather web pages and then process them afterwards.

ngneer8y ago

dineshr938y ago

If you know java then jsoup will be very handy. [1] https://jsoup.org/

charlus8y ago

For a little diversity on tools, if you're looking for something quick that others can access the data easily - Google Apps script in a Google Sheet can be quite useful.

https://sites.google.com/site/scriptsexamples/learn-by-examp...

buildops8y ago

Why are you looking to scrape? Here's a list of some scraper bots: https://www.incapsula.com/blog/web-scraping-bots.html

What about Botscraper: http://www.botscraper.com/

wiradikusuma8y ago

I tinkered with Apache Nutch (http://nutch.apache.org/), but I found it overkill. In the end, since I use Scala, I use https://github.com/ruippeixotog/scala-scraper

laktek8y ago

One of the challenges with modern day scraping is you need to account for client-side JS rendering.

If you prefer an API as a service that can pre-render pages, I built Page.REST (https://www.page.rest). It allows you to get rendered page content via CSS selectors as a JSON response.

blueadept1118y ago

Jaunt [http://jaunt-api.com] is a good java tool.

0xdeadbeefbabe8y ago

The best tool for web scraping, for me, is something easy to deploy and redeploy; and something that doesn't rely on three working programs--eliminating selenium sounds great.

For those reasons I like https://github.com/knq/chromedp

ksahin8y ago

I wrote some blog post about Java web scraping here : https://ksah.in/introduction-to-web-scraping-with-java/

As others said, phantomJS (and now headless Chrome) are good tools to deal with heavy js websites

teremin8y ago

I use Colly[0][1] which is a young but decent scraping framework for Golang.

[0] http://go-colly.org/ [1] https://github.com/gocolly/colly

tmaly8y ago

I just tried puppeteer yesterday for the first time. It seems to work very well. My only complaint is that it is very new and does now have a plethora of examples.

I previously have used WWW::Mechanize in the Perl world, but single page applications with Javascript really require something with a browser engine.

1 more reply

RandomBookmarks8y ago

The "best tool" is different for web developers and non-coders. If you are a non-technical person that just needs some data there is:

(1) hosted services like mozenda

(2) visual automation tools like Kantu Web Automation (which includes OCR)

(3) and last but not least outsourcing the scraping on sites like Freelancer.com

thallian8y ago

I used CasperJS[0] in the past to scrap a javascript heavy forum (ProBoards) and it worked well. But that was a few years ago, I have no idea what new strategies came up in the meantime.

[0] http://casperjs.org/

tn_8y ago

Check out Heritrix if you're looking for an open-source webscraping archival tool: https://webarchive.jira.com/wiki/spaces/Heritrix

brycematheson8y ago

Shameless plug. I wrote a blog post on how I use Powershell to scrape sites: http://brycematheson.io/webscraping-with-powershell/

jschuur8y ago

If you want to extract content and specific meta data, you might find the Mercury Web Parser useful:

https://mercury.postlight.com/web-parser/

Karupan8y ago

I've had some success using portia[1]. Its a visual wrapper over scrapy, but is actually quite useful.

https://github.com/scrapinghub/portia

askz8y ago

A friend released a little tool to only scrap html from websites, with tor and proxy chaining

https://github.com/AlexMili/Scraptory

freeslugs8y ago

thegrif8y ago

A ton of people recommended Scrapy - and I am always looking for senior Scrapy resources that have experience scraping at scale. Please feel free to reach out - contact info is in my profile.

sananth128y ago

If you are looking for image scraping: https://github.com/sananth12/ImageScraper

pudo8y ago

We looked at scrapy, but it just seemed like the wrong type of framing for the type of scrapers we build: requests, some html/xml parser, and output into a service API or a SQL store.

Maybe some people will enjoy it.

kbd8y ago

For simple tasks, curl into pup is very convenient.

https://github.com/ericchiang/pup

kopos8y ago

Scrapy [https://github.com/scrapy/scrapy] works really well.

vinitagr8y ago

https://github.com/matthewmueller/x-ray

Lxr8y ago

Python requests + lxml, with Selenium as a last resort.

1 more reply

fazkan8y ago

scrapy and BS4, for serious stuff. Selenium, for automating logging and other UI related stuff, you can even play games with it.

kazinator8y ago

TXR: http://www.nongnu.org/txr

crispytx8y ago

I did a little web scraping project a few years ago using:

* cURL

* regex

thejosh8y ago

If you are scraping specific pages on a site, curl. Then transform that into the language you use.

cm20128y ago

For non developers dexi.io is great.

novaleaf8y ago

i wrote a tool: PhantomJsCloud.com

it's getting a little long in the tooth, but I will be updating it soon to use a Chrome based renderer. If you have any suggestions, you can leave it here or PM me :)

aaronhoffman8y ago

This tool takes a list of URIs and crawls each site for contact info. Phone, email, twitter, etc

https://github.com/aaronhoffman/WebsiteContactHarvester

jpepinho8y ago

WebDriver.io using Selenium and PhantomJS would be a good way to go!

greyfox8y ago

i did a quick search and didnt see this listed here:

https://www.httrack.com/

etattva8y ago

Scrapy and Jsoup are best combinations

tomc19858y ago

Perl or Ruby and Regular Expressions

herbst8y ago

Nokogiri

vsupalov8y ago

[1] https://scrapy.org/

[2] https://www.crummy.com/software/BeautifulSoup/bs4/doc/

[3] http://sangaline.com/post/advanced-web-scraping-tutorial/

[4] https://franciskim.co/dont-need-no-stinking-api-web-scraping...

[5] https://developers.google.com/web/updates/2017/04/headless-c...

[6] https://github.com/GoogleChrome/puppeteer

[7] https://blog.phantombuster.com/web-scraping-in-2017-headless...

[8] https://vsupalov.com/headless-chrome-puppeteer-docker/

[9] https://github.com/vsupalov/docker-puppeteer-dev

pwaai8y ago

hey I'm working on this thing called BAML (browser automation markup language) and it looks something like this:

    OPEN http://asdf.com
    CRAWL a
    EXTRACT {'title': '.title'}

It's meant to be super simple and built from ground up to support crawling Single Page Applications.

dor_jack8y ago

If you need to perform a web-scale crawl I strongly recommend https://www.mixnode.com.

j / k navigate · click thread line to collapse