Scrapy also has the ability to pause and restart crawls [1], run the crawlers distributed [2] etc. It is my goto option.
[0] https://blog.scrapinghub.com/2015/03/02/handling-javascript-...
So the learning curve for simple things makes me jump to bash scripts; scrapy might prove more valuable when your project starts to scale.
But also of course: normally the best tool is the one you already know!
[0] : http://python-rq.org/docs/
[1] : http://gearman.org/
All data (zip files, pdf, html, xml, json) we collect are stored as-is (/path/to/<dataset name>/<unique key>/<timestamp>) and processed later using a Spark pipeline. lxml.html is WAY faster than beautifulsoup and less prone to exception.
We have cronjob (cron + jenkins) that trigger dataset update and discovery. For example, we scrape corporate registry, so everyday we update the 20k oldest companies version. We also implement "discovery" logic in all of our crawlers so they can find new data (ex.: newly registered company). We use Redis to send task (update / discovery) to our crawlers.
Some kind of queue implemented with Redis? How does it work?
For every website we crawl we implement a custom discovery/update logic.
Discovery can be, for example, crawl a specific date range, seq number, postal code.... We usually seed discovery based on the actual data we have, like highest_company_number + 1000, so we get the newly registered companies.
Update is to update a single document. Like crawl document for company number 1234. We generate a Request [2] to crawl only that document.
[1] https://doc.scrapy.org/en/latest/topics/signals.html
[2] https://doc.scrapy.org/en/latest/topics/request-response.htm...
Scrapy is a whole framework that may be worthwhile, but if I were just starting out for a specific task, I would use:
- requests http://docs.python-requests.org/en/master/
- lxml http://lxml.de/
- cssselect https://cssselect.readthedocs.io/en/latest/
Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize. But using the web developer tools you can usually figure out the requests made by the browser and then use the Session object in the Requests library to deal with stateful requests:
http://docs.python-requests.org/en/master/user/advanced/
I usually just download pages/data/files as raw files and worry about parsing/collating them later. I try to focus on the HTTP mechanics and, if needed, the HTML parsing, before worrying about data extraction.
You could also use the WebOOB (http://weboob.org) framework. It's built on requests+lxml and it provides a Browser class usable like mechanize's one (ability to access doc, select HTML forms, etc.).
It also has nice companion features like associating url patterns to some custom Page classes where you can write what data to retrieve when a page with this url pattern is browsed.
It's pretty much always a great idea to completely separate the parts that perform the HTTP fetches and the part that figures out what those payloads mean.
Did the version of Mechanize written in Py2 stop being supported?
I've also seen these alternatives:
- https://robobrowser.readthedocs.io/en/latest/
- https://github.com/MechanicalSoup/MechanicalSoup
MechanicalSoup seems well updated but the last time I tried these libraries, they were either buggy (and/or I was ignorant) and I just couldn't get things to work as I was used to in Ruby and Mechanize.
https://github.com/google/gumbo-parser
Is the modified version you use a personal version or a well-known fork?
If you don't want to clock on the links, requests and BeautifulSoup / lxml is all you need 90% of the time. Throw gevent in there and you can get a lot of scraping done in not as much time as you think it would take.
And as long as we're talking about web scraping, I'm a huge fan of it. There's so much data out there that's not easily accessible and needs to be cleaned and organized. When running a learning algorithm, for example, a very hard part that isn't talked about a lot is getting the data before throwing it in a learning function or library. Of course, there the legal side of it if companies are not happy with people being able to scrape, but that's a different topic.
I'll keep going. The best way to learn about what are the best tools is to do a project on your own and teat them all out. Then you'll know what suits you. That's absolutely the best way to learn something about programming -- doing it instead of reading about it.
[0] https://bigishdata.com/2017/05/11/general-tips-for-web-scrap...
[1] https://bigishdata.com/2017/06/06/web-scraping-with-python-p...
When should one use one or the other, would you say?
I've heard that `lxml` can choke on certain badly-formed markup, but it's very fast. Personally has never failed on me.
Recommendation by the author (of Calibre fame) on a similar discussion: https://news.ycombinator.com/item?id=15539853
Dedicated discussion: https://news.ycombinator.com/item?id=14588333
If you're familiar with writing XPath queries, lxml is great.
Selenium IDE no longer works in Firefox for a number of reasons; 1) Selenium IDE didn’t have a maintainer 2) Selenium IDE is a Firefox add on and Mozilla changed how adding worked. They did this for numerous security reasons.
One thing I haven't worked on yet is waiting for stuff to load if that is a problem. Otherwise you try to limit hitting a site either using sleep/CRON
What's also interesting is session tokens, one site I was able to hunt down the generated token bread crumb which JS produced, but it wasn't valid. Still had to visit the site, interesting.
One thing I haven't worked on yet is waiting for stuff to load if that is a problem. Otherwise you try to limit hitting a site either using sleep/CRON
You should be using Headless Chrome or Headless Firefox with a library that can control them in a user-friendly manner
I personally avoid executing js unless it's necessary, as it adds more complexity, and is noticeably more brittle.
If you can scrape findthecompany database ? I have done it successfully !!
If Google wanted to give back something to the community, it would offer cheap automated searches (current prices are absurd). Another thing - more depth after the first 1000 results. Sometimes you want to know the next result. We shouldn't need to do all these stupid things to batch query a search engine, it should be open. That makes it all the more important to invent an open-source, federated search engine, so we can query to our heart's content (and have privacy).
As for 'federated search engine' - it's not 'federated' per se but check out Gigablast search engine. Open source (source on GitHub) and a TOTALLY AWESOME piece of software written by one guy. You can do good searches at the Gigablast site[1], or set up your own search engine. Gigablast also offers an API (I may be wrong but I think DuckDuckGo uses that API for some tasks).
duckduckgo is good but not there yet.
Would you be interested to work on a search engine ? Some projects are bitfunnel and so forth.
I have used it with a locally hosted extension to allow easy access to dom and JavaScript after load. Then dumped results to a node app. Was very happy with the results.
I use explicit waits exclusively (no direct calls like `driver.find_foo_by_bar`), and find it vastly improves selenium reliability. (Shameless plug) I have a python package, Explicit[1], that makes it easier to use explicit waits.
Have you found that you aren't able to find accessible APIs to request against? Have you ever tried to contact the administrators to see if there's an API you could access? Are you scraping data that would be against ToS if you tried to get it in a way that would benefit both you and the target web site?
I'm scraping from variety of different websites (1000+) that my org doesn't own. Reconfiguring to hit APIs would be complex, and a maintenance problem, both of which I easily avoid by using selenium to drive an actual browser, at the expense of time.
>Have you ever tried to contact the administrators to see if there's an API you could access?
Just not feasible given the scope and breadth of the scraping.
>Are you scraping data that would be against ToS if you tried to get it in a way that would benefit both you and the target web site?
I inspect and respect the robots.txt
1. A crawler, for retrieving resources over HTTP, HTTPS and sometimes other protocols a bit higher or lower on the network stack. This handles data ingestion. It will need to be sophisticated these days - sometimes you'll need to emulate a browser environment, sometimes you'll need to perform a JavaScript proof of work, and sometimes you can just do regular curl commands the old fashioned way.
2. A parser, for correctly extracting specific data from JSON, PDF, HTML, JS, XML (and other) formatted resources. This handles data processing. Naturally you'll want to parse JSON wherever you can, because parsing HTML and JS is a pain. But sometimes you'll need to parse images, or outdated protocols like SOAP.
3. A RDBMS, with databases for both the raw and normalized data, and columns that provide some sort of versioning to the data in a particular point in time. This is quite important, because if you collect the raw data and store it, you can re-parse it in perpetuity instead of needing to retrieve it again. This will happen somewhat frequently if you come across new data while scraping that you didn't realize you'd need or could use. Furthermore, if you're updating the data on a regular cadence, you'll need to maintain some sort of "retrieved_at", "updated_at" awareness in your normalized database. MySQL or PostgreSQL are both fine.
4. A server and event management system, like Redis. This is how you'll allocate scraping jobs across available workers and handle outgoing queuing for resources. You want a centralized terminal for viewing and managing a) the number of outstanding jobs and their resource allocations, b) the ongoing progress of each queue, c) problems or blockers for each queue.
5. A scheduling system, assuming your data is updated in batches. Cron is fine.
6. Reverse engineering tools, so you can find mobile APIs and scrape from them instead of using web targets. This is important because mobile API endpoints a) change far less frequently than web endpoints, and b) are far more likely to be JSON formatted, instead of HTML or JS, because the user interface code is offloaded to the mobile client (iOS or Android app). The mobile APIs will be private, so you'll typically have to reverse engineer the HMAC request signing algorithm, but that is virtually always trivial, with the exception of companies that really put effort into obfuscating the code. apktool, jadx and dex2jar are typically sufficient for this if you're working with an Android device.
7. A proxy infrastructure, this way you're not constantly pinging a website from the same IP address. Even if you're being fairly innocuous with your scraping, you probably want this, because many websites have been burned by excessive spam and will conscientiously and automatically ban any IP address that issues something nominally more than a regular user, regardless of volume. Your proxies come in several flavors: datacenter, residential and private. Datacenter proxies are the first to be banned, but they're cheapest. These are proxies resold from datacenter IP ranges. Residential IP addresses are IP addresses that are not associated with spam activity and which come from ISP IP ranges, like Verison Fios. Private IP addresses are IP addresses that have not been used for spam activity before and which are reserved for use by only your account. Naturally this is in order from lower to greater expense; it's also in order from most likely to least likely to be banned by a scraping target. NinjaProxies, StormProxies, Microleaf, etc are all good options. Avoid Luminati, which offers residential IP addresses contributed by users who don't realize their IP addresses are being leased through the use of Hola VPN.
Each website you intend to scrape is given a queue. Each queue is assigned a specific allotment of workers for processing scraping jobs in that queue. You'll write a bunch of crawling, parsing and database querying code in an "engine" class to manage the bulk of the work. Each scraping target will then have its own file which inherits functionality from the core class, with the specific crawling and parsing requirements in that file. For example, implementations of the POST requests, user agent requirements, which type of parsing code needs to be called, which database to write to and read from, which proxies should be used, asynchronous and concurrency settings, etc should all be in here.
Once triggered in a job, the individual scraping functions will call to the core functionality, which will build the requests and hand them off to one of a few possible functions. If your code is scraping a target that has sophisticated requirements, like a JavaScript proof of work system or browser emulation, it will be handed off to functionality that implements those requirements. Most of the time, this won't be needed and you can just make your requests look as human as possible - then it will be handed off to what is basically a curl script.
Each request to the endpoint is a job, and the queue will manage them as such: the request is first sent to the appropriate proxy vendor via the proxy's API, then the response is sent back through the proxy. The raw response data is stored in the raw database, then normalized data is processed out of the raw data and inserted into the normalized database, with corresponding timestamps. Then a new job is sent to a free worker. Updates to the normalized data will be handled by something like cron, where each queue is triggered at a specific time on a specific cadence.
You'll want to optimize your workflow to use endpoints which change infrequently and which use lighter resources. If you are sending millions of requests, loading the same boilerplate HTML or JS data is a waste. JSON resources are preferable, which is why you should invest some amount of time before choosing your endpoint into seeing if you can identify a usable mobile endpoint. For the most part, your custom code is going to be in middleware and the parsing particularities of each target; BeautifulSoup, QueryPath, Headless Chrome and JSDOM will take you 80% of the way in terms of pure functionality.
I've found the filesystem (local or network, depending on scale) works well for the raw data. A normalized file name with a timestamp and job identifier in a hashed directory structure of some sort (I generally use $jobtype/%Y-%m-%d/%H/ as a start) works well, and reading and writing gzip is trivial (and often you can just output the raw content of gzip encoded payloads). The filesystem is an often overlooked database. If you end up needing more transactional support, or to easily identify what's been processed or not, look at how Maildir works.
After normalization, the database is ideal though.
That said, I was doing a few gigabytes a day, not a dew terabytes, so you might have run into some scale issues I didn't. I was able to keep it to mostly one box for crawling and parsing, but crawlers ended up being complex and job-queue driven enough that expanding to multiple systems wouldn't have been all that much extra work (an assessment I feel confident in, having done similar things before).
1. First go get and run this code, which allows immediate gathering of all text nodes from the DOM: https://github.com/prettydiff/getNodesByType/blob/master/get...
2. Extract the text content from the text nodes and ignore nodes that contain only white space:
let text = document.getNodesByType(3), a = 0, b = text.length, output = []; do { if ((/^(\s+)$/).test(text[a].textContent) === false) { output.push(text[a].textContent); } a = a + 1; } while (a < b); output;
That will gather ALL text from the page. Since you are working from the DOM directly you can filter your results by various contextual and stylistic factors. Since this code is small and executes stupid fast it can be executed by bots easily.
Test this out in your browser console.
scapy is fine but selenium, phantom, etc are all outdated IMO
For what reason? Genuine question.
They've completely deprecated/sun-setted the desktop tool in favor of a greatly improved web application.
Preferably one that doesn't mind giving you a bunch of IPs, and if they do, don't charge a fortune for them.
Then you can worry about what software you're gonna use.
We have a public API on Apify for that [2]
[1] https://anti-captcha.com/mainpage
[2] https://www.apify.com/petr_cermak/anti-captcha-recaptcha
We use this stack at WrapAPI (https://wrapapi.com), which we highly recommend as a tool to turn webpages into APIs. It doesn't completely do all the scraping (you still need to write a script), but it does make turning a HTML page into a JSON structure much easier.
https://cran.r-project.org/web/packages/httr/vignettes/quick...
For most things, I use Node.js with the Cheerio library, which is basically a stripped-down version of jQuery without the need for a browser environment. I find using the jQuery API far more desirable than the clunky, hideous Beautiful Soup or Nokogiri APIs.
For something that requires an actual DOM or code execution, PhantomJS with Horseman works well, though everyone is talking about headless Chrome these days so IDK. I've not had nearly as many bad experiences with PhantomJS as others have purportedly experienced.
Do you have any experience with processing and scraping large files using Cheerio? It doesn't support streaming does it? I am currently faced with processing a ~75 MB XML and I am not sure if Cheerio is suited for that.
If you’re looking to run it on a Linux machine also take a look at https://browserless.io (full disclosure I’m the creator of that site).
It lets you use jQuery-like selectors to extract data.
Like this: Elements newsHeadlines = doc.select("#mp-itn b a");
It has a GUI on it that is not designed very well, and documentation that is complete, but hard to search...
But it can do just about any type of scrape, including getting started from a command line script
Don't stick with the default "scrapy" or "Ruby" or "Jakarta Commons-HttpClient/...", which end up (justly) being banned more easily than unique ones, like "ABC/2.0 - https://example.com/crawler" or the like.
As a frequent scraper of government sites, and sometimes commercial sites for research purposes, I avoid as much as possible as faking a User Agent, i.e. copying the default strings for popular browsers:
`Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36`
Almost always, if a site rejects my scraper on the basis of agent, they're doing a regex for "curl", "wget" or for an empty string. Setting a user-agent to something unique and explicit, i.e. "Dan's program by danso@myemail.com" works fine without feeling shady.
Maybe for old government sites that break on anything but IE, you'll have to pretend to be IE, but that's very rare.
- Which squares have historically hit the most often in Superbowl Squares (http://www.picks.org/nfl/super-bowl-squares)
- Search a job website for a search term and list of locations, collecting each job title, company, location, and link, to view as one large spreadsheet, instead of having to navigate through 10 results per page.
- Collect cost of living indices in a list of cities
It is not open source, and runs in windows only, but it is one of the easiest to use tools that i have found. I can set up scrapes entirely visually, and it handles complex cases like infinite scroll pages, highly javascript dependent pages and the like. I really wish there were an open source solution that was as good as this one.
I use it with one of my clients professionally. Their support is VERY good btw.
On the pure scraping side, it has a "declarative parsing" to avoid painful plain-old procedural code [1]. You can parse pages by simply specifying a bunch of XPaths and indicating a few filters from the library to apply on those XPath elements, for example CleanText to remove whitespace nonsense, Lower (to lower-case), Regexp, CleanDecimal (to parse as number) and a lot more. URL patterns can be associated to a Page class of such declarative parsing. If declarative becomes too verbose, it can always be replaced locally by writing a plain-old Python method.
A set of applications are provided to visualize extracted data, and other niceties are provided for debug easing. Simply put: « Wonderful, Efficient, Beautiful, Outshining, Omnipotent, Brilliant: meet WebOOB ».
[1] http://dev.weboob.org/guides/module.html#parsing-of-pages
I have also used custom written Python crawlers in a lot of cases.
The other thing I would emphasize is that a web scraper has multiple parts, such as crawling (downloading pages) and then actually parsing the page for data. The systems I've set up in the past typically are structured like this:
1. crawl - download pages to file system 2. clean then parse (extract data) 3. ingest extracted data into database 4. query - run adhoc queries on database
One of the trickiest things in my experience is managing updates. So when new articles/content are added to the site you only want to have to get and add that to your database, rather than crawl the whole site again. Also detecting updated content can be tricky. The brute force approach of course is just to crawl the whole site again and rebuild the database - not ideal though!
Of course, this all depends really on what you are trying to do!
So I decided to use scrapy, the core of scrapinghub.com.
I haven't written much python before but scrapy was very easy to learn. I wrote 2 spiders and run on scrapinghub (their serverless cloud). Scrapinghub support jobs scheduling and many other things at a cost. I prefer scrapinghub because in my team we don't have DevOps. It also supports Crawlera to prevent IP banning, Portia for point and click (still in beta, it was still hard to use), and Splash for SPA websites but it's buggy and the github repo is not under active maintenance.
For DOM query I use BeautifulSoup4. I love it. It's jQuery for python.
For SPA websites I wrote a scrapy middleware which uses puppeteer. The puppeteer is deployed on Amazon Lambda (1m free request first 365 days, more than enough for scraping) using this https://github.com/sambaiz/puppeteer-lambda-starter-kit
I am planning to use Amazon RDS to store scraped data.
A few similar tools also exist, like https://page.rest/.
I have a function to help me search :
def find_r(value, ind, array,stop_word):
indice = ind
for i in array:
indice = value.find(i,indice)+1
end = value.find(stop_word,indice)
return value[indice: end], end
You can use it like that : resulting_text , end_index = find_r(string, start_index, ["<td", ">"], "</td")
To find text it is quite fast and you don't need to master a framworkAuto-detection of languages, and will automatically give you things like the following:
>>> article.parse()
>>> article.authors [u'Leigh Ann Caldwell', 'John Honway']
>>> article.text u'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'
>>> article.top_image u'http://someCDN.com/blah/blah/blah/file.png'
>>> article.movies [u'http://youtube.com/path/to/link.com', ...]
If you need more power, I heard good stuff about http://80legs.com/ though never tried them myself.
If you really need to do crazy shit like crawling the iOS App Store really fast and keep thing up to date. I suggest using Amazon Lambda and a custom Python parser. Though Lambda is not meant for this kind of things it works really well and is super scalable at a reasonable price.
So I was really hoping this this thread would have revealed some newer commercial GUI-based alternatives(on-premise, not SaaS). Because I dont really ever want to go back the maintenance hell of hand rolled robots ever again :)
for javascript heavy pages most people rely on selenium webdriver. However you can also try hlspy (https://github.com/kanishka-linux/hlspy), which is a little utility I made a while ago for dealing with javascript heavy pages for simple usage.
It's a very easy to use frontend to PhantomJS. You can code your interactions in JS or CoffeeScript and scrape virtually anything with a few lines of code.
If you need crawling, just pair a CasperJS script with any spider library like the ones mentioned around here.
That's what I'd use, if I had to scrape again (no JS support).
Agenty is cloud-hosted web scraping app and you can setup scraping agents using their point and click CSS Selector Chrome extension to extract anything from HTML with these 3 modes below: - TEXT : Simple clean text - HTML : Outer or Inner HTML - ATTR : Any attribute of a html tag like image src, hyperlink href…
Or advance mode like REGEX, XPATH etc.
And then save the scraping agent to execute on cloud-hosted app with most advanced features like batch crawling, scheduling, multiple website scraping simultaneously without worrying in ip-address block or speed like never before.
Recently the platform added support for headless Chrome and Puppeteer, you can even run jobs written in Scrapy or any other library as long as it can be packaged as Docker container.
Disclaimer: I'm a co-founder of Apify
I use a python->selenium->chrome stack. The Page Object Model [0] has been a revelation for me. My scripts went from being a mess of spaghetti code to something that's a pleasure to write and maintain.
[0] https://www.guru99.com/page-object-model-pom-page-factory-in...
[0] https://github.com/cheeriojs/cheerio [1] https://github.com/Softcadbury/football-peek/blob/master/ser...
It was so hard that we made our own company JUST to scrape stuff easily without requiring programming. Take a look at https://www.parsehub.com
It outputs to the warc file format (https://en.wikipedia.org/wiki/Web_ARChive), in case your workflow is to gather web pages and then process them afterwards.
https://sites.google.com/site/scriptsexamples/learn-by-examp...
What about Botscraper: http://www.botscraper.com/
If you prefer an API as a service that can pre-render pages, I built Page.REST (https://www.page.rest). It allows you to get rendered page content via CSS selectors as a JSON response.
For those reasons I like https://github.com/knq/chromedp
As others said, phantomJS (and now headless Chrome) are good tools to deal with heavy js websites
[0] http://go-colly.org/ [1] https://github.com/gocolly/colly
I previously have used WWW::Mechanize in the Perl world, but single page applications with Javascript really require something with a browser engine.
(1) hosted services like mozenda
(2) visual automation tools like Kantu Web Automation (which includes OCR)
(3) and last but not least outsourcing the scraping on sites like Freelancer.com
We looked at scrapy, but it just seemed like the wrong type of framing for the type of scrapers we build: requests, some html/xml parser, and output into a service API or a SQL store.
Maybe some people will enjoy it.
* cURL
* regex
it's getting a little long in the tooth, but I will be updating it soon to use a Chrome based renderer. If you have any suggestions, you can leave it here or PM me :)
Here's where I hit the limit with that setup: dynamic websites. If you're looking at something like discourse-powered communities or similar, and don't feel a bit too lazy to dig into all the ways requests are expected to look, it's no fun anymore. Luckily, there's lots of js-goodness which can handle dynamic website, inject your javascript for convenience and more [4].
The recently published Headless Chrome [5] and puppeteer [6] (a Node API for it), are really promising for many kinds of tasks - scraping among them. You can get a first impression in this article [7]. The ecosystem does not seem to be as mature yet, but I think this will be foundation of the next go-to scraping tech stack.
If you want to try it yourself, I've written a brief intro [8] and published a simple dockerized development environment [9], so you can give it a go without cluttering your machine or find out what dependencies you need and how the libraries are called.
[2] https://www.crummy.com/software/BeautifulSoup/bs4/doc/
[3] http://sangaline.com/post/advanced-web-scraping-tutorial/
[4] https://franciskim.co/dont-need-no-stinking-api-web-scraping...
[5] https://developers.google.com/web/updates/2017/04/headless-c...
[6] https://github.com/GoogleChrome/puppeteer
[7] https://blog.phantombuster.com/web-scraping-in-2017-headless...
OPEN http://asdf.com
CRAWL a
EXTRACT {'title': '.title'}
It's meant to be super simple and built from ground up to support crawling Single Page Applications.Also, creating a terminal client (early ver: https://imgur.com/a/RYx5g) for it which will launch a Chrome browser and scrape everything. http://export.sh is still very early in the works, I'd appreciate any feedback (email in profile, contact form doesn't work).