This is Jan, founder of Apify, a web scraping and automation platform. Drawing on our team's years of experience, today we're launching Crawlee [1], the web scraping and browser automation library for Node.js that's designed for the fastest development and maximum reliability in production.
For details, see the short video [2] or read the announcement blog post [3].
Main features:
- Supports headless browsers with Playwright or Puppeteer
- Supports raw HTTP crawling with Cheerio or JSDOM
- Automated parallelization and scaling of crawlers for best performance
- Avoids blocking using smart sessions, proxies, and browser fingerprints
- Simple management and persistence of queues of URLs to crawl
- Written completely in TypeScript for type safety and code autocompletion
- Comprehensive documentation, code examples, and tutorials
- Actively maintained and developed by Apify—we use it ourselves!
- Lively community on Discord
To get started, visit https://crawlee.dev or run the following command: npx crawlee create my-crawler
If you have any questions or comments, our team will be happy to answer them here.
[2] https://www.youtube.com/watch?v=g1Ll9OlFwEQ
[3] https://blog.apify.com/announcing-crawlee-the-web-scraping-a...
I'm especially excited about the unified API for browser and HTML scraping, which is something I've had to hack on top of Scrapy in the past and it really wasn't a good experience. That, along with puppeteer-heap-snapshot, will make the common case of "we need this to run NOW, you can rewrite it later" so much easier to handle.
While I'm not particularly happy to see JavaScript begin taking over another field as it truly is an awful language, more choice is always better and this project looks valuable enough to make dealing with JS a worthwhile tradeoff.
Love that comment :D
Yeah, the ability to switch between headless and http is very important to us in production. We often hack something up quickly with headless and then later optimize it to use HTTP when we find the time.
Anyone can trash js except for all these C like languages with boringly similar designs, doubly so for python. Itself took over the world due to the fact that amateur (scientists and other professions as opposed to programmers) can easily play with it.
Sorry but this irked me. What exactly are you hangups with JS ? It's just a JiT dynamic typed language. By design.
It has its qwirks for sure, but again, its just a language. Truly awful it isn't.
I have no skin in the game (anymore), but boy the feeling repeated gazillion times left and right in past decade about javascript crawling to be used in places it shouldn't be strongly resonated with me back then (not particularly for this project, rant in general).
Is there any kind of detection/stealthiness benchmark compared to libraries such as puppeteer-stealth or fakebrowser?
Honestly no matter how feature-complete and powerful a scraping tool is, the main "selling point" for me will always be stealthiness/human like behavior no matter how crappy the dev experience is.(and IMHO that's the same for most serious scrapers/bot makers)
Will it always be free or could it turn into a product/paid SaaS?(kind of like browserless) I'm kind of wondering if it's worth learning it if the next cool features are going to be for paying users only.
Is this something that you use internally or is it just a way to promote your paid products?
Thanks :)
Can't say I agree. The biggest value for me is being able to respond to site changes quickly. Having a key bot offline for an extended period of time can be costly, so being able to update, test and deploy it quickly is a big selling point. The vast majority of sites, including major companies, have very rudimentary bot detection, and a high-quality proxy provider is often all you need to bypass it.
As for the advanced methods like recaptcha 3 and cloudflare, I don't know of any framework that passes those out of the box anyways, so might as well use something that's easy to hack on and implement your own bypasses as necessary.
Helps too that the apify devs themselves are nice and super responsive (we've had quite a few PRs merged over the last couple of years). The SDK code (and supporting libs like browser-tool, got-scraping) is clean and very easy to read/follow/extend (happy to hear too that the license is going to remain unchanged).
Crawlee does appear to do the basic checks though, like checking navigator.webdriver: https://github.com/apify/crawlee/blob/master/test/browser-po...
Last time I checked (over a year ago) I couldn't find any public code to make Chrome/Firefox properly undetectable.
That said, going to extreme lengths to be undetectable is rarely necessary, because some sites will serve up CAPTCHA's to real people on clean uncompromised residential connections anyway.
It has an A rating in the BotD (fingerprint.js) detection. Now we're working on improving the CreepJS detection. That one is really tough though. Not even sure if anybody would use it in production environments as it must throw a lot of false positives.
It will always be free and maintained, because we're using it internally in all of our projects. We thought about adding a commercial license like Docker. Open source, but paid if you have more than $10mil revenue or more than 250 employees. But in the end we decided that we won't do even that and it's just free and always be free.
We dont have any benchmarks for Crawlee just yet, but we are working on those as we speak. We care deeply about bot detection, one of the features of Crawlee is generated fingerprints based on real browser data we gather - you can read more about it in the https://github.com/apify/fingerprint-suite repository, which is used under the hood in Crawlee. For scraping via HTTP requests (e.g. cheerio/jsdom), we develop library called got-scraping (https://github.com/apify/got-scraping), that tries to mimic real browsers while doing fast HTTP requests.
Crawlee is and always will be open source. It originated from the Apify SDK (http://sdk.apify.com), which is a library to support development of so called Actors on the Apify Platform (http://apify.com) - so you can see it as a way for us to improve the experience of our customers. But you can use it anywhere you want, we provide ready to use Dockerfiles for each template.
Above is the headline from the crawlee.dev website.
I opened a PR to change it: https://github.com/apify/crawlee/pull/1480
Is there a similar guide for Crawlee?
import { Actor } from 'apify';
and then all references to Actor and either remove them or replace them with Crawlee functions.
E.g. await Actor.openKeyValueStore() should be replaced with KeyValueStore.open()
It makes sense to add a separate example for Crawlee though. But it's true that it does not exist yet.
The typical response to people raising these issues is "buuuut xy is a private platform that can do what it wants", yes, but why are you defending technocrats with bigger profits than many nation state's GDPs? (Reasonable) crawling should be allowed and promoted, in fact, it should be codified in law as a necessary element for the future of open and free internet. Anyone trying to prevent it, or even worse, make it illegal, is a bad actor.
Rate limits can be applied for different reasons. If they protect the website from being overloaded, they are good in our opinion. If they protect it from competition, research or building new non-competitive, but valuable products that are not harmful to the original website, they are not ideal.
We leave that to the user to decide the ethics of their project and just provide the tools.
From my experience headless scraping is in the order of 10-100x slower and significantly more resource intensive, even if you carefully block requests for images/ads/etc.
You should always start with traditional scraping, try as hard as you can to stick with it, and only move to headless if absolutely necessary. Sometimes, even if it will take 10x more “requests” to scrape traditionally, it’s still faster than headless.
The libraries look useful - one question which wasn't obvious in the doc, how do you manage / suggest approaching rate limiting by domain? Ideally respecting crawl-delay in robots.txt, or just defaulting to some sane value.. most naive queue implementations make it challenging, and queue-per-domain feels annoying to manage.
On Crawlee level, you can open new queues with one line of code and name them with the hostname, so the most straightforward solution would be to run multiple Crawler instances with multiple queues and then rate limit using the options explained here https://crawlee.dev/docs/guides/scaling-crawlers and push the new URLs to the respective queues using the URLs' hostname.
If you'd like to discuss this a bit more in depth, you can join our Discord or ask in GitHub discussions. Both are linked from Crawlee homepage.
I would really like this but running in Python.
It allows headed crawling + avoiding blockers etc.
One issue I have w/ webdriving headless browser in general is host RAM usage per browser/chromium/puppeteer instance (e.g. ~600-900mb) for a single browser/context/page.
Could crawlee make it easier to run more browser contexts with less ram usage?
e.g. concurrently running multiple of these (pages requiring js execution): https://crawlee.dev/docs/examples/forms
From our experience, RAM is not the limiting factor. It's the CPU. You need at least 1 CPU core for the modern browsers to work reliably at scale so if you're using a container that has 1GB ram and 0.25 core, it's just not worth it. If you have access to containers that have strong CPUs and not a lot of RAM, then it's a different story.
One note that may be helpful, if all you care about is the HTML, it's better to take a "snapshot" of the page by streaming the response directly to blob storage like S3. That way if something fails and you need to retry, you can reference the saved raw data from storage vs making another request and potentially getting blocked. Node pipelines makes it really easy to chain this stuff together with other logic.
For reference, I run a company that does large scale scraping / data aggregation.
To send emails, you can use any 3rd party tools, check out Apify as Crawlee is well integrated there and they have email sender easy to use.
I've been using the unmaintained node-osmosis lib for years, maybe it'll motivate me to finally move from it.
Just a feedback from the developer point of view tho. I think the documentation (both clawlee & Apify) need some work. I took me a while get the difference between clawlee & other headless crawler like playwight etc.
Crawlee is basically a big wrapper around open source tools like Puppeteer, Playwright, Cheerio (I would not call these crawlers though as they don't have any logic for enqueueing requests etc.)