This is Jan, the founder of Apify (https://apify.com/) — a full-stack web scraping platform. After the success of Crawlee for JavaScript (https://github.com/apify/crawlee/) and the demand from the Python community, we're launching Crawlee for Python today!
The main features are:
- A unified programming interface for both HTTP (HTTPX with BeautifulSoup) & headless browser crawling (Playwright)
- Automatic parallel crawling based on available system resources
- Written in Python with type hints for enhanced developer experience
- Automatic retries on errors or when you’re getting blocked
- Integrated proxy rotation and session management
- Configurable request routing - direct URLs to the appropriate handlers
- Persistent queue for URLs to crawl
- Pluggable storage for both tabular data and files
For details, you can read the announcement blog post: https://crawlee.dev/blog/launching-crawlee-python
Our team and I will be happy to answer here any questions you might have.
As a concrete example: command-f for "tier" on https://crawlee.dev/python/docs/guides/proxy-management and tell me how anyone could possibly know what `tiered_proxy_urls: list[list[str]] | None = None` should contain and why?
We wanted to have as many features in the initial release as possible, because we have a local Python community conference coming up tomorrow and we wanted to have the library ready for that.
More docs will come soon. I promise. And thanks for the shout.
I don't think this is fair. The code looks pretty readable to me.
The example should show how to literally find and target all data as in .csv .xlsx tables etc and actually download it.
Anyone can use requests and just get the text and grep for urls. I don't get it.
Remember: pick an example where you need to parse one thing to get 1000s of other things to then hit some other endpoints to then get the 3-5 things at each of those. Any example that doesn't look like that is not going to impress anyone.
I'm not even clear if this is saying it's a framework or actually some automation tool. Automation meaning it actually autodetects where to look.
Now I am using crawlee. Thanks. I will work on it, to better integrate into my project, however I already can tell it works flawlessly.
My project, with crawlee: https://github.com/rumca-js/Django-link-archive
- Crawlee has out-of-the-box support for headless browser crawling (Playwright). You don't have to install any plugin or set up the middleware. - Crawlee has a minimalistic & elegant interface - Set up your scraper with fewer than 10 lines of code. You don't have to care about what middleware, settings, and anything are or need to be changed, on the top that we also have templates which makes the learning curve much smaller. - Complete type hint coverage. Which is something Scrapy hasn't completed yet. - Based on standard Asyncio. Integrating Scrapy into a classic asyncio app requires integration of Twisted and asyncio. Which is possible, but not easy, and can result in troubles.
That cuts both ways, in true 80/20 fashion: it also means that anyone who isn't on the happy path of the way that crawlee was designed is going to have to edit your python files (`pip install -e` type business) to achieve their goals
I found the api a lot better than any python scraping api till date. However I am tempted to try out python with Crawlee.
The playwright integration with gotScraping makes the entire programming experience a breeze. My crawling and scraping involves all kinds of frontend rendered websites with a lot of modified XHR responses to be captured. And IT JUST WORKS!
Thanks a ton . I will definitely use the Apify platform to scale given the integration.
Please note that this is the first release, and we'll keep adding many more features as we go, including anti-blocking, adaptive crawling, etc. To see where this might go, check https://github.com/apify/crawlee
Detecting when blocked and switching proxy/“browser fingerprint”.
The code example on the front page has this:
`const data = await crawler.get_data()`
That looks like Javascript? Is there a missing underscore?
Nice work though.
I was trying to build a small Langchain based RAG based on internal documents but getting the documents from sharepoint/confluence (we have both) is very painful.
We provide Apify platform to publish your scrapers as Actors for the developer community, and developers earn money through it. You can use Crawlee for Python as well :)
tldr; Crawlee is and always will be free to use and open sourced.
- RSS feed is transferred via HTTP. - BeatifulSoup can parse both HTML & XML.
(RSS uses XML format)
But I personally think it does some things a little easier, a little faster and little more conveniently than the other libraries and tools out there.
Although there’s one thing that the JS version of Crawlee has which unfortunately isn’t in Python yet, but it will be there soon. AFAIK it’s unique among all libraries. It’s automatically detecting whether a headless browser is needed or if HTTP will suffice and using the most performant option.
I find some dynamic sites purposefully make it extremely difficult to parse and they obfuscate the XHR calls to their API
I've also seen some websites pollute the data when it detects scraping which results in garbage data but you don't know until its verified
Data pollution is real. Also location specific results, personalized results, A/B testing, and my favorite, badly implemented websites are real as well.
When you encounter this, you can try scraping the data from different locations, with various tokens, cookies, referrers etc. and often you can find a pattern to make the data consistent. Websites hate scraping, but they hate showing wrong data to human users even more. So if you resemble a legit user, you’ll most likely get correct data. But of course, there are exceptions.