We wanted to share the results, as they show OLake performing very competitively, often exceeding the speed of both open-source and commercial alternatives, while offering the cost advantages of a self-hosted open-source solution.
The benchmarks covered both full initial loads and Change Data Capture (CDC) on a large dataset (billions of rows for full load, tens of millions of changes for CDC) over a 24-hour window.
Link to entire benchmark postgres - https://olake.io/docs/connectors/postgres/benchmarks
For full loads, OLake achieved throughput of around 46,262 rows/sec, processing over 4 billion rows in 24 hours.
This was essentially on par with Fivetran (46,395 RPS) and significantly faster than Debezium (14,839 RPS - 3.1x slower), Estuary (3,982 RPS - 11.6x slower on a smaller processed dataset), and Airbyte (457 RPS - 101x slower before it failed the long test).
The most striking results were in CDC performance.
For processing 50 million changes, OLake completed the task in 22.5 minutes at 36,982 rows/sec. Fivetran took 31 minutes (1.4x slower), Debezium took 60 minutes (2.7x slower), Estuary took 4.5 hours (12x slower), and Airbyte took 23 hours (63x slower).
This indicates OLake delivers significantly lower latency for propagating changes from PostgreSQL to Iceberg.
On the cost side, OLake is open source and self-hosted. The cost is simply the infrastructure. Running the benchmarks on a substantial VM (64 vcpus, 128 GiB memory) for 24 hours cost less than $75.
Comparing this to the vendor list prices for the data synced in the tests: Fivetran's full load cost $7,446 ($1.86/M rows), Estuary's full load cost $4,462 ($12.97/M rows), Airbyte Cloud's partial full load cost $5,560 ($438.8/M rows).
For CDC, Fivetran cost $2,257 ($45.14/M rows), Estuary cost $22.72 ($0.45/M rows), and Airbyte Cloud cost $148.95 ($2.98/M rows).
While Estuary shows a low per-row cost for CDC in this specific test, the overall picture strongly favors the predictable, infra-based cost of self-hosted OLake, especially for large-scale replication.
In summary, these benchmarks suggest OLake can match or exceed the speed of leading proprietary tools for PostgreSQL to Iceberg replication, offers superior CDC latency compared to all tested alternatives, and provides a significantly lower and more predictable cost structure due to being open source and self-hosted.
You can find more details on the benchmarks and the tool itself in our documentation.
Happy to discuss the results and our approach.
Video (29 min): https://www.youtube.com/watch?v=qqtE_BrjVkM
If you are someone who prefer text, here’s the quick TLDR;
Why Debezium became a drag for them: 1. Long full loads on multi-million-row MongoDB collections, and any failure meant restarting from scratch 2. Kafka and Connect infrastructure felt heavy when the end goal was “Parquet/Iceberg on S3” 3. Handling heterogeneous arrays required custom SMTs 4. Continuous streaming only; they still had to glue together ad-hoc batch pulls for some workflows 5. Ongoing schema drift demanded extra code to keep Iceberg tables aligned
What changed with OLake? -> Writes directly from MongoDB (and friends) into Apache Iceberg, no message broker in between
-> Two modes: full load for the initial dump, then CDC for ongoing changes — exposed by a single flag in the job config -> Automatic schema evolution: new MongoDB fields appear as nullable columns; complex sub-docs land as JSON strings you can parse later
-> Resumable, chunked full loads: a pod crash resumes instead of restarting
-> Runs as either a Kubernetes CronJob or an Airflow task; config is one YAML/JSON file.
Their stack in one line: MongoDB → OLake writer → Iceberg on S3 → Spark jobs → Trino / occasional Redshift, all orchestrated by Airflow and/or K8s.
Posting here because many of us still bolt Kafka onto CDC just to land files. If you only need Iceberg tables, a simpler path might exist now. Curious to hear others’ experiences with broker-less CDC tools.
(Disclaimer: I work on OLake and hosted the meetup, but the talk is purely technical.)
Check out github repo - https://github.com/datazip-inc/olake
What to expect in the call:
1. Sync Data from a Database into Apache Iceberg using one of the following catalogs (REST, Hive, Glue, JDBC) 2. Explore how Iceberg Partitioning will play out here [new feature] 3. Query the data using a popular lakehouse query tool.
When:
1. Date: 28th April (Monday) 2025 at 16:30 IST (04:30 PM). 2. RSVP here - https://lu.ma/s2tr10oz [make sure to add to your calendars]
Assuming most of you folks must be unware about OLake (we started <1 year ago), I would be glad if you could take sometime out and help us build OLake for the community.
GitHub repo - https://github.com/datazip-inc/olake
Today we’re excited to introduce OLake (github.com/datazip-inc/olake, 130+ and growing fast), an open-source tool built to help you replicate Database (MongoDB, for now, mysql and postgres under development) data into Data Lakehouse at faster speed without any hassle of managing Debezium or kafka (at least 10x faster than Airbyte and Fivetran at fraction of the cost, refer docs for benchmarks - https://olake.io/docs/connectors/mongodb/benchmarks).
You might think “we don't need yet another ETL tool”, true but we tried existing tools (proprietary and open sourced as well) none of them were good fit.
We made this mistake in our first product by building a lot of connectors and learnt the hard way to pick a pressing pain point and build a world class solution for it
Who is it for?
We built this for data engineers and engineers teams struggling with:
1. Debezium + Kafka setup and that 16MB per document size limitation of Debezium when working with mongoDB. We are Debezium free.
2. lost cursors during the CDC process, with no way left other than to resync the entire data.
3. sync running for hours and hours and you have no visibility into what's happening under the hood. Limited visibility (the sync logs, completion time, which table is being replicated, etc).
4. complexity of setting with Debezium + Kafka pipeline or other solutions.
5. present ETL tools are very generic and not optimised to sync DB data to a lakehouse and handling all the associated complexities (metadata + schema management)
6. knowing from where to restart the sync. Here, features like resumable syncs + visibility of exactly where the sync paused + stored cursor token you get with OLake.
What is OLake?
OLake is engineered from the ground up to address the above common pain points.
We intend to use the native features of Databases ( e.g extracting data in BSON format for mongodb) and the modern table format of Apache Iceberg (the future going ahead), OLake delivers:
Parallelized initial loads and continuous change-data capture (CDC), so you can replicate 100s of GB in minutes into parquet format and dump it to S3. Read about OLake architecture - https://olake.io/blog/olake-architecture
Adaptive Fault Tolerance: We designed it to handle disruptions like lost cursor, making sure data integrity with minimal latency (configure the sync speed yourself). We store the state with a resume token, so that you know exactly where to resume your sync.
Modular architecture, scalable, with configurable batching (select streams you want to sync) and parallelism among them to avoid OOMs or crashes.
Why OLake?
As your production data grows, so do the challenges of managing it. For small businesses, self-serve tools and third-party SaaS connectors are often enough—but they typically max out at around 1TB of data per month and you are back to square one googling for a perfect tool that's quick and fits your budget.
If you have data which looks something like 1TB/month in a database with probability of it growing rapidly AND are looking to replicate it to Data Lakes for Analytics use cases, we can help. Reach out to us at hello@olake.io
We are not saying we are the perfect solution to your every problem, this open source project is very new and we want to build it with your support.
Join our slack community - https://getolake.slack.com and help us to build and set the industry standard for database ETL tools to lake houses so that there is no need for “yet another” attempt to fix something that isn’t broken.
About Us OLake is a proud open source project from Datazip, founded by data enthusiasts Sandeep Devarapalli, Shubham Baldava, and Rohan Khameshra built from India.
Contribute - olake.io/docs/getting-started
We are calling out for contributors, OLake is an Apache 2.0 license maintained by Datazip.