We wanted to share the results, as they show OLake performing very competitively, often exceeding the speed of both open-source and commercial alternatives, while offering the cost advantages of a self-hosted open-source solution.
The benchmarks covered both full initial loads and Change Data Capture (CDC) on a large dataset (billions of rows for full load, tens of millions of changes for CDC) over a 24-hour window.
Link to entire benchmark postgres - https://olake.io/docs/connectors/postgres/benchmarks
For full loads, OLake achieved throughput of around 46,262 rows/sec, processing over 4 billion rows in 24 hours.
This was essentially on par with Fivetran (46,395 RPS) and significantly faster than Debezium (14,839 RPS - 3.1x slower), Estuary (3,982 RPS - 11.6x slower on a smaller processed dataset), and Airbyte (457 RPS - 101x slower before it failed the long test).
The most striking results were in CDC performance.
For processing 50 million changes, OLake completed the task in 22.5 minutes at 36,982 rows/sec. Fivetran took 31 minutes (1.4x slower), Debezium took 60 minutes (2.7x slower), Estuary took 4.5 hours (12x slower), and Airbyte took 23 hours (63x slower).
This indicates OLake delivers significantly lower latency for propagating changes from PostgreSQL to Iceberg.
On the cost side, OLake is open source and self-hosted. The cost is simply the infrastructure. Running the benchmarks on a substantial VM (64 vcpus, 128 GiB memory) for 24 hours cost less than $75.
Comparing this to the vendor list prices for the data synced in the tests: Fivetran's full load cost $7,446 ($1.86/M rows), Estuary's full load cost $4,462 ($12.97/M rows), Airbyte Cloud's partial full load cost $5,560 ($438.8/M rows).
For CDC, Fivetran cost $2,257 ($45.14/M rows), Estuary cost $22.72 ($0.45/M rows), and Airbyte Cloud cost $148.95 ($2.98/M rows).
While Estuary shows a low per-row cost for CDC in this specific test, the overall picture strongly favors the predictable, infra-based cost of self-hosted OLake, especially for large-scale replication.
In summary, these benchmarks suggest OLake can match or exceed the speed of leading proprietary tools for PostgreSQL to Iceberg replication, offers superior CDC latency compared to all tested alternatives, and provides a significantly lower and more predictable cost structure due to being open source and self-hosted.
You can find more details on the benchmarks and the tool itself in our documentation.
Happy to discuss the results and our approach.
Video (29 min): https://www.youtube.com/watch?v=qqtE_BrjVkM
If you are someone who prefer text, here’s the quick TLDR;
Why Debezium became a drag for them: 1. Long full loads on multi-million-row MongoDB collections, and any failure meant restarting from scratch 2. Kafka and Connect infrastructure felt heavy when the end goal was “Parquet/Iceberg on S3” 3. Handling heterogeneous arrays required custom SMTs 4. Continuous streaming only; they still had to glue together ad-hoc batch pulls for some workflows 5. Ongoing schema drift demanded extra code to keep Iceberg tables aligned
What changed with OLake? -> Writes directly from MongoDB (and friends) into Apache Iceberg, no message broker in between
-> Two modes: full load for the initial dump, then CDC for ongoing changes — exposed by a single flag in the job config -> Automatic schema evolution: new MongoDB fields appear as nullable columns; complex sub-docs land as JSON strings you can parse later
-> Resumable, chunked full loads: a pod crash resumes instead of restarting
-> Runs as either a Kubernetes CronJob or an Airflow task; config is one YAML/JSON file.
Their stack in one line: MongoDB → OLake writer → Iceberg on S3 → Spark jobs → Trino / occasional Redshift, all orchestrated by Airflow and/or K8s.
Posting here because many of us still bolt Kafka onto CDC just to land files. If you only need Iceberg tables, a simpler path might exist now. Curious to hear others’ experiences with broker-less CDC tools.
(Disclaimer: I work on OLake and hosted the meetup, but the talk is purely technical.)
Check out github repo - https://github.com/datazip-inc/olake
What to expect in the call:
1. Sync Data from a Database into Apache Iceberg using one of the following catalogs (REST, Hive, Glue, JDBC) 2. Explore how Iceberg Partitioning will play out here [new feature] 3. Query the data using a popular lakehouse query tool.
When:
1. Date: 28th April (Monday) 2025 at 16:30 IST (04:30 PM). 2. RSVP here - https://lu.ma/s2tr10oz [make sure to add to your calendars]
Assuming most of you folks must be unware about OLake (we started <1 year ago), I would be glad if you could take sometime out and help us build OLake for the community.
GitHub repo - https://github.com/datazip-inc/olake