pkhodiyar on Hacker News

101x Airbyte, 11x Estuary, Postgres to Iceberg

Hi HN, we've been developing OLake, an open-source connector specifically designed for replicating data from PostgreSQL into Apache Iceberg. We recently ran some detailed benchmarks comparing its performance and cost against several popular data movement tools: Fivetran, Debezium (using the memiiso setup mentioned), Estuary, and Airbyte.

We wanted to share the results, as they show OLake performing very competitively, often exceeding the speed of both open-source and commercial alternatives, while offering the cost advantages of a self-hosted open-source solution.

The benchmarks covered both full initial loads and Change Data Capture (CDC) on a large dataset (billions of rows for full load, tens of millions of changes for CDC) over a 24-hour window.

Link to entire benchmark postgres - https://olake.io/docs/connectors/postgres/benchmarks

For full loads, OLake achieved throughput of around 46,262 rows/sec, processing over 4 billion rows in 24 hours.

This was essentially on par with Fivetran (46,395 RPS) and significantly faster than Debezium (14,839 RPS - 3.1x slower), Estuary (3,982 RPS - 11.6x slower on a smaller processed dataset), and Airbyte (457 RPS - 101x slower before it failed the long test).

The most striking results were in CDC performance.

For processing 50 million changes, OLake completed the task in 22.5 minutes at 36,982 rows/sec. Fivetran took 31 minutes (1.4x slower), Debezium took 60 minutes (2.7x slower), Estuary took 4.5 hours (12x slower), and Airbyte took 23 hours (63x slower).

This indicates OLake delivers significantly lower latency for propagating changes from PostgreSQL to Iceberg.

On the cost side, OLake is open source and self-hosted. The cost is simply the infrastructure. Running the benchmarks on a substantial VM (64 vcpus, 128 GiB memory) for 24 hours cost less than $75.

Comparing this to the vendor list prices for the data synced in the tests: Fivetran's full load cost $7,446 ($1.86/M rows), Estuary's full load cost $4,462 ($12.97/M rows), Airbyte Cloud's partial full load cost $5,560 ($438.8/M rows).

For CDC, Fivetran cost $2,257 ($45.14/M rows), Estuary cost $22.72 ($0.45/M rows), and Airbyte Cloud cost $148.95 ($2.98/M rows).

While Estuary shows a low per-row cost for CDC in this specific test, the overall picture strongly favors the predictable, infra-based cost of self-hosted OLake, especially for large-scale replication.

In summary, these benchmarks suggest OLake can match or exceed the speed of leading proprietary tools for PostgreSQL to Iceberg replication, offers superior CDC latency compared to all tested alternatives, and provides a significantly lower and more predictable cost structure due to being open source and self-hosted.

You can find more details on the benchmarks and the tool itself in our documentation.

Happy to discuss the results and our approach.

5pkhodiyar1y ago0

Show HN: We launched hosted MCP and RAG (no infra, we host) (opens in new tab)

(customgpt.ai)

3pkhodiyar1y ago0

Debezium to olake.io – PhysicsWallah switch for CDC

We recently hosted a small online meetup at OLake where a Data Engineer at PhysicsWallah, walked through why his team dropped Debezium and moved to OLake’s “MongoDB → Iceberg” pipeline.

Video (29 min): https://www.youtube.com/watch?v=qqtE_BrjVkM

If you are someone who prefer text, here’s the quick TLDR;

Why Debezium became a drag for them: 1. Long full loads on multi-million-row MongoDB collections, and any failure meant restarting from scratch 2. Kafka and Connect infrastructure felt heavy when the end goal was “Parquet/Iceberg on S3” 3. Handling heterogeneous arrays required custom SMTs 4. Continuous streaming only; they still had to glue together ad-hoc batch pulls for some workflows 5. Ongoing schema drift demanded extra code to keep Iceberg tables aligned

What changed with OLake? -> Writes directly from MongoDB (and friends) into Apache Iceberg, no message broker in between

-> Two modes: full load for the initial dump, then CDC for ongoing changes — exposed by a single flag in the job config -> Automatic schema evolution: new MongoDB fields appear as nullable columns; complex sub-docs land as JSON strings you can parse later

-> Resumable, chunked full loads: a pod crash resumes instead of restarting

-> Runs as either a Kubernetes CronJob or an Airflow task; config is one YAML/JSON file.

Their stack in one line: MongoDB → OLake writer → Iceberg on S3 → Spark jobs → Trino / occasional Redshift, all orchestrated by Airflow and/or K8s.

Posting here because many of us still bolt Kafka onto CDC just to land files. If you only need Iceberg tables, a simpler path might exist now. Curious to hear others’ experiences with broker-less CDC tools.

(Disclaimer: I work on OLake and hosted the meetup, but the talk is purely technical.)

Check out github repo - https://github.com/datazip-inc/olake

3pkhodiyar1y ago1

Your OpenAI Project and RAG (opens in new tab)

(customgpt.ai)

2pkhodiyar1y ago1

Data Partitioning in Apache Iceberg, demo time

We at OLake (Fast database to Apache Iceberg replication, open-source - https://github.com/datazip-inc/olake) will soon support Iceberg’s Hidden Partitioning and wider catalog support hence we are organising our 6th community call.

What to expect in the call:

1. Sync Data from a Database into Apache Iceberg using one of the following catalogs (REST, Hive, Glue, JDBC) 2. Explore how Iceberg Partitioning will play out here [new feature] 3. Query the data using a popular lakehouse query tool.

When:

1. Date: 28th April (Monday) 2025 at 16:30 IST (04:30 PM). 2. RSVP here - https://lu.ma/s2tr10oz [make sure to add to your calendars]

Assuming most of you folks must be unware about OLake (we started <1 year ago), I would be glad if you could take sometime out and help us build OLake for the community.

GitHub repo - https://github.com/datazip-inc/olake

5pkhodiyar1y ago2

101x Airbyte, 11x Estuary, Postgres to Iceberg

The benchmarks covered both full initial loads and Change Data Capture (CDC) on a large dataset (billions of rows for full load, tens of millions of changes for CDC) over a 24-hour window.

Link to entire benchmark postgres - https://olake.io/docs/connectors/postgres/benchmarks

For full loads, OLake achieved throughput of around 46,262 rows/sec, processing over 4 billion rows in 24 hours.

The most striking results were in CDC performance.

This indicates OLake delivers significantly lower latency for propagating changes from PostgreSQL to Iceberg.

On the cost side, OLake is open source and self-hosted. The cost is simply the infrastructure. Running the benchmarks on a substantial VM (64 vcpus, 128 GiB memory) for 24 hours cost less than $75.

For CDC, Fivetran cost $2,257 ($45.14/M rows), Estuary cost $22.72 ($0.45/M rows), and Airbyte Cloud cost $148.95 ($2.98/M rows).

You can find more details on the benchmarks and the tool itself in our documentation.

Happy to discuss the results and our approach.

Debezium to olake.io – PhysicsWallah switch for CDC

We recently hosted a small online meetup at OLake where a Data Engineer at PhysicsWallah, walked through why his team dropped Debezium and moved to OLake’s “MongoDB → Iceberg” pipeline.

Video (29 min): https://www.youtube.com/watch?v=qqtE_BrjVkM

If you are someone who prefer text, here’s the quick TLDR;

What changed with OLake? -> Writes directly from MongoDB (and friends) into Apache Iceberg, no message broker in between

-> Resumable, chunked full loads: a pod crash resumes instead of restarting

-> Runs as either a Kubernetes CronJob or an Airflow task; config is one YAML/JSON file.

Their stack in one line: MongoDB → OLake writer → Iceberg on S3 → Spark jobs → Trino / occasional Redshift, all orchestrated by Airflow and/or K8s.

(Disclaimer: I work on OLake and hosted the meetup, but the talk is purely technical.)

Check out github repo - https://github.com/datazip-inc/olake

Data Partitioning in Apache Iceberg, demo time

What to expect in the call:

When:

1. Date: 28th April (Monday) 2025 at 16:30 IST (04:30 PM). 2. RSVP here - https://lu.ma/s2tr10oz [make sure to add to your calendars]

Assuming most of you folks must be unware about OLake (we started <1 year ago), I would be glad if you could take sometime out and help us build OLake for the community.

GitHub repo - https://github.com/datazip-inc/olake

pkhodiyar

Recent submissions

Agent Credential Brokers in 2026 (opens in new tab)

Running AI agents without losing my keys (opens in new tab)

Authsome – open-source local auth proxy for AI agents (opens in new tab)

Show HN: API for 13M+ Indian court cases with citation graphs and vector search (opens in new tab)

Show HN: LegalTech – A curated list of tools and software (opens in new tab)

Show HN: Reddit and HN for Cybersecurity [Free] (opens in new tab)

Awesome-RAG GitHub (opens in new tab)

101x Airbyte, 11x Estuary, Postgres to Iceberg

Show HN: We launched hosted MCP and RAG (no infra, we host) (opens in new tab)

Debezium to olake.io – PhysicsWallah switch for CDC

Your OpenAI Project and RAG (opens in new tab)

Data Partitioning in Apache Iceberg, demo time

RAG API + OpenAI compatibility (opens in new tab)

Postgres dump to Apache Iceberg, OLake got one step closer (opens in new tab)

Support for Parquet writer to S3 and local storage in OLake (opens in new tab)

Recent submissions

Agent Credential Brokers in 2026 (opens in new tab)

Running AI agents without losing my keys (opens in new tab)

Authsome – open-source local auth proxy for AI agents (opens in new tab)

Show HN: API for 13M+ Indian court cases with citation graphs and vector search (opens in new tab)

Show HN: LegalTech – A curated list of tools and software (opens in new tab)

Show HN: Reddit and HN for Cybersecurity [Free] (opens in new tab)

Awesome-RAG GitHub (opens in new tab)

101x Airbyte, 11x Estuary, Postgres to Iceberg

Show HN: We launched hosted MCP and RAG (no infra, we host) (opens in new tab)

Debezium to olake.io – PhysicsWallah switch for CDC

Your OpenAI Project and RAG (opens in new tab)

Data Partitioning in Apache Iceberg, demo time

RAG API + OpenAI compatibility (opens in new tab)

Postgres dump to Apache Iceberg, OLake got one step closer (opens in new tab)

Support for Parquet writer to S3 and local storage in OLake (opens in new tab)