Show HN: Streambed – Stream Postgres to Iceberg on S3, Supports Postgres Wire (opens in new tab)

(github.com)

129 pointsvira2820d ago40 comments

40 comments

37 comments · 11 top-level

vira28OP20d ago· 15 in thread

Author here. For context, I was the tech lead for the Postgres team at Cloudflare, and this came directly out of a challenge I kept hitting there: BI and dashboard teams needed to run long-running analytical queries, and the answer was always to spin up another bespoke read replica or stand up an ETL dump into an analytical database and query that.

So the question I started with was: what's the fewest components I could get away with? That led to the architecture here — Streambed connects to Postgres as a logical replication subscriber (same mechanism as a read replica) and streams WAL changes straight into Apache Iceberg on S3, queryable from psql via an embedded DuckDB. There are a lot of edge cases to handle, and it's very much early days.

Welcome any feedback.

kikimora20d ago

To me being able to query over psql is secondary. I’m fine with any SQL. What is very important is being able to transform the data to better suite analytical queries. That is, define custom transformations, define how data sectioned and what indices available.

erikcw20d ago

Thanks for releasing this! How do you handle DDL queries? Are table changes synchronized to the Iceberg table automatically?

Also, I recently started looking into olake[0] to serve the same purpose. What would you say differentiates Streambed?

[0] https://github.com/datazip-inc/olake

vira28OP19d ago

Thanks for the kind words!

Short answer: yes, column-level schema changes sync to Iceberg automatically[0].

Logical replication (pgoutput in v1) doesn't actually stream DDL statements. Instead, Postgres emits a fresh Relation message describing the table's current column layout right before the next change to that table. So we diff that against the last layout we knew and infer what changed.

From there we evolve the Iceberg schema in place: flush any buffered rows under the old schema first, then write a new metadata version with the change. What's handled today:

  - ADD COLUMN — new field ID allocated; the column's Postgres DEFAULT is carried into Iceberg's initial-default/write-default, so existing rows read back correctly
  - DROP COLUMN — removed from the current schema, existing data files untouched
  - Type widening — int4→int8, float4→float8 (the changes Iceberg considers compatible)
  - REPLICA IDENTITY changes

[0] https://github.com/viggy28/streambed/pull/21

saxenaabhi19d ago

Hey vira28, thanks a lot for your work. This is a very promising project because other alternative like supabase/etl, Kuvasz-streamer, Sequin all have some subtle issues.

Few questions: 1) For a supabase project can we setup replication slot on replica instead of primary? https://sequinstream.com/docs/reference/databases#using-sequ...

2) For a planetscale cluster are the replication slots on primary or the follower nodes?

I'm asking because isn't setting up slots on primary riskier than setting them on replicas/followers? Because If you have them primary In case of WAL buildup your primary will go down?

vira28OP19d ago

Welcome. To avoid primary running out of disk space, you can configure max_slot_wal_keep_size https://www.postgresql.org/docs/17/runtime-config-replicatio...

Since Supabase is vanilla Postgres, streambed should work with replica as the source.

reg, Planetscale, I haven't looked at their offerings yet.

Where do you host your DB currently? Happy to try out with that provider as the source.

ashtuchkin20d ago

Just wanted to say thank you! Very relevant to our use cases. I'll report if I find any issues.

vira28OP19d ago

Welcome. Would love to hear your experience. Feel free to share here or in the repo. Fully open source.

kshri2419d ago

> streams WAL changes straight into Apache Iceberg on S3, queryable from psql via an embedded DuckDB

Why not use Ducklake instead of Apache Iceberg? Wouldn't that simplify the architecture substantially?

vira28OP17d ago

From what I understand Ducklake needs a dedicated metadata database and it also ties to DuckDB land wherease with Iceberg many engines can query directly.

1 more reply

raducu20d ago

> queryable from psql via an embedded DuckDB.

noob question here from someone who ony played a bit with iceberg and trino: what's the reason to do the analytics stil inside the postgres -- is it so that you don't eat up the IOPS/bandwidth of the main postgresql disks?

1 more reply

alex_hirner19d ago

How does it compare to https://github.com/supabase/etl ?

vira28OP17d ago

The idea is pretty similar. As per their README, Iceberg support is deprecated.

iamcreasy20d ago

Very cool! What would a 10,000 feet solution look like for MySQL to Iceberg on S3?

vira28OP19d ago

Should be fairly doable using binlog-based producer https://github.com/go-mysql-org/go-mysql.

BodyCulture19d ago

Why are your queries slow?

viveknathani_20d ago· 3 in thread

interesting approach, was exploring a Postgres to Clickhouse CDC setup while helping a team sometime back, this seems better as it allows separating the compute (query server) and storage (s3) layers, and thereby allowing us to be creative in cost reductions

vira28OP19d ago

Aside from the cost, my major motivation is to keep the infrastructure simple. The data is already there in Postgres, so I didn't want to add another data warehouse. I have also shared my thoughts on where this is heading https://viggy28.dev/article/postgres-gateway-drug/

saisrirampur20d ago

It depends on the use case. For real-time, customer-facing analytics, ClickHouse’s MergeTree engine is a natural fit, so a Postgres → ClickHouse CDC setup with low latencies (single-digit seconds) is better.

Replication to Iceberg/S3 is better suited for offline analytics and data warehousing use cases. You can use the same ClickHouse engine to query layer Iceberg data in S3.

viveknathani_20d ago

makes sense!

buremba20d ago· 2 in thread

Looks interesting! It reminds me of pg_lake, which we evaluated for our startup https://lobu.ai but it's missing a lot of pushdown capabilities which made OLAP queries expensive.

I also tried DuckLake but that required us to move away from PG-first approach. I was thinking of using Debezium to create Iceberg on S3 for our append-only PG tables and use DuckDB. I will try Streambed out as well!

vira28OP20d ago

Both projects are relevant. Curious, what kinda pushdown capabilities that you were looking for?

nylonstrung20d ago

Does pushdown require support at this part of the stack or can you just delegate to Datafusion as your query engine, which has very good pushdown

cpard20d ago· 2 in thread

Replicating the Postgres WAL to S3 and Iceberg reliably is a hard problem but it’s not accurate to say that no ETL is needed here.

maybe you can say it’s more of an ELT pattern but anyone who’s interested into using this for realistic analytics they will have to transform the data at some point.

If an org is early enough to think that they can use a solution like this and just get in duckdb and start spitting out reports, they will be up for a really bad experience.

Please educate people to do the right thing and realize the scope of the work they are facing, it might feel that it hurts your growth in the short term but it will benefit you greatly in the mid-long term as a vendor.

kikimora20d ago

IDK, AWS Zero ETL from Autora into Redshift really helped us at some point. You right that data transformation is very limited if not possible. But having data in an analytical store, being able to experiment with queries, understand what is wrong with your OLTP schema and then build ETL is way better than doing an upfront design.

cpard20d ago

Of course it is. What you describe is one of the reasons that ELT became popular, if you couple it with a variant type and schema on read, you have a very powerful and flexible architecture.

But there’s no free lunch, building and maintains data infrastructure that is reliable requires work. Many companies don’t realise that when they start their analytical journey and aggressive marketing doesn’t help. That’s the point I was trying to make.

1 more reply

karakanb20d ago· 1 in thread

Hi, this looks interesting, thanks for sharing. I am the builder of ingestr (https://github.com/bruin-data/ingestr), so I am very much in the same space.

I really like that you did this in Go, and I'll definitely dig a bit more into the source code to see how you tackled the CDC stuff, given that there is not many reliable CDC libraries in Go, and there are quite a few gotchas when it comes to doing CDC right. We also hand-rolled ours in ingestr, or I must say clanker-rolled, and we got quite a few things wrong in the first place.

Curious about the postgres-compatible query option: what's the usecase you have in mind there? My perception is that any org that would use Iceberg also has one or a few query engines in place, is this more for debugging stuff?

Quite cool stuff, keep it up!

vira28OP20d ago

Hello, I checked ingestr repo, and it is in my bookmark. Small world.

Agree, CDC is like Death by a thousand cuts. I believe Debezium has a Java library.

My initial need was Postgres compatibilty. Wanted to give an endpoint that BI and dashboard teams can use to query as if they are querying a Postgres replica. Added more context here https://news.ycombinator.com/item?id=48350820

ryanshrott19d ago· 1 in thread

We ran into issues with CDC when tables had a lot of TOAST columns. The WAL records don't include the full values unless you set REPLICA IDENTITY FULL. Does Streambed handle that, or do you need the extra config?

vira28OP19d ago

Currently, Strembed expects REPLICA IDENTITY FULL for getting the before and after value of TOAST column. Since we have the data in object storage, we could populate it without the need for REPLICA IDENTITY FULL. Created an issue https://github.com/viggy28/streambed/issues/25 to track this feature.

nightfly19d ago· 1 in thread

vira28: It looks like nearly all of your responses to comments/questions here are flagged/dead. Probably because they all look AI written. Are you actually responding or do you have an agent answering questions for you?

bithavoc19d ago

I wouldn't be surprised, even the the core of the project is heavily vibe-coded[0]

[0] https://github.com/viggy28/streambed/blob/a660ebb75b4744f5bd...

nitinram19d ago· 1 in thread

This is a nice project! we do some exporting of data from postgres to s3 and its a little flaky but does the job for now. Feel like this a good project to explore using

vira28OP18d ago

The challenge with any CDC is making it reliable. Curious, how are you exporting to S3? - Debezium or some service in AWS or home grown tool?

chrislusf19d ago

If less components is desired, use SeaweedFS, which supports S3 table buckets and Iceberg catalog and maintenance. Basically storing Iceberg tables data and metadata.

oa33520d ago

nice work! we have handrolled something similar at work.

do you have any perf metrics? throughput, end-to-end latency, etc?

ApiFB-Dev20d ago

hmm wow very interesting idea!

j / k navigate · click thread line to collapse

40 comments

37 comments · 11 top-level

vira28OP20d ago· 15 in thread

Welcome any feedback.

kikimora20d ago

erikcw20d ago

Thanks for releasing this! How do you handle DDL queries? Are table changes synchronized to the Iceberg table automatically?

Also, I recently started looking into olake[0] to serve the same purpose. What would you say differentiates Streambed?

[0] https://github.com/datazip-inc/olake

vira28OP19d ago

Thanks for the kind words!

Short answer: yes, column-level schema changes sync to Iceberg automatically[0].

From there we evolve the Iceberg schema in place: flush any buffered rows under the old schema first, then write a new metadata version with the change. What's handled today:

  - ADD COLUMN — new field ID allocated; the column's Postgres DEFAULT is carried into Iceberg's initial-default/write-default, so existing rows read back correctly
  - DROP COLUMN — removed from the current schema, existing data files untouched
  - Type widening — int4→int8, float4→float8 (the changes Iceberg considers compatible)
  - REPLICA IDENTITY changes

[0] https://github.com/viggy28/streambed/pull/21

saxenaabhi19d ago

Hey vira28, thanks a lot for your work. This is a very promising project because other alternative like supabase/etl, Kuvasz-streamer, Sequin all have some subtle issues.

Few questions: 1) For a supabase project can we setup replication slot on replica instead of primary? https://sequinstream.com/docs/reference/databases#using-sequ...

2) For a planetscale cluster are the replication slots on primary or the follower nodes?

I'm asking because isn't setting up slots on primary riskier than setting them on replicas/followers? Because If you have them primary In case of WAL buildup your primary will go down?

vira28OP19d ago

Welcome. To avoid primary running out of disk space, you can configure max_slot_wal_keep_size https://www.postgresql.org/docs/17/runtime-config-replicatio...

Since Supabase is vanilla Postgres, streambed should work with replica as the source.

reg, Planetscale, I haven't looked at their offerings yet.

Where do you host your DB currently? Happy to try out with that provider as the source.

ashtuchkin20d ago

Just wanted to say thank you! Very relevant to our use cases. I'll report if I find any issues.

vira28OP19d ago

Welcome. Would love to hear your experience. Feel free to share here or in the repo. Fully open source.

kshri2419d ago

> streams WAL changes straight into Apache Iceberg on S3, queryable from psql via an embedded DuckDB

Why not use Ducklake instead of Apache Iceberg? Wouldn't that simplify the architecture substantially?

vira28OP17d ago

From what I understand Ducklake needs a dedicated metadata database and it also ties to DuckDB land wherease with Iceberg many engines can query directly.

1 more reply

raducu20d ago

> queryable from psql via an embedded DuckDB.

1 more reply

alex_hirner19d ago

How does it compare to https://github.com/supabase/etl ?

vira28OP17d ago

The idea is pretty similar. As per their README, Iceberg support is deprecated.

iamcreasy20d ago

Very cool! What would a 10,000 feet solution look like for MySQL to Iceberg on S3?

vira28OP19d ago

Should be fairly doable using binlog-based producer https://github.com/go-mysql-org/go-mysql.

BodyCulture19d ago

Why are your queries slow?

viveknathani_20d ago· 3 in thread

vira28OP19d ago

saisrirampur20d ago

Replication to Iceberg/S3 is better suited for offline analytics and data warehousing use cases. You can use the same ClickHouse engine to query layer Iceberg data in S3.

viveknathani_20d ago

makes sense!

buremba20d ago· 2 in thread

Looks interesting! It reminds me of pg_lake, which we evaluated for our startup https://lobu.ai but it's missing a lot of pushdown capabilities which made OLAP queries expensive.

vira28OP20d ago

Both projects are relevant. Curious, what kinda pushdown capabilities that you were looking for?

nylonstrung20d ago

Does pushdown require support at this part of the stack or can you just delegate to Datafusion as your query engine, which has very good pushdown

cpard20d ago· 2 in thread

Replicating the Postgres WAL to S3 and Iceberg reliably is a hard problem but it’s not accurate to say that no ETL is needed here.

maybe you can say it’s more of an ELT pattern but anyone who’s interested into using this for realistic analytics they will have to transform the data at some point.

If an org is early enough to think that they can use a solution like this and just get in duckdb and start spitting out reports, they will be up for a really bad experience.

kikimora20d ago

cpard20d ago

Of course it is. What you describe is one of the reasons that ELT became popular, if you couple it with a variant type and schema on read, you have a very powerful and flexible architecture.

1 more reply

karakanb20d ago· 1 in thread

Hi, this looks interesting, thanks for sharing. I am the builder of ingestr (https://github.com/bruin-data/ingestr), so I am very much in the same space.

Quite cool stuff, keep it up!

vira28OP20d ago

Hello, I checked ingestr repo, and it is in my bookmark. Small world.

Agree, CDC is like Death by a thousand cuts. I believe Debezium has a Java library.

ryanshrott19d ago· 1 in thread

vira28OP19d ago

nightfly19d ago· 1 in thread

bithavoc19d ago

I wouldn't be surprised, even the the core of the project is heavily vibe-coded[0]

[0] https://github.com/viggy28/streambed/blob/a660ebb75b4744f5bd...

nitinram19d ago· 1 in thread

This is a nice project! we do some exporting of data from postgres to s3 and its a little flaky but does the job for now. Feel like this a good project to explore using

vira28OP18d ago

The challenge with any CDC is making it reliable. Curious, how are you exporting to S3? - Debezium or some service in AWS or home grown tool?

chrislusf19d ago

If less components is desired, use SeaweedFS, which supports S3 table buckets and Iceberg catalog and maintenance. Basically storing Iceberg tables data and metadata.

oa33520d ago

nice work! we have handrolled something similar at work.

do you have any perf metrics? throughput, end-to-end latency, etc?

ApiFB-Dev20d ago

hmm wow very interesting idea!

j / k navigate · click thread line to collapse