So the question I started with was: what's the fewest components I could get away with? That led to the architecture here — Streambed connects to Postgres as a logical replication subscriber (same mechanism as a read replica) and streams WAL changes straight into Apache Iceberg on S3, queryable from psql via an embedded DuckDB. There are a lot of edge cases to handle, and it's very much early days.
Welcome any feedback.
Also, I recently started looking into olake[0] to serve the same purpose. What would you say differentiates Streambed?
Short answer: yes, column-level schema changes sync to Iceberg automatically[0].
Logical replication (pgoutput in v1) doesn't actually stream DDL statements. Instead, Postgres emits a fresh Relation message describing the table's current column layout right before the next change to that table. So we diff that against the last layout we knew and infer what changed.
From there we evolve the Iceberg schema in place: flush any buffered rows under the old schema first, then write a new metadata version with the change. What's handled today:
- ADD COLUMN — new field ID allocated; the column's Postgres DEFAULT is carried into Iceberg's initial-default/write-default, so existing rows read back correctly
- DROP COLUMN — removed from the current schema, existing data files untouched
- Type widening — int4→int8, float4→float8 (the changes Iceberg considers compatible)
- REPLICA IDENTITY changes
[0] https://github.com/viggy28/streambed/pull/21Few questions: 1) For a supabase project can we setup replication slot on replica instead of primary? https://sequinstream.com/docs/reference/databases#using-sequ...
2) For a planetscale cluster are the replication slots on primary or the follower nodes?
I'm asking because isn't setting up slots on primary riskier than setting them on replicas/followers? Because If you have them primary In case of WAL buildup your primary will go down?
Since Supabase is vanilla Postgres, streambed should work with replica as the source.
reg, Planetscale, I haven't looked at their offerings yet.
Where do you host your DB currently? Happy to try out with that provider as the source.
Why not use Ducklake instead of Apache Iceberg? Wouldn't that simplify the architecture substantially?
noob question here from someone who ony played a bit with iceberg and trino: what's the reason to do the analytics stil inside the postgres -- is it so that you don't eat up the IOPS/bandwidth of the main postgresql disks?
Replication to Iceberg/S3 is better suited for offline analytics and data warehousing use cases. You can use the same ClickHouse engine to query layer Iceberg data in S3.
I also tried DuckLake but that required us to move away from PG-first approach. I was thinking of using Debezium to create Iceberg on S3 for our append-only PG tables and use DuckDB. I will try Streambed out as well!
maybe you can say it’s more of an ELT pattern but anyone who’s interested into using this for realistic analytics they will have to transform the data at some point.
If an org is early enough to think that they can use a solution like this and just get in duckdb and start spitting out reports, they will be up for a really bad experience.
Please educate people to do the right thing and realize the scope of the work they are facing, it might feel that it hurts your growth in the short term but it will benefit you greatly in the mid-long term as a vendor.
But there’s no free lunch, building and maintains data infrastructure that is reliable requires work. Many companies don’t realise that when they start their analytical journey and aggressive marketing doesn’t help. That’s the point I was trying to make.
I really like that you did this in Go, and I'll definitely dig a bit more into the source code to see how you tackled the CDC stuff, given that there is not many reliable CDC libraries in Go, and there are quite a few gotchas when it comes to doing CDC right. We also hand-rolled ours in ingestr, or I must say clanker-rolled, and we got quite a few things wrong in the first place.
Curious about the postgres-compatible query option: what's the usecase you have in mind there? My perception is that any org that would use Iceberg also has one or a few query engines in place, is this more for debugging stuff?
Quite cool stuff, keep it up!
Agree, CDC is like Death by a thousand cuts. I believe Debezium has a Java library.
My initial need was Postgres compatibilty. Wanted to give an endpoint that BI and dashboard teams can use to query as if they are querying a Postgres replica. Added more context here https://news.ycombinator.com/item?id=48350820
[0] https://github.com/viggy28/streambed/blob/a660ebb75b4744f5bd...
do you have any perf metrics? throughput, end-to-end latency, etc?