Preview: Amazon S3 Tables and Lakehouse in DuckDB (opens in new tab)

(duckdb.org)

177 pointshn19861y ago47 comments

47 comments

35 comments · 12 top-level

dm035141y ago· 7 in thread

I've mentioend this whenever iceberg comes up. It's wild how immature the ecosystem is still. Duckdb itself lacks the ability to write iceberg....

https://duckdb.org/docs/stable/extensions/iceberg/overview.h...

Apache iceberg go ? Nope

https://github.com/apache/iceberg-go?tab=readme-ov-file#read...

Basically java iceberg is the only mature way to do this, it's not a very accessible ecosystem.

For a side project I'm using pyiceberg to sink streaming data to iceberg (using DuckDB as the stream processor):

https://sql-flow.com/docs/tutorials/iceberg-sink

It's basically a workaround for DuckDB's lack of native support. I am very happy with the Pyicerbg library as a user, It was very easy and the native Arrow support is a glimpse into the future. Arrow as an interchange format is quite amazing. Just open up the iceberg table and append Arrow dataframes to it!

https://github.com/turbolytics/sql-flow

Arrow is quite spectacular and it's cool to see the industry moving to standardize on it as a dataframe. For example, Clickhouse python also support arrow-based insertion:

https://sql-flow.com/docs/tutorials/clickhouse-sink

This makes the glue code trivial to sink into these different systems as long as arrow is used.

zeroshade1y ago

Hi! Primary developer of iceberg-go here!

We're about to merge https://github.com/apache/iceberg-go/pull/339 which will complete support for `AddFiles` to add existing parquet files to the table.

Not too far behind this is support for appending a stream of Arrow record batches, likely in the next couple weeks.

Slow and steady!

dm035141y ago

Amazing! Thank you for the update, this will be huge

Mortiffer1y ago

I came to the same conclusion and moved on. We had some c# applications reading some python

hn1986OP1y ago

tracking write support here:

https://github.com/duckdb/duckdb-iceberg/issues/37

ramraj071y ago

Is there a reliable way to convert an existing Parquet directory to iceberg without moving the data?

lidavidm1y ago

Iceberg-go is working on it! (edit: it being write support)

barrenko1y ago

What the hell is iceberg now?

isjustintime1y ago· 6 in thread

This is pretty exciting. DuckDB is already proving to be a powerful tool in the industry.

Previously there was a strong trend of using simple S3-backed blob storage with Parquet and Athena for querying data lakes. It felt like things have gotten pretty complicated, but as integrations improve and Apache Iceberg gains maturity, I'm seeing a shift toward greater flexibility with less SaaS/tool sprawl in data lakes.

RobinL1y ago

Yes - agree! I actually wrote a blog about this just two days ago:

May be of interest to people who:

- What to know what DuckDB is and why it's interesting

- What's good about it

- Why for orgs without huge data, we will hopefully see a lot more of 's3 + duckdb' rather than more complex architectures and services, and hopefully (IMHO) less Spark!

https://www.robinlinacre.com/recommend_duckdb/

I think most people in data science or data engineering should at least try it to get a sense of what it can do

Really for me, the most important thing is it makes it so much easier to design and test complex ETL because you're not constantly having to run queries against Athena/Spark to check they work - you can do it all locally, in CI, set up tests, etc.

pletnes1y ago

I have the same thoughts. However my impression is also that most orgs would choose eg databricks or something for the permission handling, web ui, ++ so what is the equivalent «full rig» with duckdb and S3 / blob storage?

1 more reply

yakshaving_jgt1y ago

Funny, I read TFA and came to the comments to share exactly this recent blog post of yours. Big fan of your work, Robin!

1 more reply

hn1986OP1y ago

from the blog: "This is a very interesting new development, making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data."

I don't think we'll ever see this, honestly.

excellent podcast episode with Joe Reis - I've also never understood this whole idea of "just use Spark" or you gotta get on Redshift.

1 more reply

mritchie7121y ago

if you're looking to try out duckdb + iceberg on AWS, we have a solid guide here: https://www.definite.app/blog/cloud-iceberg-duckdb-aws

raffraffraff1y ago

Kinda the same as metrics/logs systems using blob storage? (Eg Mimir, Loki). Because I remember the hassle of hbase, Cassandra, ELK.

reedf11y ago· 2 in thread

As a data engineering dabbler; parquet in S3 is beautiful. So is DuckDB. What an incredible match.

alexott1y ago

Plain parquet has a lot of problems. That’s why iceberg and delta arise

timenova1y ago

Can you elaborate what kind of problems does plain parquet have?

1 more reply

sys131y ago· 2 in thread

Wonder why not Delta Lake instead, since Iceberg will merge with Delta

alexott1y ago

It’s already supported for quite a while: https://duckdb.org/2024/06/10/delta.html

jl61y ago

It will?

whinvik1y ago· 2 in thread

When is write support for iceberg coming?

dm035141y ago

pfsh who needs to write data??? ;p

If you have streaming data as a source, I built a side project to write streaming data to s3 in iceberg format:

https://sql-flow.com/docs/tutorials/iceberg-sink

https://github.com/turbolytics/sql-flow

I realize it's not quite what you asked for but wanted to mention it. I'm surprised at lack of native iceberg write support in these tools.

Pyiceberg though was quite easy to use, arrow-based API was very helpful as well.

whinvik1y ago

Thanks. This looks cool.

However, my issue is the need to introduce one more tool. I feel that without a single tool to read and write to Iceberg, I would not want to introduce it to our team.

Spark is cool and all but it requires quite a bit of effort to properly work. And Spark seems to be the only thing right now that can read and write to Iceberg natively with a SQL like interface.

1 more reply

yodon1y ago· 2 in thread

Can someone Eli5 the difference between AWS S3 Tables and AWS SimpleDB?

nattaylor1y ago

S3 Tables is designed for storing and optimizing tabular data in S3 using Apache Iceberg, offering features like automatic optimization and fast query performance. SimpleDB is a NoSQL database service focused on providing simple indexing and querying capabilities without requiring a schema.

alex_smart1y ago

They are so completely different that it would be simpler if you explained what similarities you see between the two.

margorczynski1y ago· 1 in thread

Looks like they're going the route of Starrocks? https://www.starrocks.io/

Basically decoupling the file/data storage from the distributed computation layer.

jamesblonde1y ago

That is exactly what the Lakehouse is about - decoupling storage (Iceberg, Delta, Hudi) from query engine.

ayhanfuat1y ago· 1 in thread

Anybody tried S3 tables? How is your experience? It seems more tempting now that DuckDB supports it.

Kalanos1y ago

Haven't tried it. S3 Tables sounds like a great idea. However, I am wary. For it to be useful, a suite of AWS services probably needs to integrate with it. These services are all managed by different teams that don't always work well together out of the box and often compete with redundant products. For example, configuring SageMaker Studio to use an EMR cluster for Spark was a multi-day hassle with a lot of custom (insecure?) configuration. How is this different from other existing table offerings? AWS is a mess.

TheGuyWhoCodes1y ago

Does DuckDB just delegate the query to S3 Tables? or does it do anything in-engine with the data files?

On thing that's missing in DuckDB is predicate pushdown for iceberg - see https://github.com/duckdb/duckdb-iceberg/issues/2

Which puts it way behind the competition, performance wise.

_atyler_1y ago

This is a great example of how simplicity often wins in practice. Too many systems overcomplicate storage and retrieval, assuming every use case needs full indexing or ultra-low latency. In reality, for many workloads, treating S3 like a raw table and letting the engine handle the heavy lifting makes a lot of sense. Curious to see how it performs under high concurrency—any benchmarks on that yet?

AlecBG1y ago

Does this support time travel queries?

Does it support reading everything from one snapshot to another? (This is missing in Athena)

If yes to both, does it respect row level deletes when it does this?

rubenvanwyk1y ago

Wow, DuckDB continues to be the MVP.

j / k navigate · click thread line to collapse

47 comments

35 comments · 12 top-level

dm035141y ago· 7 in thread

I've mentioend this whenever iceberg comes up. It's wild how immature the ecosystem is still. Duckdb itself lacks the ability to write iceberg....

https://duckdb.org/docs/stable/extensions/iceberg/overview.h...

Apache iceberg go ? Nope

https://github.com/apache/iceberg-go?tab=readme-ov-file#read...

Basically java iceberg is the only mature way to do this, it's not a very accessible ecosystem.

For a side project I'm using pyiceberg to sink streaming data to iceberg (using DuckDB as the stream processor):

https://sql-flow.com/docs/tutorials/iceberg-sink

https://github.com/turbolytics/sql-flow

Arrow is quite spectacular and it's cool to see the industry moving to standardize on it as a dataframe. For example, Clickhouse python also support arrow-based insertion:

https://sql-flow.com/docs/tutorials/clickhouse-sink

This makes the glue code trivial to sink into these different systems as long as arrow is used.

zeroshade1y ago

Hi! Primary developer of iceberg-go here!

We're about to merge https://github.com/apache/iceberg-go/pull/339 which will complete support for `AddFiles` to add existing parquet files to the table.

Not too far behind this is support for appending a stream of Arrow record batches, likely in the next couple weeks.

Slow and steady!

dm035141y ago

Amazing! Thank you for the update, this will be huge

Mortiffer1y ago

I came to the same conclusion and moved on. We had some c# applications reading some python

hn1986OP1y ago

tracking write support here:

https://github.com/duckdb/duckdb-iceberg/issues/37

ramraj071y ago

Is there a reliable way to convert an existing Parquet directory to iceberg without moving the data?

lidavidm1y ago

Iceberg-go is working on it! (edit: it being write support)

barrenko1y ago

What the hell is iceberg now?

isjustintime1y ago· 6 in thread

This is pretty exciting. DuckDB is already proving to be a powerful tool in the industry.

RobinL1y ago

Yes - agree! I actually wrote a blog about this just two days ago:

May be of interest to people who:

- What to know what DuckDB is and why it's interesting

- What's good about it

- Why for orgs without huge data, we will hopefully see a lot more of 's3 + duckdb' rather than more complex architectures and services, and hopefully (IMHO) less Spark!

https://www.robinlinacre.com/recommend_duckdb/

I think most people in data science or data engineering should at least try it to get a sense of what it can do

pletnes1y ago

1 more reply

yakshaving_jgt1y ago

Funny, I read TFA and came to the comments to share exactly this recent blog post of yours. Big fan of your work, Robin!

1 more reply

hn1986OP1y ago

from the blog: "This is a very interesting new development, making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data."

I don't think we'll ever see this, honestly.

excellent podcast episode with Joe Reis - I've also never understood this whole idea of "just use Spark" or you gotta get on Redshift.

1 more reply

mritchie7121y ago

if you're looking to try out duckdb + iceberg on AWS, we have a solid guide here: https://www.definite.app/blog/cloud-iceberg-duckdb-aws

raffraffraff1y ago

Kinda the same as metrics/logs systems using blob storage? (Eg Mimir, Loki). Because I remember the hassle of hbase, Cassandra, ELK.

reedf11y ago· 2 in thread

As a data engineering dabbler; parquet in S3 is beautiful. So is DuckDB. What an incredible match.

alexott1y ago

Plain parquet has a lot of problems. That’s why iceberg and delta arise

timenova1y ago

Can you elaborate what kind of problems does plain parquet have?

1 more reply

sys131y ago· 2 in thread

Wonder why not Delta Lake instead, since Iceberg will merge with Delta

alexott1y ago

It’s already supported for quite a while: https://duckdb.org/2024/06/10/delta.html

jl61y ago

It will?

whinvik1y ago· 2 in thread

When is write support for iceberg coming?

dm035141y ago

pfsh who needs to write data??? ;p

If you have streaming data as a source, I built a side project to write streaming data to s3 in iceberg format:

https://sql-flow.com/docs/tutorials/iceberg-sink

https://github.com/turbolytics/sql-flow

I realize it's not quite what you asked for but wanted to mention it. I'm surprised at lack of native iceberg write support in these tools.

Pyiceberg though was quite easy to use, arrow-based API was very helpful as well.

whinvik1y ago

Thanks. This looks cool.

However, my issue is the need to introduce one more tool. I feel that without a single tool to read and write to Iceberg, I would not want to introduce it to our team.

Spark is cool and all but it requires quite a bit of effort to properly work. And Spark seems to be the only thing right now that can read and write to Iceberg natively with a SQL like interface.

1 more reply

yodon1y ago· 2 in thread

Can someone Eli5 the difference between AWS S3 Tables and AWS SimpleDB?

nattaylor1y ago

alex_smart1y ago

They are so completely different that it would be simpler if you explained what similarities you see between the two.

margorczynski1y ago· 1 in thread

Looks like they're going the route of Starrocks? https://www.starrocks.io/

Basically decoupling the file/data storage from the distributed computation layer.

jamesblonde1y ago

That is exactly what the Lakehouse is about - decoupling storage (Iceberg, Delta, Hudi) from query engine.

ayhanfuat1y ago· 1 in thread

Anybody tried S3 tables? How is your experience? It seems more tempting now that DuckDB supports it.

Kalanos1y ago

TheGuyWhoCodes1y ago

Does DuckDB just delegate the query to S3 Tables? or does it do anything in-engine with the data files?

On thing that's missing in DuckDB is predicate pushdown for iceberg - see https://github.com/duckdb/duckdb-iceberg/issues/2

Which puts it way behind the competition, performance wise.

_atyler_1y ago

AlecBG1y ago

Does this support time travel queries?

Does it support reading everything from one snapshot to another? (This is missing in Athena)

If yes to both, does it respect row level deletes when it does this?

rubenvanwyk1y ago

Wow, DuckDB continues to be the MVP.

j / k navigate · click thread line to collapse