https://duckdb.org/docs/stable/extensions/iceberg/overview.h...
Apache iceberg go ? Nope
https://github.com/apache/iceberg-go?tab=readme-ov-file#read...
Basically java iceberg is the only mature way to do this, it's not a very accessible ecosystem.
For a side project I'm using pyiceberg to sink streaming data to iceberg (using DuckDB as the stream processor):
https://sql-flow.com/docs/tutorials/iceberg-sink
It's basically a workaround for DuckDB's lack of native support. I am very happy with the Pyicerbg library as a user, It was very easy and the native Arrow support is a glimpse into the future. Arrow as an interchange format is quite amazing. Just open up the iceberg table and append Arrow dataframes to it!
https://github.com/turbolytics/sql-flow
Arrow is quite spectacular and it's cool to see the industry moving to standardize on it as a dataframe. For example, Clickhouse python also support arrow-based insertion:
https://sql-flow.com/docs/tutorials/clickhouse-sink
This makes the glue code trivial to sink into these different systems as long as arrow is used.
We're about to merge https://github.com/apache/iceberg-go/pull/339 which will complete support for `AddFiles` to add existing parquet files to the table.
Not too far behind this is support for appending a stream of Arrow record batches, likely in the next couple weeks.
Slow and steady!
Previously there was a strong trend of using simple S3-backed blob storage with Parquet and Athena for querying data lakes. It felt like things have gotten pretty complicated, but as integrations improve and Apache Iceberg gains maturity, I'm seeing a shift toward greater flexibility with less SaaS/tool sprawl in data lakes.
May be of interest to people who:
- What to know what DuckDB is and why it's interesting
- What's good about it
- Why for orgs without huge data, we will hopefully see a lot more of 's3 + duckdb' rather than more complex architectures and services, and hopefully (IMHO) less Spark!
https://www.robinlinacre.com/recommend_duckdb/
I think most people in data science or data engineering should at least try it to get a sense of what it can do
Really for me, the most important thing is it makes it so much easier to design and test complex ETL because you're not constantly having to run queries against Athena/Spark to check they work - you can do it all locally, in CI, set up tests, etc.
I don't think we'll ever see this, honestly.
excellent podcast episode with Joe Reis - I've also never understood this whole idea of "just use Spark" or you gotta get on Redshift.
If you have streaming data as a source, I built a side project to write streaming data to s3 in iceberg format:
https://sql-flow.com/docs/tutorials/iceberg-sink
https://github.com/turbolytics/sql-flow
I realize it's not quite what you asked for but wanted to mention it. I'm surprised at lack of native iceberg write support in these tools.
Pyiceberg though was quite easy to use, arrow-based API was very helpful as well.
However, my issue is the need to introduce one more tool. I feel that without a single tool to read and write to Iceberg, I would not want to introduce it to our team.
Spark is cool and all but it requires quite a bit of effort to properly work. And Spark seems to be the only thing right now that can read and write to Iceberg natively with a SQL like interface.
Basically decoupling the file/data storage from the distributed computation layer.
On thing that's missing in DuckDB is predicate pushdown for iceberg - see https://github.com/duckdb/duckdb-iceberg/issues/2
Which puts it way behind the competition, performance wise.
Does it support reading everything from one snapshot to another? (This is missing in Athena)
If yes to both, does it respect row level deletes when it does this?