Apache Iceberg (opens in new tab)

(iceberg.apache.org)

207 pointsjacobmarble1y ago66 comments

66 comments

If you're looking to give Iceberg a spin, here's how to get it running locally, on AWS[0] and on GCP[1]. The posts use DuckDB as the query engine, but you could swap in Trino (or even chdb / clickhouse).

0 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws

1 - https://www.definite.app/blog/cloud-iceberg-duckdb

adesh_nalpet1y ago

You can just used Iceberg Java API: https://iceberg.apache.org/docs/1.6.1/api/#file-level

Use it with Dropwizard/Springboot, you get to expose rest APIs too.

romperstomper1y ago

you can just use iceberg tables with AWS Glue/Athena

hipadev231y ago

aws glue/athena has the most absurd setup process. duckdb and clickhouse is “select * from s3(…)”

romperstomper1y ago

I'm not sure what do you mean by "absurd setup". In case of Athena you just use the iceberg type for a table as you create it and that's it. Under the hood AWS also uses Trino or Presto as far as I know.

dm035141y ago

I think iceberg solves a lot of big data problems, for handling huge amounts of data on blob storage, including partitioning, compaction and ACID semantics.

I really like the way the catalog standard can decouple underlying storage as well.

My biggest concern is how inaccessible the implementations are, Java / spark has the only mature implementation right now,

Even DuckDB doesn’t support writing yet.

I built out a tool to stream data to iceberg which uses the python iceberg client:

https://www.linkedin.com/pulse/streaming-iceberg-using-sqlfl...

gopalv1y ago

Hidden partitioning is the most interesting Iceberg feature, because most of the very large datasets are timeseries fact tables.

I don't remember seeing that in Delta Lake [1], which is probably because the industry standard benchmarks use date as a column (tpc-h) or join date as a dimension table (tpc-ds) and do not use timestamp ranges instead of dates.

[1] - https://github.com/delta-io/delta/issues/490

fiddlerwoaroof1y ago

Delta Lake now has Hilbert-curve based clustering which solves a lot of the downsides of hive partitioning

gopalv1y ago

> Hilbert-curve based clustering which solves a lot of the downsides of hive partitioning

Yes, that solved the 2-column high NDV partitioning issue - if you had your ip traffic sorted on destination or source, you need Z-curves, which are a little easier with bit twiddling for fixed types to do the same thing.

Hive would write a large number of small files when partitioned like that or you lose efficiencies when scanning on the non-partitioned column.

This does fix the high NDV issue, but in general Netflix wrote hidden partitioning in specifically to avoid sorting on high NDV columns and to reduce the sort complexity on writes (most daily writes won't need any partitioned inserts at all).

While clustering on timestamp will force a sort even if it is a single day.

autodidacticon1y ago

What is NDV partitioning?

1 more reply

teleforce1y ago

Apache Iceberg is one of the emerging Open Table Formats in addition to Delta Lake and Apache Hudi [1].

[1] Open Table Formats:

https://www.starburst.io/data-glossary/open-table-formats/

Icathian1y ago

I think this mischaracterizes the state of the space. Iceberg is the winner of this competition, as of a few months ago. All major vendors who didn't directly invent one of the others now support iceberg or have announced plans to do so.

Building lakehouse products on any table format but iceberg starting now seems to me like it must be a mistake.

bdndndndbve1y ago

Yeah working in the data space I see a ton of customers using Iceberg and some using Delta Lake if they're already a Databricks shop. Virtually no Hudi.

jl61y ago

The table on that page makes it look like all three of these are very similar, with schema evolution and partition evolution being the key differences. Is that really it?

I’d also love to see a good comparison between “regular” Iceberg and AWS’s new S3 Tables.

benesch1y ago

Yes, the three major open table formats are all quite similar.

When AWS launched S3 Tables last month I wrote a blog post with my first impressions: https://meltware.com/2024/12/04/s3-tables

There may be more in depth comparisons available by now but it’s at least a good starting point for understanding how S3 Tables integrates with Iceberg.

jl61y ago

Cool, thank you. It feels like Athena + S3 Tables has the potential to be a very attractive serverless data lakehouse combo.

pradeepchhetri1y ago

ClickHouse has a solid Iceberg integration. It has an Iceberg table function[0] and Iceberg table engine[1] for interacting with Iceberg data stored in s3, gcs, azure, hadoop etc.

[0] https://clickhouse.com/docs/en/sql-reference/table-functions...

[1] https://clickhouse.com/docs/en/engines/table-engines/integra...

tlarkworthy1y ago

I would say it doesn't but it is actively working on it

https://github.com/ClickHouse/ClickHouse/issues/52054

mritchie7121y ago

duckdb has the same issue[0], I submitted a PR, but it's been stalled

0 - https://github.com/duckdb/duckdb-iceberg/pull/78

tlarkworthy1y ago

Oh they just fixed this 9d ago and I guess this comment provoked them to close the issue!

pradeepchhetri1y ago

I am looking forward to learn about such upcoming features in the community call https://clickhouse.com/company/events/v25-1-community-releas...

1 more reply

volderette1y ago

How do you query your iceberg tables? We are looking into moving away from Bigquery and Starrocks [1] looks like a good option.

[1] https://www.starrocks.io/

macqm1y ago

Trino is pretty good (open source presto).

https://trino.io/

adesh_nalpet1y ago

Common opensource options (other than Spark and Flink): 1. Dremio: https://www.dremio.com/ 2. Trino: https://trino.io/ 3. Iceberg Java API: https://iceberg.apache.org/docs/1.6.1/api/

czwief1y ago

Starburst (full disclosure: I work there) provides a query engine (trino under the hood) with Iceberg support [1] -- worth checking out.

[1] https://www.starburst.io/platform/icehouse/

mritchie7121y ago

right now, starrocks or trino are likely your best options, but all the major query engines (clickhouse, snowflake, databricks, even duckdb) are improving their support too.

jl61y ago

Why away from bigquery? Just wondering if it’s a cost thing.

volderette1y ago

Yes, mainly driven by cost. BigQuery is really unpredictable when dashboards with filters are being used intensively by users. We don’t want to limit our users in their data exploration.

crorella1y ago

What I like about iceberg is that the partitions of the tables are not tightly coupled to the subfolder structure of the storage layer (at least logically, at the end of the day the partitions are still subfolders with files), but at least the metadata is not tied to that, so you can change the partition of the tables going forward and still query a mix of old and new partitions time ranges.

In the other hand, since one of the use cases they created it at Netflix was to consume directly from real time systems, the management of the file creation when updates to the data is less trivial (the CoW vs MoR problem and how to compact small files) which becomes important on multi-petabytes tables with lots of users and frequent updates. This is something I assume not a lot companies put a lot of attention to (heck, not even at Netflix) and have big performance and cost implications.

varsketiz1y ago

I'm somewhat surprised to see it here - Iceberg is around for some time already.

benjaminwootton1y ago

It’s been on the up in recent years though as it appears to have won the format wars. Every vendor is rallying around it and there were new open source catalogues and support from AWS at the end of 2024.

mritchie7121y ago

yeah, I'll admit I was worried when Databricks acquired Tabular[0] that it would hurt Iceberg's momentum (e.g. databricks would push delta instead), but it seems the opposite has happened.

0 - https://www.definite.app/blog/databricks-tabular-acquisition

twoodfin1y ago

I was more worried—and continue to be so—that Databricks will bring the rat’s nest of complexity and pseudo-open source model that characterizes Delta to the future of Iceberg.

mrbluecoat1y ago

Yeah, I was confused as well. It was like seeing "postage stamps" on the HN front page.

nikolatt1y ago

I've been looking at Iceberg for a while, but in the end went with Delta Lake because it doesn't have a dependency on a catalog. It also has good support for reading and writing from it without needing Spark.

Does anyone know if Iceberg has plans to support similar use cases?

pammf1y ago

Iceberg has the hdfs catalog, which also relies only on dirs and files.

That said, a catalog (which Delta also can have) helps a lot to keep things tidy. For example, I can write a dataset with Spark, transform it with dbt and a query engine (such as Trino) and consume the resulting dataset with any client that supports Iceberg. If I use a catalog, all happens without having to register the dataset location in each of these components.

mritchie7121y ago

Why don't you want a catalog? The SQL or REST catalogs are pretty light to set up. I have my eye on lakekeeper[0], but Polaris (from Snowflake) is a good option too.

PyIceberg is likely the easiest way to write without Spark.

0 - https://github.com/lakekeeper/lakekeeper

datancoffee1y ago

We did an evaluation of various REST catalog options and went with Open Catalog from Snowflake (a Polaris-based managed service that works independently from their data warehousing solution). Lakekeeper is nice - it's one of the few catalogs with FGAC and table maintenance.

https://tower.dev/blog/picking-snowflake-open-catalog-as-a-m...

anktor1y ago

PyIceberg is nice but we had to drop it because it's behind Java API and it's unclear when it will match up, so depending on which features are needed I'd look it up

mritchie7121y ago

what are you using instead?

apwell231y ago

I am stockholder in snowflake and iceberg's ascendance seems to coincide with snow's downfall.

Is the query engine value add justify snowflake's valuation. Their data marketplace thing didn't seem to have actually worked.

jaakl1y ago

I’m doing datalake modernization for medium-large enterprise and spent last months in sales calls of MS Fabric vs Snowflake vs Databricks. All fun, but now with the managed Iceberg in AWS (S3 tables) I tend to consider to choose none of them: just plain Iceberg is good enough. Of course someone needs to write and read it; but there are so many good free options already, even build does not feel scary. So I would go to the short side in Snowflake in medium-long term (looking their current value prop at least). Databricks has maybe more future as it has ML/AI-first approach. In short term we might still start with SF (with its Iceberg features), as the alternative future stack needs to mature and establish a bit.

mkl951y ago

Iceberg on S3 tables is going to be a hot topic in the next few years.

npalli1y ago

Are there robust non-JVM based implementations for Iceberg currently? Sorry to say, but recommending JVM ecosystems around large data just feels like professional malpractice at this point. Whether deployment complexity, resource overhead, tool sprawl or operational complexity the ecosystem seems to attract people who solve only 50% of the problem and have another tool to solve the rest, which in turn only solves 50% etc.. ad infinitum. The popularity of solutions like Snowflake, Clickhouse, or DuckDB is not an accident and is the direction everything should go. I hear Snowflake will adopt this in the future, that is good news.

juunpp1y ago

> who solve only 50% of the problem and have another tool to solve the rest, which in turn only solves 50% etc.. ad infinitum

This actually converges to 1:

1/2 + 1/4 + 1/8 + 1/16 + ... = 1

You just need 30kloc of maven in your pom before you get there.

rdegges1y ago

OneHouse also has a fantastic iceberg implementation (they're the team behind Apache Hudi) and does a ton of great interop work: https://www.onehouse.ai/blog/comprehensive-data-catalog-comp... && https://www.onehouse.ai/blog/open-data-foundations-with-apac...

chehai1y ago

In order to get good query performance from Iceberg, we have to run compaction frequently. Compaction turns out to be very expensive. Any tip to minimize compaction while keeping queries fast?

vonnik1y ago

Curious to what extent Iceberg enables data composability and what the best complements and alternatives are.

lmm1y ago

Delta Lake is the main competitor. There's a lot of convergence going on, because everyone wants a common format and it's pretty clear what the desirable features are. Ultimately it becomes just boring infrastructure IMO.

nxm1y ago

It allows you to be query engine agnostic - query the same data via Spark, Snowflake or Trino. Granted, performance may suffer vs Snowflake internal tables somewhat due to certain performance optimizations not being there.

jmakov1y ago

Why would one choose this instead of DeltaLake?

jeffhuys1y ago

Looks good, but come on… at least try to open your website on a mobile device.

malnourish1y ago

It loads poorly and causes my 3080 to turn on its fan when I load it in up-to-date Firefox on Windows.

dangoodmanUT1y ago

iceberg is plauged with the problems it tries to solve, like being too tied to spark just to write data

apwell231y ago

huh what? We use iceberg extensively, never used spark.

honestSysAdmin1y ago

Iceberg is a pretty cool guy, he consolidates the Parquet and doesn't afraid of anything.

rubenvanwyk1y ago

And yet there's still no straightforward way to write directly to Iceberg tables from Javascript as far as I know.

Rhubarrbb1y ago

Writing to catalogs is still pretty new. Databricks has recently been pushing delta-kernel-rs that DuckDb has a connector set up for, and there’s support for writing via Python with the Polars package through delta-rs. For small-time developers this has been pretty helpful for me and influential in picking delta lake over iceberg.

kermatt1y ago

> influential in picking delta lake over iceberg

Can you expand on those reasons a bit?

The dependency on a catalog in Iceberg made it more complicated for simple cases than Delta, where a directory hierarchy was sufficient - if I was understanding the PyIceberg docs correctly.

enether1y ago

for some reason it's really cumbersome to access this tech

peschu1y ago

I agree, as a long time Business Intelligence developer I‘m still confused and astounded with all the tooling and bits and pieces seemingly necessary to create analytics/dashboards with open source tools.

For years I used a proprietary solution like Qlik Sense for the whole journey from data extraction to a finished dashboard (mostly on-prem). Going from raw data to a finished dashboard is a matter of days (not weeks/month) with one single tool (and maybe some scripts for supporting tasks). There is some „scripting“ involved for loading and transforming data, but if you already understand data models (and maybe have some sql experience) it is very easy. The Dashboard creation itself does not need any coding at all.just drag and drop and some formulas like sum(amount).

But this a standalone tool and it is hard to integrate it into your own piece of software. From my experience, software developers have a much more complicated view on data handling. Often this is just the complexity of their use cases, sometimes it is just a lack of knowledge of data preparation for analytics use cases.

Another part which complicates stuff greatly is the focus on use-cases involving cloud storage and doing all the transformations on distributed systems.

And it is often not clear what amount of data we are talking about and if it is realtime (streaming) data or not. There is a big difference in the possible approaches, if you have 6h hours to prepare data or if it has to be refreshed every second (or when new data arrives etc).

Long story short: Yes it is complicated to grasp. There is also a big difference if you use the data for normal analytics use cases in a company (mostly read only data models) or if you use the data in a (big tech) product.

I would suggest to start simple by looking into a „query engine“ to extract some data from somewhere and then doing some transformations with pandas/polars/cubejs for basic understanding. You will need some schedulers and orchestration on the way forward. But this will be dependent on the real use cases and environment you are in.

bdndndndbve1y ago

I would argue that stuff like Iceberg is really aimed at Data Platform Engineers, not BI analysts. Companies I've worked with in the past have 10-15 people on a Platform team that work directly with stuff like this, to offer analysts and data scientists a view into the company's data.

nxm1y ago

What’s your use case? Iceberg is meant for analytical workloads

j / k navigate · click thread line to collapse

66 comments

mritchie7121y ago

0 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws

1 - https://www.definite.app/blog/cloud-iceberg-duckdb

adesh_nalpet1y ago

You can just used Iceberg Java API: https://iceberg.apache.org/docs/1.6.1/api/#file-level

Use it with Dropwizard/Springboot, you get to expose rest APIs too.

romperstomper1y ago

you can just use iceberg tables with AWS Glue/Athena

hipadev231y ago

aws glue/athena has the most absurd setup process. duckdb and clickhouse is “select * from s3(…)”

romperstomper1y ago

dm035141y ago

I think iceberg solves a lot of big data problems, for handling huge amounts of data on blob storage, including partitioning, compaction and ACID semantics.

I really like the way the catalog standard can decouple underlying storage as well.

My biggest concern is how inaccessible the implementations are, Java / spark has the only mature implementation right now,

Even DuckDB doesn’t support writing yet.

I built out a tool to stream data to iceberg which uses the python iceberg client:

https://www.linkedin.com/pulse/streaming-iceberg-using-sqlfl...

gopalv1y ago

Hidden partitioning is the most interesting Iceberg feature, because most of the very large datasets are timeseries fact tables.

[1] - https://github.com/delta-io/delta/issues/490

fiddlerwoaroof1y ago

Delta Lake now has Hilbert-curve based clustering which solves a lot of the downsides of hive partitioning

gopalv1y ago

> Hilbert-curve based clustering which solves a lot of the downsides of hive partitioning

Hive would write a large number of small files when partitioned like that or you lose efficiencies when scanning on the non-partitioned column.

While clustering on timestamp will force a sort even if it is a single day.

autodidacticon1y ago

What is NDV partitioning?

1 more reply

teleforce1y ago

Apache Iceberg is one of the emerging Open Table Formats in addition to Delta Lake and Apache Hudi [1].

[1] Open Table Formats:

https://www.starburst.io/data-glossary/open-table-formats/

Icathian1y ago

Building lakehouse products on any table format but iceberg starting now seems to me like it must be a mistake.

bdndndndbve1y ago

Yeah working in the data space I see a ton of customers using Iceberg and some using Delta Lake if they're already a Databricks shop. Virtually no Hudi.

jl61y ago

The table on that page makes it look like all three of these are very similar, with schema evolution and partition evolution being the key differences. Is that really it?

I’d also love to see a good comparison between “regular” Iceberg and AWS’s new S3 Tables.

benesch1y ago

Yes, the three major open table formats are all quite similar.

When AWS launched S3 Tables last month I wrote a blog post with my first impressions: https://meltware.com/2024/12/04/s3-tables

There may be more in depth comparisons available by now but it’s at least a good starting point for understanding how S3 Tables integrates with Iceberg.

jl61y ago

Cool, thank you. It feels like Athena + S3 Tables has the potential to be a very attractive serverless data lakehouse combo.

pradeepchhetri1y ago

ClickHouse has a solid Iceberg integration. It has an Iceberg table function[0] and Iceberg table engine[1] for interacting with Iceberg data stored in s3, gcs, azure, hadoop etc.

[0] https://clickhouse.com/docs/en/sql-reference/table-functions...

[1] https://clickhouse.com/docs/en/engines/table-engines/integra...

tlarkworthy1y ago

I would say it doesn't but it is actively working on it

https://github.com/ClickHouse/ClickHouse/issues/52054

mritchie7121y ago

duckdb has the same issue[0], I submitted a PR, but it's been stalled

0 - https://github.com/duckdb/duckdb-iceberg/pull/78

tlarkworthy1y ago

Oh they just fixed this 9d ago and I guess this comment provoked them to close the issue!

pradeepchhetri1y ago

I am looking forward to learn about such upcoming features in the community call https://clickhouse.com/company/events/v25-1-community-releas...

1 more reply

volderette1y ago

How do you query your iceberg tables? We are looking into moving away from Bigquery and Starrocks [1] looks like a good option.

[1] https://www.starrocks.io/

macqm1y ago

Trino is pretty good (open source presto).

https://trino.io/

adesh_nalpet1y ago

Common opensource options (other than Spark and Flink): 1. Dremio: https://www.dremio.com/ 2. Trino: https://trino.io/ 3. Iceberg Java API: https://iceberg.apache.org/docs/1.6.1/api/

czwief1y ago

Starburst (full disclosure: I work there) provides a query engine (trino under the hood) with Iceberg support [1] -- worth checking out.

[1] https://www.starburst.io/platform/icehouse/

mritchie7121y ago

right now, starrocks or trino are likely your best options, but all the major query engines (clickhouse, snowflake, databricks, even duckdb) are improving their support too.

jl61y ago

Why away from bigquery? Just wondering if it’s a cost thing.

volderette1y ago

Yes, mainly driven by cost. BigQuery is really unpredictable when dashboards with filters are being used intensively by users. We don’t want to limit our users in their data exploration.

crorella1y ago

varsketiz1y ago

I'm somewhat surprised to see it here - Iceberg is around for some time already.

benjaminwootton1y ago

mritchie7121y ago

yeah, I'll admit I was worried when Databricks acquired Tabular[0] that it would hurt Iceberg's momentum (e.g. databricks would push delta instead), but it seems the opposite has happened.

0 - https://www.definite.app/blog/databricks-tabular-acquisition

twoodfin1y ago

I was more worried—and continue to be so—that Databricks will bring the rat’s nest of complexity and pseudo-open source model that characterizes Delta to the future of Iceberg.

mrbluecoat1y ago

Yeah, I was confused as well. It was like seeing "postage stamps" on the HN front page.

nikolatt1y ago

Does anyone know if Iceberg has plans to support similar use cases?

pammf1y ago

Iceberg has the hdfs catalog, which also relies only on dirs and files.

mritchie7121y ago

Why don't you want a catalog? The SQL or REST catalogs are pretty light to set up. I have my eye on lakekeeper[0], but Polaris (from Snowflake) is a good option too.

PyIceberg is likely the easiest way to write without Spark.

0 - https://github.com/lakekeeper/lakekeeper

datancoffee1y ago

https://tower.dev/blog/picking-snowflake-open-catalog-as-a-m...

anktor1y ago

PyIceberg is nice but we had to drop it because it's behind Java API and it's unclear when it will match up, so depending on which features are needed I'd look it up

mritchie7121y ago

what are you using instead?

apwell231y ago

I am stockholder in snowflake and iceberg's ascendance seems to coincide with snow's downfall.

Is the query engine value add justify snowflake's valuation. Their data marketplace thing didn't seem to have actually worked.

jaakl1y ago

mkl951y ago

Iceberg on S3 tables is going to be a hot topic in the next few years.

npalli1y ago

juunpp1y ago

> who solve only 50% of the problem and have another tool to solve the rest, which in turn only solves 50% etc.. ad infinitum

This actually converges to 1:

1/2 + 1/4 + 1/8 + 1/16 + ... = 1

You just need 30kloc of maven in your pom before you get there.

rdegges1y ago

chehai1y ago

In order to get good query performance from Iceberg, we have to run compaction frequently. Compaction turns out to be very expensive. Any tip to minimize compaction while keeping queries fast?

vonnik1y ago

Curious to what extent Iceberg enables data composability and what the best complements and alternatives are.

lmm1y ago

nxm1y ago

jmakov1y ago

Why would one choose this instead of DeltaLake?

jeffhuys1y ago

Looks good, but come on… at least try to open your website on a mobile device.

malnourish1y ago

It loads poorly and causes my 3080 to turn on its fan when I load it in up-to-date Firefox on Windows.

dangoodmanUT1y ago

iceberg is plauged with the problems it tries to solve, like being too tied to spark just to write data

apwell231y ago

huh what? We use iceberg extensively, never used spark.

honestSysAdmin1y ago

Iceberg is a pretty cool guy, he consolidates the Parquet and doesn't afraid of anything.

rubenvanwyk1y ago

And yet there's still no straightforward way to write directly to Iceberg tables from Javascript as far as I know.

Rhubarrbb1y ago

kermatt1y ago

> influential in picking delta lake over iceberg

Can you expand on those reasons a bit?

The dependency on a catalog in Iceberg made it more complicated for simple cases than Delta, where a directory hierarchy was sufficient - if I was understanding the PyIceberg docs correctly.

enether1y ago

for some reason it's really cumbersome to access this tech

peschu1y ago

Another part which complicates stuff greatly is the focus on use-cases involving cloud storage and doing all the transformations on distributed systems.

bdndndndbve1y ago

nxm1y ago

What’s your use case? Iceberg is meant for analytical workloads

j / k navigate · click thread line to collapse