Use it with Dropwizard/Springboot, you get to expose rest APIs too.
I really like the way the catalog standard can decouple underlying storage as well.
My biggest concern is how inaccessible the implementations are, Java / spark has the only mature implementation right now,
Even DuckDB doesn’t support writing yet.
I built out a tool to stream data to iceberg which uses the python iceberg client:
https://www.linkedin.com/pulse/streaming-iceberg-using-sqlfl...
I don't remember seeing that in Delta Lake [1], which is probably because the industry standard benchmarks use date as a column (tpc-h) or join date as a dimension table (tpc-ds) and do not use timestamp ranges instead of dates.
Yes, that solved the 2-column high NDV partitioning issue - if you had your ip traffic sorted on destination or source, you need Z-curves, which are a little easier with bit twiddling for fixed types to do the same thing.
Hive would write a large number of small files when partitioned like that or you lose efficiencies when scanning on the non-partitioned column.
This does fix the high NDV issue, but in general Netflix wrote hidden partitioning in specifically to avoid sorting on high NDV columns and to reduce the sort complexity on writes (most daily writes won't need any partitioned inserts at all).
While clustering on timestamp will force a sort even if it is a single day.
[1] Open Table Formats:
Building lakehouse products on any table format but iceberg starting now seems to me like it must be a mistake.
I’d also love to see a good comparison between “regular” Iceberg and AWS’s new S3 Tables.
When AWS launched S3 Tables last month I wrote a blog post with my first impressions: https://meltware.com/2024/12/04/s3-tables
There may be more in depth comparisons available by now but it’s at least a good starting point for understanding how S3 Tables integrates with Iceberg.
[0] https://clickhouse.com/docs/en/sql-reference/table-functions...
[1] https://clickhouse.com/docs/en/engines/table-engines/integra...
In the other hand, since one of the use cases they created it at Netflix was to consume directly from real time systems, the management of the file creation when updates to the data is less trivial (the CoW vs MoR problem and how to compact small files) which becomes important on multi-petabytes tables with lots of users and frequent updates. This is something I assume not a lot companies put a lot of attention to (heck, not even at Netflix) and have big performance and cost implications.
0 - https://www.definite.app/blog/databricks-tabular-acquisition
Does anyone know if Iceberg has plans to support similar use cases?
That said, a catalog (which Delta also can have) helps a lot to keep things tidy. For example, I can write a dataset with Spark, transform it with dbt and a query engine (such as Trino) and consume the resulting dataset with any client that supports Iceberg. If I use a catalog, all happens without having to register the dataset location in each of these components.
PyIceberg is likely the easiest way to write without Spark.
https://tower.dev/blog/picking-snowflake-open-catalog-as-a-m...
Is the query engine value add justify snowflake's valuation. Their data marketplace thing didn't seem to have actually worked.
This actually converges to 1:
1/2 + 1/4 + 1/8 + 1/16 + ... = 1
You just need 30kloc of maven in your pom before you get there.
Can you expand on those reasons a bit?
The dependency on a catalog in Iceberg made it more complicated for simple cases than Delta, where a directory hierarchy was sufficient - if I was understanding the PyIceberg docs correctly.
For years I used a proprietary solution like Qlik Sense for the whole journey from data extraction to a finished dashboard (mostly on-prem). Going from raw data to a finished dashboard is a matter of days (not weeks/month) with one single tool (and maybe some scripts for supporting tasks). There is some „scripting“ involved for loading and transforming data, but if you already understand data models (and maybe have some sql experience) it is very easy. The Dashboard creation itself does not need any coding at all.just drag and drop and some formulas like sum(amount).
But this a standalone tool and it is hard to integrate it into your own piece of software. From my experience, software developers have a much more complicated view on data handling. Often this is just the complexity of their use cases, sometimes it is just a lack of knowledge of data preparation for analytics use cases.
Another part which complicates stuff greatly is the focus on use-cases involving cloud storage and doing all the transformations on distributed systems.
And it is often not clear what amount of data we are talking about and if it is realtime (streaming) data or not. There is a big difference in the possible approaches, if you have 6h hours to prepare data or if it has to be refreshed every second (or when new data arrives etc).
Long story short: Yes it is complicated to grasp. There is also a big difference if you use the data for normal analytics use cases in a company (mostly read only data models) or if you use the data in a (big tech) product.
I would suggest to start simple by looking into a „query engine“ to extract some data from somewhere and then doing some transformations with pandas/polars/cubejs for basic understanding. You will need some schedulers and orchestration on the way forward. But this will be dependent on the real use cases and environment you are in.