DuckDB 0.9 - https://news.ycombinator.com/item?id=37657736 - Sept 2023 (59 comments)
It's not about duckdb's performance improvements per se, it's about the change to the aws instance type they're using to get fairer benchmark results by avoiding block storage and preventing noisy neighbors.
"The team at DuckDB Labs has been hard at work improving the performance of the out-of-core hash aggregates and joins."
Aside from that, since April DuckDB has vastly improved performance of a number of queries, including aggregates, joins, window functions. [0]
[0] https://duckdb.org/2023/09/26/announcing-duckdb-090.html#cor...
(Submitted title was "DuckDB performance improvements with the latest release")
ClickBench (from ClickHouse) has some benchmarks[1] where it can be compared, but am not super sure how up to date it is. At least a while back, they were majorly out of date and haven't looked too closely on whether they are keeping it fair for everyone else :)
The draft[1] includes a comparison to DuckDB and preliminary benchmark results.
I tried converting one of my parquet files from datalake from parquet to arrow and size difference is staggering. 20mb parquet -> 700mb arrow.
doesnt seem fit for datalake at all
> Parquet and Arrow are complementary technologies, and they make some different design tradeoffs. In particular, Parquet is a storage format designed for maximum space efficiency, whereas Arrow is an in-memory format intended for operation by vectorized computational kernels.
> The major distinction is that Arrow provides O(1) random access lookups to any array index, whilst Parquet does not. In particular, Parquet uses dremel record shredding, variable length encoding schemes, and block compression to drastically reduce the data size, but these techniques come at the loss of performant random access lookups.
Deep Dive into Common Open Formats for Analytical DBMSs https://www.vldb.org/pvldb/vol16/p3044-liu.pdf
It’s actually faster than AWS Athena for me.
Just the other day I used it to transform an unordered 60 GB CSV file with links and texts into a 3 GB parquet file that's so fast I can create a projection for the relevant data of each partition in like a minute (which then fits in memory).
It has some minor stability issues so I'm not sure I'd build a full blown application on top of it, but for data transformation tasks it's amazing.
And no you don’t have to know the exact parquet file. You would treat the Hive partitioned data as a single dataset and DuckDB will scan it automatically. (Partition elimination, predicate pushdown etc all done automatically)
We are in the same scenario (querying lots of Parquet files in S3) and we noticed that DuckDB quickly crashes with OOMs in environments with a few gigs of RAM.
Setting the memory limit setting or the disk swap file has not worked.
I used to have to chunk my data to avoid OOMs but I haven’t had to do that.
That said there are a few more out of core strategies on the roadmap that have not yet been implemented. If you still get OOMs, chunking your data will help.
Also consider that few gigs of RAM might not be large enough for your workload. Out of core strategies can only do so much.
(Head of Produck at MotherDuck and a huge fan of DuckDB)
To my knowledge the only databases that can be described as "military-grade" in terms of testing are SQLite and Postgres.
* the 1 billion row benchmarks are run on a single, uncompressed 50 GB CSV file. 50 GB should be stored in multiple files.
* the benchmarks only show the query runtime once the data has been persisted in memory. They should also show how long it takes to persist the data in memory. If query_engine_A takes 5 mins to persist in memory & 10 seconds to run the query and query_engine_B takes 2 mins to persist in memory & 20 seconds to run the query, then the amount of time to persist the data is highly relevant.
* benchmarks should also show results when the data isn't persisted in memory.
* Using a Parquet file with column pruning would make a lot more sense than a huge CSV file. The groupby dataset has 9 columns and some of the queries only require 3 columns. Needlessly persisting 6 columns in memory is really misleading for some engines.
* Seems like some of the engines have queries that are more optimized than others. Some have explicitly casted columns as int32 and presumably others are int64. The queries should be apples:apples across engines.
* Some engines are parallel and lazy. "Running" some of these queries is hard because lazy engines don't want to do work unless they have to. The authors have forced some of these queries to run by persisting in memory, which is another step, so that should be investigated.
* There are obvious missing query types like filtering and "compound queries" like filter, join, then aggregate.
I like these benchmarks a lot and use the h2o datasets locally all the time, but the methodology really needs to be modernized. At the bottom you can see "Benchmark run took around 105.3 hours." This is way to slow and there are some obvious fixes that'll make the results more useful for the data community.
For a generic OLAP db, maybe. In this case, though, a single file fits one of DuckDB's use cases: analytics on the data consumed by or produced by a data scientist. In such scenario, it's not uncommon for a multi-GB input or for dumping GBs of a dataframe into a single CSV file.
I still seem to run into the workflow problem where data has to be in proximity to compute in order to function.
If I need to run joins on a 5-10 GB parquet / table, unless I have that sitting locally, the performance bottleneck is not the database.
I still find myself reaching to Databricks / Spark for most tasks for this reason.
I suppose this is what Motherduck is trying to solve? But it just doesn't feel like it's quite there yet for me. Anybody who is better at this stuff than me have thoughts?
Yes, we are indeed a good use case for this. For one, we built a fully-fledged managed storage system on top of DuckDB, with better performance and caching and the like. Two, we're going to be pretty good at reading from S3 because we've optimized that path. Three, our storage has sharing/IAM and is about to have things like zero-copy clone and time travel.
Happy to answer any Qs.
https://duckdb.org/docs/archive/0.9.1/extensions/postgres_sc...
One of the reasons we exist is because DuckDB is meant to be a single-player database. MotherDuck is doing tons of heavy-lifting to turn it into a true multi-player data warehouse, so things like IAM/sharing, persistence, time travel, administration, the ecosystem and so forth.
What's magical about MotherDuck is that virtually any DuckDB instance in the wild can connect to MotherDuck by simply running '.open motherduck:' [1], and suddenly you get all these aforementioned benefits.
(head of produck at MotherDuck)
[1] https://motherduck.com/docs/getting-started/connect-query-fr...
Any DuckDB result is easily converted to Pandas (by appending .df()) or Polars (by appending .pl()).
The conversion to polars is instantaneous because it’s zero copy because it all goes through Arrow in-memory format.
So I usually write complex queries in DuckDB SQL but if I need to manipulate it in polars I just convert it in my workflow midstream (only takes milliseconds) and then continue working with that in DuckDB. It’s seamless due to Apache Arrow.
There's nothing in the architecture that would make it so that performance in one OLAP engine is fundamentally impossible to achieve in another.
But I still think it's surprising how much legs Python model of wrapping around C/C++/Rust libraries has. I would assume that if you have Python calling the libraries, you can't do lazy evaluation and thus you hit a wall such as Pandas.
But we seen with compiling Pytorch and Polars that you can have your cake and eat it too. Still have the ease of use of Python while having performance with enough engineering.
When I checked at 0.7.1, when working with ~90 S3 parquet objects (x0000 rows each, so not too many) it was 25-50% faster to first download them in Go and then query them, rather than using the DuckDB S3 extension with those objects directly (the whole execution ran on the order of a couple hundred milliseconds).
It’s not a DuckDB specific issue (although there’s headroom for improvement — I don’t think DuckDB’s S3 connector is highly optimized). It’s S3.
The overhead of fetching from S3 via a naive Go implementation (goroutine per object) to disk and then running duckdb on that was lower than using duckdb end-to-end.
I was measuring the S3 overhead in both cases.
Those are written to offer scale over large data so have very different overheads and limits compared to something like duckdb. Seems odd to have them on the same chart.
Also, side note but I'd love to see the performance impact of pandas/dask with pyarrow schemas.
> Fugue provides an easier interface to using distributed compute effectively and accelerates big data projects. It does this by minimizing the amount of code you need to write, in addition to taking care of tricks and optimizations that lead to more efficient execution on distrubted compute. Fugue ports Python, Pandas, and SQL code to Spark, Dask, and Ray.
https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...