Updates to the H2O.ai db-benchmark (opens in new tab)

(duckdb.org)

192 pointsvgt2y ago87 comments

87 comments

63 comments · 19 top-level

treesciencebot2y ago· 11 in thread

Would be curious how the performance compares to DataFusion[0] as one of the top contenders to DuckDB on this area (albeit they being different in a lot of parts, I find it one of the closest compared to all others).

ClickBench (from ClickHouse) has some benchmarks[1] where it can be compared, but am not super sure how up to date it is. At least a while back, they were majorly out of date and haven't looked too closely on whether they are keeping it fair for everyone else :)

[0]: https://github.com/apache/arrow-datafusion

[1]: https://benchmark.clickhouse.com

jabart2y ago

Looks like a recent PR bumped benchmark.clickhouse.com to DuckDB v0.9 on the 3rd.

https://github.com/ClickHouse/ClickBench/pull/141

sanderjd2y ago

Looks like DataFusion is included in most of the results in the article?

treesciencebot2y ago

You are right! Seems like it is not text-addressable which is why my ctrl+f searches failed.

leicmi2y ago

A paper on DataFusion is in progress[0].

The draft[1] includes a comparison to DuckDB and preliminary benchmark results.

[0]: https://github.com/apache/arrow-datafusion/issues/6782

[1]: https://www.overleaf.com/read/qjhrxqhgksvr

riku_iki2y ago

Why do you run benchmarks on such small datasets? It is very hard to judge performance..

slt20212y ago

question about Arrow: the format seems to be not very space efficient.

I tried converting one of my parquet files from datalake from parquet to arrow and size difference is staggering. 20mb parquet -> 700mb arrow.

doesnt seem fit for datalake at all

ayhanfuat2y ago

Arrow is not really designed for storage though. See the "Parquet vs Arrow" section of this post (https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encod...):

> Parquet and Arrow are complementary technologies, and they make some different design tradeoffs. In particular, Parquet is a storage format designed for maximum space efficiency, whereas Arrow is an in-memory format intended for operation by vectorized computational kernels.

> The major distinction is that Arrow provides O(1) random access lookups to any array index, whilst Parquet does not. In particular, Parquet uses dremel record shredding, variable length encoding schemes, and block compression to drastically reduce the data size, but these techniques come at the loss of performant random access lookups.

1 more reply

alamb2y ago

The following paper describes some of the tradeoffs between different formats

Deep Dive into Common Open Formats for Analytical DBMSs https://www.vldb.org/pvldb/vol16/p3044-liu.pdf

iwd2y ago

Do you have compression enabled? At least from Pandas, Parquet defaults to compressed and Arrow/Feather default to uncompressed. When I enable zstd compression, I get similar file sizes, and sometimes Arrow is smaller.

1 more reply

lomereiter2y ago

Arrow format is not intended for storage, it's for in-memory data exchange between different libraries and languages.

the_optimist2y ago

Seems likely a memory management issue in the Arrow interface for the language you’re using.

wenc2y ago· 10 in thread

Fantastic job by the DuckDB team. I’ve been using it for the past year to query 100s of GBs of Parquet files with complex analytic queries involving multiple levels of aggregations, joins and window functions and it all works and works fast. And I do all this from Jupyter Notebook.

It’s actually faster than AWS Athena for me.

marginalia_nu2y ago

Yeah I've started using it with my search engine as well. It's fantastic how versatile it is for data all manner of data munging.

Just the other day I used it to transform an unordered 60 GB CSV file with links and texts into a 3 GB parquet file that's so fast I can create a projection for the relevant data of each partition in like a minute (which then fits in memory).

It has some minor stability issues so I'm not sure I'd build a full blown application on top of it, but for data transformation tasks it's amazing.

ngrilly2y ago

That’s great! Where are the Parquet files stored, and where is running DuckDB?

wenc2y ago

S3 and SageMaker. Same region.

1 more reply

datadeft2y ago

Exactly the same experience here. I am thinking about getting rid off everything else in our data infra as soon as it hits 1.0.

cmollis2y ago

this part is confusing to me in the doc.. I assume that you're using the httpfs (S3) extensions and perhaps doing scanning of the parquet files (which I think is actually streamed.. e.g. querying for a specific column values in a series of parquet files). We have a huge data set of hive-partitioned parquet files in s3 (e.g. /customerid/year/month/<series of parquet files>). Can i just scan these files using the glob pattern to retrieve data like I can with Athena? The extension doc seems to indicate that I can (from the doc: SELECT * FROM read_parquet('s3://bucket/*/file.parquet', HIVE_PARTITIONING = 1) where year=2013;) Or do I need to know which parquet files I'm looking for in S3 and bring them down to work on locally? If it's the former, then it seems equivalent to Athena..

wenc2y ago

No you can definitely use globs in DuckDB.

And no you don’t have to know the exact parquet file. You would treat the Hive partitioned data as a single dataset and DuckDB will scan it automatically. (Partition elimination, predicate pushdown etc all done automatically)

https://duckdb.org/docs/data/partitioning/hive_partitioning

1 more reply

LunaSea2y ago

Are you not getting OutOfMemory erros?

We are in the same scenario (querying lots of Parquet files in S3) and we noticed that DuckDB quickly crashes with OOMs in environments with a few gigs of RAM.

Setting the memory limit setting or the disk swap file has not worked.

wenc2y ago

Not since 0.9.0. There’s been a lot of work on out of core stuff especially with big joins. It’s in the release notes. Also if you set a temp folder it will spill to disk (not set by default).

I used to have to chunk my data to avoid OOMs but I haven’t had to do that.

That said there are a few more out of core strategies on the roadmap that have not yet been implemented. If you still get OOMs, chunking your data will help.

Also consider that few gigs of RAM might not be large enough for your workload. Out of core strategies can only do so much.

https://duckdb.org/2023/09/26/announcing-duckdb-090.html

1 more reply

vgtOP2y ago

DuckDB 0.9 has fixed most of OOM scenarios by both using far less RAM and by spilling to disk better.

(Head of Produck at MotherDuck and a huge fan of DuckDB)

ayhanfuat2y ago

I encountered the same issue. Polars' memory usage was much lower for the datasets I tried.

1 more reply

sega_sai2y ago· 5 in thread

I was/am a fan of duckdb, but I recently discovered a bug in 0.9.1 where a fairly innocuous query was silently returning wrong results (issue 9399 on github). That made me much less confident about duckdb and how well tested it is. Maybe it was a one off, but with postgresql for example I don't think I personally encountered cases of simply incorrect query results.

datarecipes2y ago

Just had a look (https://github.com/duckdb/duckdb/issues/9399). Yeah it's worrying that such a trivial query returned incorrect results - but credit to the Devs for getting it fixed quickly.

To my knowledge the only databases that can be described as "military-grade" in terms of testing are SQLite and Postgres.

nyanpasu642y ago

Apparently DuckDB requires your real-life name to file an online bug report, bucking every norm of online handles for communication, as well as enabling doxxers and stalkers to find and trace people in real life.

4 more replies

Cthulhu_2y ago

What about Oracle and MSSQL? I'd imagine that especially the former would be "military-grade" (whatever that entails)

1 more reply

code_biologist2y ago

I'm not sure if the fix is reassuring or not: https://github.com/duckdb/duckdb/pull/9411/files

1 more reply

datadeft2y ago

Almost(!) every single query engine has bugs like this. The fact is that DuckDB is version 0.9.1 (pre 1.0) so bugs are kind of expected. I have found bugs like this in commercial engines.

indeedmug2y ago· 5 in thread

I am impressed that Polar is close to DuckDB near the top. It's surprising that a Python library would often out perform everything but DuckDB. DuckDB is very impressive but DataFrames and Python is too useful to give up on.

wenc2y ago

DuckDB interoperates with polars dataframes easily. I see DuckDB as a SQL engine for dataframes.

Any DuckDB result is easily converted to Pandas (by appending .df()) or Polars (by appending .pl()).

The conversion to polars is instantaneous because it’s zero copy because it all goes through Arrow in-memory format.

So I usually write complex queries in DuckDB SQL but if I need to manipulate it in polars I just convert it in my workflow midstream (only takes milliseconds) and then continue working with that in DuckDB. It’s seamless due to Apache Arrow.

https://duckdb.org/docs/guides/python/polars.html

indeedmug2y ago

Wow, what a cool workflow. I looks like the interop promise of Apache Arrow is real. It's a great thing when your computer works as fast as you think as opposed to sitting around waiting for queries to finish.

theLiminator2y ago

I mean polars is great, but there's nothing fundamentally impossible about polars providing similar performance to DuckDB, polars is written in rust, and really a lazy dataframe just provides an alternative frontend (sql being another frontend).

There's nothing in the architecture that would make it so that performance in one OLAP engine is fundamentally impossible to achieve in another.

indeedmug2y ago

I didn't know that Polars was implemented in Rust. In fact, it's very neat that Rust can interop with Python so cleanly, but that shouldn't be surprising since Python is basically a wrapper around C libraries.

But I still think it's surprising how much legs Python model of wrapping around C/C++/Rust libraries has. I would assume that if you have Python calling the libraries, you can't do lazy evaluation and thus you hit a wall such as Pandas.

But we seen with compiling Pytorch and Polars that you can have your cake and eat it too. Still have the ease of use of Python while having performance with enough engineering.

shpongled2y ago

Polars is written in Rust!

dang2y ago· 3 in thread

Recent and related:

DuckDB 0.9 - https://news.ycombinator.com/item?id=37657736 - Sept 2023 (59 comments)

kristianp2y ago

Moderation note: the actual title of this article is "Updates to the H2O.ai db-benchmark!".

It's not about duckdb's performance improvements per se, it's about the change to the aws instance type they're using to get fairer benchmark results by avoiding block storage and preventing noisy neighbors.

vgtOP2y ago

From the article itself:

"The team at DuckDB Labs has been hard at work improving the performance of the out-of-core hash aggregates and joins."

Aside from that, since April DuckDB has vastly improved performance of a number of queries, including aggregates, joins, window functions. [0]

[0] https://duckdb.org/2023/09/26/announcing-duckdb-090.html#cor...

1 more reply

dang2y ago

Changed now. Thanks!

(Submitted title was "DuckDB performance improvements with the latest release")

MrPowers2y ago· 3 in thread

I think these benchmarks are great, but also quite misleading and should be updated:

* the 1 billion row benchmarks are run on a single, uncompressed 50 GB CSV file. 50 GB should be stored in multiple files.

* the benchmarks only show the query runtime once the data has been persisted in memory. They should also show how long it takes to persist the data in memory. If query_engine_A takes 5 mins to persist in memory & 10 seconds to run the query and query_engine_B takes 2 mins to persist in memory & 20 seconds to run the query, then the amount of time to persist the data is highly relevant.

* benchmarks should also show results when the data isn't persisted in memory.

* Using a Parquet file with column pruning would make a lot more sense than a huge CSV file. The groupby dataset has 9 columns and some of the queries only require 3 columns. Needlessly persisting 6 columns in memory is really misleading for some engines.

* Seems like some of the engines have queries that are more optimized than others. Some have explicitly casted columns as int32 and presumably others are int64. The queries should be apples:apples across engines.

* Some engines are parallel and lazy. "Running" some of these queries is hard because lazy engines don't want to do work unless they have to. The authors have forced some of these queries to run by persisting in memory, which is another step, so that should be investigated.

* There are obvious missing query types like filtering and "compound queries" like filter, join, then aggregate.

I like these benchmarks a lot and use the h2o datasets locally all the time, but the methodology really needs to be modernized. At the bottom you can see "Benchmark run took around 105.3 hours." This is way to slow and there are some obvious fixes that'll make the results more useful for the data community.

g9yuayon2y ago

> the 1 billion row benchmarks are run on a single, uncompressed 50 GB CSV file. 50 GB should be stored in multiple files.

For a generic OLAP db, maybe. In this case, though, a single file fits one of DuckDB's use cases: analytics on the data consumed by or produced by a data scientist. In such scenario, it's not uncommon for a multi-GB input or for dumping GBs of a dataframe into a single CSV file.

jandrewrogers2y ago

Why should 50GB be stored in multiple files?

MrPowers2y ago

Modern query engines are designed to read data in parallel because it's so much faster. The data could be stored in 50 different one-gig files that were read in parallel.

1 more reply

sroerick2y ago· 2 in thread

I really love DuckDB for one-offs and analytics and I wonder if anybody here has experience using with medium size data.

I still seem to run into the workflow problem where data has to be in proximity to compute in order to function.

If I need to run joins on a 5-10 GB parquet / table, unless I have that sitting locally, the performance bottleneck is not the database.

I still find myself reaching to Databricks / Spark for most tasks for this reason.

I suppose this is what Motherduck is trying to solve? But it just doesn't feel like it's quite there yet for me. Anybody who is better at this stuff than me have thoughts?

pjot2y ago

I really like Motherducks hybrid execution model. My DS colleagues love abusing our data warehouse - bringing the data down locally to pound on is a win-win

vgtOP2y ago

Hi, head of Produck at MotherDuck here.

Yes, we are indeed a good use case for this. For one, we built a fully-fledged managed storage system on top of DuckDB, with better performance and caching and the like. Two, we're going to be pretty good at reading from S3 because we've optimized that path. Three, our storage has sharing/IAM and is about to have things like zero-copy clone and time travel.

Happy to answer any Qs.

cube22222y ago· 2 in thread

It would be interesting to see some more benchmarks of e.g. querying multiple files from S3, and how that evolved across versions.

When I checked at 0.7.1, when working with ~90 S3 parquet objects (x0000 rows each, so not too many) it was 25-50% faster to first download them in Go and then query them, rather than using the DuckDB S3 extension with those objects directly (the whole execution ran on the order of a couple hundred milliseconds).

wenc2y ago

You definitely pay a performance penalty on S3 (S3 is high throughput but high latency storage) so not optimal for (any) database use cases. Local disk will always be faster if you can swing that.

It’s not a DuckDB specific issue (although there’s headroom for improvement — I don’t think DuckDB’s S3 connector is highly optimized). It’s S3.

cube22222y ago

I might’ve been unclear, so to clarify:

The overhead of fetching from S3 via a naive Go implementation (goroutine per object) to disk and then running duckdb on that was lower than using duckdb end-to-end.

I was measuring the S3 overhead in both cases.

1 more reply

spapas822y ago· 1 in thread

If you have some data in postgresql and want to query it with duckdb (really fast) you can try extracting the data to a parquet file; this file can then be queried from duckdb with incredible speed. I've written a small program in python that reads from postgresql and exports to parquet for anybody that wanna try it https://github.com/spapas/pg-parquet-py#why

chrisjc2y ago

There's also the option to use the DuckDB PostgreSQL scanner

https://duckdb.org/docs/archive/0.9.1/extensions/postgres_sc...

esafak2y ago· 1 in thread

Reminder that you can use Fugue as a unified API and swap out back-ends, including DuckDB: https://fugue-tutorials.readthedocs.io/tutorials/integration...

rovr1382y ago

Fugue,

> Fugue provides an easier interface to using distributed compute effectively and accelerates big data projects. It does this by minimizing the amount of code you need to write, in addition to taking care of tricks and optimizations that lead to more efficient execution on distrubted compute. Fugue ports Python, Pandas, and SQL code to Spark, Dask, and Ray.

timost2y ago· 1 in thread

Interestingly, Dask runs out of memory on many of the tasks of the benchmark.

pgorczak2y ago

I had similar issues with the Dask scheduler a few months ago. The docs say it encourages depth first behavior in the computation graph but in my case it kept running out of memory on a large ETL task by first trying to load all the input files into memory before moving on to the next stage.

alamb2y ago

I do think it was important for duckdb to put out a new version of the results as the earlier version of that benchmark [1] went dormant with a very old version of duckdb with very bad performance, especially against polars.

[1] https://h2oai.github.io/db-benchmark/

vgtOP2y ago

We at MotherDuck are working closely with the DuckDB folks to deliver a serverless analytics service powered by DuckDB [0]. Currently in open Beta and driving towards GA. Our users are certainly recognizing how fast DuckDB is in the cloud.

One of the reasons we exist is because DuckDB is meant to be a single-player database. MotherDuck is doing tons of heavy-lifting to turn it into a true multi-player data warehouse, so things like IAM/sharing, persistence, time travel, administration, the ecosystem and so forth.

What's magical about MotherDuck is that virtually any DuckDB instance in the wild can connect to MotherDuck by simply running '.open motherduck:' [1], and suddenly you get all these aforementioned benefits.

(head of produck at MotherDuck)

[0] https://motherduck.com/

[1] https://motherduck.com/docs/getting-started/connect-query-fr...

tristenharr2y ago

Nice job DuckDB team, those are great performance improvements compared to a couple years ago. It’s neat that solutions like this are being offered and developed. At Hasura we’ve been working on a data-connector that’ll wrap around DuckDB. Curious what others would think about having a GraphQL layer on JSON/Parquet/CSV via DuckDB?

benrutter2y ago

I love the increased focus on benchmark testing recently, but I always find it a little weird to read for stuff like spark or dask.

Those are written to offer scale over large data so have very different overheads and limits compared to something like duckdb. Seems odd to have them on the same chart.

Also, side note but I'd love to see the performance impact of pandas/dask with pyarrow schemas.

jcuenod2y ago

Why do so many of the 0.5GB clickhouse benchmarks fail?

riku_iki2y ago

Kinda odd that they have max 50GB dataset and run it on machine with 160GB ram, so no testing of out of memory capabilities.

vorticalbox2y ago

It's nice that they put the Tldr at the top, im not sure why but a lot of places, for some weird reason, put it at the end.

jcuenod2y ago

Really interesting to compare to Clickhouse's benchmark, when you filter out the non-comparable results. The TLDR is that their benchmark shows DuckDB winning a lot of the races:

https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...

j / k navigate · click thread line to collapse

87 comments

63 comments · 19 top-level

treesciencebot2y ago· 11 in thread

[0]: https://github.com/apache/arrow-datafusion

[1]: https://benchmark.clickhouse.com

jabart2y ago

Looks like a recent PR bumped benchmark.clickhouse.com to DuckDB v0.9 on the 3rd.

https://github.com/ClickHouse/ClickBench/pull/141

sanderjd2y ago

Looks like DataFusion is included in most of the results in the article?

treesciencebot2y ago

You are right! Seems like it is not text-addressable which is why my ctrl+f searches failed.

leicmi2y ago

A paper on DataFusion is in progress[0].

The draft[1] includes a comparison to DuckDB and preliminary benchmark results.

[0]: https://github.com/apache/arrow-datafusion/issues/6782

[1]: https://www.overleaf.com/read/qjhrxqhgksvr

riku_iki2y ago

Why do you run benchmarks on such small datasets? It is very hard to judge performance..

slt20212y ago

question about Arrow: the format seems to be not very space efficient.

I tried converting one of my parquet files from datalake from parquet to arrow and size difference is staggering. 20mb parquet -> 700mb arrow.

doesnt seem fit for datalake at all

ayhanfuat2y ago

Arrow is not really designed for storage though. See the "Parquet vs Arrow" section of this post (https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encod...):

1 more reply

alamb2y ago

The following paper describes some of the tradeoffs between different formats

Deep Dive into Common Open Formats for Analytical DBMSs https://www.vldb.org/pvldb/vol16/p3044-liu.pdf

iwd2y ago

1 more reply

lomereiter2y ago

Arrow format is not intended for storage, it's for in-memory data exchange between different libraries and languages.

the_optimist2y ago

Seems likely a memory management issue in the Arrow interface for the language you’re using.

wenc2y ago· 10 in thread

It’s actually faster than AWS Athena for me.

marginalia_nu2y ago

Yeah I've started using it with my search engine as well. It's fantastic how versatile it is for data all manner of data munging.

It has some minor stability issues so I'm not sure I'd build a full blown application on top of it, but for data transformation tasks it's amazing.

ngrilly2y ago

That’s great! Where are the Parquet files stored, and where is running DuckDB?

wenc2y ago

S3 and SageMaker. Same region.

1 more reply

datadeft2y ago

Exactly the same experience here. I am thinking about getting rid off everything else in our data infra as soon as it hits 1.0.

cmollis2y ago

wenc2y ago

No you can definitely use globs in DuckDB.

https://duckdb.org/docs/data/partitioning/hive_partitioning

1 more reply

LunaSea2y ago

Are you not getting OutOfMemory erros?

We are in the same scenario (querying lots of Parquet files in S3) and we noticed that DuckDB quickly crashes with OOMs in environments with a few gigs of RAM.

Setting the memory limit setting or the disk swap file has not worked.

wenc2y ago

Not since 0.9.0. There’s been a lot of work on out of core stuff especially with big joins. It’s in the release notes. Also if you set a temp folder it will spill to disk (not set by default).

I used to have to chunk my data to avoid OOMs but I haven’t had to do that.

That said there are a few more out of core strategies on the roadmap that have not yet been implemented. If you still get OOMs, chunking your data will help.

Also consider that few gigs of RAM might not be large enough for your workload. Out of core strategies can only do so much.

https://duckdb.org/2023/09/26/announcing-duckdb-090.html

1 more reply

vgtOP2y ago

DuckDB 0.9 has fixed most of OOM scenarios by both using far less RAM and by spilling to disk better.

(Head of Produck at MotherDuck and a huge fan of DuckDB)

ayhanfuat2y ago

I encountered the same issue. Polars' memory usage was much lower for the datasets I tried.

1 more reply

sega_sai2y ago· 5 in thread

datarecipes2y ago

Just had a look (https://github.com/duckdb/duckdb/issues/9399). Yeah it's worrying that such a trivial query returned incorrect results - but credit to the Devs for getting it fixed quickly.

To my knowledge the only databases that can be described as "military-grade" in terms of testing are SQLite and Postgres.

nyanpasu642y ago

4 more replies

Cthulhu_2y ago

What about Oracle and MSSQL? I'd imagine that especially the former would be "military-grade" (whatever that entails)

1 more reply

code_biologist2y ago

I'm not sure if the fix is reassuring or not: https://github.com/duckdb/duckdb/pull/9411/files

1 more reply

datadeft2y ago

Almost(!) every single query engine has bugs like this. The fact is that DuckDB is version 0.9.1 (pre 1.0) so bugs are kind of expected. I have found bugs like this in commercial engines.

indeedmug2y ago· 5 in thread

wenc2y ago

DuckDB interoperates with polars dataframes easily. I see DuckDB as a SQL engine for dataframes.

Any DuckDB result is easily converted to Pandas (by appending .df()) or Polars (by appending .pl()).

The conversion to polars is instantaneous because it’s zero copy because it all goes through Arrow in-memory format.

https://duckdb.org/docs/guides/python/polars.html

indeedmug2y ago

theLiminator2y ago

There's nothing in the architecture that would make it so that performance in one OLAP engine is fundamentally impossible to achieve in another.

indeedmug2y ago

But we seen with compiling Pytorch and Polars that you can have your cake and eat it too. Still have the ease of use of Python while having performance with enough engineering.

shpongled2y ago

Polars is written in Rust!

dang2y ago· 3 in thread

Recent and related:

DuckDB 0.9 - https://news.ycombinator.com/item?id=37657736 - Sept 2023 (59 comments)

kristianp2y ago

Moderation note: the actual title of this article is "Updates to the H2O.ai db-benchmark!".

vgtOP2y ago

From the article itself:

"The team at DuckDB Labs has been hard at work improving the performance of the out-of-core hash aggregates and joins."

Aside from that, since April DuckDB has vastly improved performance of a number of queries, including aggregates, joins, window functions. [0]

[0] https://duckdb.org/2023/09/26/announcing-duckdb-090.html#cor...

1 more reply

dang2y ago

Changed now. Thanks!

(Submitted title was "DuckDB performance improvements with the latest release")

MrPowers2y ago· 3 in thread

I think these benchmarks are great, but also quite misleading and should be updated:

* the 1 billion row benchmarks are run on a single, uncompressed 50 GB CSV file. 50 GB should be stored in multiple files.

* benchmarks should also show results when the data isn't persisted in memory.

* There are obvious missing query types like filtering and "compound queries" like filter, join, then aggregate.

g9yuayon2y ago

> the 1 billion row benchmarks are run on a single, uncompressed 50 GB CSV file. 50 GB should be stored in multiple files.

jandrewrogers2y ago

Why should 50GB be stored in multiple files?

MrPowers2y ago

Modern query engines are designed to read data in parallel because it's so much faster. The data could be stored in 50 different one-gig files that were read in parallel.

1 more reply

sroerick2y ago· 2 in thread

I really love DuckDB for one-offs and analytics and I wonder if anybody here has experience using with medium size data.

I still seem to run into the workflow problem where data has to be in proximity to compute in order to function.

If I need to run joins on a 5-10 GB parquet / table, unless I have that sitting locally, the performance bottleneck is not the database.

I still find myself reaching to Databricks / Spark for most tasks for this reason.

I suppose this is what Motherduck is trying to solve? But it just doesn't feel like it's quite there yet for me. Anybody who is better at this stuff than me have thoughts?

pjot2y ago

I really like Motherducks hybrid execution model. My DS colleagues love abusing our data warehouse - bringing the data down locally to pound on is a win-win

vgtOP2y ago

Hi, head of Produck at MotherDuck here.

Happy to answer any Qs.

cube22222y ago· 2 in thread

It would be interesting to see some more benchmarks of e.g. querying multiple files from S3, and how that evolved across versions.

wenc2y ago

You definitely pay a performance penalty on S3 (S3 is high throughput but high latency storage) so not optimal for (any) database use cases. Local disk will always be faster if you can swing that.

It’s not a DuckDB specific issue (although there’s headroom for improvement — I don’t think DuckDB’s S3 connector is highly optimized). It’s S3.

cube22222y ago

I might’ve been unclear, so to clarify:

The overhead of fetching from S3 via a naive Go implementation (goroutine per object) to disk and then running duckdb on that was lower than using duckdb end-to-end.

I was measuring the S3 overhead in both cases.

1 more reply

spapas822y ago· 1 in thread

chrisjc2y ago

There's also the option to use the DuckDB PostgreSQL scanner

https://duckdb.org/docs/archive/0.9.1/extensions/postgres_sc...

esafak2y ago· 1 in thread

Reminder that you can use Fugue as a unified API and swap out back-ends, including DuckDB: https://fugue-tutorials.readthedocs.io/tutorials/integration...

rovr1382y ago

Fugue,

timost2y ago· 1 in thread

Interestingly, Dask runs out of memory on many of the tasks of the benchmark.

pgorczak2y ago

alamb2y ago

[1] https://h2oai.github.io/db-benchmark/

vgtOP2y ago

(head of produck at MotherDuck)

[0] https://motherduck.com/

[1] https://motherduck.com/docs/getting-started/connect-query-fr...

tristenharr2y ago

benrutter2y ago

I love the increased focus on benchmark testing recently, but I always find it a little weird to read for stuff like spark or dask.

Those are written to offer scale over large data so have very different overheads and limits compared to something like duckdb. Seems odd to have them on the same chart.

Also, side note but I'd love to see the performance impact of pandas/dask with pyarrow schemas.

jcuenod2y ago

Why do so many of the 0.5GB clickhouse benchmarks fail?

riku_iki2y ago

Kinda odd that they have max 50GB dataset and run it on machine with 160GB ram, so no testing of out of memory capabilities.

vorticalbox2y ago

It's nice that they put the Tldr at the top, im not sure why but a lot of places, for some weird reason, put it at the end.

jcuenod2y ago

Really interesting to compare to Clickhouse's benchmark, when you filter out the non-comparable results. The TLDR is that their benchmark shows DuckDB winning a lot of the races:

https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...

j / k navigate · click thread line to collapse