DuckDB 0.7.0 (opens in new tab)

(duckdb.org)

105 pointsrolandm3y ago64 comments

64 comments

49 comments · 12 top-level

smt883y ago· 11 in thread

This is an interesting niche. Can anyone explain what they're using it for currently?

Much like Redis, I admire the technology but can't think of a project I've worked on that would benefit from it.

Is it for games, maybe? Desktop or mobile apps?

isoprophlex3y ago

Quick, in-core data transformation. If you want to transform some data right now, one option is writing pyspark and running that on a spark cluster. But noone really has big big data, there are relatively few cases where you have multi TB datasets, warranting the complexities of running the analyics in a distributed way.

DuckDB lets you process all that locally. It's the OLAP equivalent to SQLite's OLTP.

If I wasn't so beholden to the vagaries and inefficiencies of C-level endorsed enterprise software, I'd immediately be trying this out for data transformations/pipelines. I think that one big box (200+ gb ram, couple of cores and fat IO/network) runs circles around an entire spark cluster.

smt883y ago

Interesting. I need to think about this one a little bit. Thank you.

Is there a reason "in-core" is a specific requirement here?

1 more reply

adlpz3y ago

I'm interested in this, too.

I can totally see how not having to manage a standalone RDBMS makes sense. But, what's the real-world advantage over something like SQLite?

I mean, the idea of an in-memory relational engine for things like games or embedded totally makes sense, but this seems to target large datasets and deep analysis.

As far as I understand with this model you pretty much re-ingest data from the "raw" source on startup every time. Is this correct?

Judging by the rise on interest I'm sure there's an obvious use case I'm not seeing either.

smt883y ago

> But, what's the real-world advantage over something like SQLite?

This very specific question is what I'm trying to understand. SQLite can be run in memory and as a temporary store.

1 more reply

shanipribadi3y ago

think BI tools, analytics dashboards for exploratory analysis, or even just exploratory analysis on the terminal with it's rich query capabilities. you can keep analytics data in SQLite, but DuckDB will process it faster/easier for the analytics use cases.

1 more reply

stagg3y ago

We're using it to migrate data pipelines in AWS which were previously run using Glue to Lambda with duckdb. Glue was too heavyweight, slow and expensive for our GB data volumes. We consume csv files use a lambda and duckdb to convert them to parquet. Then another lambda to load these parquet files and do our transformation logic (deduplications, enrichments, clean up, etc) and writing out to parquet files.

smt883y ago

Hmm, interesting... so basically DuckDB works in this case because there's no way to parallelize the migration of a single volume anyway?

This is definitely a pretty niche case, though, so there must be something more general that this was built to do.

amelius3y ago

From what I can gather, it's in-process, so less of a hassle when installing things or doing system administration. Also I suppose easier to port, and e.g. run in a browser. I guess any speed improvements wrt a DB running in a separate process will be tiny and irrelevant for most applications.

smt883y ago

> easier to port, and e.g. run in a browser

This has to be the main point, right? DuckDB isn't the first mover here (SQL.js, which is SQLite compiled to WASM using emscripten, seems to work fine), but perhaps DuckDB is better as a purpose-built solution.

nurettin3y ago

Might be wrong, but it looks like duckdb lets you host the database engine in your process, so you don't pay for IPC. It is the opposite of Redis, as you use it to share memory between processes

smt883y ago

I'm confused by your sentence structure. Are you saying DuckDB lets you share memory between processes?

If so, that's the opposite of what DuckDB does. Under to "When not to use DuckDB" section of their website, they say:

> "[Do not use DuckDB when] writing to a single database from multiple concurrent processes"

Honestly, that's the most baffling part. I can't imagine wanting any database that's locked in a single process.

0x0083y ago· 10 in thread

> DuckDB is an in-process SQL OLAP database management system

I don’t understand what it means. Can someone explain? I don’t get why they put such a complicated claim with unexplained acronyms on their homepage.

When I shop for a db, when should I consider duck DB compared to for example Postgres or MySQL? Or do they compete with arrow or parquet? To me it’s unclear because they don’t say what they compete against.

chrisjc3y ago

> in-process

think of sqlite.

> OLAP

think data warehouse. Columnar for analytical workloads.

If you want something "in-process" then you're probably going to have to decide between sqlite and duckdb.

If your workload is

  1) individual fast and frequent read-write operations (OLTP), then you should probably pick sqlite.

  2) massive amounts of read-heavy analytical operations (OLAP), then you should probably pick duckdb.

That's the decision process stated as simply as possible, but obviously there might be other options out there to consider.

Postgres and MySQL are really better suited for out-of-process (shared server) workloads where multiple clients are interacting with the data and resources/compute. They are both row-based OLTP databases, although I do believe Postgres has an option for both table types (HTAP).

Parquet is a file format, just as Avro, JSON, CSV,... are.

Arrow (still grasping this one) is a way that data can be exchanged between systems and processes in such a way that the data is optimized in such a way that doesn't have to go through the extra steps of being shuffled around in memory. For example the data that is returned from a SQL query can be used directly in a Python/Scala/etc dataframe if using Arrow.

I empathize with you about how confusing it all seems, but your curiosity will serve you well. I remember when I was asking these very same questions and it was being led down this road that opened my mind to the world of databases and data engineering.

Google "olap vs oltp" and when you get that, then google "olap vs oltp vs htap".

Or maybe read "The Log: What every software engineer should know about real-time data's unifying abstraction". I know how lame it sounds, but this article really did change the way I think about data.

https://engineering.linkedin.com/distributed-systems/log-wha...

0x0083y ago

> If you want something "in-process" then you're probably going to have to decide between sqlite and duckdb.

Now THAT is easy to understand. Thank you.

santiagobasulto3y ago

How do you use DuckDB in production for a company? Store SQLite file in something like S3, sync it once per day and run DuckDB with it?

3 more replies

shanipribadi3y ago

https://en.wikipedia.org/wiki/Online_analytical_processing as opposed to https://en.wikipedia.org/wiki/Online_transaction_processing

DuckDB is when you need to do OLAP analysis, and the data fits in a single node (your laptop), but it's too large for plain excel.

technically you can use PG/MySQL/Python+Numpy+Pandas to process those data for that use case as well, but DuckDB does it easier/faster most of the time.

SOLAR_FIELDS3y ago

What do frontends for this type of stuff look like these days? I remember one of my first jobs out of college they had wired up an OLAP cube into Excel so that you could import it into a pivot table and arbitrarily slice and dice data by dragging and dropping columns. I thought that was the coolest thing. Is there a modern day non excel equivalent of this? Cube.js?

1 more reply

0x0083y ago

ok That sounds pretty interesting.. i am curious to find out when I would use this over parquet or pandas.

1 more reply

krishadi3y ago

If you are a web-developer, building application that store username, email, password and all that stuff, where you'd have to do multiple row level fetch of the data with a lot of parallel connections then you'd use postgres or the like.

If you are a data analyst, analysing a lot of data, which isn't updated in real time, where you'd have to do joins, aggregates which are usually column wise functions, you'd use something like a OLAP db, where duckdb is great for that!

Jarwain3y ago

I wouldn't say compete. DuckDB feels like it fits right between postgres and parquet. I think of it as like SQLite for columnar data. Sql semantics but optimized for operations against columns instead of rows.

ddorian433y ago

> I don’t understand what it means.

It's like Sqlite(OLTP) but for OLAP.

colesantiago3y ago

This is still confusing, what do I use this for exactly?

2 more replies

talolard3y ago· 4 in thread

I use Duckdb as a data scientist / analyst. It’s amazing for working with large data locally, because it is very fast and there is almost 0 overhead for use.

For example, I helped an Israeli ngo analyze retailer pricing data (supermarkets must publish prices every day by law). Pandas chokes on data that large, Postgres can handle it but aggregations are very slow. Duckdb is lightning fast.

The traditional alternative I’m familiar with is spark, but it’s such a hassle to setup, expensive to run and not as fast on these kinds of use cases.

I will note that familiarity with Parquet and how columnar engines work is helpful. I have gotten tremendous performance increases when storing the data in a sorted manner in a parquet file, which is ETL overhead.

Still, it’s a very powerful and convenient tool for working with large datasets locally

stinos3y ago

So I'm not super familiar with different databases, but do understand the basics and do know how to work wit data with e.g. pandas, and do think I understand what Duckdb is useful for, but what I'm still completely missing is: how do I get data in Duckdb? I.e. how did you get that data into Duckdb? Or: suppose I have a device producing sensor data, normally I'd connect to some MySQL endpoint somehow and tell it to insert data. How does one do that with Duckdb? Or is the idea rather that you construct your Duckdb first by getting data from somewhere else (like the MySQL db in my example)?

RyEgswuCsn3y ago

My experience has been that most of the time you don’t tell DuckDB to insert data. One is expected to point DuckDB to an existing data file (parquet, csv, json with this new release, etc.) and either import/copy the data into DuckDB tables, or issue SQL queries directly against the file.

Think of it as a SQL engine for ad-hoc querying larger-than-memory datasets.

talolard3y ago

You can do both ways but the latter is the more useful one. Duckdb is designed to read the data very fast and to operate on it fast. So you load a csv/json/parquet and then “create table” and Duckdb lays out the data in a way that makes it fast to read.

But you(I) wouldn’t use it like a standard db where stuff gets constantly written in, rather like a tool to effectively analyze data that’s already somewhere

talolard3y ago

Oh just took a look at the release notes, the new ability to write hive partitioned data with the partition by clause makes etl stuff much easier

eliasmacpherson3y ago· 3 in thread

So say I wanted to try out a workload on various sql databases, mariadb, sqllite, postgres - is there a database that will act as a front end to them?

I find this 'support for pluggable database engines' intriguing. Not least because I can then claim to have used all of the database engines in anger :-)

(I know that it is probably a dumb question due to the following, but I asked anyway: https://en.wikipedia.org/wiki/SQL#Interoperability_and_stand...)

chrisjc3y ago

It's not a dumb question at all. I'm pretty knowledgeable with DBs and still find it very difficult to understand how many of these front-end/pass-through engines work.

Checkout Postgres Foreign Data Wrappers. That might be the most well known approach for accessing one database through another. The Supabase team wrote an interesting piece about this recently.

https://supabase.com/blog/postgres-foreign-data-wrappers-rus...

You might also want to try out duckdb's approach to reading other DBs (or DB files). They talk about how they can "import" a sqlite DB in the above 0.7.0 announcement, but also have some other examples in their duckdblabs github project. Check out their "...-scanner" repos:

https://github.com/duckdblabs/postgres_scanner

https://github.com/duckdblabs/sqlite_scanner

Gasp0de3y ago

Just use an ORM? If you code your workload in Python with SQLAlchemy you can just swap out any (relational) db. If you want to do benchmarking or something, this might not be the best approach, since each db might need some specific tuning to reach full potential.

dirkderkdurk3y ago

You could also try using ODBC or ADO.NET. I've not used the latter, but ODBC was my goto for this kind of thing a decade or so ago. So mileage may vary drastically, and there might be roadkill along the way.

https://en.wikipedia.org/wiki/Open_Database_Connectivity

merricksb3y ago· 3 in thread

Major discussion of project 2 days ago:

https://news.ycombinator.com/item?id=34741195 (160 points/2 days ago/97 comments)

Also:

https://news.ycombinator.com/item?id=33612898 – DuckDB 0.6.0 (36 points/89 days ago)

https://news.ycombinator.com/item?id=31355050 – Friendlier SQL with DuckDB (366 points/9 months ago/133 comments)

And many others as per dang's comment:

https://news.ycombinator.com/item?id=34746724

chrisjc3y ago

Any idea why this has been marked a dupe? I can't find the 0.7.0 story announced anywhere else on HN? The additions to 0.7.0 are quite significant and definitely news/discussion worthy.

I would have been upset missing this announcement and related commentary if I hadn't seen it before being marked a dupe.

cmdlineluser3y ago

Many of the previous duckdb threads have comments advertising Clickhouse features.

This was #1 before disappearing and now the current #1 is a blog post about Clickhouse.

dang3y ago

A moderator downweighted it because there was a DuckDB thread on the front page for 16 hours just a couple days ago:

DuckDB – An in-process SQL OLAP database management system - https://news.ycombinator.com/item?id=34741195 - Feb 2023 (99 comments)

HN operates on the basis of not having too much repetition on the front page. As seen at https://news.ycombinator.com/item?id=34746724, there have also been lots of other DuckDB threads in recent months.

I realize a new release is rightly significant to the people working on the product and/or who are users of the product, and it would have been better for the major thread not to just be a generic post about the project. However, that distinction isn't as salient from a HN discussion point of view, because either way, the thread will fill up with comments about the product in general. You can see that quite clearly in the current thread. The important criterion from an HN point of view is "is this submission different enough to support a substantially different discussion", and in this case the answer is no, so the moderation call was correct.

It's quite impossible to learn about every major release of every major product from HN—frontpage space is the scarcest resource we have [1]. The front page could consist of nothing else and you still couldn't learn about them all from HN alone. Nor is that the purpose of the site; the purpose is intellectual curiosity [2]. Curiosity doesn't do well with repetition [2], so the median curious reader isn't served by having two big threads about the same product within days. Of course, we all have at least one project where we would love to see that, but it's a different choice in everyone's case and we have to try to serve everybody.

[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

[2] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...

[3] https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...

1 more reply

ZephyrBlu3y ago· 2 in thread

I haven't used it yet, but DuckDB looks really cool. I'm looking forwards to what MotherDuck releases with it. Having a great local-first product focused on datasets <100GB would be awesome.

Meta comment: it's fascinating to me that so many people seem to have never heard of OLAP databases.

frodowtf3y ago

It's also fascinating to me that some people seem to believe it's HN's job to define these terms. Seriously, the definition of OLAP is one search away and it is not hard to grasp...

animuchan3y ago

One of those people here (not that I'm proud of it or anything) — I guess a lot of software engineers just don't do data analysis so heavy it requires a different kind of database.

Not in the utterly pedestrian CRUD apps I write, anyway.

krishadi3y ago· 2 in thread

Is duckdb-wasm also going to get this update soon?

hfmuehleisen3y ago

Yes we are working on it

krishadi3y ago

Is there a roadmap on handling geographic data?

1 more reply

isoprophlex3y ago· 1 in thread

> After this release DuckDB will also be able to write hive-partitioned data using the PARTITION_BY clause. These files can be exported locally or remotely to S3 compatible storage.

Kudos to the team for their consistently useful, interesting work. They really seem to know their audience well, to have a well-thought-out feature roadmap.

Makes you wonder if a single, well-specced box running DuckDB is going to be 2024's databricks killer.

Tarq0n3y ago

Is Duckdb designed for multiple users? I always got the impression the default use-case is single-user.

anonymousDan3y ago· 1 in thread

What is DuckDB's story for concurrency, transactions, multicore etc? Is it multi-threaded?

jereze3y ago

https://duckdb.org/faq

polyrand3y ago

I'm super excited about the new query building. It's like having CTEs that you can easily debug and explore, and instead of having to work with big queries now you can just play around with Python objects. This will make testing complex SQL easier too, you can do `.limit(20).show()` on any intermediate relation and look at the table.

natrys3y ago

Upsert support and Lateral Join are really cool additions, thanks a lot for these. My code became quite ugly without these.

londogard3y ago

The Polars integration is golden! Now I can use my two favorite tools without any awkward conversations via Arrow.

Really smooth, love it!

j / k navigate · click thread line to collapse

64 comments

49 comments · 12 top-level

smt883y ago· 11 in thread

This is an interesting niche. Can anyone explain what they're using it for currently?

Much like Redis, I admire the technology but can't think of a project I've worked on that would benefit from it.

Is it for games, maybe? Desktop or mobile apps?

isoprophlex3y ago

DuckDB lets you process all that locally. It's the OLAP equivalent to SQLite's OLTP.

smt883y ago

Interesting. I need to think about this one a little bit. Thank you.

Is there a reason "in-core" is a specific requirement here?

1 more reply

adlpz3y ago

I'm interested in this, too.

I can totally see how not having to manage a standalone RDBMS makes sense. But, what's the real-world advantage over something like SQLite?

I mean, the idea of an in-memory relational engine for things like games or embedded totally makes sense, but this seems to target large datasets and deep analysis.

As far as I understand with this model you pretty much re-ingest data from the "raw" source on startup every time. Is this correct?

Judging by the rise on interest I'm sure there's an obvious use case I'm not seeing either.

smt883y ago

> But, what's the real-world advantage over something like SQLite?

This very specific question is what I'm trying to understand. SQLite can be run in memory and as a temporary store.

1 more reply

shanipribadi3y ago

1 more reply

stagg3y ago

smt883y ago

Hmm, interesting... so basically DuckDB works in this case because there's no way to parallelize the migration of a single volume anyway?

This is definitely a pretty niche case, though, so there must be something more general that this was built to do.

amelius3y ago

smt883y ago

> easier to port, and e.g. run in a browser

nurettin3y ago

Might be wrong, but it looks like duckdb lets you host the database engine in your process, so you don't pay for IPC. It is the opposite of Redis, as you use it to share memory between processes

smt883y ago

I'm confused by your sentence structure. Are you saying DuckDB lets you share memory between processes?

If so, that's the opposite of what DuckDB does. Under to "When not to use DuckDB" section of their website, they say:

> "[Do not use DuckDB when] writing to a single database from multiple concurrent processes"

Honestly, that's the most baffling part. I can't imagine wanting any database that's locked in a single process.

0x0083y ago· 10 in thread

> DuckDB is an in-process SQL OLAP database management system

I don’t understand what it means. Can someone explain? I don’t get why they put such a complicated claim with unexplained acronyms on their homepage.

chrisjc3y ago

> in-process

think of sqlite.

> OLAP

think data warehouse. Columnar for analytical workloads.

If you want something "in-process" then you're probably going to have to decide between sqlite and duckdb.

If your workload is

  1) individual fast and frequent read-write operations (OLTP), then you should probably pick sqlite.

  2) massive amounts of read-heavy analytical operations (OLAP), then you should probably pick duckdb.

That's the decision process stated as simply as possible, but obviously there might be other options out there to consider.

Parquet is a file format, just as Avro, JSON, CSV,... are.

Google "olap vs oltp" and when you get that, then google "olap vs oltp vs htap".

https://engineering.linkedin.com/distributed-systems/log-wha...

0x0083y ago

> If you want something "in-process" then you're probably going to have to decide between sqlite and duckdb.

Now THAT is easy to understand. Thank you.

santiagobasulto3y ago

How do you use DuckDB in production for a company? Store SQLite file in something like S3, sync it once per day and run DuckDB with it?

3 more replies

shanipribadi3y ago

https://en.wikipedia.org/wiki/Online_analytical_processing as opposed to https://en.wikipedia.org/wiki/Online_transaction_processing

DuckDB is when you need to do OLAP analysis, and the data fits in a single node (your laptop), but it's too large for plain excel.

technically you can use PG/MySQL/Python+Numpy+Pandas to process those data for that use case as well, but DuckDB does it easier/faster most of the time.

SOLAR_FIELDS3y ago

1 more reply

0x0083y ago

ok That sounds pretty interesting.. i am curious to find out when I would use this over parquet or pandas.

1 more reply

krishadi3y ago

Jarwain3y ago

ddorian433y ago

> I don’t understand what it means.

It's like Sqlite(OLTP) but for OLAP.

colesantiago3y ago

This is still confusing, what do I use this for exactly?

2 more replies

talolard3y ago· 4 in thread

I use Duckdb as a data scientist / analyst. It’s amazing for working with large data locally, because it is very fast and there is almost 0 overhead for use.

The traditional alternative I’m familiar with is spark, but it’s such a hassle to setup, expensive to run and not as fast on these kinds of use cases.

Still, it’s a very powerful and convenient tool for working with large datasets locally

stinos3y ago

RyEgswuCsn3y ago

Think of it as a SQL engine for ad-hoc querying larger-than-memory datasets.

talolard3y ago

But you(I) wouldn’t use it like a standard db where stuff gets constantly written in, rather like a tool to effectively analyze data that’s already somewhere

talolard3y ago

Oh just took a look at the release notes, the new ability to write hive partitioned data with the partition by clause makes etl stuff much easier

eliasmacpherson3y ago· 3 in thread

So say I wanted to try out a workload on various sql databases, mariadb, sqllite, postgres - is there a database that will act as a front end to them?

I find this 'support for pluggable database engines' intriguing. Not least because I can then claim to have used all of the database engines in anger :-)

(I know that it is probably a dumb question due to the following, but I asked anyway: https://en.wikipedia.org/wiki/SQL#Interoperability_and_stand...)

chrisjc3y ago

It's not a dumb question at all. I'm pretty knowledgeable with DBs and still find it very difficult to understand how many of these front-end/pass-through engines work.

Checkout Postgres Foreign Data Wrappers. That might be the most well known approach for accessing one database through another. The Supabase team wrote an interesting piece about this recently.

https://supabase.com/blog/postgres-foreign-data-wrappers-rus...

https://github.com/duckdblabs/postgres_scanner

https://github.com/duckdblabs/sqlite_scanner

Gasp0de3y ago

dirkderkdurk3y ago

https://en.wikipedia.org/wiki/Open_Database_Connectivity

merricksb3y ago· 3 in thread

Major discussion of project 2 days ago:

https://news.ycombinator.com/item?id=34741195 (160 points/2 days ago/97 comments)

Also:

https://news.ycombinator.com/item?id=33612898 – DuckDB 0.6.0 (36 points/89 days ago)

https://news.ycombinator.com/item?id=31355050 – Friendlier SQL with DuckDB (366 points/9 months ago/133 comments)

And many others as per dang's comment:

https://news.ycombinator.com/item?id=34746724

chrisjc3y ago

Any idea why this has been marked a dupe? I can't find the 0.7.0 story announced anywhere else on HN? The additions to 0.7.0 are quite significant and definitely news/discussion worthy.

I would have been upset missing this announcement and related commentary if I hadn't seen it before being marked a dupe.

cmdlineluser3y ago

Many of the previous duckdb threads have comments advertising Clickhouse features.

This was #1 before disappearing and now the current #1 is a blog post about Clickhouse.

dang3y ago

A moderator downweighted it because there was a DuckDB thread on the front page for 16 hours just a couple days ago:

DuckDB – An in-process SQL OLAP database management system - https://news.ycombinator.com/item?id=34741195 - Feb 2023 (99 comments)

[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

[2] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...

[3] https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...

1 more reply

ZephyrBlu3y ago· 2 in thread

I haven't used it yet, but DuckDB looks really cool. I'm looking forwards to what MotherDuck releases with it. Having a great local-first product focused on datasets <100GB would be awesome.

Meta comment: it's fascinating to me that so many people seem to have never heard of OLAP databases.

frodowtf3y ago

It's also fascinating to me that some people seem to believe it's HN's job to define these terms. Seriously, the definition of OLAP is one search away and it is not hard to grasp...

animuchan3y ago

One of those people here (not that I'm proud of it or anything) — I guess a lot of software engineers just don't do data analysis so heavy it requires a different kind of database.

Not in the utterly pedestrian CRUD apps I write, anyway.

krishadi3y ago· 2 in thread

Is duckdb-wasm also going to get this update soon?

hfmuehleisen3y ago

Yes we are working on it

krishadi3y ago

Is there a roadmap on handling geographic data?

1 more reply

isoprophlex3y ago· 1 in thread

> After this release DuckDB will also be able to write hive-partitioned data using the PARTITION_BY clause. These files can be exported locally or remotely to S3 compatible storage.

Kudos to the team for their consistently useful, interesting work. They really seem to know their audience well, to have a well-thought-out feature roadmap.

Makes you wonder if a single, well-specced box running DuckDB is going to be 2024's databricks killer.

Tarq0n3y ago

Is Duckdb designed for multiple users? I always got the impression the default use-case is single-user.

anonymousDan3y ago· 1 in thread

What is DuckDB's story for concurrency, transactions, multicore etc? Is it multi-threaded?

jereze3y ago

https://duckdb.org/faq

polyrand3y ago

natrys3y ago

Upsert support and Lateral Join are really cool additions, thanks a lot for these. My code became quite ugly without these.

londogard3y ago

The Polars integration is golden! Now I can use my two favorite tools without any awkward conversations via Arrow.

Really smooth, love it!

j / k navigate · click thread line to collapse