For example, I helped an Israeli ngo analyze retailer pricing data (supermarkets must publish prices every day by law). Pandas chokes on data that large, Postgres can handle it but aggregations are very slow. Duckdb is lightning fast.
The traditional alternative I’m familiar with is spark, but it’s such a hassle to setup, expensive to run and not as fast on these kinds of use cases.
I will note that familiarity with Parquet and how columnar engines work is helpful. I have gotten tremendous performance increases when storing the data in a sorted manner in a parquet file, which is ETL overhead.
Still, it’s a very powerful and convenient tool for working with large datasets locally
Think of it as a SQL engine for ad-hoc querying larger-than-memory datasets.
But you(I) wouldn’t use it like a standard db where stuff gets constantly written in, rather like a tool to effectively analyze data that’s already somewhere
I don’t understand what it means. Can someone explain? I don’t get why they put such a complicated claim with unexplained acronyms on their homepage.
When I shop for a db, when should I consider duck DB compared to for example Postgres or MySQL? Or do they compete with arrow or parquet? To me it’s unclear because they don’t say what they compete against.
think of sqlite.
> OLAP
think data warehouse. Columnar for analytical workloads.
If you want something "in-process" then you're probably going to have to decide between sqlite and duckdb.
If your workload is
1) individual fast and frequent read-write operations (OLTP), then you should probably pick sqlite.
2) massive amounts of read-heavy analytical operations (OLAP), then you should probably pick duckdb.
That's the decision process stated as simply as possible, but obviously there might be other options out there to consider.Postgres and MySQL are really better suited for out-of-process (shared server) workloads where multiple clients are interacting with the data and resources/compute. They are both row-based OLTP databases, although I do believe Postgres has an option for both table types (HTAP).
Parquet is a file format, just as Avro, JSON, CSV,... are.
Arrow (still grasping this one) is a way that data can be exchanged between systems and processes in such a way that the data is optimized in such a way that doesn't have to go through the extra steps of being shuffled around in memory. For example the data that is returned from a SQL query can be used directly in a Python/Scala/etc dataframe if using Arrow.
I empathize with you about how confusing it all seems, but your curiosity will serve you well. I remember when I was asking these very same questions and it was being led down this road that opened my mind to the world of databases and data engineering.
Google "olap vs oltp" and when you get that, then google "olap vs oltp vs htap".
Or maybe read "The Log: What every software engineer should know about real-time data's unifying abstraction". I know how lame it sounds, but this article really did change the way I think about data.
https://engineering.linkedin.com/distributed-systems/log-wha...
Now THAT is easy to understand. Thank you.
DuckDB is when you need to do OLAP analysis, and the data fits in a single node (your laptop), but it's too large for plain excel.
technically you can use PG/MySQL/Python+Numpy+Pandas to process those data for that use case as well, but DuckDB does it easier/faster most of the time.
If you are a data analyst, analysing a lot of data, which isn't updated in real time, where you'd have to do joins, aggregates which are usually column wise functions, you'd use something like a OLAP db, where duckdb is great for that!
It's like Sqlite(OLTP) but for OLAP.
Meta comment: it's fascinating to me that so many people seem to have never heard of OLAP databases.
Not in the utterly pedestrian CRUD apps I write, anyway.
I find this 'support for pluggable database engines' intriguing. Not least because I can then claim to have used all of the database engines in anger :-)
(I know that it is probably a dumb question due to the following, but I asked anyway: https://en.wikipedia.org/wiki/SQL#Interoperability_and_stand...)
Checkout Postgres Foreign Data Wrappers. That might be the most well known approach for accessing one database through another. The Supabase team wrote an interesting piece about this recently.
https://supabase.com/blog/postgres-foreign-data-wrappers-rus...
You might also want to try out duckdb's approach to reading other DBs (or DB files). They talk about how they can "import" a sqlite DB in the above 0.7.0 announcement, but also have some other examples in their duckdblabs github project. Check out their "...-scanner" repos:
https://news.ycombinator.com/item?id=34741195 (160 points/2 days ago/97 comments)
Also:
https://news.ycombinator.com/item?id=33612898 – DuckDB 0.6.0 (36 points/89 days ago)
https://news.ycombinator.com/item?id=31355050 – Friendlier SQL with DuckDB (366 points/9 months ago/133 comments)
And many others as per dang's comment:
I would have been upset missing this announcement and related commentary if I hadn't seen it before being marked a dupe.
This was #1 before disappearing and now the current #1 is a blog post about Clickhouse.
DuckDB – An in-process SQL OLAP database management system - https://news.ycombinator.com/item?id=34741195 - Feb 2023 (99 comments)
HN operates on the basis of not having too much repetition on the front page. As seen at https://news.ycombinator.com/item?id=34746724, there have also been lots of other DuckDB threads in recent months.
I realize a new release is rightly significant to the people working on the product and/or who are users of the product, and it would have been better for the major thread not to just be a generic post about the project. However, that distinction isn't as salient from a HN discussion point of view, because either way, the thread will fill up with comments about the product in general. You can see that quite clearly in the current thread. The important criterion from an HN point of view is "is this submission different enough to support a substantially different discussion", and in this case the answer is no, so the moderation call was correct.
It's quite impossible to learn about every major release of every major product from HN—frontpage space is the scarcest resource we have [1]. The front page could consist of nothing else and you still couldn't learn about them all from HN alone. Nor is that the purpose of the site; the purpose is intellectual curiosity [2]. Curiosity doesn't do well with repetition [2], so the median curious reader isn't served by having two big threads about the same product within days. Of course, we all have at least one project where we would love to see that, but it's a different choice in everyone's case and we have to try to serve everybody.
[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
[2] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...
[3] https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...
Kudos to the team for their consistently useful, interesting work. They really seem to know their audience well, to have a well-thought-out feature roadmap.
Makes you wonder if a single, well-specced box running DuckDB is going to be 2024's databricks killer.
Much like Redis, I admire the technology but can't think of a project I've worked on that would benefit from it.
Is it for games, maybe? Desktop or mobile apps?
DuckDB lets you process all that locally. It's the OLAP equivalent to SQLite's OLTP.
If I wasn't so beholden to the vagaries and inefficiencies of C-level endorsed enterprise software, I'd immediately be trying this out for data transformations/pipelines. I think that one big box (200+ gb ram, couple of cores and fat IO/network) runs circles around an entire spark cluster.
Is there a reason "in-core" is a specific requirement here?
I can totally see how not having to manage a standalone RDBMS makes sense. But, what's the real-world advantage over something like SQLite?
I mean, the idea of an in-memory relational engine for things like games or embedded totally makes sense, but this seems to target large datasets and deep analysis.
As far as I understand with this model you pretty much re-ingest data from the "raw" source on startup every time. Is this correct?
Judging by the rise on interest I'm sure there's an obvious use case I'm not seeing either.
This very specific question is what I'm trying to understand. SQLite can be run in memory and as a temporary store.
This is definitely a pretty niche case, though, so there must be something more general that this was built to do.
This has to be the main point, right? DuckDB isn't the first mover here (SQL.js, which is SQLite compiled to WASM using emscripten, seems to work fine), but perhaps DuckDB is better as a purpose-built solution.
If so, that's the opposite of what DuckDB does. Under to "When not to use DuckDB" section of their website, they say:
> "[Do not use DuckDB when] writing to a single database from multiple concurrent processes"
Honestly, that's the most baffling part. I can't imagine wanting any database that's locked in a single process.
Really smooth, love it!