We had Pandas or similar if we wanted to load and transform some data in memory. SQL is nicer than Pandas APIs but not sure that’s a killer feature?
If we have a lot of data and multiple people working with it then it makes sense to centralise it in a database or warehouse where it’s then easy to access via SQL anyway.
We can query files on S3 with it and have the processing locally, but then we have network latency because compute and storage are further apart. There’s a cost benefit here because we don’t need to run a server which could be significant.
It’s nicely implemented but I’m not seeing a big gap that it fills?
Lots of users: (a) mentally align better with SQL than pandas APIs, regardless of whether they know both or not (b) want decent performance on their analyses, which they aren't getting from pandas, and won't get from many OLTP-databases they're using over-the-wire (c) want ease of accessing parquet and csv locally and remotely with minimal development overhead. it's super simple with duckdb.
Nonetheless, some of the other things you pointed out are some tradeoffs. We're building a serverless cloud capability at motherduck on top of duckdb in order to address some of these and optimize compute and storage based on data locality, bandwidth and the need for collaboration amongst multiple people.
Sure, the complete and most up to date version of your data can be stored in a warehouse/database but when you're potentially slicing/dicing/filtering/sorting/exploring/munging data, it can get quite expensive to have your warehouse/database servicing these requests. Even if it's a cloud/modern warehouse like Snowflake. This would be especially true if you're doing analytics on a non-OLAP database.
With duckdb you could pull down the subset of data you're working on from your warehouse/db/datalake, and then perform the last-minute-analytics "in-process". That might be in your notepad, browser (WASM), etc... As a result you can expect some pretty amazing query performance since it's all happening locally, on a subset of the entire data that you've selected.
Then of course if you still wanted to use pandas you can point it at duckdb and allow pandas to fill in the deficiencies of SQL (while still potentially pushing-down some SQL to duckdb). Then of course you can take those dataframes and push them right back into duckdb instead of writing back out disk.
> We can query files on S3 with it and have the processing locally, but then we have network latency because compute and storage are further apart.
If you're files on S3 in a sensible form (hive, iceberg, etc), then you can also use duckdb to pull only the data you need from your bucket and work on it locally.
In short, it's not about having a single killer feature, but about being a low-friction solution to my problems.
It’s a local columnar engine that I can use inside a Jupyter notebook. This lowers my cost of iteration tremendously.
Yes I can query data from Postgres and munge with Pandas.
But what if I need to iterate on a large set of parquet files (mine is 200gb on my local machine, Hive partitioned, over a billion records) and munge them with complex SQL with a high perf engine? And seamlessly join with other smaller local datasets (there are always smaller datasets that contain metadata) in CSV, Pandas and JSON format in the same SQL statement?
This is a surprisingly common use case in a lot of data science work and prior to DuckDB you could not do it easily, ergonomically or quickly with a single tool. The authors of DuckDB talked to lots of data scientists to learn their pain points and the final product shows that they really listened well.
https://github.com/multiprocessio/dsq#comparisons
Realistically though aside from the variety of input formats that DuckDB doesn't (yet) support, I think most people should probably use DuckDB or ClickHouse-local. Tools like dsq can provide broader support or a slightly simpler UX in some cases (and even that is obviously debatable). But I think the future is more the DuckDB or ClickHouse-local way.
dsq may end up being a frontend over DuckDB some day.
`alias dckr='docker run --rm -it -v $(pwd):/data -w /data duckerlabs/ducker'`
then `dckr` gives you a DuckDB shell with PRQL, httpfs, json, parquet, postgres, sqlite, and substrait enabled.
For example, to get the first 5 lines of a csv file named "albums.csv", you could run it with PRQL
```dckr -c 'from `albums.csv` | take 5;'```
With their Postgres scanner[0] you can now easily query multiple datasources using SQL and join between them (i.e. Postgres table with JSON file). Something I previously strived to build with OctoSQL[1]. There's even predicate push-down to the underlying databases (for Postgres)!
It's amazing to see how quickly DuckDB is adding new features.
Not a huge fan of C++, which is right now used for authoring extensions, it'd be really cool if somebody implemented a Rust extension SDK, or even something like Steampipe[2] does for Postgres FDWs which would provide a shim for quickly implementing non-performance-sensitive extensions for various things.
Godspeed!
[0]: https://duckdb.org/2022/09/30/postgres-scanner.html
[1]: https://github.com/cube2222/octosql
[2]: https://steampipe.io
Very little amount and unclear pushdown filters are one of the issues, not handling certain data types and thus not being able to scan the table (even if the column in question isn't used) is another.
I think that DuckDB is also missing a PostgreSQL logical replication driver to continuously replicate a subset of tables you want to run stats on.
Syncing the full table every time is too slow.
My own Trustfall query engine is one of them: https://github.com/obi1kenobi/trustfall
For example, you can query the HackerNews APIs from your browser: "Which Twitter/GitHub users comment on stories about OpenAI?" https://play.predr.ag/hackernews#?f=1&q=IyBDcm9zcyBBUEkgcXVl...
One of its real-world use cases is at the core a Rust semver linter: https://predr.ag/blog/speeding-up-rust-semver-checking-by-ov...
Would just be a nice gui feature to have.
Was looking for something to read json files which will house the config via SQL, i.e. a human readable db as an alternative to what the BI tool is using for it's config persistence.
Will give DuckDB a go, thanks for posting!
If that’s of interest, you can read our launch HN here: https://news.ycombinator.com/item?id=28304781
One of our community members has built a pretty cool Duck DB + dbt + evidence data stack that you can run entirely in GitHub codespaces. He’s calling it modern data stack in a box.
You can see that repo here: https://github.com/matsonj/nba-monte-carlo
My plan is to poc a tool that allows you to edit metabase config as files and secondly something that can replicate cloud instances to other environments like local docker image or staging instance
Often I find that the database wants to be the authority and that makes working with different formats a bit uncomfortable.
Join us and help!
Do you mean polars reading Parquet into DuckDB to process that amount of data?
is a good example of an actual database that uses plaintext files in your filesystem.
I can see the argument that doing this with JSON is better (or worse), but regardless, Recutils is an interesting idea that i wish more people knew about. I can imagine a lot of cool things emerging if people would iterate on the idea.
I'd like to understand more about what that means. Does it use multiple threads each reading from a different position in the file?
I tried it against a large-ish (hundreds of mb) log file and the query was simple to write while spitting out the result .. very quick.
Impressed!
I was about to comment about how this is all fantastic stuff, but I've really found reading through duckdb docs quite challenging. But for these json table functions, documentation looks much better.
https://duckdb.org/docs/extensions/json
Need to spend some more time digging in, but this json functionality combined with some kind of file partitioning (Hive or hive-like) looks promising for some of my use cases.
Incidentally, the documentation for hive/parquet stuff is a good example of what I'm talking about above. For the `parquet_scan` function, where can i see all of the possible function parameters? Where can get more information about the specifics of `FILENAME`, `HIVE_PARTITIONING`, etc?
JSON is a row based file format. It doesn't allow query engines to skip rows or skip columns when running queries, so all data needs to get read into memory. That's really inefficient.
Column based file formats allow for query engines to skip entire columns of data (e.g. Parquet). Parquet also stores metadata on row groups and allows query engines to skip rows when reading data. These performance enhancements can speed up queries from 0x - 100x or more (depends on how much data is skipped).
Data Lakehouse storage systems abstract the file metadata to a separate layer, which is even better than storing it in the file footer like Parquet does.
This DuckDB functionality is cool, but I think it's best to use it to convert JSON files to Parquet / a Lakehouse storage system, and then query them. JSON is a really inefficient file format for running queries.
jq's syntax is deliberately terse which works really really well for "one-liners", while sql queries tend to be more verbose.
I like the simplicity of DuckDB's proposal, but haven't seen much info about how fast to expect it to be in comparison with traditional RDBs, for smaller, mostly-read-only applications.
For a dataset that size, I'd probably use SQLite to avoid having to manage a persistent MySQL process, especially when it's being used as an alternative to CSV files. That is, unless there's a MySQL/Postgres server already running I can just create a new database on.
DuckDB automatically creates indexes for all general-purpose columns. However, they're not persisted.
*I use the HoneySQL (clojure) library to programmatically build up queries and execute them via the JDBC driver.
[1] https://www.vantage.sh/blog/querying-aws-cost-data-duckdb
1) Needed to increase maximum_object_size 2) Unexpected yyjson tag in ValTypeToString
Couldn't find a reference anywhere to that error. Loads into Snowflake without a hitch - which is where I normally query large JSON files.
Could you perhaps open an issue [1] or share the file with us so we could investigate the problem?
Meanwhile, doing the same with Jq ("jq '.[0]'") completed in 11 seconds and consumed about 2.8GB RAM.
I love DuckDB, but it does seem like something isn't right here.
* duckdb example: https://siuba.org/guide/workflows-backends.html#duckdb
* supported methods: https://siuba.org/guide/ops-support-table.html
See my comment above: https://news.ycombinator.com/item?id=35027712
edit: Looks like DuckDB lets you use SQL-style queries
will follow duckdb.
- CSV: 166mb
- JSON: 795mb
That said, not all data is tabular.DuckDB already supports Parquet, which supports structs and is a very good format for storing data for reporting workloads. But JSON is a standard interchange format, so a lot of people are going to want to do something with JSON payloads they receive from API calls.
I could definitely imagine a workload where you receive JSON from an API call, load it into DuckDB or similar to help with ETL, then store results in Parquet.
If I were building a system that had to deal with large amounts of tabular data that isn't directly consumed by humans, JSON wouldn't be my first choice nor my last.