DuckDB: Querying JSON files as if they were tables (opens in new tab)

(duckdb.org)

294 pointscmdlineluser3y ago80 comments

80 comments

70 comments · 22 top-level

orthoxerox3y ago· 12 in thread

I think I write this under every article about DuckDB, but it's become an indispensable tool for me. I used to abuse Excel because going from Excel to a script to process some data was too much friction, but with DuckDB the friction is gone: loading CSV and Parquet (and now JSON) files is a snap, you can create and persist any tables you want, the SQL dialect has lots of useful sugar.

RyanHamilton3y ago

I've just added support for duckdb to the free SQL tool I make: https://www.timestored.com/qstudio/help/duckdb-sql-editor It allows click to open, browsing tables etc. If you have some time, I would really appreciate feedback from a real user, other than myself.

ihateolives3y ago

Tried to open DuckDB file with one table with 890K rows and about 30 columns. Process stalled, had to kill it (MacOS). DuckDB cli opens it in a snap.

1 more reply

benjaminwootton3y ago

The benefit of “in process” isn’t really clicking for me.

We had Pandas or similar if we wanted to load and transform some data in memory. SQL is nicer than Pandas APIs but not sure that’s a killer feature?

If we have a lot of data and multiple people working with it then it makes sense to centralise it in a database or warehouse where it’s then easy to access via SQL anyway.

We can query files on S3 with it and have the processing locally, but then we have network latency because compute and storage are further apart. There’s a cost benefit here because we don’t need to run a server which could be significant.

It’s nicely implemented but I’m not seeing a big gap that it fills?

ryguyrg3y ago

Just to agree with @orthozerox, it's not a 'killer feature' but about lowering the friction.

Lots of users: (a) mentally align better with SQL than pandas APIs, regardless of whether they know both or not (b) want decent performance on their analyses, which they aren't getting from pandas, and won't get from many OLTP-databases they're using over-the-wire (c) want ease of accessing parquet and csv locally and remotely with minimal development overhead. it's super simple with duckdb.

Nonetheless, some of the other things you pointed out are some tradeoffs. We're building a serverless cloud capability at motherduck on top of duckdb in order to address some of these and optimize compute and storage based on data locality, bandwidth and the need for collaboration amongst multiple people.

chrisjc3y ago

Focusing on "in process" for the moment, I believe that where duckdb can really shine is in what I like to call last-mile-analytics.

Sure, the complete and most up to date version of your data can be stored in a warehouse/database but when you're potentially slicing/dicing/filtering/sorting/exploring/munging data, it can get quite expensive to have your warehouse/database servicing these requests. Even if it's a cloud/modern warehouse like Snowflake. This would be especially true if you're doing analytics on a non-OLAP database.

With duckdb you could pull down the subset of data you're working on from your warehouse/db/datalake, and then perform the last-minute-analytics "in-process". That might be in your notepad, browser (WASM), etc... As a result you can expect some pretty amazing query performance since it's all happening locally, on a subset of the entire data that you've selected.

Then of course if you still wanted to use pandas you can point it at duckdb and allow pandas to fill in the deficiencies of SQL (while still potentially pushing-down some SQL to duckdb). Then of course you can take those dataframes and push them right back into duckdb instead of writing back out disk.

> We can query files on S3 with it and have the processing locally, but then we have network latency because compute and storage are further apart.

If you're files on S3 in a sensible form (hive, iceberg, etc), then you can also use duckdb to pull only the data you need from your bucket and work on it locally.

orthoxerox3y ago

Well, I know SQL much better than I know either Scala+Spark or Python+Pandas. DuckDB is also much easier to set up, being a self-contained JAR. I also like how having a DB file means I don't have to worry about losing in-memory dataframes or leaving random temp files all over the filesystem.

In short, it's not about having a single killer feature, but about being a low-friction solution to my problems.

wenc3y ago

To me it does fill a big gap.

It’s a local columnar engine that I can use inside a Jupyter notebook. This lowers my cost of iteration tremendously.

Yes I can query data from Postgres and munge with Pandas.

But what if I need to iterate on a large set of parquet files (mine is 200gb on my local machine, Hive partitioned, over a billion records) and munge them with complex SQL with a high perf engine? And seamlessly join with other smaller local datasets (there are always smaller datasets that contain metadata) in CSV, Pandas and JSON format in the same SQL statement?

This is a surprisingly common use case in a lot of data science work and prior to DuckDB you could not do it easily, ergonomically or quickly with a single tool. The authors of DuckDB talked to lots of data scientists to learn their pain points and the final product shows that they really listened well.

gigatexal3y ago

To me this is a lot simpler than Pandas as it’s just SQL but it’s SQLite with a columnar engine etc etc

denysonique3y ago

You can also do this with AlaSQL.js which apart from being able to run SQL against JSON and also works with CSV and XSLS, just include it using a script tag or import it as a node module.

nickpeterson3y ago

Any chance you’ve tried clickhouse local? I was thinking it might be a good fit but haven’t used duckdb at all so I might be missing out on big differences.

orthoxerox3y ago

No, I haven't, but it looks like it's a standalone console application, while DuckDB is in-process, like SQLite, and lives inside its JDBC driver. This means I can use it inside any compatible GUI and get stuff like schema browsing and IntelliSense out of the box. Since I have DBeaver open all day anyway, DuckDB is always a tab away.

2 more replies

ihateolives3y ago

I'm recent convert too. I used to work with SQLite for querying datasets but for my usecase DuckDB is much faster plus CLI is nicer to work with.

Waterluvian3y ago· 7 in thread

Mind you this isn’t appropriate for most cases. But I love the idea of “you start with text file. You end with text file. All the database stuff, indexes, etc. are just a detail.”

Often I find that the database wants to be the authority and that makes working with different formats a bit uncomfortable.

jmartin26833y ago

We’re currently building real-time apis backed by terabytes of compressed parquet… hundreds of billions of ‘rows’… in exactly this fashion using polars. It amazes us at every turn.

Join us and help!

ploppyploppy3y ago

What project?

Do you mean polars reading Parquet into DuckDB to process that amount of data?

1 more reply

truculent3y ago

What does this look like in practice? Using the filesystem as a database?

masukomi3y ago

GNU Recutils https://www.gnu.org/software/recutils/

is a good example of an actual database that uses plaintext files in your filesystem.

I can see the argument that doing this with JSON is better (or worse), but regardless, Recutils is an interesting idea that i wish more people knew about. I can imagine a lot of cool things emerging if people would iterate on the idea.

1 more reply

vlovich1233y ago

Anything that stores data on a computer is essentially a database. It's all about representation and what kinds of operations you prioritize for performance.

mcdonje3y ago

Apache Spark / Databricks is an example of this. Parquet files are stored in folders. A folder is assumed to hold one dataset split into multiple files based on specified partition criteria. The VMs read the necessary files into memory and then operate on it.

bobleeswagger3y ago

Isn't linux a good example of this? Everything is a file.

cube22223y ago· 5 in thread

This is really cool!

With their Postgres scanner[0] you can now easily query multiple datasources using SQL and join between them (i.e. Postgres table with JSON file). Something I previously strived to build with OctoSQL[1]. There's even predicate push-down to the underlying databases (for Postgres)!

It's amazing to see how quickly DuckDB is adding new features.

Not a huge fan of C++, which is right now used for authoring extensions, it'd be really cool if somebody implemented a Rust extension SDK, or even something like Steampipe[2] does for Postgres FDWs which would provide a shim for quickly implementing non-performance-sensitive extensions for various things.

Godspeed!

[0]: https://duckdb.org/2022/09/30/postgres-scanner.html

[1]: https://github.com/cube2222/octosql

[2]: https://steampipe.io

cube22223y ago

To answer myself, I've found a project which enables extension development for DuckDB using Rust[0].

[0]: https://github.com/Mause/duckdb-extension-framework

jmartin26833y ago

Consider polars for rust. Much, much faster with fewer resources than Duckdb or datafusion in my experience.

cube22223y ago

Polars is a dataframe library, no? That's a quite different use-case.

1 more reply

LunaSea3y ago

After having tried their PostgreSQL plugin I feel like it's a bit too early to use in production.

Very little amount and unclear pushdown filters are one of the issues, not handling certain data types and thus not being able to scan the table (even if the column in question isn't used) is another.

I think that DuckDB is also missing a PostgreSQL logical replication driver to continuously replicate a subset of tables you want to run stats on.

Syncing the full table every time is too slow.

obi1kenobi3y ago

It's a very exciting time to be working in this space! Going beyond structured databases and file formats like JSON/CSV, there are also systems that can query APIs, source code, ML models, etc.

My own Trustfall query engine is one of them: https://github.com/obi1kenobi/trustfall

For example, you can query the HackerNews APIs from your browser: "Which Twitter/GitHub users comment on stories about OpenAI?" https://play.predr.ag/hackernews#?f=1&q=IyBDcm9zcyBBUEkgcXVl...

One of its real-world use cases is at the core a Rust semver linter: https://predr.ag/blog/speeding-up-rust-semver-checking-by-ov...

pbreit3y ago· 5 in thread

Besides XML, JSON is about the worst way to format tabular data, right?

mcdonje3y ago

If it's tabular, self-describing formats have way too much overhead. I ran a query with a tabular result in the neighborhood of 100 columns by 215k rows, and exported it in multiple formats:

  - CSV: 166mb
  - JSON: 795mb

That said, not all data is tabular.

DuckDB already supports Parquet, which supports structs and is a very good format for storing data for reporting workloads. But JSON is a standard interchange format, so a lot of people are going to want to do something with JSON payloads they receive from API calls.

I could definitely imagine a workload where you receive JSON from an API call, load it into DuckDB or similar to help with ETL, then store results in Parquet.

chundicus3y ago

For me it depends a lot on the context. JSON is often very human readable (as long as it's not too deeply nested), fairly well defined (compared to CSVs), and most languages and software have easy out of the box support for parsing and manipulating it.

If I were building a system that had to deal with large amounts of tabular data that isn't directly consumed by humans, JSON wouldn't be my first choice nor my last.

pbreit3y ago

It's interesting that JSON is still the format of choice for transmitting tabular data to SPAs and mobile apps. Granted, it's likely compressed. But still seems something more efficient like CSV would be better.

lnkuiper3y ago

This is very true. DuckDB does not support JSON because it’s a good tabular format, but because JSON is ubiquitous, and there are many use cases where querying JSON dumps for analytics is useful.

AtNightWeCode3y ago

My love for line-based data formats have increased over time. CSV, JSON-string per line and so on. You can always append to the data and you can deserialize line-by-line.

mkaic3y ago· 4 in thread

I'm currently operating a very small (10s of millions of rows, ~20GB of total data) low-write MySQL DB with a couple different tables. I'm new to RDBs in general and am using MySQL because my thought was any "real" DB would be better than our previous "pipeline", which was just doing all our data filtering/merging with CSVs and Pandas in Python (extremely slowly, and frustrating).

I like the simplicity of DuckDB's proposal, but haven't seen much info about how fast to expect it to be in comparison with traditional RDBs, for smaller, mostly-read-only applications.

piperswe3y ago

Scanning through a CSV can be quite close to querying a SQL database in performance when the SQL database doesn't have any indices. The primary benefits of using a SQL database for querying are (1) indices and (2) a declarative query language. Using DuckDB or SQLite's CSV/JSON support gets you the best of both worlds (minus indices), where you get the declarative query language and query planner but your data's still just CSV/JSON files.

For a dataset that size, I'd probably use SQLite to avoid having to manage a persistent MySQL process, especially when it's being used as an alternative to CSV files. That is, unless there's a MySQL/Postgres server already running I can just create a new database on.

sidpatil3y ago

> Using DuckDB or SQLite's CSV/JSON support gets you the best of both worlds (minus indices)

DuckDB automatically creates indexes for all general-purpose columns. However, they're not persisted.

https://duckdb.org/docs/sql/indexes.html

nkh3y ago

DuckDB is "column oriented" vs "row oriented". I have found it 10x* faster for queries on data your size compared to SQLite or MySQL or Postgres. The added advantage of it being a single file is very nice as well.

*I use the HoneySQL (clojure) library to programmatically build up queries and execute them via the JDBC driver.

mytherin3y ago

Perhaps have a look at this article [1]

[1] https://www.vantage.sh/blog/querying-aws-cost-data-duckdb

janee3y ago· 3 in thread

Ah I was looking for exactly this the other day. I'm try to build a git based interface to our BI tool so that we can get config for our reports in source control instead of configuration in a db.

Was looking for something to read json files which will house the config via SQL, i.e. a human readable db as an alternative to what the BI tool is using for it's config persistence.

Will give DuckDB a go, thanks for posting!

amcaskill3y ago

I’m working on an open source BI alternative called Evidence which works nicely with version control and CI/CD.

If that’s of interest, you can read our launch HN here: https://news.ycombinator.com/item?id=28304781

One of our community members has built a pretty cool Duck DB + dbt + evidence data stack that you can run entirely in GitHub codespaces. He’s calling it modern data stack in a box.

You can see that repo here: https://github.com/matsonj/nba-monte-carlo

Jgrubb3y ago

Are you talking about metabase by any chance?

janee3y ago

Haha indeed! I've started with a very basic Ruby api client that can read and create dashboards.

My plan is to poc a tool that allows you to edit metabase config as files and secondly something that can replicate cloud instances to other environments like local docker image or staging instance

spullara3y ago· 3 in thread

Running into a couple issues right out of the gate:

1) Needed to increase maximum_object_size 2) Unexpected yyjson tag in ValTypeToString

Couldn't find a reference anywhere to that error. Loads into Snowflake without a hitch - which is where I normally query large JSON files.

jjwiseman3y ago

Tried "select * from 'data.json' limit 10" on a 6.3 MB file (which feels relatively tiny…) and got the same `unexpected end of data. Try increasing "maximum_object_size"` error. (This is my very first attempt to use duckdb, so with respect I'm not invested enough to open an issue).

mytherin3y ago

Thanks for trying it out!

Could you perhaps open an issue [1] or share the file with us so we could investigate the problem?

[1] https://github.com/duckdb/duckdb/issues

atombender3y ago

I tried to do "select * from ... limit 1" from a 1.7GB JSON file (array of objects), and I had to increase maximum_object_size to 1GB to make it not throw an error. But DuckDB then consumed 8GB of RAM and sat there consuming 100% CPU (1 core) for ever — I killed it after about 10 minutes.

Meanwhile, doing the same with Jq ("jq '.[0]'") completed in 11 seconds and consumed about 2.8GB RAM.

I love DuckDB, but it does seem like something isn't right here.

antman3y ago· 3 in thread

Very nice! Does anyone know if we can query duckdb with a pandas dialect?

data_ders3y ago

Check out ibis! https://github.com/ibis-project/ibis

closed3y ago

Hey, I maintain a tool called siuba that converts pandas methods to SQL, including for duckdb!

* duckdb example: https://siuba.org/guide/workflows-backends.html#duckdb

* supported methods: https://siuba.org/guide/ops-support-table.html

snthpy3y ago

Not Pandas but very similar and (in my very biased opinion) better is PRQL (www.prql-lang.org) which as of yesterday you can now use in DuckDB!

See my comment above: https://news.ycombinator.com/item?id=35027712

cweill3y ago· 2 in thread

If you ever need to join two large dataframes, but are OOMing on the join, write them to disk as parquet files then use DuckDB to do the join. It's amazing what you can do on one machine thanks to DuckDB.

qolop3y ago

This isn't unique to duckdb. Almost all databases allow for sorting and joins of large tables that don't fit into memory.

cweill3y ago

Yes but if you're in a Jupyter notebook, you may not be directly connected to a DB. If you're using pandas, this unlocks some scalability before needing dask and a cluster.

eatonphil3y ago· 1 in thread

Welcome to the gang! :)

https://github.com/multiprocessio/dsq#comparisons

Realistically though aside from the variety of input formats that DuckDB doesn't (yet) support, I think most people should probably use DuckDB or ClickHouse-local. Tools like dsq can provide broader support or a slightly simpler UX in some cases (and even that is obviously debatable). But I think the future is more the DuckDB or ClickHouse-local way.

dsq may end up being a frontend over DuckDB some day.

snthpy3y ago

I made a docker image with a number of extensions already installed and enabled so you can start using DuckDB with the lowest friction.

`alias dckr='docker run --rm -it -v $(pwd):/data -w /data duckerlabs/ducker'`

then `dckr` gives you a DuckDB shell with PRQL, httpfs, json, parquet, postgres, sqlite, and substrait enabled.

For example, to get the first 5 lines of a csv file named "albums.csv", you could run it with PRQL

```dckr -c 'from `albums.csv` | take 5;'```

https://github.com/duckerlabs/ducker

tracker13y ago· 1 in thread

What's funny, is I've wanted something similar as a feature for "Azure Data Studio" that can open/use a CSV file and query it as a sqlite table. Basically an auto-import to a temp or in-memory db/table that you can then query against.

Would just be a nice gui feature to have.

tanin3y ago

That is exactly https://superintendent.app (disclaimer: I'm the creator)

simonw3y ago· 1 in thread

> If your JSON file is newline-delimited, DuckDB can parallelize reading.

I'd like to understand more about what that means. Does it use multiple threads each reading from a different position in the file?

lnkuiper3y ago

DuckDB will use multiple threads for reading the same file. Each thread will read different parts of the file, but the output will be in the order that the file came in due to DuckDB’s order preserving parallelism.

papruapap3y ago· 1 in thread

Could anyone that uses these tools regularly tell if this a better than jq for querying?

jalk3y ago

It really depends. Using the relational operators to query deeply nested json objects is pretty painful (multiple layers of unnest's ) but fairly simple in jq. On the other hand, joining a couple of "flat" json files will be simple in DuckDB but not readily supported in jq. And if you already know sql thats a win ofc. i.e. I know how to group and aggregate using DuckDB since I know SQL, but currently have no idea about how to do that in jq. And once I find a solution in jq, that is not knowledge I can transfer to other tools.

jq's syntax is deliberately terse which works really really well for "one-liners", while sql queries tend to be more verbose.

richraposa3y ago

It's definitely cool to be able to query data in place instead of inserting it into a table. You can use clickhouse-local to do the same thing with JSON files (and with dozens of other data formats): https://clickhouse.com/blog/worlds-fastest-json-querying-too...

luxurytent3y ago

I'm impressed to all heck how fast this is, and how intuitive!

I tried it against a large-ish (hundreds of mb) log file and the query was simple to write while spitting out the result .. very quick.

Impressed!

chrisjc3y ago

Awesome stuff!

I was about to comment about how this is all fantastic stuff, but I've really found reading through duckdb docs quite challenging. But for these json table functions, documentation looks much better.

https://duckdb.org/docs/extensions/json

Need to spend some more time digging in, but this json functionality combined with some kind of file partitioning (Hive or hive-like) looks promising for some of my use cases.

Incidentally, the documentation for hive/parquet stuff is a good example of what I'm talking about above. For the `parquet_scan` function, where can i see all of the possible function parameters? Where can get more information about the specifics of `FILENAME`, `HIVE_PARTITIONING`, etc?

MrPowers3y ago

This is cool, but would like to give some higher level context about querying JSON files.

JSON is a row based file format. It doesn't allow query engines to skip rows or skip columns when running queries, so all data needs to get read into memory. That's really inefficient.

Column based file formats allow for query engines to skip entire columns of data (e.g. Parquet). Parquet also stores metadata on row groups and allows query engines to skip rows when reading data. These performance enhancements can speed up queries from 0x - 100x or more (depends on how much data is skipped).

Data Lakehouse storage systems abstract the file metadata to a separate layer, which is even better than storing it in the file footer like Parquet does.

This DuckDB functionality is cool, but I think it's best to use it to convert JSON files to Parquet / a Lakehouse storage system, and then query them. JSON is a really inefficient file format for running queries.

nmy3y ago

I used to do this with Apache Drill a few years ago. There is something beautiful about downloading 1 binary and being able to query your files (json/csv/parquet etc) right away

vgt3y ago

One of the major reasons we bet big on DuckDb at MotherDuck is due to the extendable nature of the code base, as demonstrated by the frequency of major improvements and additions.

xyzzy47473y ago

How does this compare with CouchDB, which also stores JSON and allows you to construct cached map/reduce views?

edit: Looks like DuckDB lets you use SQL-style queries

nigamanth3y ago

Really makes you wonder what's the difference between NoSQL and SQL databases if they can be rendered as each other.

theloco3y ago

this looks preddy cool. i was using the json datatype in mysql at the beginning of my project and we ended up yanking it out because of the way you query data within the json. it just started getting kludgey and i felt like i was trying to turn mysql into mongo, but suffering because its not.

will follow duckdb.

j / k navigate · click thread line to collapse

80 comments

70 comments · 22 top-level

orthoxerox3y ago· 12 in thread

RyanHamilton3y ago

ihateolives3y ago

Tried to open DuckDB file with one table with 890K rows and about 30 columns. Process stalled, had to kill it (MacOS). DuckDB cli opens it in a snap.

1 more reply

benjaminwootton3y ago

The benefit of “in process” isn’t really clicking for me.

We had Pandas or similar if we wanted to load and transform some data in memory. SQL is nicer than Pandas APIs but not sure that’s a killer feature?

If we have a lot of data and multiple people working with it then it makes sense to centralise it in a database or warehouse where it’s then easy to access via SQL anyway.

It’s nicely implemented but I’m not seeing a big gap that it fills?

ryguyrg3y ago

Just to agree with @orthozerox, it's not a 'killer feature' but about lowering the friction.

chrisjc3y ago

Focusing on "in process" for the moment, I believe that where duckdb can really shine is in what I like to call last-mile-analytics.

> We can query files on S3 with it and have the processing locally, but then we have network latency because compute and storage are further apart.

If you're files on S3 in a sensible form (hive, iceberg, etc), then you can also use duckdb to pull only the data you need from your bucket and work on it locally.

orthoxerox3y ago

In short, it's not about having a single killer feature, but about being a low-friction solution to my problems.

wenc3y ago

To me it does fill a big gap.

It’s a local columnar engine that I can use inside a Jupyter notebook. This lowers my cost of iteration tremendously.

Yes I can query data from Postgres and munge with Pandas.

gigatexal3y ago

To me this is a lot simpler than Pandas as it’s just SQL but it’s SQLite with a columnar engine etc etc

denysonique3y ago

You can also do this with AlaSQL.js which apart from being able to run SQL against JSON and also works with CSV and XSLS, just include it using a script tag or import it as a node module.

nickpeterson3y ago

Any chance you’ve tried clickhouse local? I was thinking it might be a good fit but haven’t used duckdb at all so I might be missing out on big differences.

orthoxerox3y ago

2 more replies

ihateolives3y ago

I'm recent convert too. I used to work with SQLite for querying datasets but for my usecase DuckDB is much faster plus CLI is nicer to work with.

Waterluvian3y ago· 7 in thread

Mind you this isn’t appropriate for most cases. But I love the idea of “you start with text file. You end with text file. All the database stuff, indexes, etc. are just a detail.”

Often I find that the database wants to be the authority and that makes working with different formats a bit uncomfortable.

jmartin26833y ago

We’re currently building real-time apis backed by terabytes of compressed parquet… hundreds of billions of ‘rows’… in exactly this fashion using polars. It amazes us at every turn.

Join us and help!

ploppyploppy3y ago

What project?

Do you mean polars reading Parquet into DuckDB to process that amount of data?

1 more reply

truculent3y ago

What does this look like in practice? Using the filesystem as a database?

masukomi3y ago

GNU Recutils https://www.gnu.org/software/recutils/

is a good example of an actual database that uses plaintext files in your filesystem.

1 more reply

vlovich1233y ago

Anything that stores data on a computer is essentially a database. It's all about representation and what kinds of operations you prioritize for performance.

mcdonje3y ago

bobleeswagger3y ago

Isn't linux a good example of this? Everything is a file.

cube22223y ago· 5 in thread

This is really cool!

It's amazing to see how quickly DuckDB is adding new features.

Godspeed!

[0]: https://duckdb.org/2022/09/30/postgres-scanner.html

[1]: https://github.com/cube2222/octosql

[2]: https://steampipe.io

cube22223y ago

To answer myself, I've found a project which enables extension development for DuckDB using Rust[0].

[0]: https://github.com/Mause/duckdb-extension-framework

jmartin26833y ago

Consider polars for rust. Much, much faster with fewer resources than Duckdb or datafusion in my experience.

cube22223y ago

Polars is a dataframe library, no? That's a quite different use-case.

1 more reply

LunaSea3y ago

After having tried their PostgreSQL plugin I feel like it's a bit too early to use in production.

I think that DuckDB is also missing a PostgreSQL logical replication driver to continuously replicate a subset of tables you want to run stats on.

Syncing the full table every time is too slow.

obi1kenobi3y ago

It's a very exciting time to be working in this space! Going beyond structured databases and file formats like JSON/CSV, there are also systems that can query APIs, source code, ML models, etc.

My own Trustfall query engine is one of them: https://github.com/obi1kenobi/trustfall

For example, you can query the HackerNews APIs from your browser: "Which Twitter/GitHub users comment on stories about OpenAI?" https://play.predr.ag/hackernews#?f=1&q=IyBDcm9zcyBBUEkgcXVl...

One of its real-world use cases is at the core a Rust semver linter: https://predr.ag/blog/speeding-up-rust-semver-checking-by-ov...

pbreit3y ago· 5 in thread

Besides XML, JSON is about the worst way to format tabular data, right?

mcdonje3y ago

If it's tabular, self-describing formats have way too much overhead. I ran a query with a tabular result in the neighborhood of 100 columns by 215k rows, and exported it in multiple formats:

  - CSV: 166mb
  - JSON: 795mb

That said, not all data is tabular.

I could definitely imagine a workload where you receive JSON from an API call, load it into DuckDB or similar to help with ETL, then store results in Parquet.

chundicus3y ago

If I were building a system that had to deal with large amounts of tabular data that isn't directly consumed by humans, JSON wouldn't be my first choice nor my last.

pbreit3y ago

lnkuiper3y ago

This is very true. DuckDB does not support JSON because it’s a good tabular format, but because JSON is ubiquitous, and there are many use cases where querying JSON dumps for analytics is useful.

AtNightWeCode3y ago

My love for line-based data formats have increased over time. CSV, JSON-string per line and so on. You can always append to the data and you can deserialize line-by-line.

mkaic3y ago· 4 in thread

I like the simplicity of DuckDB's proposal, but haven't seen much info about how fast to expect it to be in comparison with traditional RDBs, for smaller, mostly-read-only applications.

piperswe3y ago

sidpatil3y ago

> Using DuckDB or SQLite's CSV/JSON support gets you the best of both worlds (minus indices)

DuckDB automatically creates indexes for all general-purpose columns. However, they're not persisted.

https://duckdb.org/docs/sql/indexes.html

nkh3y ago

*I use the HoneySQL (clojure) library to programmatically build up queries and execute them via the JDBC driver.

mytherin3y ago

Perhaps have a look at this article [1]

[1] https://www.vantage.sh/blog/querying-aws-cost-data-duckdb

janee3y ago· 3 in thread

Ah I was looking for exactly this the other day. I'm try to build a git based interface to our BI tool so that we can get config for our reports in source control instead of configuration in a db.

Was looking for something to read json files which will house the config via SQL, i.e. a human readable db as an alternative to what the BI tool is using for it's config persistence.

Will give DuckDB a go, thanks for posting!

amcaskill3y ago

I’m working on an open source BI alternative called Evidence which works nicely with version control and CI/CD.

If that’s of interest, you can read our launch HN here: https://news.ycombinator.com/item?id=28304781

One of our community members has built a pretty cool Duck DB + dbt + evidence data stack that you can run entirely in GitHub codespaces. He’s calling it modern data stack in a box.

You can see that repo here: https://github.com/matsonj/nba-monte-carlo

Jgrubb3y ago

Are you talking about metabase by any chance?

janee3y ago

Haha indeed! I've started with a very basic Ruby api client that can read and create dashboards.

My plan is to poc a tool that allows you to edit metabase config as files and secondly something that can replicate cloud instances to other environments like local docker image or staging instance

spullara3y ago· 3 in thread

Running into a couple issues right out of the gate:

1) Needed to increase maximum_object_size 2) Unexpected yyjson tag in ValTypeToString

Couldn't find a reference anywhere to that error. Loads into Snowflake without a hitch - which is where I normally query large JSON files.

jjwiseman3y ago

mytherin3y ago

Thanks for trying it out!

Could you perhaps open an issue [1] or share the file with us so we could investigate the problem?

[1] https://github.com/duckdb/duckdb/issues

atombender3y ago

Meanwhile, doing the same with Jq ("jq '.[0]'") completed in 11 seconds and consumed about 2.8GB RAM.

I love DuckDB, but it does seem like something isn't right here.

antman3y ago· 3 in thread

Very nice! Does anyone know if we can query duckdb with a pandas dialect?

data_ders3y ago

Check out ibis! https://github.com/ibis-project/ibis

closed3y ago

Hey, I maintain a tool called siuba that converts pandas methods to SQL, including for duckdb!

* duckdb example: https://siuba.org/guide/workflows-backends.html#duckdb

* supported methods: https://siuba.org/guide/ops-support-table.html

snthpy3y ago

Not Pandas but very similar and (in my very biased opinion) better is PRQL (www.prql-lang.org) which as of yesterday you can now use in DuckDB!

See my comment above: https://news.ycombinator.com/item?id=35027712

cweill3y ago· 2 in thread

qolop3y ago

This isn't unique to duckdb. Almost all databases allow for sorting and joins of large tables that don't fit into memory.

cweill3y ago

Yes but if you're in a Jupyter notebook, you may not be directly connected to a DB. If you're using pandas, this unlocks some scalability before needing dask and a cluster.

eatonphil3y ago· 1 in thread

Welcome to the gang! :)

https://github.com/multiprocessio/dsq#comparisons

dsq may end up being a frontend over DuckDB some day.

snthpy3y ago

I made a docker image with a number of extensions already installed and enabled so you can start using DuckDB with the lowest friction.

`alias dckr='docker run --rm -it -v $(pwd):/data -w /data duckerlabs/ducker'`

then `dckr` gives you a DuckDB shell with PRQL, httpfs, json, parquet, postgres, sqlite, and substrait enabled.

For example, to get the first 5 lines of a csv file named "albums.csv", you could run it with PRQL

```dckr -c 'from `albums.csv` | take 5;'```

https://github.com/duckerlabs/ducker

tracker13y ago· 1 in thread

Would just be a nice gui feature to have.

tanin3y ago

That is exactly https://superintendent.app (disclaimer: I'm the creator)

simonw3y ago· 1 in thread

> If your JSON file is newline-delimited, DuckDB can parallelize reading.

I'd like to understand more about what that means. Does it use multiple threads each reading from a different position in the file?

lnkuiper3y ago

papruapap3y ago· 1 in thread

Could anyone that uses these tools regularly tell if this a better than jq for querying?

jalk3y ago

jq's syntax is deliberately terse which works really really well for "one-liners", while sql queries tend to be more verbose.

richraposa3y ago

luxurytent3y ago

I'm impressed to all heck how fast this is, and how intuitive!

I tried it against a large-ish (hundreds of mb) log file and the query was simple to write while spitting out the result .. very quick.

Impressed!

chrisjc3y ago

Awesome stuff!

I was about to comment about how this is all fantastic stuff, but I've really found reading through duckdb docs quite challenging. But for these json table functions, documentation looks much better.

https://duckdb.org/docs/extensions/json

Need to spend some more time digging in, but this json functionality combined with some kind of file partitioning (Hive or hive-like) looks promising for some of my use cases.

MrPowers3y ago

This is cool, but would like to give some higher level context about querying JSON files.

JSON is a row based file format. It doesn't allow query engines to skip rows or skip columns when running queries, so all data needs to get read into memory. That's really inefficient.

Data Lakehouse storage systems abstract the file metadata to a separate layer, which is even better than storing it in the file footer like Parquet does.

nmy3y ago

I used to do this with Apache Drill a few years ago. There is something beautiful about downloading 1 binary and being able to query your files (json/csv/parquet etc) right away

vgt3y ago

One of the major reasons we bet big on DuckDb at MotherDuck is due to the extendable nature of the code base, as demonstrated by the frequency of major improvements and additions.

xyzzy47473y ago

How does this compare with CouchDB, which also stores JSON and allows you to construct cached map/reduce views?

edit: Looks like DuckDB lets you use SQL-style queries

nigamanth3y ago

Really makes you wonder what's the difference between NoSQL and SQL databases if they can be rendered as each other.

theloco3y ago

will follow duckdb.

j / k navigate · click thread line to collapse