DuckDB as the New jq (opens in new tab)

(pgrs.net)

364 pointspgr0ss2y ago73 comments

73 comments

56 comments · 23 top-level

xg152y ago· 15 in thread

The most effective combination I've found so far is jq + basic shell tools.

I still think jq's syntax and data model is unbelievably elegant and powerful once you get the hang of it - but its "standard library" is unfortunately sorely lacking in many places and has some awkward design choices in others, which means that a lot of practical everyday tasks - such as aggregations or even just set membership - are a lot more complicated than they ought to be.

Luckily, what jq can do really well is bringing data of interest into a line-based text representation, which is ideal for all kinds of standard unix shell tools - so you can just use those to take over the parts of your pipeline that would be hard to do in "pure" jq.

So I think my solution to the OP's task - get all distinct OSS licenses from the project list and count usages for each one - would be:

curl ... | jq '.[].license.key' | sort | uniq -c

That's it.

pcthrowaway2y ago

> I still think jq's syntax and data model is unbelievably elegant and powerful once you get the hang of it - but its "standard library" is unfortunately sorely lacking in many places

After a few years of stalled development, jq has been taken over recently by a new team of maintainers and is rapidly working through a lot of longstanding issues (https://github.com/jqlang/jq), so I'm not sure if this is still the case

xg152y ago

Wasn't aware of that, that's great to hear! I think if there is one utility that deserves a great maintainer team then this one. But if we saw some actual improvements in the future, that would be awesome!

I have a list of pet peeves that I'd really like to see fixed, so I'm gonna risk a bit of hope.

salmo2y ago

As an old Unix guy this is exactly how I see jq: a gateway to a fantastic library of text processing tools. I see a lot of complicated things done inside the language, which is a valid approach. But I don’t need it to be a programming language itself, just a transform to meet my next command after the pipe.

If I want logic beyond that, then I skip the shell and write “real” software.

I personally find those both to be more readable and easier to fit in my head than long complex jq expressions. But that’s completely subjective and others may find the jq expression language easier to read than shell or (choose your programming language).

digdugdirk2y ago

Your comment made me go look up jq (even more than the article did) and the first paragraph of the repo [0] feels like a secret club's secret language.

I'm very interested, but not a Linux person, do you know of any good resources for learning the Linux shell as a programming language?

[0] https://jqlang.github.io/jq/

3 more replies

hipjiveguy2y ago

I do the same thing all the time, lol

but have in the last year or so, tried to start writing things that will be read/used by other people in python or java

why? because most people don't have a clue how bash scripts work, but can read/debug python and java with ease, and both can be run as shell scripts too

the mac has sqlite installed by default as well, so there's a powerful combo available on every mac

but I totally do what you're doing all the time, and it is my default go to if I have to get something done fast :D

jb36892y ago

Driving that further - I don’t want to have to edit my query for minutes when I am in the shell. I don’t believe that the shell is conducive to complex SQL queries. You could write simpler SQL queries, but then you’re in a space where there are less verbose tools for those simpler tasks.

DuckDB seems stronger for someone who needs to create a scripting library - which also has lots of options and competition - or someone who has a very specific workflow of working with JSON dumps for a huge percentage of their time.

mightybyte2y ago

Your command line solution doesn't give quite the same result as OP. The final output in OP is sorted by the count field, but your command line incantation doesn't do that. One might respond that all you need to do is add a second "| sort" at the end, but that doesn't quite do it either. That will use string sorting instead of proper numeric sorting. In this example with only three output rows it's not an issue. But with larger amounts of data it will become a problem.

Your fundamental point about the power of basic shell tools is still completely valid. But if I could attempt to summarize OP's point, I think it would be that SQL is more powerful than ad-hoc jq incantations. And in this case, I tend to agree with OP. I've made substantial use of jq and yq over the course of years, as well as other tools for CSVs and other data formats. But every time I reach for them I have to spend a lot of time hunting the docs for just the right syntax to attack my specific problem. I know jq's paradigm draws from functional programming concepts and I have plenty of personal experience with functional programming, but the syntax and still feel very ad hoc and clunky.

Modern OLAP DB tools like duckdb, clickhouse, etc that provide really nice ways to get all kinds of data formats into and out of a SQL environment seem dramatically more powerful to me. Then when you add the power of all the basic shell tools on top of that, I think you get a much more powerful combination.

I like this example from the clickhouse-local documentation:

  $ ps aux | tail -n +2 | awk '{ printf("%s\t%s\n", $1, $4) }' \
      | clickhouse-local --structure "user String, mem Float64" \
          --query "SELECT user, round(sum(mem), 2) as memTotal
            FROM table GROUP BY user ORDER BY memTotal DESC FORMAT Pretty"

xg152y ago

You can archive that by appending sort -n, so the whole thing becomes:

curl ... | jq '.[].license.key' | sort | uniq -c | sort -n

You can even turn it back into json by exploiting the fact that when uniq -c gets lines of json as input, it's output will be "accidentally" parseable as a sequence of json literals by jq, where every second literal is a count. You can use jq's (very weird) input function to transform each pair of literals into a "proper" json object:

curl ... | jq '.[].license.key' | sort | uniq -c | sort -n | jq '{"count":., "value":input}'

michaelcampbell2y ago

> I think it would be that SQL is more powerful than ad-hoc jq incantations.

I don't disagree, but only because I have a lot of experience with SQL so I "think" more in those terms. The blog post made perfect sense to me, and I have to go to claude/openai/copilot for whatever I need in `jq`, EVERY TIME. Because I don't use it nearly as often, so I don't have its language internalized.

Readability (and I'd posit to the point of this, power) is more a function of the reader than the code, pg's "blub paradox" notwithstanding.

I didn't know anything about duckdb, but I might give this a shot.

tleb_2y ago

For reference, sort(1) has -n, --numeric-sort: compare according to string numerical value.

eru2y ago

> I still think jq's syntax and data model is unbelievably elegant and powerful once you get the hang of it [...]

It's basically just functional programming. (Or what you would get from a functional programmer given the task of writing such a tool as jq.)

That's not to diminish jq, it's a great tool. I love it!

DanielHB2y ago

found out recently that jq can url-encode values, is there anything it _can't_ do?

xg152y ago

Finding out whether . is contained in a given array or not, evidently.

(That's not strictly true - you can do it, you just have to bend over backwards for what is essentially the "in" keyword in python, sql, etc. jq has no less than four functions that look like they should do that - in(), has(), contains() and inside(), yet they all do something slightly different)

jimbokun2y ago

The Unix philosophy continues to pass the test of time.

pphysch2y ago

Yes and no. Many UNIX philosophy proponents are abhorred by powerful binaries like jq and awk.

1 more reply

JeremyNT2y ago· 5 in thread

I have a lot of trouble understanding the benefits of this versus just working with json with a programming language. It seems like you're adding another layer of abstraction versus just dealing with a normal hashmap-like data structure in your language of choice.

If you want to work with it interactively, you could use a notebook or REPL.

WuxiFingerHold2y ago

My thoughts as well:

  const response = await fetch("https://api.github.com/orgs/golang/repos");
  const repos = await response.json();

  const groups = Map.groupBy(repos, e => e?.license?.key);
  ...

edu_guitar2y ago

if you are used to the command line and knows some basic syntax, it is less verbose then opening a REPL and reading a file. The fact that you can pipe the json data into it is also a plus, making it easier to check quickly if the response of a curl call has the fields/values you were expecting. Of course, if you are more comfortable doing that from the REPL, you get less value from learning jq. If you are fond of one liners, jq offers a lot of potential.

bdcravens2y ago

Pipelining CLI commands or bash scripts. From a security perspective, it may be preferable to not ship with a runtime.

jonfw2y ago

Use a compiled language like golang if you don't want to ship with a runtime.

If you're willing to ship w/ bash then I don't understand the opposition to JS. Either tool puts you in a scenario where somebody who can exec into your env can do whatever they want

vips7L2y ago

bash and jq are both runtimes.

1 more reply

NortySpock2y ago· 3 in thread

In a similar vein, I have found Benthos to be an incredible swiss-army-knife for transforming data and shoving it either into (or out of) a message bus, webhook, or a database.

https://www.benthos.dev/

esafak2y ago

I wish it was not based on YAML. Pipelines are code, not configuration!!

krembo2y ago

How does this defer from filebeat?

NortySpock2y ago

I don't know which filebeat you are referring to...

https://github.com/elastic/beats/tree/master/filebeat

This one? I only looked for a moment, but filebeat appears to be ingestion only. Benthos does input, output, side-effects, stream-stream joins, metrics-on-the-side, tiny-json-wrangling-webooks, and more. I find it to be like plumbers putty, closing over tooling gaps and smoothing rough edges where ordinarily you'd have to write 20 lines of stream processing code and 300 lines of error handling, reporting, and performance hacks.

hu32y ago· 2 in thread

Related, clickhouse local cli command is a speed demon to parse and query JSON and other formats such as CSV:

- "The world’s fastest tool for querying JSON files" https://clickhouse.com/blog/worlds-fastest-json-querying-too...

- "Show HN: ClickHouse-local – a small tool for serverless data analytics" https://news.ycombinator.com/item?id=34265206

mightybyte2y ago

I'll second this. Clickhouse is amazing. I was actually using it today to query some CSV files. I had to refresh my memory on the syntax so if anyone is interested:

  clickhouse local -q "SELECT foo, sum(bar) FROM file('foobar.csv', CSV) GROUP BY foo FORMAT Pretty"

Way easier than opening in Excel and creating a pivot table which was my previous workflow.

Here's a list of the different input and output formats that it supports.

https://clickhouse.com/docs/en/interfaces/formats

gkbrk2y ago

You don't even need to use file() for a lot of things recently. These just work with clickhouse local. Even wildcards work.

  select * from `foobar.csv`

  select * from `monthly-report-*.csv`

2 more replies

sshine2y ago· 2 in thread

Very cool!

I am also a big fan of jq.

And I think using DuckDB and SQL probably makes a lot of sense in a lot of cases.

But I think the examples are very geared towards being better solved in SQL.

The ideal jq examples are combinations of filter (select), map (map) and concat (.[]).

For example, finding the right download link:

  $ curl -s https://api.github.com/repos/go-gitea/gitea/releases/latest \
    | jq -r '.assets[]
             | .browser_download_url
             | select(endswith("linux-amd64"))'
  https://github.com/go-gitea/gitea/releases/download/v1.15.7/gitea-1.15.7-linux-amd64

Or extracting the KUBE_CONFIG of a DigitalOcean Kubernetes cluster from Terraform state:

  $ jq -r '.resources[]
          | select(.type == "digitalocean_kubernetes_cluster")
          | .instances[].attributes.kube_config[].raw_config' \ 
      terraform.tfstate
  apiVersion: v1
  kind: Config
  clusters:
  - cluster:
      certificate-authority-data: ...
      server: https://...k8s.ondigitalocean.com
  ...

pgr0ssOP2y ago

I think that's a fair point. Unnesting arrays in SQL can be annoying. Here is your first example with duckdb:

  duckdb -c \
    "select * from ( \
      select unnest(assets)->>'browser_download_url' as url \
      from read_json('https://api.github.com/repos/go-gitea/gitea/releases/latest') \
    ) \
    where url like '%linux-amd64'"

_flux2y ago

In case someone else was wondering, one can get a shell-consumable output from that with

  duckdb -noheader -list

HellsMaddy2y ago· 2 in thread

Jq tip: Instead of `sort_by(.count) | reverse`, you can do `sort_by(-.count)`

philsnow2y ago

only if you're sure that .count is never null:

  $ echo '[{"a": {"count": null}}]' | jq -c 'sort_by(-.count)'
  jq: error (at <stdin>:1): null (null) cannot be negated
  $ echo '[{"a": {"count": null}}]' | jq -c 'sort_by(.count) | reverse'
  [{"a":{"count":null}}]

mdaniel2y ago

this whole thread is like nerd sniping me :-D but I felt compelled to draw attention to jq's coalesce operator because I only recently learned about it and searching for the word "coalesce" in the man page is pfffft (it's official name is "Alternative operator", with alternative being "for false and null")

  $ echo '[{"a": {"count": null}}]' | jq -c 'sort_by(-(.count//0))'
  [{"a":{"count":null}}]

haradion2y ago· 1 in thread

I've found Nushell (https://www.nushell.sh/) to be really handy for ad-hoc data manipulation (and a decent enough general-purpose shell).

wraptile2y ago

Nushell is really good but the learning curve is massive.

I've been on nushell for almost a year now and still struggle to put more complex commands together. The docs are huge but not very good and the community resources are very limited (it's on Dicord smh) unfortunately. So, if anyone wants to get into it you really need to put down few days to understand the whole syntax suite but it's worth it!

hprotagonist2y ago· 1 in thread

i've been using simonw's sqlite-utils (https://sqlite-utils.datasette.io/en/stable/) for this sort of thing; given structured json or jsonl, you can throw data at an in-memory sqlite database and query away: https://sqlite-utils.datasette.io/en/stable/cli.html#queryin...

mutant2y ago

Thank you was going to say this as well, sqlite does this the same as duckdb

Sammi2y ago· 1 in thread

I work primarily in projects that use js and I mostly don't see the point in working with json in other tools than js.

I have tried jq a little bit, but learning jq is learning a new thing, which is healthy, but it also requires time and energy, which is not always available.

When I want to munge some json I use js... because that is what js in innately good at and it's what I already know. A little js script that does stdin/file read and then JSON.parse, and then map and filter some stuff, and at the end JSON.stringify to stdout/file does the job 100% of the time in my experience.

And I can use a debugger or put in console logs when I want to debug. I don't know how to debug jq or sql, so when I'm stuck I end up going for js which I can debug.

Are there js developers who reach for jq when you are already familiar with js? Is it because you are already strong in bash and terminal usage? I think I get why you would want to use sql if you are already experienced in sql. Sql is common and made for data munging. Jq however is a new dsl when I don't see the limitation of existing js or sql.

wwader2y ago

I do quite a lot of adhoc/exploratory programming to query and transform data then jq is very convenient as it works very well with "deep" data structures and the language itself it very composable.

To debug in jq you can use the debug function to prints to stderr, ex: "123 | debug | ..." or "{a:123, b:456} | debug({a}) | ... " only prints value of a "{a:123}"

schindlabua2y ago· 1 in thread

Shoutout to jqp, an interactive jq explorer.

https://github.com/noahgorstein/jqp

parentheses2y ago

This is pretty nice!

ndr2y ago

If you like lisp, and especially clojure, check out babashka[0]. This my first attempt but I bet you can do something nicer even if you keep forcing yourself to stay into a single pipe command.

  cat repos.json | bb -e ' (->> (-> *in* slurp (json/parse-string true))
                                (group-by #(-> % :license :key))
                                (map #(-> {:license (key %)
                                           :count (-> % val count)}))
                                json/generate-string
                                println)'

[0] https://babashka.org/

jeffbee2y ago

I tried this and it just seems to add bondage and discipline that I don't need on top of what is, in practice, an extremely chaotic format.

Example: trying to pick one field out of 20000 large JSON files that represent local property records.

% duckdb -json -c "select apn.apnNumber from read_json('*')" Invalid Input Error: JSON transform error in file "052136400500", in record/value 1: Could not convert string 'fb1b1e68-89ee-11ea-bc55-0242ad1302303' to INT128

Well, I didn't want that converted. I just want to ignore it. This has been my experience overall. DuckDB is great if there is a logical schema, not as good as jq when the corpus is just data soup.

mritchie7122y ago

You can also query (public) Google Sheets [0]

    SELECT * 
    FROM read_csv_auto('https://docs.google.com/spreadsheets/export? 
    format=csv&id=1GuEPkwjdICgJ31Ji3iUoarirZNDbPxQj_kf7fd4h4Ro', normalize_names=True);

0 - https://x.com/thisritchie/status/1767922982046015840?s=20

nf32y ago

I run a pretty substantial platform where I implemented structured logging to SQLite databases. Each log event is stored as a JSON object in a row. A separate database is kept for each day. Daily log files are about 35GB, so that's quite a lot of data to go through is you want to look for something specific. Being able to index on specific fields, as well as express searches as SQL queries is a real game changer IMO.

pletnes2y ago

Worth noting that both jq and duckdb can be used from python and from the command line. Both are very useful data tools!

ec1096852y ago

While jq’s syntax can be hard to remember, ChatGTP does an excellent job generating jq from an example json file and a description of how you want it parsed.

snthpy2y ago

Hi,

I very much share your sentiment and I saw a few comments mentioning PRQL so I thought it might be worth bringing up the following:

In order to make working with data at the terminal as easy and fun as possible, some time ago I created pq (prql-query) which leverages DuckDB, DataFusion and PRQL.

Unfortunately I am currently not in a position to maintain it so the repo is archived but if someone wanted to help out and collaborate we could change that.

It doesn't have much in the way of json functions out-of-the-box but in PRQL it's easy to wrap the DuckDB functions for that and with the new PRQL module system it will soon also become possible to share those. If you look through my HN comment history I did provide a JSON example before.

Anyway, you can take a look at the repo here: https://github.com/PRQL/prql-query

If interested, you can get in touch with me via Github or the PRQL Discord. I'm @snth on both.

mutant2y ago

https://github.com/mikefarah/yq

Yq handles almost every format, and IMO easier to use.

rpigab2y ago

I love jq and yq, but sometimes I don't want to invest time in learning new syntax and just fallback to some python one liner, that can if necessary become a small python script.

Something like this, I have a version of this in a shell alias:

  python3 -c "import json,sys;d=json.load(sys.stdin);print(doStuff(d['path']['etc']))"

Pretty print is done with json.dumps.

phmx2y ago

There is also a way to import a table from the STDIN (see also https://duckdb.org/docs/data/json/overview)

cat my.json | duckdb -c "CREATE TABLE mytbl AS SELECT * FROM read_json_auto('/dev/stdin'); SELECT ... FROM mytbl"

dudus2y ago

DuckDB parses JSON using yyjson internally .

https://github.com/ibireme/yyjson

jonfw2y ago

My current team produces a CLI binary that is available on every build system and everybody's dev machines

Whenever we're writing automation, if the code is nontrivial, or if it starts to include dependencies, we move the code into the CLI tool.

The reason we like this is that we don't want to have to version control tools like duckdb across every dev machine and every build system that might run this script. We build and version control a single binary and it makes life simple.

hermitcrab2y ago

if you want a very visual way to transform JSON/XML/CSV/Excel etc in a pipeline it might also be worth looking at Easy Data Transform.

j / k navigate · click thread line to collapse

73 comments

56 comments · 23 top-level

xg152y ago· 15 in thread

The most effective combination I've found so far is jq + basic shell tools.

So I think my solution to the OP's task - get all distinct OSS licenses from the project list and count usages for each one - would be:

curl ... | jq '.[].license.key' | sort | uniq -c

That's it.

pcthrowaway2y ago

> I still think jq's syntax and data model is unbelievably elegant and powerful once you get the hang of it - but its "standard library" is unfortunately sorely lacking in many places

xg152y ago

I have a list of pet peeves that I'd really like to see fixed, so I'm gonna risk a bit of hope.

salmo2y ago

If I want logic beyond that, then I skip the shell and write “real” software.

digdugdirk2y ago

Your comment made me go look up jq (even more than the article did) and the first paragraph of the repo [0] feels like a secret club's secret language.

I'm very interested, but not a Linux person, do you know of any good resources for learning the Linux shell as a programming language?

[0] https://jqlang.github.io/jq/

3 more replies

hipjiveguy2y ago

I do the same thing all the time, lol

but have in the last year or so, tried to start writing things that will be read/used by other people in python or java

why? because most people don't have a clue how bash scripts work, but can read/debug python and java with ease, and both can be run as shell scripts too

the mac has sqlite installed by default as well, so there's a powerful combo available on every mac

but I totally do what you're doing all the time, and it is my default go to if I have to get something done fast :D

jb36892y ago

mightybyte2y ago

I like this example from the clickhouse-local documentation:

  $ ps aux | tail -n +2 | awk '{ printf("%s\t%s\n", $1, $4) }' \
      | clickhouse-local --structure "user String, mem Float64" \
          --query "SELECT user, round(sum(mem), 2) as memTotal
            FROM table GROUP BY user ORDER BY memTotal DESC FORMAT Pretty"

xg152y ago

You can archive that by appending sort -n, so the whole thing becomes:

curl ... | jq '.[].license.key' | sort | uniq -c | sort -n

curl ... | jq '.[].license.key' | sort | uniq -c | sort -n | jq '{"count":., "value":input}'

michaelcampbell2y ago

> I think it would be that SQL is more powerful than ad-hoc jq incantations.

Readability (and I'd posit to the point of this, power) is more a function of the reader than the code, pg's "blub paradox" notwithstanding.

I didn't know anything about duckdb, but I might give this a shot.

tleb_2y ago

For reference, sort(1) has -n, --numeric-sort: compare according to string numerical value.

eru2y ago

> I still think jq's syntax and data model is unbelievably elegant and powerful once you get the hang of it [...]

It's basically just functional programming. (Or what you would get from a functional programmer given the task of writing such a tool as jq.)

That's not to diminish jq, it's a great tool. I love it!

DanielHB2y ago

found out recently that jq can url-encode values, is there anything it _can't_ do?

xg152y ago

Finding out whether . is contained in a given array or not, evidently.

jimbokun2y ago

The Unix philosophy continues to pass the test of time.

pphysch2y ago

Yes and no. Many UNIX philosophy proponents are abhorred by powerful binaries like jq and awk.

1 more reply

JeremyNT2y ago· 5 in thread

If you want to work with it interactively, you could use a notebook or REPL.

WuxiFingerHold2y ago

My thoughts as well:

  const response = await fetch("https://api.github.com/orgs/golang/repos");
  const repos = await response.json();

  const groups = Map.groupBy(repos, e => e?.license?.key);
  ...

edu_guitar2y ago

bdcravens2y ago

Pipelining CLI commands or bash scripts. From a security perspective, it may be preferable to not ship with a runtime.

jonfw2y ago

Use a compiled language like golang if you don't want to ship with a runtime.

If you're willing to ship w/ bash then I don't understand the opposition to JS. Either tool puts you in a scenario where somebody who can exec into your env can do whatever they want

vips7L2y ago

bash and jq are both runtimes.

1 more reply

NortySpock2y ago· 3 in thread

In a similar vein, I have found Benthos to be an incredible swiss-army-knife for transforming data and shoving it either into (or out of) a message bus, webhook, or a database.

https://www.benthos.dev/

esafak2y ago

I wish it was not based on YAML. Pipelines are code, not configuration!!

krembo2y ago

How does this defer from filebeat?

NortySpock2y ago

I don't know which filebeat you are referring to...

https://github.com/elastic/beats/tree/master/filebeat

hu32y ago· 2 in thread

Related, clickhouse local cli command is a speed demon to parse and query JSON and other formats such as CSV:

- "The world’s fastest tool for querying JSON files" https://clickhouse.com/blog/worlds-fastest-json-querying-too...

- "Show HN: ClickHouse-local – a small tool for serverless data analytics" https://news.ycombinator.com/item?id=34265206

mightybyte2y ago

I'll second this. Clickhouse is amazing. I was actually using it today to query some CSV files. I had to refresh my memory on the syntax so if anyone is interested:

  clickhouse local -q "SELECT foo, sum(bar) FROM file('foobar.csv', CSV) GROUP BY foo FORMAT Pretty"

Way easier than opening in Excel and creating a pivot table which was my previous workflow.

Here's a list of the different input and output formats that it supports.

https://clickhouse.com/docs/en/interfaces/formats

gkbrk2y ago

You don't even need to use file() for a lot of things recently. These just work with clickhouse local. Even wildcards work.

  select * from `foobar.csv`

  select * from `monthly-report-*.csv`

2 more replies

sshine2y ago· 2 in thread

Very cool!

I am also a big fan of jq.

And I think using DuckDB and SQL probably makes a lot of sense in a lot of cases.

But I think the examples are very geared towards being better solved in SQL.

The ideal jq examples are combinations of filter (select), map (map) and concat (.[]).

For example, finding the right download link:

  $ curl -s https://api.github.com/repos/go-gitea/gitea/releases/latest \
    | jq -r '.assets[]
             | .browser_download_url
             | select(endswith("linux-amd64"))'
  https://github.com/go-gitea/gitea/releases/download/v1.15.7/gitea-1.15.7-linux-amd64

Or extracting the KUBE_CONFIG of a DigitalOcean Kubernetes cluster from Terraform state:

  $ jq -r '.resources[]
          | select(.type == "digitalocean_kubernetes_cluster")
          | .instances[].attributes.kube_config[].raw_config' \ 
      terraform.tfstate
  apiVersion: v1
  kind: Config
  clusters:
  - cluster:
      certificate-authority-data: ...
      server: https://...k8s.ondigitalocean.com
  ...

pgr0ssOP2y ago

I think that's a fair point. Unnesting arrays in SQL can be annoying. Here is your first example with duckdb:

  duckdb -c \
    "select * from ( \
      select unnest(assets)->>'browser_download_url' as url \
      from read_json('https://api.github.com/repos/go-gitea/gitea/releases/latest') \
    ) \
    where url like '%linux-amd64'"

_flux2y ago

In case someone else was wondering, one can get a shell-consumable output from that with

  duckdb -noheader -list

HellsMaddy2y ago· 2 in thread

Jq tip: Instead of `sort_by(.count) | reverse`, you can do `sort_by(-.count)`

philsnow2y ago

only if you're sure that .count is never null:

  $ echo '[{"a": {"count": null}}]' | jq -c 'sort_by(-.count)'
  jq: error (at <stdin>:1): null (null) cannot be negated
  $ echo '[{"a": {"count": null}}]' | jq -c 'sort_by(.count) | reverse'
  [{"a":{"count":null}}]

mdaniel2y ago

  $ echo '[{"a": {"count": null}}]' | jq -c 'sort_by(-(.count//0))'
  [{"a":{"count":null}}]

haradion2y ago· 1 in thread

I've found Nushell (https://www.nushell.sh/) to be really handy for ad-hoc data manipulation (and a decent enough general-purpose shell).

wraptile2y ago

Nushell is really good but the learning curve is massive.

hprotagonist2y ago· 1 in thread

mutant2y ago

Thank you was going to say this as well, sqlite does this the same as duckdb

Sammi2y ago· 1 in thread

I work primarily in projects that use js and I mostly don't see the point in working with json in other tools than js.

I have tried jq a little bit, but learning jq is learning a new thing, which is healthy, but it also requires time and energy, which is not always available.

And I can use a debugger or put in console logs when I want to debug. I don't know how to debug jq or sql, so when I'm stuck I end up going for js which I can debug.

wwader2y ago

I do quite a lot of adhoc/exploratory programming to query and transform data then jq is very convenient as it works very well with "deep" data structures and the language itself it very composable.

To debug in jq you can use the debug function to prints to stderr, ex: "123 | debug | ..." or "{a:123, b:456} | debug({a}) | ... " only prints value of a "{a:123}"

schindlabua2y ago· 1 in thread

Shoutout to jqp, an interactive jq explorer.

https://github.com/noahgorstein/jqp

parentheses2y ago

This is pretty nice!

ndr2y ago

If you like lisp, and especially clojure, check out babashka[0]. This my first attempt but I bet you can do something nicer even if you keep forcing yourself to stay into a single pipe command.

  cat repos.json | bb -e ' (->> (-> *in* slurp (json/parse-string true))
                                (group-by #(-> % :license :key))
                                (map #(-> {:license (key %)
                                           :count (-> % val count)}))
                                json/generate-string
                                println)'

[0] https://babashka.org/

jeffbee2y ago

I tried this and it just seems to add bondage and discipline that I don't need on top of what is, in practice, an extremely chaotic format.

Example: trying to pick one field out of 20000 large JSON files that represent local property records.

Well, I didn't want that converted. I just want to ignore it. This has been my experience overall. DuckDB is great if there is a logical schema, not as good as jq when the corpus is just data soup.

mritchie7122y ago

You can also query (public) Google Sheets [0]

    SELECT * 
    FROM read_csv_auto('https://docs.google.com/spreadsheets/export? 
    format=csv&id=1GuEPkwjdICgJ31Ji3iUoarirZNDbPxQj_kf7fd4h4Ro', normalize_names=True);

0 - https://x.com/thisritchie/status/1767922982046015840?s=20

nf32y ago

pletnes2y ago

Worth noting that both jq and duckdb can be used from python and from the command line. Both are very useful data tools!

ec1096852y ago

While jq’s syntax can be hard to remember, ChatGTP does an excellent job generating jq from an example json file and a description of how you want it parsed.

snthpy2y ago

Hi,

I very much share your sentiment and I saw a few comments mentioning PRQL so I thought it might be worth bringing up the following:

In order to make working with data at the terminal as easy and fun as possible, some time ago I created pq (prql-query) which leverages DuckDB, DataFusion and PRQL.

Unfortunately I am currently not in a position to maintain it so the repo is archived but if someone wanted to help out and collaborate we could change that.

Anyway, you can take a look at the repo here: https://github.com/PRQL/prql-query

If interested, you can get in touch with me via Github or the PRQL Discord. I'm @snth on both.

mutant2y ago

https://github.com/mikefarah/yq

Yq handles almost every format, and IMO easier to use.

rpigab2y ago

I love jq and yq, but sometimes I don't want to invest time in learning new syntax and just fallback to some python one liner, that can if necessary become a small python script.

Something like this, I have a version of this in a shell alias:

  python3 -c "import json,sys;d=json.load(sys.stdin);print(doStuff(d['path']['etc']))"

Pretty print is done with json.dumps.

phmx2y ago

There is also a way to import a table from the STDIN (see also https://duckdb.org/docs/data/json/overview)

cat my.json | duckdb -c "CREATE TABLE mytbl AS SELECT * FROM read_json_auto('/dev/stdin'); SELECT ... FROM mytbl"

dudus2y ago

DuckDB parses JSON using yyjson internally .

https://github.com/ibireme/yyjson

jonfw2y ago

My current team produces a CLI binary that is available on every build system and everybody's dev machines

Whenever we're writing automation, if the code is nontrivial, or if it starts to include dependencies, we move the code into the CLI tool.

hermitcrab2y ago

if you want a very visual way to transform JSON/XML/CSV/Excel etc in a pipeline it might also be worth looking at Easy Data Transform.

j / k navigate · click thread line to collapse