I still think jq's syntax and data model is unbelievably elegant and powerful once you get the hang of it - but its "standard library" is unfortunately sorely lacking in many places and has some awkward design choices in others, which means that a lot of practical everyday tasks - such as aggregations or even just set membership - are a lot more complicated than they ought to be.
Luckily, what jq can do really well is bringing data of interest into a line-based text representation, which is ideal for all kinds of standard unix shell tools - so you can just use those to take over the parts of your pipeline that would be hard to do in "pure" jq.
So I think my solution to the OP's task - get all distinct OSS licenses from the project list and count usages for each one - would be:
curl ... | jq '.[].license.key' | sort | uniq -c
That's it.
After a few years of stalled development, jq has been taken over recently by a new team of maintainers and is rapidly working through a lot of longstanding issues (https://github.com/jqlang/jq), so I'm not sure if this is still the case
I have a list of pet peeves that I'd really like to see fixed, so I'm gonna risk a bit of hope.
If I want logic beyond that, then I skip the shell and write “real” software.
I personally find those both to be more readable and easier to fit in my head than long complex jq expressions. But that’s completely subjective and others may find the jq expression language easier to read than shell or (choose your programming language).
I'm very interested, but not a Linux person, do you know of any good resources for learning the Linux shell as a programming language?
but have in the last year or so, tried to start writing things that will be read/used by other people in python or java
why? because most people don't have a clue how bash scripts work, but can read/debug python and java with ease, and both can be run as shell scripts too
the mac has sqlite installed by default as well, so there's a powerful combo available on every mac
but I totally do what you're doing all the time, and it is my default go to if I have to get something done fast :D
DuckDB seems stronger for someone who needs to create a scripting library - which also has lots of options and competition - or someone who has a very specific workflow of working with JSON dumps for a huge percentage of their time.
Your fundamental point about the power of basic shell tools is still completely valid. But if I could attempt to summarize OP's point, I think it would be that SQL is more powerful than ad-hoc jq incantations. And in this case, I tend to agree with OP. I've made substantial use of jq and yq over the course of years, as well as other tools for CSVs and other data formats. But every time I reach for them I have to spend a lot of time hunting the docs for just the right syntax to attack my specific problem. I know jq's paradigm draws from functional programming concepts and I have plenty of personal experience with functional programming, but the syntax and still feel very ad hoc and clunky.
Modern OLAP DB tools like duckdb, clickhouse, etc that provide really nice ways to get all kinds of data formats into and out of a SQL environment seem dramatically more powerful to me. Then when you add the power of all the basic shell tools on top of that, I think you get a much more powerful combination.
I like this example from the clickhouse-local documentation:
$ ps aux | tail -n +2 | awk '{ printf("%s\t%s\n", $1, $4) }' \
| clickhouse-local --structure "user String, mem Float64" \
--query "SELECT user, round(sum(mem), 2) as memTotal
FROM table GROUP BY user ORDER BY memTotal DESC FORMAT Pretty"curl ... | jq '.[].license.key' | sort | uniq -c | sort -n
You can even turn it back into json by exploiting the fact that when uniq -c gets lines of json as input, it's output will be "accidentally" parseable as a sequence of json literals by jq, where every second literal is a count. You can use jq's (very weird) input function to transform each pair of literals into a "proper" json object:
curl ... | jq '.[].license.key' | sort | uniq -c | sort -n | jq '{"count":., "value":input}'
I don't disagree, but only because I have a lot of experience with SQL so I "think" more in those terms. The blog post made perfect sense to me, and I have to go to claude/openai/copilot for whatever I need in `jq`, EVERY TIME. Because I don't use it nearly as often, so I don't have its language internalized.
Readability (and I'd posit to the point of this, power) is more a function of the reader than the code, pg's "blub paradox" notwithstanding.
I didn't know anything about duckdb, but I might give this a shot.
It's basically just functional programming. (Or what you would get from a functional programmer given the task of writing such a tool as jq.)
That's not to diminish jq, it's a great tool. I love it!
(That's not strictly true - you can do it, you just have to bend over backwards for what is essentially the "in" keyword in python, sql, etc. jq has no less than four functions that look like they should do that - in(), has(), contains() and inside(), yet they all do something slightly different)
If you want to work with it interactively, you could use a notebook or REPL.
const response = await fetch("https://api.github.com/orgs/golang/repos");
const repos = await response.json();
const groups = Map.groupBy(repos, e => e?.license?.key);
...If you're willing to ship w/ bash then I don't understand the opposition to JS. Either tool puts you in a scenario where somebody who can exec into your env can do whatever they want
https://github.com/elastic/beats/tree/master/filebeat
This one? I only looked for a moment, but filebeat appears to be ingestion only. Benthos does input, output, side-effects, stream-stream joins, metrics-on-the-side, tiny-json-wrangling-webooks, and more. I find it to be like plumbers putty, closing over tooling gaps and smoothing rough edges where ordinarily you'd have to write 20 lines of stream processing code and 300 lines of error handling, reporting, and performance hacks.
- "The world’s fastest tool for querying JSON files" https://clickhouse.com/blog/worlds-fastest-json-querying-too...
- "Show HN: ClickHouse-local – a small tool for serverless data analytics" https://news.ycombinator.com/item?id=34265206
clickhouse local -q "SELECT foo, sum(bar) FROM file('foobar.csv', CSV) GROUP BY foo FORMAT Pretty"
Way easier than opening in Excel and creating a pivot table which was my previous workflow.Here's a list of the different input and output formats that it supports.
select * from `foobar.csv`
or select * from `monthly-report-*.csv`I am also a big fan of jq.
And I think using DuckDB and SQL probably makes a lot of sense in a lot of cases.
But I think the examples are very geared towards being better solved in SQL.
The ideal jq examples are combinations of filter (select), map (map) and concat (.[]).
For example, finding the right download link:
$ curl -s https://api.github.com/repos/go-gitea/gitea/releases/latest \
| jq -r '.assets[]
| .browser_download_url
| select(endswith("linux-amd64"))'
https://github.com/go-gitea/gitea/releases/download/v1.15.7/gitea-1.15.7-linux-amd64
Or extracting the KUBE_CONFIG of a DigitalOcean Kubernetes cluster from Terraform state: $ jq -r '.resources[]
| select(.type == "digitalocean_kubernetes_cluster")
| .instances[].attributes.kube_config[].raw_config' \
terraform.tfstate
apiVersion: v1
kind: Config
clusters:
- cluster:
certificate-authority-data: ...
server: https://...k8s.ondigitalocean.com
... duckdb -c \
"select * from ( \
select unnest(assets)->>'browser_download_url' as url \
from read_json('https://api.github.com/repos/go-gitea/gitea/releases/latest') \
) \
where url like '%linux-amd64'" duckdb -noheader -list $ echo '[{"a": {"count": null}}]' | jq -c 'sort_by(-.count)'
jq: error (at <stdin>:1): null (null) cannot be negated
$ echo '[{"a": {"count": null}}]' | jq -c 'sort_by(.count) | reverse'
[{"a":{"count":null}}] $ echo '[{"a": {"count": null}}]' | jq -c 'sort_by(-(.count//0))'
[{"a":{"count":null}}]I've been on nushell for almost a year now and still struggle to put more complex commands together. The docs are huge but not very good and the community resources are very limited (it's on Dicord smh) unfortunately. So, if anyone wants to get into it you really need to put down few days to understand the whole syntax suite but it's worth it!
I have tried jq a little bit, but learning jq is learning a new thing, which is healthy, but it also requires time and energy, which is not always available.
When I want to munge some json I use js... because that is what js in innately good at and it's what I already know. A little js script that does stdin/file read and then JSON.parse, and then map and filter some stuff, and at the end JSON.stringify to stdout/file does the job 100% of the time in my experience.
And I can use a debugger or put in console logs when I want to debug. I don't know how to debug jq or sql, so when I'm stuck I end up going for js which I can debug.
Are there js developers who reach for jq when you are already familiar with js? Is it because you are already strong in bash and terminal usage? I think I get why you would want to use sql if you are already experienced in sql. Sql is common and made for data munging. Jq however is a new dsl when I don't see the limitation of existing js or sql.
To debug in jq you can use the debug function to prints to stderr, ex: "123 | debug | ..." or "{a:123, b:456} | debug({a}) | ... " only prints value of a "{a:123}"
cat repos.json | bb -e ' (->> (-> *in* slurp (json/parse-string true))
(group-by #(-> % :license :key))
(map #(-> {:license (key %)
:count (-> % val count)}))
json/generate-string
println)'
[0] https://babashka.org/Example: trying to pick one field out of 20000 large JSON files that represent local property records.
% duckdb -json -c "select apn.apnNumber from read_json('*')" Invalid Input Error: JSON transform error in file "052136400500", in record/value 1: Could not convert string 'fb1b1e68-89ee-11ea-bc55-0242ad1302303' to INT128
Well, I didn't want that converted. I just want to ignore it. This has been my experience overall. DuckDB is great if there is a logical schema, not as good as jq when the corpus is just data soup.
SELECT *
FROM read_csv_auto('https://docs.google.com/spreadsheets/export?
format=csv&id=1GuEPkwjdICgJ31Ji3iUoarirZNDbPxQj_kf7fd4h4Ro', normalize_names=True);
0 - https://x.com/thisritchie/status/1767922982046015840?s=20I very much share your sentiment and I saw a few comments mentioning PRQL so I thought it might be worth bringing up the following:
In order to make working with data at the terminal as easy and fun as possible, some time ago I created pq (prql-query) which leverages DuckDB, DataFusion and PRQL.
Unfortunately I am currently not in a position to maintain it so the repo is archived but if someone wanted to help out and collaborate we could change that.
It doesn't have much in the way of json functions out-of-the-box but in PRQL it's easy to wrap the DuckDB functions for that and with the new PRQL module system it will soon also become possible to share those. If you look through my HN comment history I did provide a JSON example before.
Anyway, you can take a look at the repo here: https://github.com/PRQL/prql-query
If interested, you can get in touch with me via Github or the PRQL Discord. I'm @snth on both.
Yq handles almost every format, and IMO easier to use.
Something like this, I have a version of this in a shell alias:
python3 -c "import json,sys;d=json.load(sys.stdin);print(doStuff(d['path']['etc']))"
Pretty print is done with json.dumps.cat my.json | duckdb -c "CREATE TABLE mytbl AS SELECT * FROM read_json_auto('/dev/stdin'); SELECT ... FROM mytbl"
Whenever we're writing automation, if the code is nontrivial, or if it starts to include dependencies, we move the code into the CLI tool.
The reason we like this is that we don't want to have to version control tools like duckdb across every dev machine and every build system that might run this script. We build and version control a single binary and it makes life simple.