https://devblogs.microsoft.com/scripting/learn-how-to-config...
One of the things I like most about the UNIX userland is that I can use small programs to edit vary large files, without needing lots of memory. I want programs that are designed to accomodate the possibility of line-by-line processing.
If the intent is to make output network friendly, maybe something like netstrings is useful. Easy to parse. Low memory footprint.
Seems to me this JSON idea is not designed to improve performance, agility or resource efficiency but to ignore the UNIX example in favour of a different, slower, approach that is perceived as easier for some people to use. Namely those who do not want to spend the time to learn how to use an existing, faster solution with lower resource requirements.
You are misinterpreting the Unix philosophy. It's fine to use a bunch of sed, awk, grep, etc. when you're either transforming text or processing already well-structured data. But trying to write a full-fledged parser for something with only human-readable output, especially as a shell script, definitely goes against that philosophy. Congratulations, you've managed to piece together 50 commands in a pipeline and create a monstrosity that's far from the minimalist philosophy.
In fact, I would argue that by using `jc` together with `jq` you can actually create some nice pipelines for parsing the data that will be much more in line with the Unix philosophy.
Nobody ever said this was designed to improve performance, but I have a hard time believing your claims about it being significantly slower which is not backed up by any source. Most likely, eliminating the JSON conversion would be at most an unnecessary micro-optimization. But if your code was truly performance-critical, you wouldn't be piecing it together with shell pipelines that cause a bunch of unnecessary forks, you'd write it in something like C instead.
And the "JSON was designed for the web browser" argument doesn't hold much water either. You're about several decades too late for that, JSON is extremely ubuquitous and used in a lot of non-browser contexts. Sure, some people depending on their needs may use other formats like XML or protobuf, but JSON is still very common.
> These programs do not expect large amounts of memory and many are written with the intent that they may be used to process text line-by-line
Which is only a problem if you are being very silly, don't choose NDJSON (newline-delimited JSON) and instead shove 10GB of data in a big [] array that the parser has to read in all at once. Almost every single JSON library can do NDJSON already. One of the most heavily used JSON-over-stdio applications is the Language Server Protocol, which uses JSON-RPC 2.0 and is entirely NDJSON. Same for about 15 different log-yeeting tools. Nobody has ever suggested switching LSP to plain text for performance reasons, only lower-overhead binary formats that don't throw out everything gained by having structure at all.
Large memory use by JSON is not something inherent to the encoding that plain text is somehow immune to. All sorts of CLI programs read stdin in all at once, and you don't see plain text getting slammed for exorbitant memory use.
In the context of the original post, `jc` etc, we're talking about essentially a constant sized output that's just much easier to parse, so the complaint is not relevant to those at all.
You are underestimating the power of unix tools. A chain of unix tools can match or exceed the performance of C programs written by average programmers. That is a true beauty of unix and partly why it is still relevant today. The author has little idea about performance and doesn't understand how unix works; otherwise he wouldn't make arrogant claims like:
> With jc, we can make the linux world a better place until the OS and GNU tools join us in the 21’st century!
It wasn't. It was _inspired_ by JS's syntax (and that of Python), but wasn't designed for it. Crockford designed it as a lightweight data exchange format that used a familiar syntax. Quoting from the json.org website itself:
> JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.
JSON isn't terribly difficult to parse, nor does it require "generous amounts of memory". Shy of something like s-expressions, it's about as straightforward as you can get when it comes to structured data.
Netstrings are really useful but they encode strings, not structured data.
While Windows isn't a whole language OS like those, .NET and COM gets pretty close to it, and that is what PowerShell knows about, instead of raw text.
This is what is missing across most traditional UNIX shells, integrate raw text, UNIX IPC (and newer ones like D-BUS/gRPC), shared libraries, structured data, into a single REPL experience.
JC has streaming parsers that lazily output JSON lines for these types of commands. (ls, ping, vmstat, iostat, etc.)
The problem with JSON is that it does not support streaming by default. It's possible to use non-standard JSON-like formats to work around this, but then you're no longer using JSON!
ndjson is worth knowing about. We use it for things too large to stream.
This summarises the problem I have with JSON more succintly. It was not designed for streaming, thus "it does not support streaming by default".
Non-standard, line-oriented JSON formats are usable, although as a user I cannot see how they offer any significant improvement over previous approaches with fewer brackets, braces, colons, commas and quotes (BBCCQ). Consider the BSD utility mtree or the BSD-version of stat. These have options to output text in "shell-friendly" formats,1 minus all the BBCC and excessive Q. Sure, people could add options to utilities to output XML, or line-oriented JSON, but generally they don't. Why is that. Perhaps there is a reason.
You said it best: "It's possible use non-standard JSON-like formats to work around [JSON's limitations], but then you're no longer using JSON!"
Maybe JSON is just about hype or something. An attractant for today's "developers". This would explain why I am just not attracted by it.
1.
readable,
handy since probably every language has libs that work with it fine,
there's a lot of tools that work with jsons e.g generating code classes from json,
it's insanely popular,
really easy to learn
grep key file.json | awk -F: '{print $2}'
if you're already searching for a key, seems like you're just wanting the value.
granted, i hardly ever (have i actually ever??) interact with JSON this way, so i'm not exactly familiar with pitfalls.
(Granted grep still works, but...not nicely.)
awk -F: '/key/ {print $2}' file.json ~ % dig example.com | jc --dig | jq
[
{
"id": 61315,
"opcode": "QUERY",
"status": "NOERROR",
"flags": [
"qr",
"rd",
"ra"
],
"query_num": 1,
"answer_num": 1,
"authority_num": 0,
"additional_num": 1,
"opt_pseudosection": {
"edns": {
"version": 0,
"flags": [],
"udp": 512
}
},
"question": {
"name": "example.com.",
"class": "IN",
"type": "A"
},
"answer": [
{
"name": "example.com.",
"class": "IN",
"type": "A",
"ttl": 85586,
"data": "93.184.216.34"
}
],
"query_time": 29,
"server": "10.0.0.1#53(10.0.0.1)",
"when": "Sun Dec 05 15:12:08 PST 2021",
"rcvd": 56,
"when_epoch": 1638745928,
"when_epoch_utc": null
}
]Now, to find a query tool with a saner language than Jq...
Also, there is jellex[1], which is a TUI wrapper around jello that can help you build your queries.
[0] https://github.com/kellyjonbrazil/jello [1] https://github.com/kellyjonbrazil/jellex
jq can get pretty deep but for most things in this area I'm not sure how it could improve upon, but would be interested in hearing alternatives.
https://github.com/fiatjaf/jiq
Is a realtime feedback wrapper which I find useful when crafting one-off command line uses for jq and it starts getting crazy.
Having multiple outputs is a great feature, though. I'm especially fond of tooling in Kubernetes that allows you to nicely pipe things in and out in multiple formats.
> The libxo library allows an application to generate text, XML, JSON, and HTML output using a common set of function calls. The application decides at run time which output style should be produced. The application calls a function "xo_emit" to product output that is described in a format string. A "field descriptor" tells libxo what the field is and what it means.
* https://github.com/Juniper/libxo
Then add an "--output-format" option.
As just one example, the Azure CLI defaults to human-readable output, but has an "output" parameter so you can have JSON if you want - I've never once wanted any kind of format auto-detection, and I have to say that I still don't.
However more modern shells fix this problem with having typed pipelines and builtins written to understand more than just a flat file of bytes.
Take _murex_ for example (disclaimer, I'm the author of that shell):
» jobs
PID State Background Process Parameters
2104 Executing true exec sleep 9000000
2240 Executing true exec sleep 9000000
It's readable but what if I wanted to pass it as a table? » jobs | cat
["PID","State","Background","Process","Parameters"]
[2104,"Executing",true,"exec","sleep 9000000"]
[2240,"Executing",true,"exec","sleep 9000000"]
ok, so it auto-detects it is running as a pipe and outputs it as a jsonlines table. That would be annoying in Bash. But with a type aware shell, that shell knows it's a jsonlines table, eg » jobs | debug | [[ /Data-Type/Murex ]]
jsonl
...but what can we do with a jsonlines table? Well you can select individual columns: » jobs | [ PID State ]
[
"PID",
"State"
]
[
"2104",
"Executing"
]
[
"2240",
"Executing"
]
run SQL against it » jobs | select * where PID > 2200
["PID","State","Background","Process","Parameters"]
["2240","Executing","true","exec","sleep 9000000"]
iterate through each row » jobs | foreach proc { if { =$proc[0]>2200 } then { echo $proc } }
[2240,"Executing",true,"exec","sleep 9000000"]
or even just convert it into another format, like CSV » jobs | format csv
PID,State,Background,Process,Parameters
2104,Executing,true,exec,sleep 9000000
2240,Executing,true,exec,sleep 9000000
...or YAML... » jobs | format csv
- - PID
- State
- Background
- Process
- Parameters
- - "2104"
- Executing
- "true"
- exec
- sleep 9000000
- - "2240"
- Executing
- "true"
- exec
- sleep 9000000
And it all just works without you having to think or even know what data format is traversing the pipeline.However unfortunately none of this is possible with Bash. And thus the majority of tools are forced to be dumb to compensate.
Output should be readable, not structured.
"kb_read_s": 0.12
This worries me. JSON doesn’t have support for fixed point math, does it? When will some random POSIX tool spit out scientific notation at me.Also, if you just output a flat schema, is there much of a point in this vs just:
cpu: 0.2
kw_read_s: 0.12
The difference is that you can use a JSON parser vs splitting on new lines and colons?I do like the idea of JSON output as an option but before every bug and mistake gets canonized as POSIX or some other standard can we at least talk about the output format for a bit?
cpu: 0.2
kb_read_s: 0.12
mac_addr: 10:AA:FF:00:55:66 dict((key.strip(), val.strip()) for key, _, val in line.strip().partition(‘:’) for line in text.split(‘\n’) if line and line.strip())
Also who says we can’t have a universal parser for this format just like we have for JSON? Not everyone needs to write the one liner like above, just use the libtextformat.parse(text) or whatever we would call it. cut -d: -f2-Plain text doesn't have support for numbers at all, which isn't much of a solution.
Let’s put it this way: if I proposed XML as the substitution for plain text, would you rather keep plain text or switch to XML?
(I’m aware of powershell and am ignoring it consciously)
https://carreau.github.io/posts/29-JupyterCon-DisplayProtoco...
https://github.com/lmorg/murex
You’d have to learn a new shell syntax but at least it’s compatible with existing CLI tools (which Powershell isn’t)
> This way I can easily filter the data in jq or other tools without having to traverse levels.
How is doing `jq '.cpu.speed'` any harder than doing `jq '.cpu_speed'`?
IMO as long as you aren't going insane with nesting levels, it's actually better to have a proper structure than dumping everything into an ugly flat object.
Grabbing an attribute is not necessarily any harder in a deeply nested structure, but filtering based on multiple deeply nested attributes in different branches can make a query quite complex.
It certainly seems like it does when the first example for flattening is oversimplifying an already simple structure that doesn't really need flattening. Maybe that was not meant to be a serious example but rather just for ease of understanding, but then the article should have probably said so.
> filtering based on multiple deeply nested attributes in different branches can make a query quite complex
Can you elaborate on this please? Maybe I'm just too tired to think clearly at 1 am, but I don't see how filtering is any harder. You would just do something like `jq '.foo | select(.bar.baz >= 42 and .qux.moo.asd == "abc")`.
If you are accessing a deep nested data that means you have to account for layers of existence of keys. If cpu exist then see if speed exist then access speed. Nothing wrong with deep nesting as long as you can guarantee a key and data will be generated but more often than not when the data is not being generated the JSON data and the key will not simply exist.
And people do get carried away with nesting. Also it is nice to have core information available at the surface level of JSON file.
$ jc -h --dig
1. human review
2. scripting / automation
In case one, human readable formats are obviously preferable. But the moment you need to script a command, you want it in a machine readable format.
A perfect example of the differences between the two are how badly spaces in file names are handled. Granted POSIX deserve a lot of the blame here too.
I think it was called "recs"
-c: Produce computer-readable output
So I tried it out: ~ networkquality -c | jq '{dl: (.dl_throughput / 1000000), ul: (.ul_throughput / 1000000)}'
awesome! {
"dl": 176.861488,
"ul": 6.742952
}
(Starlink in Perth, Western Australia)In practice this enabled generating coverage reports orders of magnitude faster than traditional gcov wrappers like lcov
1. You don't need to categorize every piece of data 2. You don't need to include everything in a single JSON file.
Deep nesting JSON is very annoying. The key-value pair structure of JSON is simply being abused at this point. Also, I really don't appreciate using numerical values as keys. Please use a list.
Optional header, selectable columns, one line per record, machine readable (raw) vs human readable numbers.
I’ve nothing against JSON output but I just don’t need it when you can print out two columns, select on the first, and print the second.
$ users -H -o name,hair |
> awk ‘$1 == “gorgoiler” {print $2}’
gray
Admittedly, that awk invocation is so commonly used it could probably be a lot more terse. Also, this whole house of cards collapses when you have data containing spaces.