Show HN: UXY – adding structure to Unix tools (opens in new tab)

(github.com)

126 pointsrumcajz7y ago39 comments

39 comments

36 comments · 11 top-level

jph7y ago· 7 in thread

Excellent, thank you for creating UXY!

I will donate $50 to you or your favorite charity to encourage a new feature: to-usv, which outputs Unicode separated values (USV) with unit separator U+241F and record separator U+241E.

Unicode separated values (USV) are much like comma separated values (CSV), tab separated values (TSV) a.k.a. tab delimited format (TDF), and ASCII separated values (ASV) a.k.a. DEL (Delimited ASCII).

The advantages of USV for me are that USV handles text that happens to contain commas and/or tabs and/or newlines, and also having a visual character representation.

For example USV is great for me within typical source code, such as Unix scripts, because the characters show up, and also easy to copy/paste, and also easy to use within various kinds of editor search boxes.

Bonus: if the programming implementation of to-usv calls a more-flexible function that takes a unit separator string and a record separator string, then you can easily create similar commands for to-asv, to-csv, etc.

inimino7y ago

Eventually you have to deal with content that contains your separator characters, however obscure. So essentially you have two choices:

A. use some "weird" separators and hope those don't appear in your input

B. bite the bullet and escape and parse properly

Option A is perfectly reasonable for one-offs, where you can handle exceptional cases or know they won't occur because you know what's in the data. However for reusable code, you need option B, which means not using `cut` to parse CSV files, for instance (since commas can occur inside double-quoted strings). In that case, what's the benefit of using USV over an existing, more common, format?

jph7y ago

Yes you're exactly right about escaping.

Orthogonal to escaping, the choice is what characters to use for unit separator and record separator.

If the data are for machines only, then for me the choice of characters doesn't matter. If the data are potentially for reading or editing, such as by a programmer, then my choice is to prefer typically-visible characters over typically-invisible characters and/or zero-width characters (e.g. ASV a.k.a. DEL a.k.a. ASCII 30 & 31).

My choice of USV is thus because U+241F and U+241E are visible, and also in Unicode they are semantically meaningful.

1 more reply

driax7y ago

U+241E is "SYMBOL FOR RECORD SEPARATOR". It seems a bit weird to use that as a separator instead of simply U+1E which is the ASCII character "record separator".

dbro7y ago

While not exactly what you asked for, I wrote something similar called csvquote ( https://github.com/dbro/csvquote ) which transforms "typical" CSV or TSV data to use the ASCII characters for field separators and record separators, and also allows for a reverse transform back to regular CSV or TSV files.

It is handy for pipelining UNIX commands so that they can handle data that includes commas and newlines inside fields. In this example, csvquote is used twice in the pipeline, first at the beginning to make the transformation to ASCII separators and then at the end to undo the transformation so that the separators are human-readable.

> csvquote foobar.csv | cut -d ',' -f 5 | sort | uniq -c | csvquote -u

It doesn't yet have any built-in awareness of UTF or multi-byte characters, but I'd be happy to receive a pull request if it's something you're able to offer.

rabidrat7y ago

How is USV better than ASV, which would use U+001E and U+001F?

Also, is your offer available for other tabular data tools? :)

jph7y ago

USV is better that ASV for me because USV is visible.

For example I can write code samples such as:

  echo 'a␟b␟c␞d␟e␟f␞g␟h␟i' | tr ␟␞ '\t\n'
  a    b    c
  d    e    f
  g    h    i

Yes my offer is available for other tabular data tools. I want USV to become a good choice for data exchange. Message me at my contact information in my profile here.

kragen7y ago

I think you're going to need a bigger budget to establish your new proposed standard through consulting fees. Do you remember what happened to GCC's CHILL frontend?

dima557y ago· 5 in thread

This is becoming a really crowded space. Some other similar tools that make slightly different design choices and that have variable envisioned use cases:

- https://github.com/dkogan/vnlog

- https://csvkit.readthedocs.io/

- https://github.com/johnkerl/miller

- https://github.com/BurntSushi/xsv

- https://github.com/eBay/tsv-utils-dlang

- https://stedolan.github.io/jq/

- http://harelba.github.io/q/

- https://github.com/BatchLabs/charlatan

- https://github.com/dinedal/textql

- https://github.com/dbohdan/sqawk

(disclaimer: vnlog is my tool)

vthriller7y ago

I'd argue this is more about quacking like a PowerShell than manipulating xSV/JSON in the pipeline. So here's my quick bunch of links that show the demand for that.

Here people emulate formatted and filtered ps(1) using GVariants and a bunch of CLI tools:

https://blogs.gnome.org/alexl/2012/08/10/rethinking-the-shel...

Here people use SQL to query and format data right from the shell:

https://github.com/jhspetersson/fselect

https://github.com/facebook/osquery

Also, libxo is a library that allows tools like ls(1) in FreeBSD to generate data in various formats (e.g. JSON):

https://wiki.freebsd.org/LibXo

(edit: formatting)

majkinetor7y ago

There actually far more more tools created so far, for example

- https://github.com/adamwiggins/rush

- https://github.com/xonsh/xonsh

Its amazing that people still try this nowdays that pwsh solved it for all.

nailer7y ago

> This is becoming a really crowded space.

Those who fail to understand powershell are condemned to recreate it poorly.

It'd be great for GNU to create a standard for native structured output (as well as a converter tool like the one in this post), then have other tools be able to do it.

But realistically, pwsh is Open Source, runs just fine on Unix boxes and does this now.

majkinetor7y ago

Amen to that

jasonpeacock7y ago

Also:

- https://github.com/benbernard/RecordStream

nerdponx7y ago· 4 in thread

Seems a lot like the Powershell model, which I have mixed feelings about. It's nice for shell scripts, but it makes day-to-day usage cumbersome. I think you can actually use Powershell on Linux, but I'm interested to see where this tool goes.

nailer7y ago

> It's nice for shell scripts, but it makes day-to-day usage cumbersome.

How? `ps | kill node`. No pgrep hack because ps output a list of processes, not a line of text. As a Unix person Windows Terminal and pwsh is where I spend most of my day.

majkinetor7y ago

> In my experience Powershell is quite a bit more verbose than that.

This is common misconception. Posh allows both verbose and shorten styles via various mechanisms - command aliases, parameter abbreviations and aliases, proxys, pipeline settings for objects etc.

nerdponx7y ago

Valid. In my experience Powershell is quite a bit more verbose than that. If this manages to press both the "object oriented" and "concise" buttons then I'll be very happy to use it indeed.

adrianratnapala7y ago

In the Powershell model, I thought things stayed as structured objects in reality, although the UI was ready to render them as text. This seems to be about continuing to use text, but to being disciplined about formatting.

If the above characterisation is right, it is a middle-ground between Powershell and traditional methods.

Also, this is not introducing a new shell language.

adrianratnapala7y ago· 3 in thread

> * any other escape sequence MUST be interpreted as ? (question mark) character.

Isn't it better to forbid them? Presumably you are saving the space for further extensions, but this is allowing readers to interpret them as '?'

Similarly what is the rationale for interpreting control characters is '?'? Instead you can ban them, with the possible exception of treating tabs as spaces.

rumcajzOP7y ago

Postel's principle: By liberal in what you accept... It means that the tool won't crash just because there's weird input.

adrianratnapala7y ago

But as written, the definition permits generating the weird output. Which ignores the other arm of Postel's principle.

You could have the format definition forbid these characters; and then in the section about Postel's principle have "If you choose to accept forbidden characters, you MUST treat them as '?'".

wgoodall017y ago

Isn't it better to crash than to fail silently, possibly storing malformed data?

1 more reply

rabidrat7y ago· 2 in thread

Very cool, I've had a similar idea myself recently! Though, why not go with a simpler format like TSV (tab-separated values)? Then you don't have to worry about quoting and escaping anything but tabs and newlines (which are very rare in tabular data).

rumcajzOP7y ago

Tabs are a nightmare to deal with when you want to align the columns. Also, I don't consider tabs to be human readable: They are too easily confused with spaces. (Case in point: make)

rabidrat7y ago

Fair enough, I've experienced those pains myself. But what is the strategy with UXY? kind of a semi-fixed-width format that is only partially aligned, but still requires quoting/escaping? I'm not sure it's any better than CSV or PSV (pipe), and it also doesn't interoperate with existing tools.

I'm not attacking your overall idea, btw. I've just given this a bunch of thought myself, and the design space is very tricky. My current approach would be to use ASV (ascii codes 27-31) and abandon 'cat'-based readability in favor of a 'vcat' which gives you a better visual representation. Of course that has its issues too.. :)

1 more reply

bayareanative7y ago· 1 in thread

A related problem is the constant churn of logging.. taking structured data, destructuring it with a string serialization and then parsing it again.

This resource-wasting antipattern pops up over and over again.

Also, logs are message-oriented entries and serializing them as discrete, lengthy files is insane.

Structured data should stay structured, say a time-series / log-structured database. Destructuring should be a rare event.

xelxebar7y ago

I think Plan 9 gives a nice distinction. We use files as both a persistent store as well as an interface, so it seems nice to separate those two concerns out. That way you could have your logs as a UI into application state and only incur the overhead of serialization and persistence when you deem necessary.

Caveat, my Plan 9 experience is mostly theoretical.

koolba7y ago· 1 in thread

> uxy align

> Aligns the data with the headers. This is done by resizing the columns so that even the longest value fits into the column.

> ...

> This command doesn't work with infinite streams.

Does this do nothing with infinite streams or does it do a "rolling" alignment?

Even with an infinite stream you can keep track of the max width seen thus far and align all future output to those levels. It'll still have some jank to the initial alignment but assuming a consistent distribution of the lengths over time it'd be good enough for eyeballing the results.

rumcajzOP7y ago

Currently it uses the alignment of the headers as the default. It's only when a field exceeds the size of the header when the output is misaligned. The next record returns to the default alignment though.

I was thinking about adding a 'trim' command that would trim long fields to fit into the default field size.

dharmatech7y ago· 1 in thread

Cool project!

Have you considered having a way to render output in a graphical toolkit?

See for example:

https://github.com/dharmatech/PsReplWpf

which renders PowerShell output in WPF presentations.

dima557y ago

You can use this (I wrote it, and have been using it daily for many years): http://github.com/dkogan/feedgnuplot

mijoharas7y ago· 1 in thread

Can anyone elaborate on why the tool is named UXY? I couldn't find anything in the repo, and there is no wiki.

imglorp7y ago

Seems like an acro-mondeau of UX (user experience) and XY (tabular format). The tool normalizes some of the Unix tool outputs as a table which can be manipulated.

no_gravity7y ago

I think this is putting too many different functions into a single command.

    uxy ls

This looks like it "tabifies" the output of a given command. Aka it turns the output of the given command into a tab seperated format.

    uxy reformat "NAME SIZE"

This seems to collide with the above since "reformat" is not a command which will be tabified. Instead it filters stdin for two columns.

    uxy align

This seems to do the same as "column -t".

vram227y ago

For anyone interested in learning how to create their own Unix command-line tools (not just use them), feel free to check out these links to content by me (about doing such work in C and Python):

1) Developing a Linux command-line utility: an article I wrote for IBM developerWorks:

https://jugad2.blogspot.com/2014/09/my-ibm-developerworks-ar...

Follow links in the article to go to the source code of the tool described in the tutorial, and the PDF of the IBM dW article.

2) My comment, here:

https://news.ycombinator.com/item?id=19564706

on this HN thread:

Ask HN: Looking for a series on implementing classic Unix tools from scratch:

https://news.ycombinator.com/item?id=19560418

j / k navigate · click thread line to collapse

39 comments

36 comments · 11 top-level

jph7y ago· 7 in thread

Excellent, thank you for creating UXY!

I will donate $50 to you or your favorite charity to encourage a new feature: to-usv, which outputs Unicode separated values (USV) with unit separator U+241F and record separator U+241E.

Unicode separated values (USV) are much like comma separated values (CSV), tab separated values (TSV) a.k.a. tab delimited format (TDF), and ASCII separated values (ASV) a.k.a. DEL (Delimited ASCII).

The advantages of USV for me are that USV handles text that happens to contain commas and/or tabs and/or newlines, and also having a visual character representation.

inimino7y ago

Eventually you have to deal with content that contains your separator characters, however obscure. So essentially you have two choices:

A. use some "weird" separators and hope those don't appear in your input

B. bite the bullet and escape and parse properly

jph7y ago

Yes you're exactly right about escaping.

Orthogonal to escaping, the choice is what characters to use for unit separator and record separator.

My choice of USV is thus because U+241F and U+241E are visible, and also in Unicode they are semantically meaningful.

1 more reply

driax7y ago

U+241E is "SYMBOL FOR RECORD SEPARATOR". It seems a bit weird to use that as a separator instead of simply U+1E which is the ASCII character "record separator".

dbro7y ago

> csvquote foobar.csv | cut -d ',' -f 5 | sort | uniq -c | csvquote -u

It doesn't yet have any built-in awareness of UTF or multi-byte characters, but I'd be happy to receive a pull request if it's something you're able to offer.

rabidrat7y ago

How is USV better than ASV, which would use U+001E and U+001F?

Also, is your offer available for other tabular data tools? :)

jph7y ago

USV is better that ASV for me because USV is visible.

For example I can write code samples such as:

  echo 'a␟b␟c␞d␟e␟f␞g␟h␟i' | tr ␟␞ '\t\n'
  a    b    c
  d    e    f
  g    h    i

Yes my offer is available for other tabular data tools. I want USV to become a good choice for data exchange. Message me at my contact information in my profile here.

kragen7y ago

I think you're going to need a bigger budget to establish your new proposed standard through consulting fees. Do you remember what happened to GCC's CHILL frontend?

dima557y ago· 5 in thread

This is becoming a really crowded space. Some other similar tools that make slightly different design choices and that have variable envisioned use cases:

- https://github.com/dkogan/vnlog

- https://csvkit.readthedocs.io/

- https://github.com/johnkerl/miller

- https://github.com/BurntSushi/xsv

- https://github.com/eBay/tsv-utils-dlang

- https://stedolan.github.io/jq/

- http://harelba.github.io/q/

- https://github.com/BatchLabs/charlatan

- https://github.com/dinedal/textql

- https://github.com/dbohdan/sqawk

(disclaimer: vnlog is my tool)

vthriller7y ago

I'd argue this is more about quacking like a PowerShell than manipulating xSV/JSON in the pipeline. So here's my quick bunch of links that show the demand for that.

Here people emulate formatted and filtered ps(1) using GVariants and a bunch of CLI tools:

https://blogs.gnome.org/alexl/2012/08/10/rethinking-the-shel...

Here people use SQL to query and format data right from the shell:

https://github.com/jhspetersson/fselect

https://github.com/facebook/osquery

Also, libxo is a library that allows tools like ls(1) in FreeBSD to generate data in various formats (e.g. JSON):

https://wiki.freebsd.org/LibXo

(edit: formatting)

majkinetor7y ago

There actually far more more tools created so far, for example

- https://github.com/adamwiggins/rush

- https://github.com/xonsh/xonsh

Its amazing that people still try this nowdays that pwsh solved it for all.

nailer7y ago

> This is becoming a really crowded space.

Those who fail to understand powershell are condemned to recreate it poorly.

It'd be great for GNU to create a standard for native structured output (as well as a converter tool like the one in this post), then have other tools be able to do it.

But realistically, pwsh is Open Source, runs just fine on Unix boxes and does this now.

majkinetor7y ago

Amen to that

jasonpeacock7y ago

Also:

- https://github.com/benbernard/RecordStream

nerdponx7y ago· 4 in thread

nailer7y ago

> It's nice for shell scripts, but it makes day-to-day usage cumbersome.

How? `ps | kill node`. No pgrep hack because ps output a list of processes, not a line of text. As a Unix person Windows Terminal and pwsh is where I spend most of my day.

majkinetor7y ago

> In my experience Powershell is quite a bit more verbose than that.

This is common misconception. Posh allows both verbose and shorten styles via various mechanisms - command aliases, parameter abbreviations and aliases, proxys, pipeline settings for objects etc.

nerdponx7y ago

Valid. In my experience Powershell is quite a bit more verbose than that. If this manages to press both the "object oriented" and "concise" buttons then I'll be very happy to use it indeed.

adrianratnapala7y ago

If the above characterisation is right, it is a middle-ground between Powershell and traditional methods.

Also, this is not introducing a new shell language.

adrianratnapala7y ago· 3 in thread

> * any other escape sequence MUST be interpreted as ? (question mark) character.

Isn't it better to forbid them? Presumably you are saving the space for further extensions, but this is allowing readers to interpret them as '?'

Similarly what is the rationale for interpreting control characters is '?'? Instead you can ban them, with the possible exception of treating tabs as spaces.

rumcajzOP7y ago

Postel's principle: By liberal in what you accept... It means that the tool won't crash just because there's weird input.

adrianratnapala7y ago

But as written, the definition permits generating the weird output. Which ignores the other arm of Postel's principle.

You could have the format definition forbid these characters; and then in the section about Postel's principle have "If you choose to accept forbidden characters, you MUST treat them as '?'".

wgoodall017y ago

Isn't it better to crash than to fail silently, possibly storing malformed data?

1 more reply

rabidrat7y ago· 2 in thread

rumcajzOP7y ago

Tabs are a nightmare to deal with when you want to align the columns. Also, I don't consider tabs to be human readable: They are too easily confused with spaces. (Case in point: make)

rabidrat7y ago

1 more reply

bayareanative7y ago· 1 in thread

A related problem is the constant churn of logging.. taking structured data, destructuring it with a string serialization and then parsing it again.

This resource-wasting antipattern pops up over and over again.

Also, logs are message-oriented entries and serializing them as discrete, lengthy files is insane.

Structured data should stay structured, say a time-series / log-structured database. Destructuring should be a rare event.

xelxebar7y ago

Caveat, my Plan 9 experience is mostly theoretical.

koolba7y ago· 1 in thread

> uxy align

> Aligns the data with the headers. This is done by resizing the columns so that even the longest value fits into the column.

> ...

> This command doesn't work with infinite streams.

Does this do nothing with infinite streams or does it do a "rolling" alignment?

rumcajzOP7y ago

I was thinking about adding a 'trim' command that would trim long fields to fit into the default field size.

dharmatech7y ago· 1 in thread

Cool project!

Have you considered having a way to render output in a graphical toolkit?

See for example:

https://github.com/dharmatech/PsReplWpf

which renders PowerShell output in WPF presentations.

dima557y ago

You can use this (I wrote it, and have been using it daily for many years): http://github.com/dkogan/feedgnuplot

mijoharas7y ago· 1 in thread

Can anyone elaborate on why the tool is named UXY? I couldn't find anything in the repo, and there is no wiki.

imglorp7y ago

Seems like an acro-mondeau of UX (user experience) and XY (tabular format). The tool normalizes some of the Unix tool outputs as a table which can be manipulated.

no_gravity7y ago

I think this is putting too many different functions into a single command.

    uxy ls

This looks like it "tabifies" the output of a given command. Aka it turns the output of the given command into a tab seperated format.

    uxy reformat "NAME SIZE"

This seems to collide with the above since "reformat" is not a command which will be tabified. Instead it filters stdin for two columns.

    uxy align

This seems to do the same as "column -t".

vram227y ago

For anyone interested in learning how to create their own Unix command-line tools (not just use them), feel free to check out these links to content by me (about doing such work in C and Python):

1) Developing a Linux command-line utility: an article I wrote for IBM developerWorks:

https://jugad2.blogspot.com/2014/09/my-ibm-developerworks-ar...

Follow links in the article to go to the source code of the tool described in the tutorial, and the PDF of the IBM dW article.

2) My comment, here:

https://news.ycombinator.com/item?id=19564706

on this HN thread:

Ask HN: Looking for a series on implementing classic Unix tools from scratch:

https://news.ycombinator.com/item?id=19560418

j / k navigate · click thread line to collapse