Problem solving with Unix commands (opens in new tab)

(vegardstikbakke.com)

356 pointsv3gas7y ago211 comments

211 comments

134 comments · 39 top-level

dnet7y ago· 17 in thread

Removing leading zeroes doesn't require Python. One easy solution would be sed:

    $ echo -e '0001\n0010\n0002' | sed 's/^0*//'
    1
    10
    2

mshook7y ago

Yeah plus seq can generate sequences with leading zeroes (something like seq -f %04.f 1 20).

So instead of scripting, he could have generated a sorted list of numbers from the files he had. Created a file with the sequence of numbers for the range and diffed/commed the whole thing. Voilà...

pwg7y ago

The seq provided in the GNU toolset has a -w flag to turn on "equal width" mode, so one can also get zero padded numbers out (from GNU seq) by turning on that mode and zero padding the input:

    $ seq -w 0001 0003
    0001
    0002
    0003
    $

1 more reply

lelf7y ago

  seq -f %04g 0 3
  echo {0000..0003} # if bash

v3gasOP7y ago

Did not know that. Thanks! That's much easier.

leblancfg7y ago

Yes. Or if you're going to whip out Python, might as well make it all in Python.

masklinn7y ago

Very much my thinking, especially when there are no significant commands you need to shell out to:

    import sys, pathlib
    basedir = pathlib.Path(sys.argv[1])
    for i in range(1, 501):
        if not (basedir / f'{i:04}_A.csv').is_file():
            print(i)

2 more replies

jraph7y ago

    $ echo -e '0001\n0010\n0002\n0' | bc
    1
    10
    2
    0

hellabites7y ago

This is better relative to the `sed` solution as it handles the `0000` case well.

jfk137y ago

If one of the items is actually zero, this would delete it entirely, which probably isn't the desired result.

neokantian7y ago

Yeah, no need for Python. Even the following seems to work fine:

$ printf "%d\n" 003

scbrg7y ago

Well, that works up to a point. Fore some, that point might be considered a bit too close to zero...

    $ printf "%d\n" 009
    sh: 1: printf: 009: not completely converted
    0

2 more replies

lelf7y ago

  $ printf %d 010
  8

eMSF7y ago

Given the situation in the article, you might as well do this:

  ls ????_A.csv | grep -o '[1-9][0-9]*'

v3gasOP7y ago

Oh nice. sed still scares me a bit.

dahart7y ago

Definitely check out perl one liner patterns. Perl is less scary and more powerful and usually almost as short as sed commands. Perl can often replace pipelines that use both sed and awk. Perl one liners need judicious flags though, common patterns look like perl -ne, perl -pie, perl -lane ... these do very different things. Once you know them, it’s like a minor superpower.

SomethingOrNot7y ago

I’m not terribly experienced with Unix tools but I reckon that it might be best to just use Perl instead. Then you just have to worry about PCRE instead of PCRE in addition to old-style regexps.

Then again, Perl is even scarier.

1 more reply

ianai7y ago

It’s useful but highly cryptic in my usage.

pdkl957y ago· 10 in thread

Gary Bernhardt[1] gave a great talk about practical problem solving with the unix shell: "The Unix Chainsaw"[2].

"Half-assed is OK when you only need half of an ass."

In the talk, he gives several demonstrations a key aspect of why unix pipelines are so practically useful: you build them interactively. A complicated 4 line pipeline started as a single command that was gradually refined into something that actually solves a complicated problem. This talk demonstrates the part that isn't included in the the usual tutorials or "cool 1-line command" lists: the cycle of "Try something. Hit up to get the command back. Make one iterative change and try again."

[1] You might know him from his other hilarious talks like "The Birth & Death of JavaScript" or "Wat".

[2] https://www.youtube.com/watch?v=sCZJblyT_XM

SomethingOrNot7y ago

> In the talk, he gives several demonstrations a key aspect of why unix pipelines are so practically useful: you build them interactively.

The standard Unix interface might have been interactive in the ’70s, back when hardware and peripherals were horribly non-interactive. But I don’t know why so many so-called millenial programmers (people my age) get excited about the alleged interactivity of the Unix that most people are familiar with. It doesn’t even have the cutting edge ’90s interactivity of Plan 9, what with mouse(!) selection of arbitrary text that can be piped to commands and so on. And every time someone comes up with a Unix-hosted tool that uses some kind of fold-up menu that informs you about what key combination you can type next (you know, like what all GUI programs have with Alt+x and the file|edit|view|… toolbar), people hail it as some kind of UX innovation.

jraph7y ago

I think the interactivity you describe might be a different thing from what your parent is talking about.

From what I understand, your parent talks about how the commands are built iteratively, with some kind of trial-error loop, which is a strength that is supposedly not emphasized enough. And I agree by the way. Nothing to do with how things are input.

2 more replies

smittywerben7y ago

I’m a lone dev that works with moderate-size data and whatever UX solution you’re thinking of is slow or doesn’t exist.

Yes Bash is an untyped hell. But when I pipe 100 GB to stout, my computer’s death wish to show me the fucking data.

mnarayan017y ago

> mouse(!) selection of arbitrary text that can be piped to commands and so on

E.g.:

  xclip -out | ...

or do you mean something different?

1 more reply

SomethingOrNot7y ago

One interactive feature I like about Bash though (or shell or whatever) is C-x C-e to edit the current line in a text editor (FC). Great when I know how to transform some text in the editor easily but not on the command line.

mercer7y ago

This is a reason why I like languages that have REPLs, and perhaps an advantage of dynamically typed, and especially functional languages that I feel is often not emphasized enough (judging by the recent discussions on the issue).

I love how I can just open up IEx (Elixir) and work my way through various functions in my app interactively. If something doesn't work as expected, I change the code and run the 'recompile' command. Then, once things work and eventually stabilize, I add typespecs for at least some of the benefits of an explicitly typed language. In some cases I might make (unit) tests part of the process, but that depends on the situation.

yesenadam7y ago

AWK makes most of that grep/cut/count-type stuff so much easier, 1-liners need something like 3 parts instead of 10.

(My grep doesn't work like in the video, but..)

  for f in *.rb;do awk '$1 ~ /class|module/ {print $2}' $f;done |
    awk '{a[$1]++} END{for (q in a) print a[q],q}' | sort -n

This seems to do what all the sed, cut, regrepping, wc etc do. AWK (in default mode, anyway) removes leading spaces, makes cutting and counting words easy. It took about 30 seconds to write, too.

matt_j7y ago

Every time I come up with some crazy bash construction someone shows me how I could have done it more elegantly with AWK and nothing else. <3

https://gregable.com/2010/09/why-you-should-know-just-little...

svsucculents7y ago

That's the tip of the iceberg. You can do anything in awk.

for a in *.c do awk 'function lensort(a,zerp, x,tmp) {for (x = 1 ; x < zerp ; x++) {tmp = a[x]; if (length(a[x + 1]) < length(a[x])) {a[x] = a[x + 1]; a[x + 1] = tmp}}} /^(int|void|char|struct|long|float)/ {while (x < 2) {getline ; arr[$1] = (substr($0,1,1) == "{") ? "\n" : $0 "\n" ; x++}} END {lensort(arr,v); for (c in arr) {printf "%s\n",(length(arr[c]) > 0) ? arr[c] : ""}}' ${a} done

1 more reply

voltagex_7y ago

I have never been able to get my head around awk - how did you learn it?

4 more replies

yuriko7y ago· 10 in thread

If you bother to write a python script to parse the integers, why not use python to solve the whole problem?

AnIdiotOnTheNet7y ago

This is one of the many reasons I think PowerShell did UNIX philosophy better: you don't need to parse text because the pipelines pass around typed objects. You can kinda almost get the same behavior from some UNIX commands by first having them dump everything into JSON and then having the other end parse the JSON for you, but you're still relying on a lot of text parsing. Personally I think it is high time the UNIX world put together a new toolset.

geophile7y ago

Take a look at osh (Object SHell): https://github.com/geophile/osh

It is a Python implementation of this idea: OS objects like files and processes are represented in Python. You construct pipelines as in UNIX, but passing Python objects instead of strings. E.g. to find the pids of /bin/bash commands:

    osh ps ^ select 'p: p.commandline.startswith("/bin/bash")' ^ f 'p: p.pid' $

- osh: Runs the tool, interpreting the rest of the line as an osh command.

- ^: Piping syntax.

- ps: Generate a stream of process objects, (the currently running processes).

- select: Select those processes, p, whose commandline starts with /bin/bash.

- f: Apply the function to the input stream and write function output to the output stream, (so, basically "map"). The function computes the pid of an input process.

- $ Print each item received in the input stream.

Osh also does database access (query results -> python tuples) and remote access. E.g., to get a process listing of (pid, commandline) on every node in a cluster:

    osh @clustername [ ps ^ f 'p: (p.pid, p.commandline)' ] $

1 more reply

useerup7y ago

Indeed, in PowerShell you could do:

    1..500 | ?{ !(Test-Path ('{0:0000}_A.csv' -f $_)) }

Explanation:

1..500 generates a sequence of numbers 1 through 500

| pipes the numbers

?{ … } is a filter that is evaluated for each item (number)

! negates the following expression

Test-Path tests that a file exists

-f formats the string left of -f with the parameters (zero-based to the right of -f

'{0:0000}_A.csv' is a pattern which formats parameter 0 as 4 digits, zero-padded.

EDIT: Explanation

1 more reply

skywhopper7y ago

Why replace your hammers, screwdrivers, and chisels just because someone invented a 3D printer? They have tradeoffs. Powershell has some good ideas, and benefits from having been invented altogether, rather than evolving over four decades. But in practice it's not as efficient for doing simple things. It's oriented towards much more complex data structures, which is great... but there's no need to throw out your simpler tools just because you think they look ugly.

2 more replies

zokier7y ago

PowerShell sort of cheats, which enables the nice object pipelines; all cmdlets are .net modules that are run within the same runtime. That makes PowerShell much closer to "normal" programming languages with repls than traditional shells. That is also why PowerShell model is not directly a good fit to the UNIX world.

I would like to see more work done in the realm of object shells (and have some ideas myself), especially around designs that meld in more the UNIXy way of having independent communicating processes. But it is a difficult problem domain, and many approaches would involve rewriting lot of the base system we take for granted that is just huge amount of work.

PowerShell had the benefit of having stuff like WMI, COM, the whole .NET, and of course all the resources, funding and marketing from MS. Even then it has seemed to have been an uphill struggle, despite there being far more a need for PS in the Windows world.

v3gasOP7y ago

Fair point! I'd argue there's a difference in time spent in writing a python script to solve it all, and just parsing the ints as I did. Python was my first thought for how to parse the ints.

james_s_tayler7y ago

Great article. I didn't know about comm actually, so that was new for me!

With regard to the 'ints' it helps to not think about them as ints but rather just some text that follows the pattern ^[0][1-9] or some equivalent to that. "Starts with any number of zeros or no zeros followed by any number of numbers between 1 to 9 or none at all."

So long as you know the repeatable pattern you can always use sed to just replace the part of the pattern you want gone with nothing which effectively deletes it from the output. sed is like a Swiss army knife in that regard because you can do nice simple deletions like that and even iterate on them if you need or you can do quite complicated capture groups if you need to as well. Sed can get you unbelievably far in terms of shaping text in a stream.

I have a few tricks I've learned with various tools that I thought were worth writing down. Hopefully you can find some more useful stuff.

https://github.com/nicostouch/grep-sed-awk-magic/blob/master...

1 more reply

nazri17y ago

You can use numfmt to parse the number:

    $ seq -w 0001 0005|numfmt 
    1
    2
    3
    4
    5

Or just use plain sed:

    $ seq -w 0001 0005|sed 's/^0*//'
    1
    2
    3
    4
    5

ekianjo7y ago

yeah that was super weird. Why write an article about the merits of shell tools if you put some python in the mix...

Retra7y ago

Python basically acts like a subshell with its own language... I don't see why anyone would think unix shell scripts are really that different from Python scripts, especially if you're not doing the subprocess control things that command shells are optimal for. Invoking python to do something doesn't seem any more awkward to me than invoking sed, awk, etc..

1 more reply

fforflo7y ago· 8 in thread

This [0] the most complete post I've read on the topic. Lays out all the relevant tools. Spending some time going through each tool's documentation/options, pays off tremendously.

[0]: https://www.ibm.com/developerworks/aix/library/au-unixtext/i...

jihadjihad7y ago

Wow, great find. Sad how hard it seems these days to come across an easy-to-follow primer on a topic without narrative fluff and/or ads everywhere. For those interested in a standalone copy there is a PDF of the content available here https://www.ibm.com/developerworks/aix/library/au-unixtext/a...

wglb7y ago

Writing clear tutorials is a fair amount of effort, more than I originally thought when I first did it.

dahart7y ago

I haven’t seen that one before and it looks pretty good. Interesting that it doesn’t mention sed or awk (edit: I'm wrong, it does mention sed & awk), let alone Perl. I would say that Perl’s so powerful for one liners and Unix text pipelines, I’d consider it required in a text processing reference.

Another one I like, and I think it’s mainly because of the philosophy contained in the title is “Ad hoc data analysis from the Unix command line” https://en.m.wikibooks.org/wiki/Ad_Hoc_Data_Analysis_From_Th...

fwip7y ago

Perhaps it's changed since you viewed it, but both sed and awk are described in the parent's link.

1 more reply

Sir_Cmpwn7y ago

My main problem with this reference is that it encourages the use of non-portable utilities and flags. Double check against POSIX before writing any of this into your scripts:

http://pubs.opengroup.org/onlinepubs/9699919799/

See also: http://shellhaters.org/

JdeBP7y ago

For actual text processing, I recommend a book.

* https://oreilly.com/openbook/utp/

SomethingOrNot7y ago

Seems like a book specifically about text processing for the purpose of writing/formatting/typesetting documents.

inp7y ago

Really nice link! Thanks fforflo.

ciucanu7y ago· 7 in thread

I usually do text processing in Bash, Notepad++ and Excel. Each has its own pros and cons, that's why I usually combine them.

Here you have the tools I use in Bash:

grep, tail, head, cat, cut, less, awk, sed, sort, uniq, wc, xargs, watch ...

james_s_tayler7y ago

As an aside I once found out you can replace 'sort | uniq' entirely with an obscure awk command so long as you don't require the output to be sorted. Iirc it performs twice as fast.

  cat file.txt | awk '!x[$0]++'

omaranto7y ago

The awk commands prints the first occurrence of each line in the order they are found in the file. I can imagine that sometimes that might be even better than sorted order.

pnutjam7y ago

sort has a -u option on my linux... ------ -u, --unique with -c, check for strict ordering; without -c, output only the first of an equal run

AnIdiotOnTheNet7y ago

If you're already in Windows land, you should consider leveraging PowerShell instead of bash. Pretty much all the same tooling is there, only with more descriptive names, tab completion on everything, passes typed object data instead of text parsing, etc.

fxfan7y ago

Ahem... what is powershell core? (I take exception to your if condition). As someone on Arch- I enjoy it a lot.

2038AD7y ago

Bash with Notepad++ and Excel? Do you use Wine or WSL?

ciucanu7y ago

It's kind of mandatory to use Windows in certain envs.

maratc7y ago· 4 in thread

The shortest one I could come up with, no need to use python.

`join -v 2` shows the entries in the second sorted stream that don't have match in the first sorted stream, the rest is self-explanatory I hope.

Edit: $ join -v2 -t_ -j1 <(ls | grep _A | sort ) <(ls | grep -v _A | sort)

Is even shorter, it takes first field (-j1) where fields are separated by '_' (-t_)

zimpenfish7y ago

Slightly shorter:

    ls -v|cut -d_ -f1|uniq -c|awk '$1<2{print $2}'

Tested by creating 500 sets of dual files and removing 10 `_A` randomly.

    for i in $(seq 1 500); do j=$(printf %04d $i); touch ${j}_data.csv; touch ${j}_A.csv; done
    for i in $(seq 1 10); do q=$((RANDOM % 500)); r=$(printf %04d $q); rm -v ${r}_A.csv; done
    removed '0438_A.csv'
    removed '0327_A.csv'
    removed '0150_A.csv'
    removed '0173_A.csv'
    removed '0460_A.csv'
    removed '0194_A.csv'
    removed '0073_A.csv'
    removed '0293_A.csv'
    removed '0404_A.csv'
    removed '0153_A.csv'

And then using the code above to verify the missing files

maratc7y ago

Elegant and short! But unless I'm missing something, your script will print even the datasets that have _A but not the corresponding _data?

1 more reply

taviso7y ago

You could use uniq -u to avoid the awk.

2 more replies

v3gasOP7y ago

Nice golf!

inp7y ago· 4 in thread

Instead to create a script in Python to convert numbers in integers, you can use awk: "python3 parse.py" becomes "awk '{printf "%d\n", $0}'"

tyingq7y ago

Not sure I understand why it needs to even know there's numbers in the filename.

The problem seems to boil down to:

"Find all files with the pattern '[something]_data.csv' and report if '[something]_A.csv' doesn't exist"

Unless I'm missing something, all the sorting and sequence generation isn't adding anything.

darrenf7y ago

Why even use awk rather than the shell's (well, bash's) builtin printf?

    $ printf '%d\n' "0005"
    5

jfk137y ago

That might not always do what a naïve user expects:

    $ printf '%d\n' "0025"
    21

inp7y ago

You can apply the awk command on a pipe, and so it is applies on each line of the file/stream.

1 more reply

wmu7y ago· 4 in thread

Easier would be just use 'cat list_of_numbers | sort | uniq -u' to get the unique entries.

phireal7y ago

Shorter still:

    sort -u < list_of_numbers

aepiepaey7y ago

And if you're using cat because it keeps the filename out of the way when editing the pipeline, then just put the redirect before the command instead, so instead of e.g.

  cat file | grep pattern | sort -u

you can write

  < file grep pattern | sort -u

and the filename is out of the way compared to

  grep pattern file | sort -u

adtac7y ago

http://porkmail.org/era/unix/award.html

now, I'll wait for someone to post a link to the "UUOC award" award

wmu7y ago

This is not the same. For sequence [5,5,4,3,3,2,1,1] "sort -u" returns [1,2,3,4,5], while "sort | uniq -u" returns [2,4].

1 more reply

jancsika7y ago· 3 in thread

The problem with this is that there isn't a standard format forced on the args that following the command name "cut".

What makes it worse is that there are seemingly patterns of standard format that get violated by other patterns. It's often based on when the utility was first authored and whatever ideas were floating around during the time. So sometimes characters can "clump" together behind a flag, under the assumption that multi-character flags will get two hyphens. Then some utilities or programs use a single flag for multi-character flags. Plus many other inconsistencies-- if I learn the basic range syntax for cut do I know the basic range syntax for imagemagick?

Those inconsistencies don't technically conflict since each only exists in the context of a particular utility. But it's a real pain to sanity to see those inconsistencies sitting on either side of a pipe, especially when one of them is wrong. (Or even when it's a single command you need but you use the wrong flag syntax.) That all adds to the cognitive load and can easily make a dev tired before its time to go to sleep.

Oh, and that language switch from bash to python is a huge risk. If you're scripting with Python on a daily basis it probably doesn't seem like it. But for someone reading along, that language boundary is huge. Because the user is no longer limited to runtime errors and finicky arg formatting errors, but also language errors. If the command line barfs up an exception or syntax error at that boundary I'd bet most users would just give up and quit reading the rest of the blog.

Edit: clarification

skywhopper7y ago

Learning the idiosyncrasies of the tools involved is one of the tradeoffs. But there's no getting around it. These tools have been around for far too long to change them all in some misguided attempt at consistency--the semantics of most tools are so different, it wouldn't even make sense to try to enforce some consistency anyway.

You don't have to know every flag for every tool. You don't need to know if you can glob args together in a certain tool. These are different tools developed across decades by different people for different purposes. The fact that you can glue them all together on an ad-hoc basis is magical!

You learn by learning how to do one thing at a time--cutting characters 10-20, or grepping for a regex, or summing with awk, or replacing strings with sed, or translating characters with tr--and adding it to your mental toolbox. It's okay to have a syntax error because man is there and you can easily iterate the command to make it do what you want.

You aren't writing a program to stand the test of time. You're solving a problem in the moment!

gcommer7y ago

It's true that this can be a pain; but this flexibility is also bash's greatest feature: a bash script can make use of almost any other program, regardless of the particular idioms that that program's author was partial too. This is exactly why bash has been so successful for so long and likely will continue to be so for a very long time.

Any attempts to tighten this down would raise the barrier for entry and therefore reduce the ecosystem that bash can operate in.

Also, it's a bit of a false dichotomy. Any other language is also susceptible to these sorts of inconsistencies. For example: Do I specify a range as [min, max] or as two separate parameters? Is it inclusive or exclusive? etc. At some point all programming interfaces come down to conventions, and if your language only supports one then you'll only be able to interop with the subset of the broader community that agrees with you.

schoen7y ago

> Oh, and that language switch from bash to python is a huge risk.

I was thinking about that and I came up with

  sed 's/^0*//'

as an alternative to the Python program. Another option that works for the same purpose is

  xargs -n1 expr 0 +

Edit: There's an earlier subthread with several options for this: https://news.ycombinator.com/item?id=19160875

jclay7y ago· 3 in thread

After moving back to working on a Windows machine the last several years and being “forced” into using PowerShell, I now find myself using it for these sorts of tasks on Linux.

I now use PowerShell for any tasks of equal or greater complexity than the article. It’s such a massive upgrade over struggling to recall the peculiar bash syntax every time and the benefits of piping typed objects around are vast.

As a nice bonus, all of my PowerShell scripts run cross-platform without issue.

nxrabl7y ago

I've dabbled in PowerShell before, but I've always found the objects you get from cmdlets to be so much more opaque than the plain text you get from Unix output, which makes it harder to use the iterative approach to development the article and other commenters describe. Do you have any tips for poking around in PowerShell objects / a workflow that works for you?

jclay7y ago

I’ve tried to love it while using it as an interactive shell, but it’s hard for me to lose the Unix muscle memory and remember their verbose commands.

For anything more than a single pipe, or anything that requires loops or control flow, I switch to Powershell in Visual Studio Code with the PowerShell extension which has intellisense and helps to poke around the methods on each object. From there you can select subsets of your script and run with F8 which helps me prototype with quick feedback.

useerup7y ago

Use `gm` (alias for Get-Member)

e.g like

    ps | gm

Will tell you exactly the different object types and member methods and properties are returned from `ps`.

    ls | gm

Will tell you that ls returns two different object types (directories and files).

pletnes7y ago· 3 in thread

Useless use of seq spotted. Seq does not exist on many systems. Bash has {0001..0500} instead.

Nice writeup though.

enriquto7y ago

Well, to be fair, bash does not exist on many systems either.

For example I have used dragonflybsd and freebsd today and they both had "seq" but no "bash".

lelf7y ago

They have jot(1)

v3gasOP7y ago

Thanks!

sagartewari017y ago· 2 in thread

My favourite one is 'pkill -9 java'. Fixes my laptop if it starts lagging.

fxfan7y ago

Does that kill electron instances too? ;)

whynotminot7y ago

It's always good sport to kill java. Warms my heart every time.

darrenf7y ago· 2 in thread

All the pipes and non-builtin commands (especially python!) look like overkill to me, I must say.

    for set in *_data.csv ; do
        num=${set/_*/}
        success=${set/data/A}
        if [ ! -e $success ] ; then echo $num ; fi
    done

ETA: likely specific to bash, since I have no experience with other shells except for dalliances with ksh and csh in the mid-90s.

phaemon7y ago

Yup, I'd probably have gone with a `for` loop also. A bit shorter:

  for set in *_data.csv; do
    [[ -f "${set/data/A}" ]] || echo "${set%_data.csv}"
  done

Edit: though I just write it out like this for formatting on HN. In real life, that would just be a one-liner:

for set in *_data.csv; do [[ -f "${set/data/A}" ]] || echo "${set%_data.csv}"; done

hellabites7y ago

Just because I like GNU parallel:

    parallel -kj1 'f="{}"; [[ -f "${f/data/A}" ]] || echo $f' ::: *_data.csv

omaranto7y ago· 2 in thread

The article solves the problem: for which numbers x between 1 and 500 is there no file x_A.csv? It looks like in this case it is equivalent to the easier problem: for which x_data.csv is there no corresponding x_A.csv?

    cd dataset-directory
    comm -23 <(ls *_data.csv | sed s/data/A/) <(ls *_A.csv)

ebeip907y ago

This will fail for any filenames that contain newlines

omaranto7y ago

Correct. It is intended for the filenames in the article. More generally, I try to write all my shell code to silently produce hard to track down errors when a filename contains newlines, in order to punish me for my carelessness if I ever accidentally create such a filename.

Upvoter337y ago· 2 in thread

awk one liner: ls | awk '{split($1,x,"_"); split(x[2],y,"."); a[x[1]]+=1} END {for (i in a) {if (a[i] < 2) {print i}}}'

IronBacon7y ago

Zsh one liner (probably works in Bash too):

    for a in {0001..0500}; do [[ ! -f ${a}_A.csv ]] && echo $((10#${a})); done

The only trick I'm using is base transformation to remove padding in the echo...

hellabites7y ago

I didn't realize Zsh (and Bash) was capable of removing zero padding in that way.

Everybody has there own style, but I would prefer to print the missing file pattern and avoid loops.

If you have GNU parallel installed (works in bash)

     parallel -kj1 '! [[ -f "{}" ]] && echo {} || :' ::: $(jot -w %04d_A.csv - 1 501)

or if preferred

     parallel -kj1 '! [[ -f "{}" ]] && echo {} || :' ::: {0001..0500}_A.csv

1 more reply

sureaboutthis7y ago· 2 in thread

> I am starting to realize that the Unix command-line toolbox can fix absolutely any problem related to text wrangling.

Am I the only one who thought, "No shit, Sherlock"?. This is a fundamental of UNIX that many people don't seem to grasp.

adtac7y ago

Everybody realises this at some point. Nobody ever thought "I can use this for anything" when they first saw a shell. It takes time.

SomethingOrNot7y ago

He’s an MS student. He’s just documenting and sharing his journey. As blogs do.

https://www.xkcd.com/1053/

SomethingOrNot7y ago· 2 in thread

> I am starting to realize that the Unix command-line toolbox can fix absolutely any problem related to text wrangling.

How many problems related to text wrangling arise simply by working with Unix tools?

“This philosophical framework will help you solve problems internal to philosophy.”

skywhopper7y ago

What a useless comment. The OP is an interesting walkthrough of solving a highly specific problem in a clever way using a common but often poorly understood toolset. Then you come in and leave a snarkbomb trashing the idea that learning how to use this toolset is worthwhile without providing any reasoning or alternatives.

Do you also trash posts about learning how to build your own furniture or troubleshooting car engines?

What elevated domain do you operate in that only has perfectly elegant solutions to beautifully architected problems that use only tools perfectly crafted to solve those exact problems? Doesn’t sound like very interesting work to me.

SomethingOrNot7y ago

http://www.art.net/~hopkins/Don/unix-haters/handbook.html

boomlinde7y ago· 1 in thread

A change in structure might be helpful:

    $ ls data
    0001.csv 0002.csv 0003.csv 0004.csv ...
    $ ls algorithm_a
    0001.csv 0002.csv 0004.csv ...
    $ diff -q algorithm_a data |grep ^Only |sed 's/.*: //g'
    0003.csv ...

v3gasOP7y ago

Excellent point, haha!

nickjj7y ago· 1 in thread

I think the best part about using Unix tools is it forces you to break down the problem into tiny steps.

You can see feedback every step of the way by removing and adding back new piped commands so you're never really dealing with more than 1 operation at a time which makes debugging and making progress a lot easier than trying to fit everything together at once.

mercer7y ago

It's basically functional programming. I find that my approach to writing code is very similar to how I work with the shell. The main difference, I guess, is that the command 'units' are slightly bigger, in the form of functions, but the way I iterate my solution to a problem is basically the same.

ben5097y ago· 1 in thread

I've often done this, usually not for a large dataset, but it's sometimes helpful to pipe text through Unix commands in Emacs. C-u M-| sort, for instance, will run the selection through sort and replace it in place.

If you're going the all python route, and even want to be able to run bash commands, and want something where you can feed the output into the input, I'd strongly recommend jupyter. (If you want to stay in a terminal, ipython is part of jupyter and heavily upgrades the built-in REPL and does 90% of what I'm mentioning here.)

You can break out each step into its own cell, save variables (though cell 5 will be auto-saved as a variable named _5) but the nicest thing is you can move cells around (check the keyboard shortcuts) and restart the entire kernel and rerun all your operations, essentially what you're getting with a long pipeline, only spread out over parts. And there are shortcuts like func? to pop up help on a function or func?? to see the source.

It's got some dependencies, so I'd recommend running it in a virtualenv via pipenv:

    pipenv install jupyter  # setup new virtualenv and add package
    pipenv run jupyter notebook
    pipenv --rm  # Blow away the virtualenv

Also, look into pandas if you want to slurp a CSV and query it.

omaranto7y ago

I doubt you'll find many Emacs users that would prefer "C-u M-| sort" over "M-x sort-lines".

almostarockstar7y ago· 1 in thread

This was a nice read and a good introduction to text processing with unix commands.

I agree with the other user re python usage - that you may as well use it for the whole task if you're going to use it at all - but I don't think it's a major flaw. It worked for you right? I would suggest naming the python file a bit more descriptively though.

Interesting to read the other suggestions about dealing with this without python.

v3gasOP7y ago

Thanks! Glad to hear!

samwhiteUK7y ago· 1 in thread

I thought this was a neat demo of building up a command with UNIX tools. The python inclusion was a bit odd, yes.

I learned about sys.stdin in Python and cutting characters using the -c flag

v3gasOP7y ago

Thanks!

kritixilithos7y ago· 1 in thread

Nicely done using Unix utils. You can have a pure sed solution (save the `ls` invocation) that is much simpler, albeit obscure, that hinges on the fact that every number has a `data.csv` file.

Given a sorted list of these files (through `ls` or otherwise) the following sed code will print out the data files for which A did not succeed on them.

  /data/!N
  /A/d
  P;D

This works on the fact that there exists a data file for all successful and unsuccessful runs on data, so sed simply prints the files for which there does not exist an `A` counterpart.

If you want to only print out the numbers, you can add a substitution or two towards the end.

  /data/!N
  /A/d
  s/^0*\|_.*//g;P;D

Edit: fixed the sed program

kritixilithos7y ago

Actually the following is even shorter

  /A/{N;d;}

So all together this gives the following

  ls|sed '/A/{N;d;}'

LogicX7y ago· 1 in thread

given the limited scope of files in the direcctory... not sure why it was necessary to use grep, instead of the built in glob?

  ls dataset-directory | egrep '\d\d\d\d_A.csv'

which FWIW wouldn't even work, on multiple levels: you need -1 on ls and no files end with A.csv

  vs

  ls -1 dataset-directory/*_A?.csv

ref: http://man7.org/linux/man-pages/man7/glob.7.html

Update: apologies, apparently my client cached an older version of this page. at that time the files were named A1.csv and A2.csv

olog-hai7y ago

Some ls man pages state the following about the -1 option: "This is the default when the output is not directed to a terminal."

I've never needed to use -1 when piping ls's output to another command.

mklm7y ago· 1 in thread

If you don't mind "cd dataset-directory" beforehand, a shorter and possibly more correct version would be:

  comm -1 -3 <(ls *_A.csv | sed 's/_.*$//') <(seq -w 0500) | sed 's/^0*//'

The OP's solution doesn't seem correct because of the different ordering of the two inputs of `comm': lexicographical (ls) and numeric (seq).

mklm7y ago

Although -w is supported by both GNU and BSD versions of `seq', BSD's ignores leading zeros in input. Thus a more portable approach is:

  comm -1 -3 <(ls *_A.csv | sed 's/_.*$//') <(seq -f %04.f 500) | sed 's/^0*//'

pixelbeat__7y ago· 1 in thread

Set operations are very useful. Here's a summary:

http://https://www.pixelbeat.org/cmdline.html#sets

pixelbeat__7y ago

https://www.pixelbeat.org/cmdline.html#sets

oh5nxo7y ago· 1 in thread

Is there a nice alternative for seq or jot ? Something neater than for-loop in awk ?

jabl7y ago

In bash, you can create sequences with {A..B}. E.g.

echo {1..10}

or to count backwards

echo {10..0} boom!

skywhopper7y ago

The brilliant fun of working with the Unix CLI toolset is that there are millions of valid ways to solve a problem. I also thought of a “better” solution of my own that took an entirely different approach than most of the ones posted here. That’s not really the point.

What’s great about this article is that it follows the process of solving the problem step by step. I find that lots of programmers I work with struggle with CLI problem solving, which I find a little surprising. But I think it all depends on how you think about problems like this.

If you start from “how can I build a function to operate on this raw data?” or “what data structure would best express the relationship between these filenames?” then you will have a hard time. But if you think in terms of “how can I mutate this data to eliminate extraneous details?” and “what tools do I have handy that can solve problems on data like this given a bit of mungeing, and how can I accomplish that bit of mungeing?” and if you can accept taking several baby steps of small operations on every line of the full dataset rather than building and manipulating abstract logical structures, then you’re well on your way to making efficient use of this remarkable toolset to solve ad hoc problems like this one in minutes instead of hours.

1 more reply

stiff7y ago

For learning to get things done with Unix, I recommend the two old books "Unix Programming Environment" and "The AWK Programming Language". There are many resources to learn the various commands etc., but there is still no better place than those books to learn the "unix philosophy". This series is also good:

https://sanctum.geek.nz/arabesque/series/unix-as-ide/

ortekk7y ago

If you are using python in your pipeline, might as well go all in!

  from pathlib import Path


  all_possible_filenames = {f'{i:04}A.csv' for i in range(1,10)}

  cur_dir_filenames = {Path('.').iterdir()}

  missing_filenames = all_possible_filenames - cur_dir_filenames

  print(*missing_filenames, sep='\n')

iheartpotatoes7y ago

I got paid $175/hr as a data analyst contractor to basically run bash, grep, sed, awk, perl. The people that hired me weren't dumb, just non-programmers and became giddy as I explained regular expressions. The gig only lasted 3 months, but I taught myself out of a job: once they got the gist of it they didn't need me. Yay?

adamchainz7y ago

I learnt a lot from the book Data Science at the Command Line, now free and online at https://www.datascienceatthecommandline.com/

js27y ago

Not the most efficient solution but this is what springs to mind for me:

    seq 1000 | xargs printf '%04d_A.csv\n' | while read -r f; do test -f $f || echo $f; done

jon497y ago

Use F# with a TypeProvider. Of course, I imagine it would take some work learning F# but once you learn it the sky is the limit in what you can do with this data.

Dowwie7y ago

More power to those who enjoy writing control flow in shell, but if I need anything beyond a single line I'm going with an interactive ipython session.

dahfizz7y ago

You could use one sed command to replace your grep, cut, and python. It feels cheap to use python do massage data in a post about Unix command line.

redka7y ago

ls | rb 'group_by { |x| x[/\d+/] }.select { |_, y| y.one? }.keys'

https://github.com/thisredone/rb

BentFranklin7y ago

For heavier duty text processing, try

emacs -e myfuns.el

When it comes to mashing text, nothing beats emacs.

iheartpotatoes7y ago

The people that created the command line weren't L33T H4XOR NOOBS. They were brilliant PhD scientists. Let's not confuse the two.

j / k navigate · click thread line to collapse

211 comments

134 comments · 39 top-level

dnet7y ago· 17 in thread

Removing leading zeroes doesn't require Python. One easy solution would be sed:

    $ echo -e '0001\n0010\n0002' | sed 's/^0*//'
    1
    10
    2

mshook7y ago

Yeah plus seq can generate sequences with leading zeroes (something like seq -f %04.f 1 20).

So instead of scripting, he could have generated a sorted list of numbers from the files he had. Created a file with the sequence of numbers for the range and diffed/commed the whole thing. Voilà...

pwg7y ago

The seq provided in the GNU toolset has a -w flag to turn on "equal width" mode, so one can also get zero padded numbers out (from GNU seq) by turning on that mode and zero padding the input:

    $ seq -w 0001 0003
    0001
    0002
    0003
    $

1 more reply

lelf7y ago

  seq -f %04g 0 3
  echo {0000..0003} # if bash

v3gasOP7y ago

Did not know that. Thanks! That's much easier.

leblancfg7y ago

Yes. Or if you're going to whip out Python, might as well make it all in Python.

masklinn7y ago

Very much my thinking, especially when there are no significant commands you need to shell out to:

    import sys, pathlib
    basedir = pathlib.Path(sys.argv[1])
    for i in range(1, 501):
        if not (basedir / f'{i:04}_A.csv').is_file():
            print(i)

2 more replies

jraph7y ago

    $ echo -e '0001\n0010\n0002\n0' | bc
    1
    10
    2
    0

hellabites7y ago

This is better relative to the `sed` solution as it handles the `0000` case well.

jfk137y ago

If one of the items is actually zero, this would delete it entirely, which probably isn't the desired result.

neokantian7y ago

Yeah, no need for Python. Even the following seems to work fine:

$ printf "%d\n" 003

scbrg7y ago

Well, that works up to a point. Fore some, that point might be considered a bit too close to zero...

    $ printf "%d\n" 009
    sh: 1: printf: 009: not completely converted
    0

2 more replies

lelf7y ago

  $ printf %d 010
  8

eMSF7y ago

Given the situation in the article, you might as well do this:

  ls ????_A.csv | grep -o '[1-9][0-9]*'

v3gasOP7y ago

Oh nice. sed still scares me a bit.

dahart7y ago

SomethingOrNot7y ago

I’m not terribly experienced with Unix tools but I reckon that it might be best to just use Perl instead. Then you just have to worry about PCRE instead of PCRE in addition to old-style regexps.

Then again, Perl is even scarier.

1 more reply

ianai7y ago

It’s useful but highly cryptic in my usage.

pdkl957y ago· 10 in thread

Gary Bernhardt[1] gave a great talk about practical problem solving with the unix shell: "The Unix Chainsaw"[2].

"Half-assed is OK when you only need half of an ass."

[1] You might know him from his other hilarious talks like "The Birth & Death of JavaScript" or "Wat".

[2] https://www.youtube.com/watch?v=sCZJblyT_XM

SomethingOrNot7y ago

> In the talk, he gives several demonstrations a key aspect of why unix pipelines are so practically useful: you build them interactively.

jraph7y ago

I think the interactivity you describe might be a different thing from what your parent is talking about.

2 more replies

smittywerben7y ago

I’m a lone dev that works with moderate-size data and whatever UX solution you’re thinking of is slow or doesn’t exist.

Yes Bash is an untyped hell. But when I pipe 100 GB to stout, my computer’s death wish to show me the fucking data.

mnarayan017y ago

> mouse(!) selection of arbitrary text that can be piped to commands and so on

E.g.:

  xclip -out | ...

or do you mean something different?

1 more reply

SomethingOrNot7y ago

mercer7y ago

yesenadam7y ago

AWK makes most of that grep/cut/count-type stuff so much easier, 1-liners need something like 3 parts instead of 10.

(My grep doesn't work like in the video, but..)

  for f in *.rb;do awk '$1 ~ /class|module/ {print $2}' $f;done |
    awk '{a[$1]++} END{for (q in a) print a[q],q}' | sort -n

This seems to do what all the sed, cut, regrepping, wc etc do. AWK (in default mode, anyway) removes leading spaces, makes cutting and counting words easy. It took about 30 seconds to write, too.

matt_j7y ago

Every time I come up with some crazy bash construction someone shows me how I could have done it more elegantly with AWK and nothing else. <3

https://gregable.com/2010/09/why-you-should-know-just-little...

svsucculents7y ago

That's the tip of the iceberg. You can do anything in awk.

1 more reply

voltagex_7y ago

I have never been able to get my head around awk - how did you learn it?

4 more replies

yuriko7y ago· 10 in thread

If you bother to write a python script to parse the integers, why not use python to solve the whole problem?

AnIdiotOnTheNet7y ago

geophile7y ago

Take a look at osh (Object SHell): https://github.com/geophile/osh

    osh ps ^ select 'p: p.commandline.startswith("/bin/bash")' ^ f 'p: p.pid' $

- osh: Runs the tool, interpreting the rest of the line as an osh command.

- ^: Piping syntax.

- ps: Generate a stream of process objects, (the currently running processes).

- select: Select those processes, p, whose commandline starts with /bin/bash.

- f: Apply the function to the input stream and write function output to the output stream, (so, basically "map"). The function computes the pid of an input process.

- $ Print each item received in the input stream.

Osh also does database access (query results -> python tuples) and remote access. E.g., to get a process listing of (pid, commandline) on every node in a cluster:

    osh @clustername [ ps ^ f 'p: (p.pid, p.commandline)' ] $

1 more reply

useerup7y ago

Indeed, in PowerShell you could do:

    1..500 | ?{ !(Test-Path ('{0:0000}_A.csv' -f $_)) }

Explanation:

1..500 generates a sequence of numbers 1 through 500

| pipes the numbers

?{ … } is a filter that is evaluated for each item (number)

! negates the following expression

Test-Path tests that a file exists

-f formats the string left of -f with the parameters (zero-based to the right of -f

'{0:0000}_A.csv' is a pattern which formats parameter 0 as 4 digits, zero-padded.

EDIT: Explanation

1 more reply

skywhopper7y ago

2 more replies

zokier7y ago

v3gasOP7y ago

Fair point! I'd argue there's a difference in time spent in writing a python script to solve it all, and just parsing the ints as I did. Python was my first thought for how to parse the ints.

james_s_tayler7y ago

Great article. I didn't know about comm actually, so that was new for me!

I have a few tricks I've learned with various tools that I thought were worth writing down. Hopefully you can find some more useful stuff.

https://github.com/nicostouch/grep-sed-awk-magic/blob/master...

1 more reply

nazri17y ago

You can use numfmt to parse the number:

    $ seq -w 0001 0005|numfmt 
    1
    2
    3
    4
    5

Or just use plain sed:

    $ seq -w 0001 0005|sed 's/^0*//'
    1
    2
    3
    4
    5

ekianjo7y ago

yeah that was super weird. Why write an article about the merits of shell tools if you put some python in the mix...

Retra7y ago

1 more reply

fforflo7y ago· 8 in thread

This [0] the most complete post I've read on the topic. Lays out all the relevant tools. Spending some time going through each tool's documentation/options, pays off tremendously.

[0]: https://www.ibm.com/developerworks/aix/library/au-unixtext/i...

jihadjihad7y ago

wglb7y ago

Writing clear tutorials is a fair amount of effort, more than I originally thought when I first did it.

dahart7y ago

fwip7y ago

Perhaps it's changed since you viewed it, but both sed and awk are described in the parent's link.

1 more reply

Sir_Cmpwn7y ago

My main problem with this reference is that it encourages the use of non-portable utilities and flags. Double check against POSIX before writing any of this into your scripts:

http://pubs.opengroup.org/onlinepubs/9699919799/

See also: http://shellhaters.org/

JdeBP7y ago

For actual text processing, I recommend a book.

* https://oreilly.com/openbook/utp/

SomethingOrNot7y ago

Seems like a book specifically about text processing for the purpose of writing/formatting/typesetting documents.

inp7y ago

Really nice link! Thanks fforflo.

ciucanu7y ago· 7 in thread

I usually do text processing in Bash, Notepad++ and Excel. Each has its own pros and cons, that's why I usually combine them.

Here you have the tools I use in Bash:

grep, tail, head, cat, cut, less, awk, sed, sort, uniq, wc, xargs, watch ...

james_s_tayler7y ago

As an aside I once found out you can replace 'sort | uniq' entirely with an obscure awk command so long as you don't require the output to be sorted. Iirc it performs twice as fast.

  cat file.txt | awk '!x[$0]++'

omaranto7y ago

The awk commands prints the first occurrence of each line in the order they are found in the file. I can imagine that sometimes that might be even better than sorted order.

pnutjam7y ago

sort has a -u option on my linux... ------ -u, --unique with -c, check for strict ordering; without -c, output only the first of an equal run

AnIdiotOnTheNet7y ago

fxfan7y ago

Ahem... what is powershell core? (I take exception to your if condition). As someone on Arch- I enjoy it a lot.

2038AD7y ago

Bash with Notepad++ and Excel? Do you use Wine or WSL?

ciucanu7y ago

It's kind of mandatory to use Windows in certain envs.

maratc7y ago· 4 in thread

The shortest one I could come up with, no need to use python.

`join -v 2` shows the entries in the second sorted stream that don't have match in the first sorted stream, the rest is self-explanatory I hope.

Edit: $ join -v2 -t_ -j1 <(ls | grep _A | sort ) <(ls | grep -v _A | sort)

Is even shorter, it takes first field (-j1) where fields are separated by '_' (-t_)

zimpenfish7y ago

Slightly shorter:

    ls -v|cut -d_ -f1|uniq -c|awk '$1<2{print $2}'

Tested by creating 500 sets of dual files and removing 10 `_A` randomly.

    for i in $(seq 1 500); do j=$(printf %04d $i); touch ${j}_data.csv; touch ${j}_A.csv; done
    for i in $(seq 1 10); do q=$((RANDOM % 500)); r=$(printf %04d $q); rm -v ${r}_A.csv; done
    removed '0438_A.csv'
    removed '0327_A.csv'
    removed '0150_A.csv'
    removed '0173_A.csv'
    removed '0460_A.csv'
    removed '0194_A.csv'
    removed '0073_A.csv'
    removed '0293_A.csv'
    removed '0404_A.csv'
    removed '0153_A.csv'

And then using the code above to verify the missing files

maratc7y ago

Elegant and short! But unless I'm missing something, your script will print even the datasets that have _A but not the corresponding _data?

1 more reply

taviso7y ago

You could use uniq -u to avoid the awk.

2 more replies

v3gasOP7y ago

Nice golf!

inp7y ago· 4 in thread

Instead to create a script in Python to convert numbers in integers, you can use awk: "python3 parse.py" becomes "awk '{printf "%d\n", $0}'"

tyingq7y ago

Not sure I understand why it needs to even know there's numbers in the filename.

The problem seems to boil down to:

"Find all files with the pattern '[something]_data.csv' and report if '[something]_A.csv' doesn't exist"

Unless I'm missing something, all the sorting and sequence generation isn't adding anything.

darrenf7y ago

Why even use awk rather than the shell's (well, bash's) builtin printf?

    $ printf '%d\n' "0005"
    5

jfk137y ago

That might not always do what a naïve user expects:

    $ printf '%d\n' "0025"
    21

inp7y ago

You can apply the awk command on a pipe, and so it is applies on each line of the file/stream.

1 more reply

wmu7y ago· 4 in thread

Easier would be just use 'cat list_of_numbers | sort | uniq -u' to get the unique entries.

phireal7y ago

Shorter still:

    sort -u < list_of_numbers

aepiepaey7y ago

And if you're using cat because it keeps the filename out of the way when editing the pipeline, then just put the redirect before the command instead, so instead of e.g.

  cat file | grep pattern | sort -u

you can write

  < file grep pattern | sort -u

and the filename is out of the way compared to

  grep pattern file | sort -u

adtac7y ago

http://porkmail.org/era/unix/award.html

now, I'll wait for someone to post a link to the "UUOC award" award

wmu7y ago

This is not the same. For sequence [5,5,4,3,3,2,1,1] "sort -u" returns [1,2,3,4,5], while "sort | uniq -u" returns [2,4].

1 more reply

jancsika7y ago· 3 in thread

The problem with this is that there isn't a standard format forced on the args that following the command name "cut".

Edit: clarification

skywhopper7y ago

You aren't writing a program to stand the test of time. You're solving a problem in the moment!

gcommer7y ago

Any attempts to tighten this down would raise the barrier for entry and therefore reduce the ecosystem that bash can operate in.

schoen7y ago

> Oh, and that language switch from bash to python is a huge risk.

I was thinking about that and I came up with

  sed 's/^0*//'

as an alternative to the Python program. Another option that works for the same purpose is

  xargs -n1 expr 0 +

Edit: There's an earlier subthread with several options for this: https://news.ycombinator.com/item?id=19160875

jclay7y ago· 3 in thread

After moving back to working on a Windows machine the last several years and being “forced” into using PowerShell, I now find myself using it for these sorts of tasks on Linux.

As a nice bonus, all of my PowerShell scripts run cross-platform without issue.

nxrabl7y ago

jclay7y ago

I’ve tried to love it while using it as an interactive shell, but it’s hard for me to lose the Unix muscle memory and remember their verbose commands.

useerup7y ago

Use `gm` (alias for Get-Member)

e.g like

    ps | gm

Will tell you exactly the different object types and member methods and properties are returned from `ps`.

    ls | gm

Will tell you that ls returns two different object types (directories and files).

pletnes7y ago· 3 in thread

Useless use of seq spotted. Seq does not exist on many systems. Bash has {0001..0500} instead.

Nice writeup though.

enriquto7y ago

Well, to be fair, bash does not exist on many systems either.

For example I have used dragonflybsd and freebsd today and they both had "seq" but no "bash".

lelf7y ago

They have jot(1)

v3gasOP7y ago

Thanks!

sagartewari017y ago· 2 in thread

My favourite one is 'pkill -9 java'. Fixes my laptop if it starts lagging.

fxfan7y ago

Does that kill electron instances too? ;)

whynotminot7y ago

It's always good sport to kill java. Warms my heart every time.

darrenf7y ago· 2 in thread

All the pipes and non-builtin commands (especially python!) look like overkill to me, I must say.

    for set in *_data.csv ; do
        num=${set/_*/}
        success=${set/data/A}
        if [ ! -e $success ] ; then echo $num ; fi
    done

ETA: likely specific to bash, since I have no experience with other shells except for dalliances with ksh and csh in the mid-90s.

phaemon7y ago

Yup, I'd probably have gone with a `for` loop also. A bit shorter:

  for set in *_data.csv; do
    [[ -f "${set/data/A}" ]] || echo "${set%_data.csv}"
  done

Edit: though I just write it out like this for formatting on HN. In real life, that would just be a one-liner:

for set in *_data.csv; do [[ -f "${set/data/A}" ]] || echo "${set%_data.csv}"; done

hellabites7y ago

Just because I like GNU parallel:

    parallel -kj1 'f="{}"; [[ -f "${f/data/A}" ]] || echo $f' ::: *_data.csv

omaranto7y ago· 2 in thread

    cd dataset-directory
    comm -23 <(ls *_data.csv | sed s/data/A/) <(ls *_A.csv)

ebeip907y ago

This will fail for any filenames that contain newlines

omaranto7y ago

Upvoter337y ago· 2 in thread

awk one liner: ls | awk '{split($1,x,"_"); split(x[2],y,"."); a[x[1]]+=1} END {for (i in a) {if (a[i] < 2) {print i}}}'

IronBacon7y ago

Zsh one liner (probably works in Bash too):

    for a in {0001..0500}; do [[ ! -f ${a}_A.csv ]] && echo $((10#${a})); done

The only trick I'm using is base transformation to remove padding in the echo...

hellabites7y ago

I didn't realize Zsh (and Bash) was capable of removing zero padding in that way.

Everybody has there own style, but I would prefer to print the missing file pattern and avoid loops.

If you have GNU parallel installed (works in bash)

     parallel -kj1 '! [[ -f "{}" ]] && echo {} || :' ::: $(jot -w %04d_A.csv - 1 501)

or if preferred

     parallel -kj1 '! [[ -f "{}" ]] && echo {} || :' ::: {0001..0500}_A.csv

1 more reply

sureaboutthis7y ago· 2 in thread

> I am starting to realize that the Unix command-line toolbox can fix absolutely any problem related to text wrangling.

Am I the only one who thought, "No shit, Sherlock"?. This is a fundamental of UNIX that many people don't seem to grasp.

adtac7y ago

Everybody realises this at some point. Nobody ever thought "I can use this for anything" when they first saw a shell. It takes time.

SomethingOrNot7y ago

He’s an MS student. He’s just documenting and sharing his journey. As blogs do.

https://www.xkcd.com/1053/

SomethingOrNot7y ago· 2 in thread

> I am starting to realize that the Unix command-line toolbox can fix absolutely any problem related to text wrangling.

How many problems related to text wrangling arise simply by working with Unix tools?

“This philosophical framework will help you solve problems internal to philosophy.”

skywhopper7y ago

Do you also trash posts about learning how to build your own furniture or troubleshooting car engines?

SomethingOrNot7y ago

http://www.art.net/~hopkins/Don/unix-haters/handbook.html

boomlinde7y ago· 1 in thread

A change in structure might be helpful:

    $ ls data
    0001.csv 0002.csv 0003.csv 0004.csv ...
    $ ls algorithm_a
    0001.csv 0002.csv 0004.csv ...
    $ diff -q algorithm_a data |grep ^Only |sed 's/.*: //g'
    0003.csv ...

v3gasOP7y ago

Excellent point, haha!

nickjj7y ago· 1 in thread

I think the best part about using Unix tools is it forces you to break down the problem into tiny steps.

mercer7y ago

ben5097y ago· 1 in thread

It's got some dependencies, so I'd recommend running it in a virtualenv via pipenv:

    pipenv install jupyter  # setup new virtualenv and add package
    pipenv run jupyter notebook
    pipenv --rm  # Blow away the virtualenv

Also, look into pandas if you want to slurp a CSV and query it.

omaranto7y ago

I doubt you'll find many Emacs users that would prefer "C-u M-| sort" over "M-x sort-lines".

almostarockstar7y ago· 1 in thread

This was a nice read and a good introduction to text processing with unix commands.

Interesting to read the other suggestions about dealing with this without python.

v3gasOP7y ago

Thanks! Glad to hear!

samwhiteUK7y ago· 1 in thread

I thought this was a neat demo of building up a command with UNIX tools. The python inclusion was a bit odd, yes.

I learned about sys.stdin in Python and cutting characters using the -c flag

v3gasOP7y ago

Thanks!

kritixilithos7y ago· 1 in thread

Nicely done using Unix utils. You can have a pure sed solution (save the `ls` invocation) that is much simpler, albeit obscure, that hinges on the fact that every number has a `data.csv` file.

Given a sorted list of these files (through `ls` or otherwise) the following sed code will print out the data files for which A did not succeed on them.

  /data/!N
  /A/d
  P;D

This works on the fact that there exists a data file for all successful and unsuccessful runs on data, so sed simply prints the files for which there does not exist an `A` counterpart.

If you want to only print out the numbers, you can add a substitution or two towards the end.

  /data/!N
  /A/d
  s/^0*\|_.*//g;P;D

Edit: fixed the sed program

kritixilithos7y ago

Actually the following is even shorter

  /A/{N;d;}

So all together this gives the following

  ls|sed '/A/{N;d;}'

LogicX7y ago· 1 in thread

given the limited scope of files in the direcctory... not sure why it was necessary to use grep, instead of the built in glob?

  ls dataset-directory | egrep '\d\d\d\d_A.csv'

which FWIW wouldn't even work, on multiple levels: you need -1 on ls and no files end with A.csv

  vs

  ls -1 dataset-directory/*_A?.csv

ref: http://man7.org/linux/man-pages/man7/glob.7.html

Update: apologies, apparently my client cached an older version of this page. at that time the files were named A1.csv and A2.csv

olog-hai7y ago

Some ls man pages state the following about the -1 option: "This is the default when the output is not directed to a terminal."

I've never needed to use -1 when piping ls's output to another command.

mklm7y ago· 1 in thread

If you don't mind "cd dataset-directory" beforehand, a shorter and possibly more correct version would be:

  comm -1 -3 <(ls *_A.csv | sed 's/_.*$//') <(seq -w 0500) | sed 's/^0*//'

The OP's solution doesn't seem correct because of the different ordering of the two inputs of `comm': lexicographical (ls) and numeric (seq).

mklm7y ago

Although -w is supported by both GNU and BSD versions of `seq', BSD's ignores leading zeros in input. Thus a more portable approach is:

  comm -1 -3 <(ls *_A.csv | sed 's/_.*$//') <(seq -f %04.f 500) | sed 's/^0*//'

pixelbeat__7y ago· 1 in thread

Set operations are very useful. Here's a summary:

http://https://www.pixelbeat.org/cmdline.html#sets

pixelbeat__7y ago

https://www.pixelbeat.org/cmdline.html#sets

oh5nxo7y ago· 1 in thread

Is there a nice alternative for seq or jot ? Something neater than for-loop in awk ?

jabl7y ago

In bash, you can create sequences with {A..B}. E.g.

echo {1..10}

or to count backwards

echo {10..0} boom!

skywhopper7y ago

1 more reply

stiff7y ago

https://sanctum.geek.nz/arabesque/series/unix-as-ide/

ortekk7y ago

If you are using python in your pipeline, might as well go all in!

  from pathlib import Path


  all_possible_filenames = {f'{i:04}A.csv' for i in range(1,10)}

  cur_dir_filenames = {Path('.').iterdir()}

  missing_filenames = all_possible_filenames - cur_dir_filenames

  print(*missing_filenames, sep='\n')

iheartpotatoes7y ago

adamchainz7y ago

I learnt a lot from the book Data Science at the Command Line, now free and online at https://www.datascienceatthecommandline.com/

js27y ago

Not the most efficient solution but this is what springs to mind for me:

    seq 1000 | xargs printf '%04d_A.csv\n' | while read -r f; do test -f $f || echo $f; done

jon497y ago

Use F# with a TypeProvider. Of course, I imagine it would take some work learning F# but once you learn it the sky is the limit in what you can do with this data.

Dowwie7y ago

More power to those who enjoy writing control flow in shell, but if I need anything beyond a single line I'm going with an interactive ipython session.

dahfizz7y ago

You could use one sed command to replace your grep, cut, and python. It feels cheap to use python do massage data in a post about Unix command line.

redka7y ago

ls | rb 'group_by { |x| x[/\d+/] }.select { |_, y| y.one? }.keys'

https://github.com/thisredone/rb

BentFranklin7y ago

For heavier duty text processing, try

emacs -e myfuns.el

When it comes to mashing text, nothing beats emacs.

iheartpotatoes7y ago

The people that created the command line weren't L33T H4XOR NOOBS. They were brilliant PhD scientists. Let's not confuse the two.

j / k navigate · click thread line to collapse