$ echo -e '0001\n0010\n0002' | sed 's/^0*//'
1
10
2So instead of scripting, he could have generated a sorted list of numbers from the files he had. Created a file with the sequence of numbers for the range and diffed/commed the whole thing. Voilà...
$ seq -w 0001 0003
0001
0002
0003
$ seq -f %04g 0 3
echo {0000..0003} # if bash import sys, pathlib
basedir = pathlib.Path(sys.argv[1])
for i in range(1, 501):
if not (basedir / f'{i:04}_A.csv').is_file():
print(i) $ echo -e '0001\n0010\n0002\n0' | bc
1
10
2
0$ printf "%d\n" 003
3
ls ????_A.csv | grep -o '[1-9][0-9]*'Then again, Perl is even scarier.
"Half-assed is OK when you only need half of an ass."
In the talk, he gives several demonstrations a key aspect of why unix pipelines are so practically useful: you build them interactively. A complicated 4 line pipeline started as a single command that was gradually refined into something that actually solves a complicated problem. This talk demonstrates the part that isn't included in the the usual tutorials or "cool 1-line command" lists: the cycle of "Try something. Hit up to get the command back. Make one iterative change and try again."
[1] You might know him from his other hilarious talks like "The Birth & Death of JavaScript" or "Wat".
The standard Unix interface might have been interactive in the ’70s, back when hardware and peripherals were horribly non-interactive. But I don’t know why so many so-called millenial programmers (people my age) get excited about the alleged interactivity of the Unix that most people are familiar with. It doesn’t even have the cutting edge ’90s interactivity of Plan 9, what with mouse(!) selection of arbitrary text that can be piped to commands and so on. And every time someone comes up with a Unix-hosted tool that uses some kind of fold-up menu that informs you about what key combination you can type next (you know, like what all GUI programs have with Alt+x and the file|edit|view|… toolbar), people hail it as some kind of UX innovation.
From what I understand, your parent talks about how the commands are built iteratively, with some kind of trial-error loop, which is a strength that is supposedly not emphasized enough. And I agree by the way. Nothing to do with how things are input.
Yes Bash is an untyped hell. But when I pipe 100 GB to stout, my computer’s death wish to show me the fucking data.
E.g.:
xclip -out | ...
or do you mean something different?I love how I can just open up IEx (Elixir) and work my way through various functions in my app interactively. If something doesn't work as expected, I change the code and run the 'recompile' command. Then, once things work and eventually stabilize, I add typespecs for at least some of the benefits of an explicitly typed language. In some cases I might make (unit) tests part of the process, but that depends on the situation.
(My grep doesn't work like in the video, but..)
for f in *.rb;do awk '$1 ~ /class|module/ {print $2}' $f;done |
awk '{a[$1]++} END{for (q in a) print a[q],q}' | sort -n
This seems to do what all the sed, cut, regrepping, wc etc do. AWK (in default mode, anyway) removes leading spaces, makes cutting and counting words easy. It took about 30 seconds to write, too.https://gregable.com/2010/09/why-you-should-know-just-little...
for a in *.c do awk 'function lensort(a,zerp, x,tmp) {for (x = 1 ; x < zerp ; x++) {tmp = a[x]; if (length(a[x + 1]) < length(a[x])) {a[x] = a[x + 1]; a[x + 1] = tmp}}} /^(int|void|char|struct|long|float)/ {while (x < 2) {getline ; arr[$1] = (substr($0,1,1) == "{") ? "\n" : $0 "\n" ; x++}} END {lensort(arr,v); for (c in arr) {printf "%s\n",(length(arr[c]) > 0) ? arr[c] : ""}}' ${a} done
It is a Python implementation of this idea: OS objects like files and processes are represented in Python. You construct pipelines as in UNIX, but passing Python objects instead of strings. E.g. to find the pids of /bin/bash commands:
osh ps ^ select 'p: p.commandline.startswith("/bin/bash")' ^ f 'p: p.pid' $
- osh: Runs the tool, interpreting the rest of the line as an osh command.- ^: Piping syntax.
- ps: Generate a stream of process objects, (the currently running processes).
- select: Select those processes, p, whose commandline starts with /bin/bash.
- f: Apply the function to the input stream and write function output to the output stream, (so, basically "map"). The function computes the pid of an input process.
- $ Print each item received in the input stream.
Osh also does database access (query results -> python tuples) and remote access. E.g., to get a process listing of (pid, commandline) on every node in a cluster:
osh @clustername [ ps ^ f 'p: (p.pid, p.commandline)' ] $ 1..500 | ?{ !(Test-Path ('{0:0000}_A.csv' -f $_)) }
Explanation:1..500 generates a sequence of numbers 1 through 500
| pipes the numbers
?{ … } is a filter that is evaluated for each item (number)
! negates the following expression
Test-Path tests that a file exists
-f formats the string left of -f with the parameters (zero-based to the right of -f
'{0:0000}_A.csv' is a pattern which formats parameter 0 as 4 digits, zero-padded.
EDIT: Explanation
I would like to see more work done in the realm of object shells (and have some ideas myself), especially around designs that meld in more the UNIXy way of having independent communicating processes. But it is a difficult problem domain, and many approaches would involve rewriting lot of the base system we take for granted that is just huge amount of work.
PowerShell had the benefit of having stuff like WMI, COM, the whole .NET, and of course all the resources, funding and marketing from MS. Even then it has seemed to have been an uphill struggle, despite there being far more a need for PS in the Windows world.
With regard to the 'ints' it helps to not think about them as ints but rather just some text that follows the pattern ^[0][1-9] or some equivalent to that. "Starts with any number of zeros or no zeros followed by any number of numbers between 1 to 9 or none at all."
So long as you know the repeatable pattern you can always use sed to just replace the part of the pattern you want gone with nothing which effectively deletes it from the output. sed is like a Swiss army knife in that regard because you can do nice simple deletions like that and even iterate on them if you need or you can do quite complicated capture groups if you need to as well. Sed can get you unbelievably far in terms of shaping text in a stream.
I have a few tricks I've learned with various tools that I thought were worth writing down. Hopefully you can find some more useful stuff.
https://github.com/nicostouch/grep-sed-awk-magic/blob/master...
$ seq -w 0001 0005|numfmt
1
2
3
4
5
Or just use plain sed: $ seq -w 0001 0005|sed 's/^0*//'
1
2
3
4
5[0]: https://www.ibm.com/developerworks/aix/library/au-unixtext/i...
Another one I like, and I think it’s mainly because of the philosophy contained in the title is “Ad hoc data analysis from the Unix command line” https://en.m.wikibooks.org/wiki/Ad_Hoc_Data_Analysis_From_Th...
http://pubs.opengroup.org/onlinepubs/9699919799/
See also: http://shellhaters.org/
Here you have the tools I use in Bash:
grep, tail, head, cat, cut, less, awk, sed, sort, uniq, wc, xargs, watch ...
cat file.txt | awk '!x[$0]++'The shortest one I could come up with, no need to use python.
`join -v 2` shows the entries in the second sorted stream that don't have match in the first sorted stream, the rest is self-explanatory I hope.
Edit: $ join -v2 -t_ -j1 <(ls | grep _A | sort ) <(ls | grep -v _A | sort)
Is even shorter, it takes first field (-j1) where fields are separated by '_' (-t_)
ls -v|cut -d_ -f1|uniq -c|awk '$1<2{print $2}'
Tested by creating 500 sets of dual files and removing 10 `_A` randomly. for i in $(seq 1 500); do j=$(printf %04d $i); touch ${j}_data.csv; touch ${j}_A.csv; done
for i in $(seq 1 10); do q=$((RANDOM % 500)); r=$(printf %04d $q); rm -v ${r}_A.csv; done
removed '0438_A.csv'
removed '0327_A.csv'
removed '0150_A.csv'
removed '0173_A.csv'
removed '0460_A.csv'
removed '0194_A.csv'
removed '0073_A.csv'
removed '0293_A.csv'
removed '0404_A.csv'
removed '0153_A.csv'
And then using the code above to verify the missing files 0073
0150
0153
0173
0194
0293
0327
0404
0438
0460The problem seems to boil down to:
"Find all files with the pattern '[something]_data.csv' and report if '[something]_A.csv' doesn't exist"
Unless I'm missing something, all the sorting and sequence generation isn't adding anything.
$ printf '%d\n' "0005"
5 sort -u < list_of_numbers cat file | grep pattern | sort -u
you can write < file grep pattern | sort -u
and the filename is out of the way compared to grep pattern file | sort -unow, I'll wait for someone to post a link to the "UUOC award" award
What makes it worse is that there are seemingly patterns of standard format that get violated by other patterns. It's often based on when the utility was first authored and whatever ideas were floating around during the time. So sometimes characters can "clump" together behind a flag, under the assumption that multi-character flags will get two hyphens. Then some utilities or programs use a single flag for multi-character flags. Plus many other inconsistencies-- if I learn the basic range syntax for cut do I know the basic range syntax for imagemagick?
Those inconsistencies don't technically conflict since each only exists in the context of a particular utility. But it's a real pain to sanity to see those inconsistencies sitting on either side of a pipe, especially when one of them is wrong. (Or even when it's a single command you need but you use the wrong flag syntax.) That all adds to the cognitive load and can easily make a dev tired before its time to go to sleep.
Oh, and that language switch from bash to python is a huge risk. If you're scripting with Python on a daily basis it probably doesn't seem like it. But for someone reading along, that language boundary is huge. Because the user is no longer limited to runtime errors and finicky arg formatting errors, but also language errors. If the command line barfs up an exception or syntax error at that boundary I'd bet most users would just give up and quit reading the rest of the blog.
Edit: clarification
You don't have to know every flag for every tool. You don't need to know if you can glob args together in a certain tool. These are different tools developed across decades by different people for different purposes. The fact that you can glue them all together on an ad-hoc basis is magical!
You learn by learning how to do one thing at a time--cutting characters 10-20, or grepping for a regex, or summing with awk, or replacing strings with sed, or translating characters with tr--and adding it to your mental toolbox. It's okay to have a syntax error because man is there and you can easily iterate the command to make it do what you want.
You aren't writing a program to stand the test of time. You're solving a problem in the moment!
Any attempts to tighten this down would raise the barrier for entry and therefore reduce the ecosystem that bash can operate in.
Also, it's a bit of a false dichotomy. Any other language is also susceptible to these sorts of inconsistencies. For example: Do I specify a range as [min, max] or as two separate parameters? Is it inclusive or exclusive? etc. At some point all programming interfaces come down to conventions, and if your language only supports one then you'll only be able to interop with the subset of the broader community that agrees with you.
I was thinking about that and I came up with
sed 's/^0*//'
as an alternative to the Python program. Another option that works for the same purpose is xargs -n1 expr 0 +
Edit: There's an earlier subthread with several options for this: https://news.ycombinator.com/item?id=19160875I now use PowerShell for any tasks of equal or greater complexity than the article. It’s such a massive upgrade over struggling to recall the peculiar bash syntax every time and the benefits of piping typed objects around are vast.
As a nice bonus, all of my PowerShell scripts run cross-platform without issue.
For anything more than a single pipe, or anything that requires loops or control flow, I switch to Powershell in Visual Studio Code with the PowerShell extension which has intellisense and helps to poke around the methods on each object. From there you can select subsets of your script and run with F8 which helps me prototype with quick feedback.
e.g like
ps | gm
Will tell you exactly the different object types and member methods and properties are returned from `ps`. ls | gm
Will tell you that ls returns two different object types (directories and files).Nice writeup though.
for set in *_data.csv ; do
num=${set/_*/}
success=${set/data/A}
if [ ! -e $success ] ; then echo $num ; fi
done
ETA: likely specific to bash, since I have no experience with other shells except for dalliances with ksh and csh in the mid-90s. for set in *_data.csv; do
[[ -f "${set/data/A}" ]] || echo "${set%_data.csv}"
done
Edit: though I just write it out like this for formatting on HN. In real life, that would just be a one-liner:for set in *_data.csv; do [[ -f "${set/data/A}" ]] || echo "${set%_data.csv}"; done
parallel -kj1 'f="{}"; [[ -f "${f/data/A}" ]] || echo $f' ::: *_data.csv cd dataset-directory
comm -23 <(ls *_data.csv | sed s/data/A/) <(ls *_A.csv) for a in {0001..0500}; do [[ ! -f ${a}_A.csv ]] && echo $((10#${a})); done
The only trick I'm using is base transformation to remove padding in the echo...Everybody has there own style, but I would prefer to print the missing file pattern and avoid loops.
If you have GNU parallel installed (works in bash)
parallel -kj1 '! [[ -f "{}" ]] && echo {} || :' ::: $(jot -w %04d_A.csv - 1 501)
or if preferred parallel -kj1 '! [[ -f "{}" ]] && echo {} || :' ::: {0001..0500}_A.csvAm I the only one who thought, "No shit, Sherlock"?. This is a fundamental of UNIX that many people don't seem to grasp.
How many problems related to text wrangling arise simply by working with Unix tools?
“This philosophical framework will help you solve problems internal to philosophy.”
Do you also trash posts about learning how to build your own furniture or troubleshooting car engines?
What elevated domain do you operate in that only has perfectly elegant solutions to beautifully architected problems that use only tools perfectly crafted to solve those exact problems? Doesn’t sound like very interesting work to me.
$ ls data
0001.csv 0002.csv 0003.csv 0004.csv ...
$ ls algorithm_a
0001.csv 0002.csv 0004.csv ...
$ diff -q algorithm_a data |grep ^Only |sed 's/.*: //g'
0003.csv ...You can see feedback every step of the way by removing and adding back new piped commands so you're never really dealing with more than 1 operation at a time which makes debugging and making progress a lot easier than trying to fit everything together at once.
If you're going the all python route, and even want to be able to run bash commands, and want something where you can feed the output into the input, I'd strongly recommend jupyter. (If you want to stay in a terminal, ipython is part of jupyter and heavily upgrades the built-in REPL and does 90% of what I'm mentioning here.)
You can break out each step into its own cell, save variables (though cell 5 will be auto-saved as a variable named _5) but the nicest thing is you can move cells around (check the keyboard shortcuts) and restart the entire kernel and rerun all your operations, essentially what you're getting with a long pipeline, only spread out over parts. And there are shortcuts like func? to pop up help on a function or func?? to see the source.
It's got some dependencies, so I'd recommend running it in a virtualenv via pipenv:
pipenv install jupyter # setup new virtualenv and add package
pipenv run jupyter notebook
pipenv --rm # Blow away the virtualenv
Also, look into pandas if you want to slurp a CSV and query it.I agree with the other user re python usage - that you may as well use it for the whole task if you're going to use it at all - but I don't think it's a major flaw. It worked for you right? I would suggest naming the python file a bit more descriptively though.
Interesting to read the other suggestions about dealing with this without python.
I learned about sys.stdin in Python and cutting characters using the -c flag
Given a sorted list of these files (through `ls` or otherwise) the following sed code will print out the data files for which A did not succeed on them.
/data/!N
/A/d
P;D
This works on the fact that there exists a data file for all successful and unsuccessful runs on data, so sed simply prints the files for which there does not exist an `A` counterpart.If you want to only print out the numbers, you can add a substitution or two towards the end.
/data/!N
/A/d
s/^0*\|_.*//g;P;D
Edit: fixed the sed program /A/{N;d;}
So all together this gives the following ls|sed '/A/{N;d;}' ls dataset-directory | egrep '\d\d\d\d_A.csv'
which FWIW wouldn't even work, on multiple levels: you need -1 on ls and no files end with A.csv vs
ls -1 dataset-directory/*_A?.csv
ref: http://man7.org/linux/man-pages/man7/glob.7.htmlUpdate: apologies, apparently my client cached an older version of this page. at that time the files were named A1.csv and A2.csv
I've never needed to use -1 when piping ls's output to another command.
comm -1 -3 <(ls *_A.csv | sed 's/_.*$//') <(seq -w 0500) | sed 's/^0*//'
The OP's solution doesn't seem correct because of the different ordering of the two inputs of `comm': lexicographical (ls) and numeric (seq). comm -1 -3 <(ls *_A.csv | sed 's/_.*$//') <(seq -f %04.f 500) | sed 's/^0*//'echo {1..10}
or to count backwards
echo {10..0} boom!
What’s great about this article is that it follows the process of solving the problem step by step. I find that lots of programmers I work with struggle with CLI problem solving, which I find a little surprising. But I think it all depends on how you think about problems like this.
If you start from “how can I build a function to operate on this raw data?” or “what data structure would best express the relationship between these filenames?” then you will have a hard time. But if you think in terms of “how can I mutate this data to eliminate extraneous details?” and “what tools do I have handy that can solve problems on data like this given a bit of mungeing, and how can I accomplish that bit of mungeing?” and if you can accept taking several baby steps of small operations on every line of the full dataset rather than building and manipulating abstract logical structures, then you’re well on your way to making efficient use of this remarkable toolset to solve ad hoc problems like this one in minutes instead of hours.
from pathlib import Path
all_possible_filenames = {f'{i:04}A.csv' for i in range(1,10)}
cur_dir_filenames = {Path('.').iterdir()}
missing_filenames = all_possible_filenames - cur_dir_filenames
print(*missing_filenames, sep='\n') seq 1000 | xargs printf '%04d_A.csv\n' | while read -r f; do test -f $f || echo $f; doneemacs -e myfuns.el
When it comes to mashing text, nothing beats emacs.