* Using AWK and R to parse 25TB - https://news.ycombinator.com/item?id=20293579
* Command-line Tools can be 235x Faster than a Hadoop Cluster - https://news.ycombinator.com/item?id=17135841
* The State of the AWK - https://news.ycombinator.com/item?id=23240800
For awk alternative implementations, I'm keeping an eye on frawk [0]. Aims to be faster, supports csv, etc.
https://www.gnu.org/software/gawk/manual/html_node/Splitting...
https://github.com/e36freak/awk-libs/blob/master/csv.awk
https://raw.githubusercontent.com/Nomarian/Awk-Batteries/mas...
Surprisingly and unnecessarily so:
> ["DSV"] is to Unix what CSV (comma-separated value) format is under Microsoft Windows and elsewhere outside the Unix world. CSV (fields separated by commas, double quotes used to escape commas, no continuation lines) is rarely found under Unix.
> In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field.
> The bad results of proliferating special cases are twofold. First, the complexity of the parser (and its vulnerability to bugs) is increased. Second, because the format rules are complex and underspecified, different implementations diverge in their handling of edge cases. Sometimes continuation lines are supported, by starting the last field of the line with an unterminated double quote — but only in some products! Microsoft has incompatible versions of CSV files between its own applications, and in some cases between different versions of the same application (Excel being the obvious example here).
— The Art of Unix Programming http://www.catb.org/~esr/writings/taoup/html/ch05s02.html
Awk sometimes proves surprisingly powerful. Just look at the concision of this awk one liner doing a fairly complex job:
zcat large.log.gz | awk '{print $0 | "gzip -v9c > large.log-"$1"_"$2".gz"}' # Breakup compressed log by syslog date and recompress. #awksome
Taken from: https://mobile.twitter.com/climagic/status/61415389723039744...A 5 min job that probably won't get extended saving you from having to spend 20 mins coding something up is better than, feeling annoyed that you have spent the 20 mins coding up the original implementation and then extend it.
Hopefully, you also get the benefit of additional knowledge on that future implementation as well. Why wouldn't this just be a net win?
Unless you're talking about writing hack after hack after hack, eventually leaving yourself with some incomprehensible eldritch monstrosity, in which case, don't do that?
[edit]
Here's the link to the gawk documentation, but most flavors of AWK work similarly: https://www.gnu.org/software/gawk/manual/gawk.html#Close-Fil...
Would recommend learning Awk with some kind of real-world use of your own. BTW it reminded me of using XSLT which I think is another often overlooked "good thing".
You might not have perl or python. You WILL have AWK. Only the most minimal of minimal linux systems will exclude it. Even busybox includes awk. That's how essential it's viewed.
I like awk, mind, but this is not necessarily (IME) a good argument for it.
Python: almost every conventional server. Python dependencies are so ubiquitous that you aren't likely to find a Linux install without it.
Perl: every DEB and RPM machine, and anything with Git installed. You can't really escape it, unless you're embedded.
PowerShell (yeah, I know): every Windows machine from XP onwards (though usable only from 7 onwards), and some Linux computers if installed.
Java: lots and lots of places will have this available.
Dockerized runtime of your choice: not ubiquitous, but I expect more and more developer machines and servers to gain Docker or Docker-like container support.
There really isn't any reason to stick to AWK, unless you're working directly on embedded devices or just like using it.
Arguably, the software world would be better off if more people did code with those 1970s languages, than with the ones we are stuck with now.
And that applies to Awk, too. As the author quotes Neil Ormos stating, Awk is well suited for personal computing, something which we have gotten further and further from at the same time as computers have become more distributed. At what point in history have such a large fraction of the human race had the ability to calculate to such an amazing order of magnitude, and at what point in history have such a large fraction of the same human race not bothered with calculation?
Awk is a great tool precisely because it puts quite a lot of expressive power in the hands of an average user on a Unix system. Sure, on a Lisp machine or Smalltalk machine there really isn't the same need for Awk: the systems languages on such machines are safe enough and expressive enough to do what Awk does. But in the Unix context — which is basically what we're all living in, with even the VMS-derived Windows more-or-less adhering to the Unix model — Awk is a godsend.
edit: correct typo
Here's the source for the fork() extension that ships with gawk...it's ~150 lines or so: https://git.savannah.gnu.org/cgit/gawk.git/tree/extension/fo...
I was able to make a (terrible/joke/but-it-kinda-works) web server with gawk using the extensions that ship with it: https://gist.github.com/willurd/5720255#gistcomment-3143007
The C interop and name-spaces (also in gawk) is a bridge too far for me. By the time you need one of those, it's time to look for another language. Awk is just not enough of a language to write serious programs in. And I really like awk. It has enabled great scripting not only for log files, but also for dictionaries, back in the day when it was still hard to load one in memory.
That is my opinion, it is mine, and belongs to me and I own it, and what it is too.
These are days when things like GC, hashmaps, file operations etc were hard things on Unix.
In the middle of that job, my supervsior, you know what, we're doing increasingly complicated things with awk and it's getting increasingly hacky... I've heard that Perl is like awk but better, do you want to learn Perl and switch to that?
And so we did. My thought then was there was little that was easier in awk than Perl, you could use Perl very much like awk if you wanted, you can even use the right command-line args to have Perl have an "implied loop" like awk... but then you can do a lot more with Perl too.
I don't use Perl anymore. Or awk.
Perl will do those things where AWK really shines and if the problem got bigger, Perl was easier to deal with.
Perl is no more complex than Python, Ruby, or Powershell. If you use any of those you can be productive with Perl in a few hours.
Perl is still used, it is just not as popular as it was in the past. Do you use Git? Parts of it are written in perl. Large parts of Git were originally written in Perl, but have been migrated to C over time.
Also, it's usually the same kind of Perl, so you don't have to worry about whether awk is the "one true" one, or mawk, or gawk...
Sadly I don't remember which book it was, but this page looks like a good start: https://ferd.ca/awk-in-20-minutes.html
This design lets you retain easy access to large sets of pre-existing libraries as well as have a "compiled/statically typed" situation, if you want. It also leverages familiarity with your existing programming languages. I adapted a similar small program like this to emit a C program, but anything else is obviously pretty easy. Easy is good. Familiar is good.
Interactivity-wise, with a TinyC/tcc fast running compiler backend my `rp` programs run sub-second from ENTER to completion on small data. Even with not optimizing tcc, they they still run faster than byte-compiled/VM interpreted mawk/gawk on a per input-byte basis. If you take the time to do an optimized build with gcc -O3/etc., they can run much faster.
And I leave the source code around if you want to just use the program generator as a way to save keystrokes/get a fast start on a row processing program.
Anyway, I'm not trying to start a language holy war, but just exhibit how if you rotate the problem (or your head looking at the problem) ever so slightly another answer exists in this space and is quite easy. :-)
[1] https://github.com/c-blake/cligen/blob/master/examples/rp.ni...
quickly looking at averages/errors, a simple awk one-liner will do.
What are the most common cases where you reach for Awk instead of some other tools?
I recently used it to parse and recombine data from the OpenVPN status file. That file has a few differently formatted tables in the same file. Using Awk, I was able to change a variable as each table was encountered, this I could change the Awk program behavior by which table it was operating on.
#!/bin/gawk -f
BEGIN { smtp="/inet/tcp/0/smtp.yourhost.com/25";
ORS="\r\n"; r=ARGV[1]; s=ARGV[2]; sbj=ARGV[3]; # /usr/local/bin/awkmail to from subj < in
print "helo " ENVIRON["HOSTNAME"] |& smtp; smtp |& getline j; print j
print "mail from: " s |& smtp; smtp |& getline j; print j
if(match(r, ","))
{
split(r, z, ",")
for(y in z) { print "rcpt to: " z[y] |& smtp; smtp |& getline j; print j }
}
else { print "rcpt to: " r |& smtp; smtp |& getline j; print j }
print "data" |& smtp; smtp |& getline j; print j
print "From: " s |& smtp; ARGV[2] = "" # not a file
print "To: " r |& smtp; ARGV[1] = "" # not a file
if(length(sbj)) { print "Subject: " sbj |& smtp; ARGV[3] = "" } # not a file
print "" |& smtp
while(getline > 0) print |& smtp
print "." |& smtp; smtp |& getline j; print j
print "quit" |& smtp; smtp |& getline j; print j
close(smtp) } # /inet/protocol/local-port/remote-host/remote-port
This allows me to bypass the local MTA (if present). The message ID is also returned, which can be useful to log.The PHP fgetcsv function has been more convenient when I have had more exotic examples.
If the CSV is simple, awk remains a very good tool.
This is the counter for all the "success" stories of awk users that walked away with an underspecced and underdeveloped 5 minute solution.
I used to write event-driven scripts off it - each line is a message, interpreted by awk. Something I was not able to get working with any of the awks I tried was where you append messages to the file as you are consuming it (this is kind of like code generation). I ended up doing this in python (https://github.com/cratuki/interface_script_py).
An time I want to process a bunch of lines in a text file, awk is my first consideration.
It's also generally unreadable.