Awk: The Power and Promise of a 40-Year-Old Language (opens in new tab)

(fosslife.org)

251 pointsjangid4y ago118 comments

118 comments

71 comments · 22 top-level

melling4y ago· 12 in thread

i no longer use it but Perl was always the better solution when one thought AWK was the answer.

Perl will do those things where AWK really shines and if the problem got bigger, Perl was easier to deal with.

The problem is that awk is a very simple language, which you can learn in an afternoon. Perl is a very complex language, and is not used anymore, so you're just spending your time on something you'll rarely use.

thesuperbigfrog4y ago

>> Perl is a very complex language, and is not used anymore, so you're just spending your time on something you'll rarely use.

Perl is no more complex than Python, Ruby, or Powershell. If you use any of those you can be productive with Perl in a few hours.

Perl is still used, it is just not as popular as it was in the past. Do you use Git? Parts of it are written in perl. Large parts of Git were originally written in Perl, but have been migrated to C over time.

mhd4y ago

The part that's equivalent to what you'd use for your regular awk isn't very different. Sure, you can do full-scale OO programs, but that doesn't have a large impact on small string munging. I get that you might not learn it to fluff up your CV.

Also, it's usually the same kind of Perl, so you don't have to worry about whether awk is the "one true" one, or mawk, or gawk...

forinti4y ago

If you work a lot with Linux, you can pretty much count on Perl and awk always being there. So it comes in quite handy to know them.

sigzero4y ago

Perl is very must still used. lol

selfhoster114y ago

It's used in Debian system tools and in Git, so it's still in wide use.

chasil4y ago

OpenBSD's binary package system is written in perl.

1 more reply

tyingq4y ago

I found that to be the case many times as well. But awk also often outperforms Perl, especially mawk.

zeteo4y ago

Perl was built initially as a sed/awk killer but got distracted into trying to take over the world. The interpreter for a language with 100x the number of features will always be slower. Also there's a very clear boundary for when I should use awk by itself, as part of a pipeline, or switch to a better tool. I feel like Perl has the potential to suck me imperceptibly into a huge mess where I spend 80% of my time refactoring everything.

fomine34y ago

Did you have ever found that a Perl oneliner is slow so rewriting by awk meaningful?

Scarbutt4y ago

Yes but you can't learn perl as quickly as you can learn awk.

jfk134y ago

Though you can learn just enough perl to do awk-like things fairly easily. And then grow from there as needed.

1 more reply

vyuh4y ago· 9 in thread

"A good programmer uses the most powerful tool to do a job. A great programmer uses the least powerful tool that does the job." I believe this, and I always try to find the combination of simple and lightweight tools which does the job at hand correctly.

Awk sometimes proves surprisingly powerful. Just look at the concision of this awk one liner doing a fairly complex job:

    zcat large.log.gz | awk '{print $0 | "gzip -v9c > large.log-"$1"_"$2".gz"}' # Breakup compressed log by syslog date and recompress. #awksome

Taken from: https://mobile.twitter.com/climagic/status/61415389723039744...

dunefox4y ago

Ehh. Until the 'job' gets extended and then your simple tool makes it exponentially more complex and you have to rewrite it with the more powerful tool.

klyrs4y ago

The nice thing about a 1-liner is you only lose a few minutes to throwing it out entirely and rewriting it to fit a new purpose. Dwelling on what might be needed is of limited utility, because of the very real possibility that what's actually needed in the future is wildly different from what you spent all that time planning for.

selfhoster114y ago

This is fine. I often "prototype" my automations as shell scripts, to explore what I actually want the tool to handle. Once it gets longer than 20 or so lines, it's time to move to a better language, but I don't mind rewriting. This is a chance to add error handling, config, proper arguments, built-in help texts and whatever else.

1 more reply

inanutshellus4y ago

Choosing a "good enough for the medium term with minimal effort now" is a winner in my book, even if it's likely to be rewritten in the long term.

1 more reply

Folcon4y ago

Surely that isn't a weakness of a simple tool?

A 5 min job that probably won't get extended saving you from having to spend 20 mins coding something up is better than, feeling annoyed that you have spent the 20 mins coding up the original implementation and then extend it.

Hopefully, you also get the benefit of additional knowledge on that future implementation as well. Why wouldn't this just be a net win?

Unless you're talking about writing hack after hack after hack, eventually leaving yourself with some incomprehensible eldritch monstrosity, in which case, don't do that?

rakoo4y ago

If I understand this correctly, it will gzip every line separately instead of gzipping them together... it's not really the most effective but it does work

aidenn04y ago

It does not. The pipe command leaves the pipe open and successive pipes with identical strings remain open until the pipe is explicitly closed.

[edit]

Here's the link to the gawk documentation, but most flavors of AWK work similarly: https://www.gnu.org/software/gawk/manual/gawk.html#Close-Fil...

1 more reply

zdwolfe4y ago

I really love that quote "..A good programmer...", do you have a source?

vyuh4y ago

I believe the source is https://mathwithbaddrawings.com/2015/12/16/good-mathematicia...

justin_oaks4y ago· 8 in thread

I only recently learned Awk enough to be useful. But I still don't reach for it when I probably should.

What are the most common cases where you reach for Awk instead of some other tools?

I recently used it to parse and recombine data from the OpenVPN status file. That file has a few differently formatted tables in the same file. Using Awk, I was able to change a variable as each table was encountered, this I could change the Awk program behavior by which table it was operating on.

chasil4y ago

Here is a script that I use to send SMTP mail, via the gawk networking extensions. I have a few different versions, but this is the most basic:

    #!/bin/gawk -f

    BEGIN { smtp="/inet/tcp/0/smtp.yourhost.com/25";
    ORS="\r\n"; r=ARGV[1]; s=ARGV[2]; sbj=ARGV[3]; # /usr/local/bin/awkmail to from subj < in

    print "helo " ENVIRON["HOSTNAME"]       |& smtp; smtp |& getline j; print j
    print "mail from: " s                   |& smtp; smtp |& getline j; print j
    if(match(r, ","))
    {
      split(r, z, ",")
      for(y in z) { print "rcpt to: " z[y]  |& smtp; smtp |& getline j; print j }
    }
    else { print "rcpt to: " r              |& smtp; smtp |& getline j; print j }
    print "data"                            |& smtp; smtp |& getline j; print j

    print "From: " s                        |& smtp; ARGV[2] = ""   # not a file
    print "To: " r                          |& smtp; ARGV[1] = ""   # not a file
    if(length(sbj)) { print "Subject: " sbj |& smtp; ARGV[3] = "" } # not a file
    print ""                                |& smtp

    while(getline > 0) print                |& smtp

    print "."                               |& smtp; smtp |& getline j; print j
    print "quit"                            |& smtp; smtp |& getline j; print j

    close(smtp) } # /inet/protocol/local-port/remote-host/remote-port

This allows me to bypass the local MTA (if present). The message ID is also returned, which can be useful to log.

exdsq4y ago

I had to take large CSV files like {question, right_ans, wrong_ans1, wrong_ans2, wrong_ans3} and covert them into SQL insert files. Few caveats - some could be duplicates, some characters were not allowed, and some had formatting issues. The first issue was avoided by upserting, but the other two I used Awk and Sed for and put together a fairly robust script far quicker than if I reached for Python. I probably would have reached for Python if I realised how many edge cases there were but I didn't know that at the start so the script just sort of grew as I went along, but now they're my go-to tools for similar tasks.

chasil4y ago

Awk is not really very good at reading complex CSVs (as defined in RFC-4180), where newlines (record separators) can appear within quoted strings. It can be done, but sometimes it's tricky.

The PHP fgetcsv function has been more convenient when I have had more exotic examples.

If the CSV is simple, awk remains a very good tool.

2 more replies

WhatIsDukkha4y ago

"""I probably would have reached for Python if I realised how many edge cases there were"""

This is the counter for all the "success" stories of awk users that walked away with an underspecced and underdeveloped 5 minute solution.

1 more reply

cturner4y ago

Have found static builds of awk useful in low-dependency work. I bundled it with a windows installer to do some wrangling we needed at install time. Another time I was sending packages to a unix cluster, but did not have access myself. Used awk as part of the bootstrap for the package.

I used to write event-driven scripts off it - each line is a message, interpreted by awk. Something I was not able to get working with any of the awks I tried was where you append messages to the file as you are consuming it (this is kind of like code generation). I ended up doing this in python (https://github.com/cratuki/interface_script_py).

coliveira4y ago

Anything that is command line based and needs small changes to text input can be done with awk. It is a very competent language for scripts.

throwawayboise4y ago

I use it a lot to filter, slice, and dice CSV (or other delimited) or fixed-format files. Sometimes I'll use q[1] if my needs are more complex. Or awk piped to q. It can be used as a fairly decent report generator for plain-text or HTML reports.

An time I want to process a bunch of lines in a text file, awk is my first consideration.

[1] http://harelba.github.io/q/

jedimastert4y ago

From what I can tell, Awk really shines in two places, transformation and collation, both of which require some form of structured file. You can transform one structure into another and you can process record by record to some form of collation or summary.

phkahler4y ago· 4 in thread

I never use Awk until last year. I wanted to monitor an embedded device with little more than bustbox and python on it. There was quite a bit of information in the log files (I had already written a custom log file viewer with some highlighting) but I wanted to monitor in real-time. Somehow I decided to use Awk to monitor the tail of the log file and do realtime bar-graphs by generating appropriate cursor control sequences. In the end I had about 50 lines of Awk to upload to the board and run a command to pipe the log into it - very minimally invasive and very informative.

Would recommend learning Awk with some kind of real-world use of your own. BTW it reminded me of using XSLT which I think is another often overlooked "good thing".

cogman104y ago

The biggest reason to learn AWK, IMO, is that it's on pretty much every single linux distribution.

You might not have perl or python. You WILL have AWK. Only the most minimal of minimal linux systems will exclude it. Even busybox includes awk. That's how essential it's viewed.

jejones31414y ago

Something fun in that regard, speaking of minimal...the TRS-80 Color Computer community now has a version of awk that runs on NitrOS-9, a variant of OS-9/6809 originally written for the Hitachi 6309. (64K address space, no separate I and D space.)

michaelcampbell4y ago

I'm curious what linux distros don't have either some version of perl or python.

I like awk, mind, but this is not necessarily (IME) a good argument for it.

6 more replies

selfhoster114y ago

IMO, unless you're doing embedded work or building minimal containers, you'll pretty much always have access to a decent runtime (or several).

Python: almost every conventional server. Python dependencies are so ubiquitous that you aren't likely to find a Linux install without it.

Perl: every DEB and RPM machine, and anything with Git installed. You can't really escape it, unless you're embedded.

PowerShell (yeah, I know): every Windows machine from XP onwards (though usable only from 7 onwards), and some Linux computers if installed.

Java: lots and lots of places will have this available.

Dockerized runtime of your choice: not ubiquitous, but I expect more and more developer machines and servers to gain Docker or Docker-like container support.

There really isn't any reason to stick to AWK, unless you're working directly on embedded devices or just like using it.

dekhn4y ago· 3 in thread

I've used Python almost my entire career, but started with out the UNIX tools. I never found awk interesting, then took a peek at it recently and understood: this was the pre-perl! it had scripting-language hash tables!

Anon844y ago

PERL was originally advertised as a replacement for “awk and sed”

dekhn4y ago

yep- and I went straight to perl after learning sed, and ignoring awk. awk looked even weirder than perl (I wasn't a big fan of the pattern matching style). In retrospect, I think awk is a massively underappreciated (for its time and context). I can't say I'd want to work with it regularly (same for perl; in the long run, I prefer variants of C style).

kamaal4y ago

First version of Perl was a replacement for C+awk+sed.

These are days when things like GC, hashmaps, file operations etc were hard things on Unix.

asicsp4y ago· 2 in thread

HN discussion threads for some of the links mentioned in the article:

* Using AWK and R to parse 25TB - https://news.ycombinator.com/item?id=20293579

* Command-line Tools can be 235x Faster than a Hadoop Cluster - https://news.ycombinator.com/item?id=17135841

* The State of the AWK - https://news.ycombinator.com/item?id=23240800

For awk alternative implementations, I'm keeping an eye on frawk [0]. Aims to be faster, supports csv, etc.

[0] https://github.com/ezrosent/frawk

nmz4y ago

CSV is a complicated format but that does not mean awk is incapable of dealing with it.

https://www.gnu.org/software/gawk/manual/html_node/Splitting...

https://github.com/e36freak/awk-libs/blob/master/csv.awk

https://raw.githubusercontent.com/Nomarian/Awk-Batteries/mas...

boogies4y ago

> CSV is a complicated format

Surprisingly and unnecessarily so:

> ["DSV"] is to Unix what CSV (comma-separated value) format is under Microsoft Windows and elsewhere outside the Unix world. CSV (fields separated by commas, double quotes used to escape commas, no continuation lines) is rarely found under Unix.

> In fact, the Microsoft version of CSV is a textbook example of how not to design a textual file format. Its problems begin with the case in which the separator character (in this case, a comma) is found inside a field. The Unix way would be to simply escape the separator with a backslash, and have a double escape represent a literal backslash. This design gives us a single special case (the escape character) to check for when parsing the file, and only a single action when the escape is found (treat the following character as a literal). The latter conveniently not only handles the separator character, but gives us a way to handle the escape character and newlines for free. CSV, on the other hand, encloses the entire field in double quotes if it contains the separator. If the field contains double quotes, it must also be enclosed in double quotes, and the individual double quotes in the field must themselves be repeated twice to indicate that they don't end the field.

> The bad results of proliferating special cases are twofold. First, the complexity of the parser (and its vulnerability to bugs) is increased. Second, because the format rules are complex and underspecified, different implementations diverge in their handling of edge cases. Sometimes continuation lines are supported, by starting the last field of the line with an unterminated double quote — but only in some products! Microsoft has incompatible versions of CSV files between its own applications, and in some cases between different versions of the same application (Excel being the obvious example here).

— The Art of Unix Programming http://www.catb.org/~esr/writings/taoup/html/ch05s02.html

3 more replies

tyingq4y ago· 2 in thread

Gawk's ability to extend it with C code is interesting as well, and pretty straightforward.

Here's the source for the fork() extension that ships with gawk...it's ~150 lines or so: https://git.savannah.gnu.org/cgit/gawk.git/tree/extension/fo...

I was able to make a (terrible/joke/but-it-kinda-works) web server with gawk using the extensions that ship with it: https://gist.github.com/willurd/5720255#gistcomment-3143007

tgv4y ago

My opinion that belongs to me is as follows. This is how it goes. The next thing I'm going to say is my opinion.

The C interop and name-spaces (also in gawk) is a bridge too far for me. By the time you need one of those, it's time to look for another language. Awk is just not enough of a language to write serious programs in. And I really like awk. It has enabled great scripting not only for log files, but also for dictionaries, back in the day when it was still hard to load one in memory.

That is my opinion, it is mine, and belongs to me and I own it, and what it is too.

gompertz4y ago

It's good you're unapologetic. At the same time, these sort of features are what I love as they avoid me having to move onwards to something new, and start near ground zero. Living by the mantra "Do 2 things 1000 times, not 1000 things 2 times."

arendtio4y ago· 2 in thread

Learning awk is actually pretty simple. For years I just used the '{print $2}' version to extract fields, but after reading some short book I felt pretty confident of having understood the basics.

Sadly I don't remember which book it was, but this page looks like a good start: https://ferd.ca/awk-in-20-minutes.html

abecedarius4y ago

Likely the one by A, W, and K. https://news.ycombinator.com/item?id=13451454

arendtio4y ago

Yes, this looks like it. Thanks :-)

torcete4y ago· 2 in thread

I use awk constantly in bioinformatics, for many of the file formats designed to store genomic data, awk is the easiest tool you can use for processing.

jhbadger4y ago

There's even a version of awk specifically designed for bioinformatics that natively knows how to handle fasta, fastq, and sam files, among other formats.

https://github.com/lh3/bioawk

unemphysbro4y ago

I did the exact same thing!

quickly looking at averages/errors, a simple awk one-liner will do.

shp0ngle4y ago· 2 in thread

awk is fast and really useful.

It's also generally unreadable.

coliveira4y ago

I don't agree. Awk is very readable for people used to c-like languages like javascript. And it is much cleaner that Perl.

gpderetta4y ago

It is certainly more readable than sed for example.

1 more reply

zeveb4y ago· 1 in thread

> Very few people still code with the legacies of the 1970s: ML, Pascal, Scheme, Smalltalk.

Arguably, the software world would be better off if more people did code with those 1970s languages, than with the ones we are stuck with now.

And that applies to Awk, too. As the author quotes Neil Ormos stating, Awk is well suited for personal computing, something which we have gotten further and further from at the same time as computers have become more distributed. At what point in history have such a large fraction of the human race had the ability to calculate to such an amazing order of magnitude, and at what point in history have such a large fraction of the same human race not bothered with calculation?

Awk is a great tool precisely because it puts quite a lot of expressive power in the hands of an average user on a Unix system. Sure, on a Lisp machine or Smalltalk machine there really isn't the same need for Awk: the systems languages on such machines are safe enough and expressive enough to do what Awk does. But in the Unix context — which is basically what we're all living in, with even the VMS-derived Windows more-or-less adhering to the Unix model — Awk is a godsend.

edit: correct typo

gompertz4y ago

Oh man, you sound like a long lost friend. As someone who struggles to adopt really anything post ~1995 in the programming world, I couldn't agree more. I've worked for Fortune 100s my whole career; mostly in big data problem-spaces, before it ever was cool (if it even is now?), and I really feel all the problems people perceive today were solved all the way back to the 1960s (i.e. Snobol4). I understand for modern web and mobile contexts, sure there is new fancy tools for that; but as you said, in the personal computing space, the proper tools have existed for decades.

jrochkind14y ago· 1 in thread

My first job getting paid to program was in awk. Processing log files.

In the middle of that job, my supervsior, you know what, we're doing increasingly complicated things with awk and it's getting increasingly hacky... I've heard that Perl is like awk but better, do you want to learn Perl and switch to that?

And so we did. My thought then was there was little that was easier in awk than Perl, you could use Perl very much like awk if you wanted, you can even use the right command-line args to have Perl have an "implied loop" like awk... but then you can do a lot more with Perl too.

I don't use Perl anymore. Or awk.

linuxlizard4y ago

I think I remember reading somewhere Larry Wall was inspired to create Perl in order to combine awk+sed functionality. He was sick of awk+sed being almost powerful enough to do what he needed. (I can't find a reference to this though.)

gompertz4y ago· 1 in thread

And let's not forget about the amazing commercial offering of Awk, known as Tawk (by Thompson Automation). To this day some features from Tawk cannot be found in Gawk.

AstroJetson4y ago

Loved TWAK, but sadly they went out of business

dugmartin4y ago

My first and only real use of awk was around 1995. I was working at a new job doing embedded software work at GE and we had a lot of documentation in SGML, written/viewed using Interleaf. Interleaf was super slow on the HP-UX workstations we had and iirc search was even slower. I got the idea to convert all the SGML files into a single HTML file and I reached for awk as I had used it for some one-liners previously. I ended up writing an awk script that generated a frameset with one sidebar frame that was a treeish table of contents and the other frame the mondo html file with anchors for the table of contents. It loaded pretty fast in the HP-UX browser and search was really fast.

zeteo4y ago

My company mandates Windows but Git Bash has been a backdoor into Unix tools and I've recently learned sed and awk to take full advantage of it. You need to think a bit about your one liners and they'll always feel very hacky, but sed/awk (with a bit of sort thrown in) are an amazingly powerful combination for dealing with all sorts of messy data dumps. In 10 minutes I can craft a one liner that replaces a 2 hours C# console app and runs just as fast. And, surprisingly, I often find it easier to go back months later and understand the messy looking one liner than the nicely formatted, well commented, unit tested console app.

ketanmaheshwari4y ago

My own shameless plug: https://ketancmaheshwari.github.io/posts/2020/05/24/SMC18-Da...

nesuse4y ago

There's a free awk course here for anyone interested https://www.udemy.com/course/awk-tutorial/

cb3214y ago

When you have a standardized problem setting like the implicit loop in awk, n alternative to a whole new programming language is a simple < 100 lines of code program generator [1].

This design lets you retain easy access to large sets of pre-existing libraries as well as have a "compiled/statically typed" situation, if you want. It also leverages familiarity with your existing programming languages. I adapted a similar small program like this to emit a C program, but anything else is obviously pretty easy. Easy is good. Familiar is good.

Interactivity-wise, with a TinyC/tcc fast running compiler backend my `rp` programs run sub-second from ENTER to completion on small data. Even with not optimizing tcc, they they still run faster than byte-compiled/VM interpreted mawk/gawk on a per input-byte basis. If you take the time to do an optimized build with gcc -O3/etc., they can run much faster.

And I leave the source code around if you want to just use the program generator as a way to save keystrokes/get a fast start on a row processing program.

Anyway, I'm not trying to start a language holy war, but just exhibit how if you rotate the problem (or your head looking at the problem) ever so slightly another answer exists in this space and is quite easy. :-)

[1] https://github.com/c-blake/cligen/blob/master/examples/rp.ni...

linuxlizard4y ago

I use awk to auto-generate C header files from other header files. I work with $vendor's huge complicated kernel driver codebase. I need small pieces of $vendor's interconnected header files in order to make kernel calls to their drivers without pulling in all their code.

mukundesh4y ago

awk is great for data analysis - usually, I start with cut, then move to awk as complexity increases and finally to python.

SjorsVG4y ago

I find it very unpleasant to read Awk code. It looks as bad as regex to me.

forinti4y ago

sed is pretty ancient too. I've used it a lot with Docker to alter parameters during builds.

j / k navigate · click thread line to collapse

118 comments

71 comments · 22 top-level

melling4y ago· 12 in thread

i no longer use it but Perl was always the better solution when one thought AWK was the answer.

Perl will do those things where AWK really shines and if the problem got bigger, Perl was easier to deal with.

coliveira4y ago

thesuperbigfrog4y ago

>> Perl is a very complex language, and is not used anymore, so you're just spending your time on something you'll rarely use.

Perl is no more complex than Python, Ruby, or Powershell. If you use any of those you can be productive with Perl in a few hours.

mhd4y ago

Also, it's usually the same kind of Perl, so you don't have to worry about whether awk is the "one true" one, or mawk, or gawk...

forinti4y ago

If you work a lot with Linux, you can pretty much count on Perl and awk always being there. So it comes in quite handy to know them.

sigzero4y ago

Perl is very must still used. lol

selfhoster114y ago

It's used in Debian system tools and in Git, so it's still in wide use.

chasil4y ago

OpenBSD's binary package system is written in perl.

1 more reply

tyingq4y ago

I found that to be the case many times as well. But awk also often outperforms Perl, especially mawk.

zeteo4y ago

fomine34y ago

Did you have ever found that a Perl oneliner is slow so rewriting by awk meaningful?

Scarbutt4y ago

Yes but you can't learn perl as quickly as you can learn awk.

jfk134y ago

Though you can learn just enough perl to do awk-like things fairly easily. And then grow from there as needed.

1 more reply

vyuh4y ago· 9 in thread

Awk sometimes proves surprisingly powerful. Just look at the concision of this awk one liner doing a fairly complex job:

    zcat large.log.gz | awk '{print $0 | "gzip -v9c > large.log-"$1"_"$2".gz"}' # Breakup compressed log by syslog date and recompress. #awksome

Taken from: https://mobile.twitter.com/climagic/status/61415389723039744...

dunefox4y ago

Ehh. Until the 'job' gets extended and then your simple tool makes it exponentially more complex and you have to rewrite it with the more powerful tool.

klyrs4y ago

selfhoster114y ago

1 more reply

inanutshellus4y ago

Choosing a "good enough for the medium term with minimal effort now" is a winner in my book, even if it's likely to be rewritten in the long term.

1 more reply

Folcon4y ago

Surely that isn't a weakness of a simple tool?

Hopefully, you also get the benefit of additional knowledge on that future implementation as well. Why wouldn't this just be a net win?

Unless you're talking about writing hack after hack after hack, eventually leaving yourself with some incomprehensible eldritch monstrosity, in which case, don't do that?

rakoo4y ago

If I understand this correctly, it will gzip every line separately instead of gzipping them together... it's not really the most effective but it does work

aidenn04y ago

It does not. The pipe command leaves the pipe open and successive pipes with identical strings remain open until the pipe is explicitly closed.

[edit]

Here's the link to the gawk documentation, but most flavors of AWK work similarly: https://www.gnu.org/software/gawk/manual/gawk.html#Close-Fil...

1 more reply

zdwolfe4y ago

I really love that quote "..A good programmer...", do you have a source?

vyuh4y ago

I believe the source is https://mathwithbaddrawings.com/2015/12/16/good-mathematicia...

justin_oaks4y ago· 8 in thread

I only recently learned Awk enough to be useful. But I still don't reach for it when I probably should.

What are the most common cases where you reach for Awk instead of some other tools?

chasil4y ago

Here is a script that I use to send SMTP mail, via the gawk networking extensions. I have a few different versions, but this is the most basic:

    #!/bin/gawk -f

    BEGIN { smtp="/inet/tcp/0/smtp.yourhost.com/25";
    ORS="\r\n"; r=ARGV[1]; s=ARGV[2]; sbj=ARGV[3]; # /usr/local/bin/awkmail to from subj < in

    print "helo " ENVIRON["HOSTNAME"]       |& smtp; smtp |& getline j; print j
    print "mail from: " s                   |& smtp; smtp |& getline j; print j
    if(match(r, ","))
    {
      split(r, z, ",")
      for(y in z) { print "rcpt to: " z[y]  |& smtp; smtp |& getline j; print j }
    }
    else { print "rcpt to: " r              |& smtp; smtp |& getline j; print j }
    print "data"                            |& smtp; smtp |& getline j; print j

    print "From: " s                        |& smtp; ARGV[2] = ""   # not a file
    print "To: " r                          |& smtp; ARGV[1] = ""   # not a file
    if(length(sbj)) { print "Subject: " sbj |& smtp; ARGV[3] = "" } # not a file
    print ""                                |& smtp

    while(getline > 0) print                |& smtp

    print "."                               |& smtp; smtp |& getline j; print j
    print "quit"                            |& smtp; smtp |& getline j; print j

    close(smtp) } # /inet/protocol/local-port/remote-host/remote-port

This allows me to bypass the local MTA (if present). The message ID is also returned, which can be useful to log.

exdsq4y ago

chasil4y ago

Awk is not really very good at reading complex CSVs (as defined in RFC-4180), where newlines (record separators) can appear within quoted strings. It can be done, but sometimes it's tricky.

The PHP fgetcsv function has been more convenient when I have had more exotic examples.

If the CSV is simple, awk remains a very good tool.

2 more replies

WhatIsDukkha4y ago

"""I probably would have reached for Python if I realised how many edge cases there were"""

This is the counter for all the "success" stories of awk users that walked away with an underspecced and underdeveloped 5 minute solution.

1 more reply

cturner4y ago

coliveira4y ago

Anything that is command line based and needs small changes to text input can be done with awk. It is a very competent language for scripts.

throwawayboise4y ago

An time I want to process a bunch of lines in a text file, awk is my first consideration.

[1] http://harelba.github.io/q/

jedimastert4y ago

phkahler4y ago· 4 in thread

Would recommend learning Awk with some kind of real-world use of your own. BTW it reminded me of using XSLT which I think is another often overlooked "good thing".

cogman104y ago

The biggest reason to learn AWK, IMO, is that it's on pretty much every single linux distribution.

You might not have perl or python. You WILL have AWK. Only the most minimal of minimal linux systems will exclude it. Even busybox includes awk. That's how essential it's viewed.

jejones31414y ago

michaelcampbell4y ago

I'm curious what linux distros don't have either some version of perl or python.

I like awk, mind, but this is not necessarily (IME) a good argument for it.

6 more replies

selfhoster114y ago

IMO, unless you're doing embedded work or building minimal containers, you'll pretty much always have access to a decent runtime (or several).

Python: almost every conventional server. Python dependencies are so ubiquitous that you aren't likely to find a Linux install without it.

Perl: every DEB and RPM machine, and anything with Git installed. You can't really escape it, unless you're embedded.

PowerShell (yeah, I know): every Windows machine from XP onwards (though usable only from 7 onwards), and some Linux computers if installed.

Java: lots and lots of places will have this available.

Dockerized runtime of your choice: not ubiquitous, but I expect more and more developer machines and servers to gain Docker or Docker-like container support.

There really isn't any reason to stick to AWK, unless you're working directly on embedded devices or just like using it.

dekhn4y ago· 3 in thread

Anon844y ago

PERL was originally advertised as a replacement for “awk and sed”

dekhn4y ago

kamaal4y ago

First version of Perl was a replacement for C+awk+sed.

These are days when things like GC, hashmaps, file operations etc were hard things on Unix.

asicsp4y ago· 2 in thread

HN discussion threads for some of the links mentioned in the article:

* Using AWK and R to parse 25TB - https://news.ycombinator.com/item?id=20293579

* Command-line Tools can be 235x Faster than a Hadoop Cluster - https://news.ycombinator.com/item?id=17135841

* The State of the AWK - https://news.ycombinator.com/item?id=23240800

For awk alternative implementations, I'm keeping an eye on frawk [0]. Aims to be faster, supports csv, etc.

[0] https://github.com/ezrosent/frawk

nmz4y ago

CSV is a complicated format but that does not mean awk is incapable of dealing with it.

https://www.gnu.org/software/gawk/manual/html_node/Splitting...

https://github.com/e36freak/awk-libs/blob/master/csv.awk

https://raw.githubusercontent.com/Nomarian/Awk-Batteries/mas...

boogies4y ago

> CSV is a complicated format

Surprisingly and unnecessarily so:

— The Art of Unix Programming http://www.catb.org/~esr/writings/taoup/html/ch05s02.html

3 more replies

tyingq4y ago· 2 in thread

Gawk's ability to extend it with C code is interesting as well, and pretty straightforward.

Here's the source for the fork() extension that ships with gawk...it's ~150 lines or so: https://git.savannah.gnu.org/cgit/gawk.git/tree/extension/fo...

I was able to make a (terrible/joke/but-it-kinda-works) web server with gawk using the extensions that ship with it: https://gist.github.com/willurd/5720255#gistcomment-3143007

tgv4y ago

My opinion that belongs to me is as follows. This is how it goes. The next thing I'm going to say is my opinion.

That is my opinion, it is mine, and belongs to me and I own it, and what it is too.

gompertz4y ago

arendtio4y ago· 2 in thread

Learning awk is actually pretty simple. For years I just used the '{print $2}' version to extract fields, but after reading some short book I felt pretty confident of having understood the basics.

Sadly I don't remember which book it was, but this page looks like a good start: https://ferd.ca/awk-in-20-minutes.html

abecedarius4y ago

Likely the one by A, W, and K. https://news.ycombinator.com/item?id=13451454

arendtio4y ago

Yes, this looks like it. Thanks :-)

torcete4y ago· 2 in thread

I use awk constantly in bioinformatics, for many of the file formats designed to store genomic data, awk is the easiest tool you can use for processing.

jhbadger4y ago

There's even a version of awk specifically designed for bioinformatics that natively knows how to handle fasta, fastq, and sam files, among other formats.

https://github.com/lh3/bioawk

unemphysbro4y ago

I did the exact same thing!

quickly looking at averages/errors, a simple awk one-liner will do.

shp0ngle4y ago· 2 in thread

awk is fast and really useful.

It's also generally unreadable.

coliveira4y ago

I don't agree. Awk is very readable for people used to c-like languages like javascript. And it is much cleaner that Perl.

gpderetta4y ago

It is certainly more readable than sed for example.

1 more reply

zeveb4y ago· 1 in thread

> Very few people still code with the legacies of the 1970s: ML, Pascal, Scheme, Smalltalk.

Arguably, the software world would be better off if more people did code with those 1970s languages, than with the ones we are stuck with now.

edit: correct typo

gompertz4y ago

jrochkind14y ago· 1 in thread

My first job getting paid to program was in awk. Processing log files.

I don't use Perl anymore. Or awk.

linuxlizard4y ago

gompertz4y ago· 1 in thread

And let's not forget about the amazing commercial offering of Awk, known as Tawk (by Thompson Automation). To this day some features from Tawk cannot be found in Gawk.

AstroJetson4y ago

Loved TWAK, but sadly they went out of business

dugmartin4y ago

zeteo4y ago

ketanmaheshwari4y ago

My own shameless plug: https://ketancmaheshwari.github.io/posts/2020/05/24/SMC18-Da...

nesuse4y ago

There's a free awk course here for anyone interested https://www.udemy.com/course/awk-tutorial/

cb3214y ago

When you have a standardized problem setting like the implicit loop in awk, n alternative to a whole new programming language is a simple < 100 lines of code program generator [1].

And I leave the source code around if you want to just use the program generator as a way to save keystrokes/get a fast start on a row processing program.

[1] https://github.com/c-blake/cligen/blob/master/examples/rp.ni...

linuxlizard4y ago

mukundesh4y ago

awk is great for data analysis - usually, I start with cut, then move to awk as complexity increases and finally to python.

SjorsVG4y ago

I find it very unpleasant to read Awk code. It looks as bad as regex to me.

forinti4y ago

sed is pretty ancient too. I've used it a lot with Docker to alter parameters during builds.

j / k navigate · click thread line to collapse