sort -k2 -k1 -k3 contacts.tsv
sort -k1 -k2 -k3 contacts.tsv
This assumes TSV input, but there are plenty of reasons to prefer that to CSV. If I'm working from CSV sources I usually convert to TSV first thing in my shell pipeline.To overcome the slowdown of disk I/O, perhaps a workaround could be to use mfs or tmpfs, maybe something like:
mkdir /dir
mount -t tmpfs tmpfs /dir
TMPDIR=/dir sort -k2 -k1 -k3 contacts.tsv
TMPDIR=/dir sort -k1 -k2 -k3 contacts.tsv
Personally, I gave up on sort for large files and use k/kdb+. I suspect it is faster for sorting than sort or the Go libraries, but I could be wrong.Instead, your best bet in that case is to give sort as much physical memory as you can spare:
sort -S 95% -k1 huge.tsv
Extra disk I/O is inevitable since your dataset doesn't fit in memory. At least during a merge sort your disk reads will be O(N) and sequentially ordered.Note: in the special case that your dataset is slightly larger than physical memory, splitting it up in advance such that one of the `sort -m` input files lives on a tmpfs should indeed be faster.
Other things to check out if you need Very Fast Large Sorts:
- Use `sort --parallel=N` to use multiple cores. By default it only uses 1.
- Use `sort --batch-size=NMERGE` to increase the number of files merged at once. Otherwise you may be doing more mergesort stages than are necessary.
[1] https://github.com/johnweldon/sortcsv/blob/55818bd8e5f9feecc...
http://csvkit.readthedocs.io/en/1.0.2/scripts/csvsort.html
Some of my favorites tools it includes are csvsql and csvlook.
Use to8 to convert from UTF(32|16)(LE)? etc. to UTF8 first, then sort with this tool.
I've used iconv for years and it's never let me down.
> Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON.