Show HN: Command Line Tool to Sort CSV and TSV Files by Multiple Headings in Go (opens in new tab)

(github.com)

29 pointsjohnweldon8y ago19 comments

19 comments

18 comments · 6 top-level

sigil8y ago· 4 in thread

Equivalent sort(1) invocations for your examples:

    sort -k2 -k1 -k3 contacts.tsv
    sort -k1 -k2 -k3 contacts.tsv

This assumes TSV input, but there are plenty of reasons to prefer that to CSV. If I'm working from CSV sources I usually convert to TSV first thing in my shell pipeline.

feelin_googley8y ago

When sort is used on really large files, it will automatically attempt to use disk, putting temp files in TMPDIR. This can be really slow.

To overcome the slowdown of disk I/O, perhaps a workaround could be to use mfs or tmpfs, maybe something like:

   mkdir /dir
   mount -t tmpfs tmpfs /dir
   TMPDIR=/dir sort -k2 -k1 -k3 contacts.tsv
   TMPDIR=/dir sort -k1 -k2 -k3 contacts.tsv

Personally, I gave up on sort for large files and use k/kdb+. I suspect it is faster for sorting than sort or the Go libraries, but I could be wrong.

sigil8y ago

For a dataset larger than physical memory, using a memory filesystem like tmpfs for the merge stage will either swap (|tmpfs| < |ram|) or deadlock (|tmpfs| >= |ram|).

Instead, your best bet in that case is to give sort as much physical memory as you can spare:

    sort -S 95% -k1 huge.tsv

Extra disk I/O is inevitable since your dataset doesn't fit in memory. At least during a merge sort your disk reads will be O(N) and sequentially ordered.

Note: in the special case that your dataset is slightly larger than physical memory, splitting it up in advance such that one of the `sort -m` input files lives on a tmpfs should indeed be faster.

Other things to check out if you need Very Fast Large Sorts:

- Use `sort --parallel=N` to use multiple cores. By default it only uses 1.

- Use `sort --batch-size=NMERGE` to increase the number of files merged at once. Otherwise you may be doing more mergesort stages than are necessary.

johnweldonOP8y ago

Thanks - I've used sort quite a bit, and I like it. I wrote this partly to just fulfil my desire to sort by named fields rather than column indexes.

sigil8y ago

I know the annoyance you're talking about, but I think you're better off wrapping sort(1) with something translates from column names to indexes. Among the reasons: sortcsv buffers all input into memory [1], while sort(1) uses a divide-and-conquer merge sort to avoid this.

[1] https://github.com/johnweldon/sortcsv/blob/55818bd8e5f9feecc...

1 more reply

bfrog8y ago· 3 in thread

I've been using https://github.com/BurntSushi/xsv which is quite nice and has a few other very handy csv tools.

danso8y ago

xsv is fantastic. I'm a longtime user and fan of csvkit but I've slowly switched some of my habitual usage to xsv. Note that csvkit -- not being a single program like xsv but rather a collection of utilities -- contains a few branches of functionality that xsv doesn't aim to replicate, namely csvsql (convert CSV into SQL create and insert statements) and in2csv (convert XLS and JSON to CSV).

johnweldonOP8y ago

I hadn't seen that tool, thanks for pointing it out.

bfrog8y ago

Indeed, perhaps it will give you some fun ideas!

z1mm32m4n8y ago· 2 in thread

I’m a huge fan of csvkit, which includes a similar utility along with lots more:

http://csvkit.readthedocs.io/en/1.0.2/scripts/csvsort.html

Some of my favorites tools it includes are csvsql and csvlook.

johnweldonOP8y ago

Cool, looks like a nicely built set of utilities in python. Thanks for the link.

taylodl8y ago

Thanks! csvkit looks awesome!

johnweldonOP8y ago· 2 in thread

Works great with previously shared Go command line tool jw4.us/to8 when input files are not UTF8.

Use to8 to convert from UTF(32|16)(LE)? etc. to UTF8 first, then sort with this tool.

donatj8y ago

Is there an advantage to to8 over iconv?

I've used iconv for years and it's never let me down.

johnweldonOP8y ago

I wrote this tool because I don't want to explicitly know the original encoding, I just want _any_ encoding to be converted to UTF8. AFAIK, iconv requires the source encoding to be specified on the command line.

feelin_googley8y ago· 1 in thread

Can anyone provide sample input and output for the example? I find it difficult to evaluate text processing software quickly against existing solutions when there is no example given, such as: here is some sample input and here is the desired output, as is done at, e.g., unix.com.

johnweldonOP8y ago

I updated the README.md with some example usage and output. Thanks for the feedback.

mdaniel8y ago

While not "in Go", Homebrew showed me this tool a while back and I like it bunches:

> Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON.

https://github.com/johnkerl/miller#readme

j / k navigate · click thread line to collapse

19 comments

18 comments · 6 top-level

sigil8y ago· 4 in thread

Equivalent sort(1) invocations for your examples:

    sort -k2 -k1 -k3 contacts.tsv
    sort -k1 -k2 -k3 contacts.tsv

This assumes TSV input, but there are plenty of reasons to prefer that to CSV. If I'm working from CSV sources I usually convert to TSV first thing in my shell pipeline.

feelin_googley8y ago

When sort is used on really large files, it will automatically attempt to use disk, putting temp files in TMPDIR. This can be really slow.

To overcome the slowdown of disk I/O, perhaps a workaround could be to use mfs or tmpfs, maybe something like:

   mkdir /dir
   mount -t tmpfs tmpfs /dir
   TMPDIR=/dir sort -k2 -k1 -k3 contacts.tsv
   TMPDIR=/dir sort -k1 -k2 -k3 contacts.tsv

Personally, I gave up on sort for large files and use k/kdb+. I suspect it is faster for sorting than sort or the Go libraries, but I could be wrong.

sigil8y ago

For a dataset larger than physical memory, using a memory filesystem like tmpfs for the merge stage will either swap (|tmpfs| < |ram|) or deadlock (|tmpfs| >= |ram|).

Instead, your best bet in that case is to give sort as much physical memory as you can spare:

    sort -S 95% -k1 huge.tsv

Extra disk I/O is inevitable since your dataset doesn't fit in memory. At least during a merge sort your disk reads will be O(N) and sequentially ordered.

Note: in the special case that your dataset is slightly larger than physical memory, splitting it up in advance such that one of the `sort -m` input files lives on a tmpfs should indeed be faster.

Other things to check out if you need Very Fast Large Sorts:

- Use `sort --parallel=N` to use multiple cores. By default it only uses 1.

- Use `sort --batch-size=NMERGE` to increase the number of files merged at once. Otherwise you may be doing more mergesort stages than are necessary.

johnweldonOP8y ago

Thanks - I've used sort quite a bit, and I like it. I wrote this partly to just fulfil my desire to sort by named fields rather than column indexes.

sigil8y ago

[1] https://github.com/johnweldon/sortcsv/blob/55818bd8e5f9feecc...

1 more reply

bfrog8y ago· 3 in thread

I've been using https://github.com/BurntSushi/xsv which is quite nice and has a few other very handy csv tools.

danso8y ago

johnweldonOP8y ago

I hadn't seen that tool, thanks for pointing it out.

bfrog8y ago

Indeed, perhaps it will give you some fun ideas!

z1mm32m4n8y ago· 2 in thread

I’m a huge fan of csvkit, which includes a similar utility along with lots more:

http://csvkit.readthedocs.io/en/1.0.2/scripts/csvsort.html

Some of my favorites tools it includes are csvsql and csvlook.

johnweldonOP8y ago

Cool, looks like a nicely built set of utilities in python. Thanks for the link.

taylodl8y ago

Thanks! csvkit looks awesome!

johnweldonOP8y ago· 2 in thread

Works great with previously shared Go command line tool jw4.us/to8 when input files are not UTF8.

Use to8 to convert from UTF(32|16)(LE)? etc. to UTF8 first, then sort with this tool.

donatj8y ago

Is there an advantage to to8 over iconv?

I've used iconv for years and it's never let me down.

johnweldonOP8y ago

feelin_googley8y ago· 1 in thread

johnweldonOP8y ago

I updated the README.md with some example usage and output. Thanks for the feedback.

mdaniel8y ago

While not "in Go", Homebrew showed me this tool a while back and I like it bunches:

> Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON.

https://github.com/johnkerl/miller#readme

j / k navigate · click thread line to collapse