Show HN: ZSV (Zip Separated Values) columnar data format (opens in new tab)

(github.com)

86 pointshafthor2y ago69 comments

A columnar data format built using simple, mature technologies.

69 comments

44 comments · 11 top-level

jitl2y ago· 8 in thread

I think “human readability” isn’t a great feature for a columnar data format, because once you get data on a scale where the column oriented layout makes sense, you’re way past the scale where a human would be want to read over the stored data anyways. Like, no human is going to read 50k rows, much less 10m rows. I guess it’s nice you can spot check the rows using only zip & head -n 10 and paste, but I don’t think that nice-ness is a good reason to pick a format that forbids common ASCII characters and doesn’t have widespread support.

It’s guess there’s a sort of perma-computing angle here, this format is simple enough that you could pack a lot of almanac data into it, and given a working zlib get it back out with very limited dependencies.

But given the petabytes of parquet files out there, I feel like the format is here to stay, much like sqlite is here to stay.

EDIT: there is a great handy CLI tool for doing SQL on parquet, csv, sqlite3, and other tabular data formats called duckdb. Handy for wrangling and analyzing tabular data from 100 to 10m rows and up.

theamk2y ago

You normally develop the tools and scripts on a much smaller data sets. So you export 1 minute of data, examine it manually or with simple tools, write your scripts and once they work, switch to processing months worth of it.

Human-readable comes handy here.

Dylan168072y ago

A pretty printer is almost as handy and with significantly fewer compromises.

jitl2y ago

I prefer SQL for tabular data larger than 3 screens high, for me it’s easier for basically any analysis compared to a grep/wc/count/cut/paste bash pipeline. I use sqlite for CSV pretty regularly, if I needed columnar I’d use duckdb for parquet

lionkor2y ago

Even more so if you store personal data in there, which ofc would be encrypted per row.

rhelz2y ago

> Like, no human is going to read 50k rows, much less 10m rows.

Well, its 2AM, some dork has checked in code which breaks production, and it absolutely positively has to be fixed by 6:00am before the customer comes in.

Your bleary eyes are scaring through log files and data files, trying to find the answer..

... believe me, you will appreciate human-readable formats for both of those. You just want to cat out the the entries in the db which the new code can't handle... the last thing you want to do is to have to invoke some other tool or write some other script to make the data human readable.

And when you find the problem, you will want to just be able to edit a text file containing test cases to verify the fix.

You don't want to write some script to generate and insert the data....at 2am, you are likely to write a buggy script which may keep you from realizing that you've already fixed the problem....or worse, indicate that you have fixed the problem when you haven't.

Fewer moving parts is always better.

dvt2y ago

> Well, its 2AM, some dork has checked in code which breaks production, and it absolutely positively has to be fixed by 6:00am before the customer comes in.

This is a classic XY problem. The issue isn't the data format, it's the fact that your organizational processes allow random code pushes at 2am that can break the whole thing.

Parquet, used by basically everyone, isn't human readable (and for good reason): it's for big data storage, retrieval, and processing. CSV is human readable (and for good reason): people use that data in Excel or other spreadsheeting software.

2 more replies

jitl2y ago

I’ve never been frustrated at 2am that my data in sqlite3 or Postgres isn’t in a human readable disk format.

If I’m working with parquet I’ll have duckdb on hand for fiddling parquet files. I’m much better at SQL at 2 am than I am at piping Unix tools together over N files.

I have no idea how I’d drop bad rows from this thing with a bash pipeline anyways, I need to select from one file to find the bad line numbers (grep I guess, I’ll need to look up how to cut just the line number), and then delete those lines from all the files in a zip (??). Sounds a lot harder than a single SELECT WHERE NOT or DELETE WHERE.