It’s guess there’s a sort of perma-computing angle here, this format is simple enough that you could pack a lot of almanac data into it, and given a working zlib get it back out with very limited dependencies.
But given the petabytes of parquet files out there, I feel like the format is here to stay, much like sqlite is here to stay.
EDIT: there is a great handy CLI tool for doing SQL on parquet, csv, sqlite3, and other tabular data formats called duckdb. Handy for wrangling and analyzing tabular data from 100 to 10m rows and up.
Human-readable comes handy here.
Well, its 2AM, some dork has checked in code which breaks production, and it absolutely positively has to be fixed by 6:00am before the customer comes in.
Your bleary eyes are scaring through log files and data files, trying to find the answer..
... believe me, you will appreciate human-readable formats for both of those. You just want to cat out the the entries in the db which the new code can't handle... the last thing you want to do is to have to invoke some other tool or write some other script to make the data human readable.
And when you find the problem, you will want to just be able to edit a text file containing test cases to verify the fix.
You don't want to write some script to generate and insert the data....at 2am, you are likely to write a buggy script which may keep you from realizing that you've already fixed the problem....or worse, indicate that you have fixed the problem when you haven't.
Fewer moving parts is always better.
This is a classic XY problem. The issue isn't the data format, it's the fact that your organizational processes allow random code pushes at 2am that can break the whole thing.
Parquet, used by basically everyone, isn't human readable (and for good reason): it's for big data storage, retrieval, and processing. CSV is human readable (and for good reason): people use that data in Excel or other spreadsheeting software.
If I’m working with parquet I’ll have duckdb on hand for fiddling parquet files. I’m much better at SQL at 2 am than I am at piping Unix tools together over N files.
I have no idea how I’d drop bad rows from this thing with a bash pipeline anyways, I need to select from one file to find the bad line numbers (grep I guess, I’ll need to look up how to cut just the line number), and then delete those lines from all the files in a zip (??). Sounds a lot harder than a single SELECT WHERE NOT or DELETE WHERE.
Most such formats support efficient querying by skipping the disk read step entirely when a chunk of data is not relevant to a query. This is done by splitting the data into segments of about 100K rows, and then calculating the min/max range for each column. That is stored separately in a header or small metadata file. This allows huge chunks of the data to be entirely skipped if it falls out of range of some query predicate.
PS: the same compression ratio advantages could be achieved by compressing columns stored as JSON arrays, but such a format could encode all Unicode characters and has a readily available decoder in all mainstream programming languages.
> Price⇥⇥0 {rows:2, distinct:2, minvalue:111.11, maxvalue:222.22} 111.11⮐222.22⮐
> Price⇥⇥1 {rows:1, distinct:1, minvalue:333.33, maxvalue:333.33} 333.33⮐
Some future programmer will be cursing my name as they try to make columnar JSON decoding performant.
What this gets right: Part of the reason you want to store columns together is that similar values compress well, so you could reduce your IO: smaller files are faster to load into memory. However in many cases (e.g. Arrow, Parquet) lightweight compression formats are preferred here, e.g. run length encoding (1,1,1,1,1,1,5,5,5,5,3,3,3 -> 6x1,5x4,3x3) or dictionary encoding (if your column is enum-like, you can store each enum value as a byte flag) because they can be scanned without decoding, amplifying your savings.
What it misses on (IMHO): - There's a metadata field but it doesn't contain any offsets to access a specific column quickly. So if you have 8 columns of 2GB each, to just get to the 7th column you have to read 12GB first which is quite wasteful. If you store just an offset, you could be reading a handful of bytes. Massive savings. - Within each column, how do you get to the range of values you want? Most columnar formats have stripes (i.e. stored in chunks of X rows each) which contain statistics (this stripe or range of values contains min value A, max value B) that allow you to skip chunks really fast. So again within that 2GB you have to read not much more than you strictly have to.
If this reminds you of an on-disk tree where you first hop to a column and then hop to some specific stripes, yeah, that's pretty much the idea.
-----
Sidenote: I've generally concluded that "human readable" is only a virtue for encoding formats that aren't doing heavy lifting, like the API call your web app is sending to the backend. Even in that case, your HTTP request is wrapped in gzip, wrapped in TLS, wrapped in TCP and chunked to all hell. No one complains about the burden caused by those. So what's one more layer of decoding? We can just demand to have tools that are not terrible, and the result is pretty transparent to us. The format is mostly for the computer, not you.
When I hear about stuff like terabytes of JSON just being dumped into s3 buckets and then consumed again by some other worker I have a fit because it's so easy and cheap these days not to be that wasteful.
The price you pay is that it is inefficient for single record access, or for "select * " kind of queries.
Also parquet has lots of features that'll get you to the general vicinity of a single record tolerably fast without sacrificing much in terms of storage or computational complexity. It's a small price for a big win.
Thanks for taking a look.
At least you could use something less likely to appear in data as record sepator (like 0x1E)
Otherwise it's an interesting idea!
We work with Parquet + Arrow every day at $DAYJOB in a ML and Big Data context and it's been great. We don't even think we're using it to its fullest potential, but it's never been the bottleneck for us.
It also wouldn't be that hard to make it seekable. All you would have to do is make each tsv file two columns: record-id, value.
> ZIP files are a collection of individually compressed files, with a directory as a footer to the file, which makes it easy to seek to a specific file without reading the whole file... The nature of .zip files makes it possible to seek and read just the columns required without having to read/decode the other columns.