Based on that experience I have come to one key, but maybe, counter-intuitive truth about interchange formats:
- Too much freedom is bad.
Why? Generating interchange data is cheaper than consuming it, because the creator only needs to consider the stuff they want to include, whereas the consumer needs to consider every single possible edge case and or scenario the format itself can support.
This is why XML is WAY more costly to ingest than CSV, because in XML someone is going to use: attributes, CDATA, namespaces, comments, different declaration, includes, et al. In CVS they're going to use rows, a format separator, and quotes (with or without escaping). That's it. That's all it supports.
Sqlite as an interchange format is a HORRIFYING suggestion, because every single feature Sqlite supports may need to be supported by consumers. Even if you curtailed Sqlite's vast feature set, you've still created something vastly more expensive to consume than XML, which itself is obnoxious.
My favorite interchange formats are, in order:
- CVS, JSON (inc. NDJSON), YAML, XML, BSON (due to type system), MessagePack, Protobuf, [Giant Gap] Sqlite, Excel (xlsx, et al)
More features mean more cost, more edge cases, more failures, more complex consumers. Keep in mind, this is ONLY about interchange formats between two parties, I have wildly different opinions about what I would use for my own application where I am only ever the creator/consumer, I actually love Sqlite for THAT.
What about Parquet and the like?
So I might do things like have every step in a pipeline begin development by reading from and writing to CSV. This helps with parallel dev work and debugging, and is easy to load into any intermediate format.
> do you use some preferred intermediate format?
This is usually dictated by speed vs money calculations, weird context issues, and familiarity. I think it's useful to look at both "why isn't this a file" and "why isn't this all in memory" perspectives.
Orc also looks good, but isn't well supported. I think parquet is optimal for now for most analytical use-cases that don't require human readability.
What you say is fair. Csv is underspecified though, there's no company called csv that's gonna sue for trademark enforcement, there's no official csv standard library that everyone uses. (They exist are some but there are so many naive importations because from first principles, because how hard could it be? output records and use a comma and newline (of which there are three possible options)).
How often do you deal with multiple Csv files to represent multiple tables that are actually what's used by vendors internally, vs one giant flattened Csv with hundreds of columns and lots of empty cells? I don't have your level of experience with csvs, but I've dealt with a them being a mess, where the other side implement whatever they think is reasonable given the name "comma separated values".
With sqlite, we're in the Internet age and so I presume this hypothetical developer would use the sqlite library and not implement their own library from scratch for funsies. This then leads to types, database normalization, multiple tables. I hear you that too many choices can bad, and xml is a great example of this, but sqlite isn't xml and isn't Csv.
It's hard to have this discussion in the abstract so I'll be forthcoming about where I'm coming from, which is Csv import export between vendors for stores, think like Doordash to UberEATS. the biggest problem we have is images of the items, and how to deal with that. It's an ongoing issue how to get them, but the failure mode, which does happen, is that when moving vendor, they just have to redo a lot of work that they shouldn't have to.
Ultimately the North Star I want to push towards is moving beyond csvs, because it'll let a people who currently have to hand edit the Csv so every row imports properly, not have to do that. They'd still exist, but instead have to deal with, well, what you see with XML files. which has its shortcomings, as you mention, but at least once how a vendor is using it is understood, individual records are generally understandable.
I was moved so I don't deal with import export currently, but it's because sqlite is so nice to work with on personal projects where it's appropriate that I want to push the notion of moving to sqlite over csvs.
I use Sqlite for a static site! Generating those static pages out to individual pages would involve millions of individual files. So instead I serve up a sqlite database over http, and use a sqlite wasm driver [0] to load (database) pages as needed. Good indexing cuts down on the number of pages it grabs, and I can even get full text search!
Only feature I'm missing is compression, which is complicated because for popular extensions like sqlite-zstd written in Rust.