undefined | Better HN

0 pointsfragmede1y ago0 comments

I want to push Sqlite as a data interchange format! it has the benefit of being well defined, and can store binary data, like images for product pictures inside the database. not a good idea if you're trying to serve users behind a web app, but as interchange, better than a zip file with filenames that have to be "relinked".

0 comments

Someone12341y ago

For context: I have a LOT of experience of interchange formats, like "full time job, every day, all day, hundreds of formats, for 20-years" experience.

Based on that experience I have come to one key, but maybe, counter-intuitive truth about interchange formats:

- Too much freedom is bad.

Why? Generating interchange data is cheaper than consuming it, because the creator only needs to consider the stuff they want to include, whereas the consumer needs to consider every single possible edge case and or scenario the format itself can support.

This is why XML is WAY more costly to ingest than CSV, because in XML someone is going to use: attributes, CDATA, namespaces, comments, different declaration, includes, et al. In CVS they're going to use rows, a format separator, and quotes (with or without escaping). That's it. That's all it supports.

Sqlite as an interchange format is a HORRIFYING suggestion, because every single feature Sqlite supports may need to be supported by consumers. Even if you curtailed Sqlite's vast feature set, you've still created something vastly more expensive to consume than XML, which itself is obnoxious.

My favorite interchange formats are, in order:

- CVS, JSON (inc. NDJSON), YAML, XML, BSON (due to type system), MessagePack, Protobuf, [Giant Gap] Sqlite, Excel (xlsx, et al)

More features mean more cost, more edge cases, more failures, more complex consumers. Keep in mind, this is ONLY about interchange formats between two parties, I have wildly different opinions about what I would use for my own application where I am only ever the creator/consumer, I actually love Sqlite for THAT.

nomel1y ago

Oh, this is interesting. Are you tying different systems together? If so, do you use some preferred intermediate format? Do you have a giant library of * -> intermediate -> * converters that you sprinkle between everything? Or maybe the intermediate format is in memory?

What about Parquet and the like?

gopher_space1y ago

Not the person you were replying to, but from my experience CSV is a good place to define data types coming into a system and a safe way to dump data as long as I write everything down.

So I might do things like have every step in a pipeline begin development by reading from and writing to CSV. This helps with parallel dev work and debugging, and is easy to load into any intermediate format.

> do you use some preferred intermediate format?

This is usually dictated by speed vs money calculations, weird context issues, and familiarity. I think it's useful to look at both "why isn't this a file" and "why isn't this all in memory" perspectives.

theLiminator1y ago

For tabular/time-series greater than 100k rows I personally feel like parquet cannot be beat. It's self-describing, strongly-typed, relatively compact, supports a bunch of io/decode skipping, and is quite fast.

Orc also looks good, but isn't well supported. I think parquet is optimal for now for most analytical use-cases that don't require human readability.

Someone12341y ago

It is an interchange format, so it is inter-system by virtue of that. If I am a self-creator/consumer the format I use can be literally anything even binary memory dumps.

sadcodemonkey1y ago

I love the wisdom in this comment!

ttyprintk1y ago

I’m not sure you need to support every SQLite feature. I’m unconvinced of binary formats, but the .dump output is text and simple SQL.

fragmedeOP1y ago

Interesting! I've dealt with file interchange between closed source (and a couple open source) programs, but that was a while ago. I've also had to deal with csvs and xslts between SaaS vendors for import export of customer's data. I've done a bunch of reverse engineering of proprietary formats so we could import the vendor's files, which had more information than they were willing to export in an interchange format. Sometimes they're encrypted and you have to break it.

What you say is fair. Csv is underspecified though, there's no company called csv that's gonna sue for trademark enforcement, there's no official csv standard library that everyone uses. (They exist are some but there are so many naive importations because from first principles, because how hard could it be? output records and use a comma and newline (of which there are three possible options)).

How often do you deal with multiple Csv files to represent multiple tables that are actually what's used by vendors internally, vs one giant flattened Csv with hundreds of columns and lots of empty cells? I don't have your level of experience with csvs, but I've dealt with a them being a mess, where the other side implement whatever they think is reasonable given the name "comma separated values".

With sqlite, we're in the Internet age and so I presume this hypothetical developer would use the sqlite library and not implement their own library from scratch for funsies. This then leads to types, database normalization, multiple tables. I hear you that too many choices can bad, and xml is a great example of this, but sqlite isn't xml and isn't Csv.

It's hard to have this discussion in the abstract so I'll be forthcoming about where I'm coming from, which is Csv import export between vendors for stores, think like Doordash to UberEATS. the biggest problem we have is images of the items, and how to deal with that. It's an ongoing issue how to get them, but the failure mode, which does happen, is that when moving vendor, they just have to redo a lot of work that they shouldn't have to.

Ultimately the North Star I want to push towards is moving beyond csvs, because it'll let a people who currently have to hand edit the Csv so every row imports properly, not have to do that. They'd still exist, but instead have to deal with, well, what you see with XML files. which has its shortcomings, as you mention, but at least once how a vendor is using it is understood, individual records are generally understandable.

I was moved so I don't deal with import export currently, but it's because sqlite is so nice to work with on personal projects where it's appropriate that I want to push the notion of moving to sqlite over csvs.

0cf8612b2e1e1y ago

One very minor problem is that you max out storing blobs of 2GB(? I think, maybe 4GB). Granted few will hit this, but this limit did kill one previous data transfer idea of mine.

ThatPlayer1y ago

> not a good idea if you're trying to serve users behind a web app

I use Sqlite for a static site! Generating those static pages out to individual pages would involve millions of individual files. So instead I serve up a sqlite database over http, and use a sqlite wasm driver [0] to load (database) pages as needed. Good indexing cuts down on the number of pages it grabs, and I can even get full text search!

Only feature I'm missing is compression, which is complicated because for popular extensions like sqlite-zstd written in Rust.

[0] https://github.com/mmomtchev/sqlite-wasm-http

j / k navigate · click thread line to collapse

0 comments

Someone12341y ago

For context: I have a LOT of experience of interchange formats, like "full time job, every day, all day, hundreds of formats, for 20-years" experience.

Based on that experience I have come to one key, but maybe, counter-intuitive truth about interchange formats:

- Too much freedom is bad.

My favorite interchange formats are, in order:

- CVS, JSON (inc. NDJSON), YAML, XML, BSON (due to type system), MessagePack, Protobuf, [Giant Gap] Sqlite, Excel (xlsx, et al)

nomel1y ago

What about Parquet and the like?

gopher_space1y ago

Not the person you were replying to, but from my experience CSV is a good place to define data types coming into a system and a safe way to dump data as long as I write everything down.

> do you use some preferred intermediate format?

theLiminator1y ago

Orc also looks good, but isn't well supported. I think parquet is optimal for now for most analytical use-cases that don't require human readability.

Someone12341y ago

It is an interchange format, so it is inter-system by virtue of that. If I am a self-creator/consumer the format I use can be literally anything even binary memory dumps.

sadcodemonkey1y ago

I love the wisdom in this comment!

ttyprintk1y ago

I’m not sure you need to support every SQLite feature. I’m unconvinced of binary formats, but the .dump output is text and simple SQL.

fragmedeOP1y ago

0cf8612b2e1e1y ago

One very minor problem is that you max out storing blobs of 2GB(? I think, maybe 4GB). Granted few will hit this, but this limit did kill one previous data transfer idea of mine.

ThatPlayer1y ago

> not a good idea if you're trying to serve users behind a web app

Only feature I'm missing is compression, which is complicated because for popular extensions like sqlite-zstd written in Rust.

[0] https://github.com/mmomtchev/sqlite-wasm-http

j / k navigate · click thread line to collapse