> Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files,
There is, actually, RFC 4180 IIRC.
> there are many flavours that are incompatible with each other in the sense that a reader for one flavour would not be suitable for reading the other and vice versa.
"There are many flavours that deviate from the spec" is a JSON problem too.
> you cannot tell programmatically whether line 1 contains column header names or already data (you will have to make an educated guess but there ambiguities in it that cannot be resolved by machine).
Also a problem in JSON
> Quoting, escaping, UTF-8 support are particular problem areas,
Sure, but they are no more nor no less a problem in JSON as well.
There aren't vast numbers of different JSON formats. There's practically one and realistically maybe two.
Headers are in each line, utf8 has never been an issue for me and quoting and escaping are well defined and obeyed.
This is because for datasets, almost exclusively, the file is machine written and rarely messed with.
Csv files have all kinds of separators, quote characters, some parsers don't accept multi lines and some do, people sort files which mostly works until there's a multi line. All kinds of line endings, encodings and mixed encodings where people have combined files.
I tried using ASCII record separators after dealing with so many issues with commas, semicolons, pipes, tabs etc and still data in the wild had these jammed into random fields.
Lots of these things don't break when you hit the issue either, the parsers happily churn on with garbage data, leading to further broken datasets.
Also they're broken for clients if the first character is a capital I.
> Headers are in each line
This might be my old “space and network cost savings” reflex, which is a lot less necessary these days, kicking in, but the feels inefficient. It also gives rise to not knowing the whole schema until you read the whole dataset (which might be multiple files), unless some form of external schema definition is provided.
Having said that, I accept that JSON has advantages over CSV, even if all that is done is translating a data-table into an array of objects representing one row each.
> utf8 has never been an issue for me
The main problem with UTF8 isn't with CSV generally, it is usually, much like the “first column is called ID” issue, due to Excel. Unfortunately a lot of people interact with CSVs primarily with Excel, so it gets tarred with that brush by association. Unless Excel sees the BOM sequence at the start of a CSV file, which the Unicode standards recommend against for UTF8, it assumes its characters are using the Win1252 encoding (almost, but not quite, ISO-8859-1).
> Csv files have all kinds of separators
I've taken to calling them Character Separated Value files, rather than Comma, for this reason.
JSONL is handy, JSON that's in the form {data: [...hundred megs of lines]} is annoying for various parsers.
I'm quite a fan of parquet, but never expect to receive that from a client (alas).
I've dealt with incorrect CSVs numerous times, never with incorrect JSON, but, of the times I know what was happening on the other system, each time the CSV was from some in house (or similar) implementation of dumping a SQL output (or similar) into a text file as an MVP. JSON was always using some library.
If so, that's all the more reason to love CSV as it stands guard for JSON. If CSV didn't exist, we would instead have broken JSON implementations. (JSON and XML would likely then share a similar relationship.)
One project I worked on involved a vendor promising to send us data dumps in "CSV format". When we finally received their "CSV" we had to figure out how to deal with (a) global fields being defined in special rows above the header row, and (b) a two-level hierarchy of semicolon-delimited values nested within comma-delimited columns. We had to write a custom parser to complete the import.
I mean, right now, the data interchange format between multiple working systems is CSV; think payment systems, inter-bank data interchange, ERP systems, CRM systems, billing systems ... the list goes on.
I just recently had a coffee with a buddy who's a salesman for some enterprise system: of the most common enterprise systems we recently worked with (SAP type things, but on smaller scales), every single one of them had CSV as the standard way to get data between themselves and other systems.
And yet, they work.
The number of people uploading excel files to be processed or downloading excel files for local visualistation and processing would floor you. It's done multiple times a day, on multiple systems, in multiple companies.
And yet, they work.
I get your argument though - a JSON array of arrays can represent everything that CSV can, and is preferable to CSV, and is what I would choose when given the choice, but the issues with using that are not going to be fewer than issues with CSV using RFC1480.
That is not my experience at all. I've been processing CSV files from financial institutions for many years. The likelihood of brokenness must be around 40%. It's unbelievable.
The main reason for this is not necessarily the CSV format as such. I believe the reason is that it is often the least experienced developers who are tasked with writing export code. And many inexperienced developers seem to think that they can generate CSV without using a library because the format is supposedly so simple.
JSON is better but it doesn't help with things like getting dates right. XML can help with that but it has complexities that people get wrong all the time (such as entities), so I think JSON is the best compromise.
It's massively used, but the lack of adherence to a proper spec causes huge issues. If you have two systems that happen to talk properly to each other, great, but if you are as I was an entrypoint for all kinds of user generated files it's a nightmare.
CSV is the standard, sure, but it's easy to write code that produces it that looks right at first glance but breaks with some edge case. Or someone has just chosen a different separator, or quote, so you need to try and detect those before parsing (I had a list that I'd go through, then look for the most commonly appearing non-letter character).
The big problem is that the resulting semantically broken csv files often look pretty OK to someone scanning them and permissive parsers. So one system reads it in, splits something on lines and assumes missing columns are blank and suddenly you have the wrong number of rows, then it exports it. Worse if it's been sorted before the export.
Of course then there's also the issues around a lack of types, so numbers and strings are not distinguishable automatically leading to broken issues where you do want leading zeros. Again often not identified until later. Or auto type detection in a system breaking because it sees a lot of number-like things and assumes it's a number column. Without types there's no verification either.
So even properly formatted CSV files need a second place for metadata about what types there are in the file.
JSON has some of these problems too, it lacks dates, but far fewer.
> but the issues with using that are not going to be fewer than issues with CSV using RFC1480.
My only disagreement here is that I've had to deal with many ingest endpoints that don't properly support that.
Fundamentally I think nobody uses CSV files because they're a good format. They've big, slow to parse, lack proper typing, lack columnar reading, lack fast jumping to a particular place, etc.
They are ubiquitous, just not good, and they're very easy to screw up in hard to identify or fix ways.
Finally, lots of this comes up because RFC4180 is only from *2005*.
Oh, and if I'm reading the spec correctly, RFC4180 doesn't support UTF8. There was a proposed update maybe in 2022 but I can't see it being accepted as an RFC.
And there are constant issues arising from that. You basically need a small team to deal with them in every institution that is processing them.
> I just recently had a coffee with a buddy who's a salesman for some enterprise system: of the most common enterprise systems we recently worked with (SAP type things, but on smaller scales), every single one of them had CSV as the standard way to get data between themselves and other systems.
Salesman of enterprise system do not care about issues programmers and clients have. They care about what they can sell to other businessmen. That teams on both sides then waste time and money on troubleshooting is no concern to the salesman. And I am saying that as someone who worked on the enterprise system that consumed a lot of csv. It does not work and process of handling them literally sometimes involved phone calls to admins of other systems. More often then would be sane.
> The number of people uploading excel files to be processed or downloading excel files for local visualistation and processing would floor you.
That is perfectly fine as long as it is a manager downloading data so that he can manually analyze them. It is pretty horrible when those files are then uploaded to other systems.
SAP has been by far the worst. I never managed to get data out of it that were not completely garbage and needed hand crafted parsers.
Through a lot of often-painful manual intervention. I've seen it first-hand.
If an organization really needs something to work, it's going to work somehow—or the organization wouldn't be around any more—but that is a low bar.
In a past role, I switched some internal systems from using CSV/TSV to using Parquet and the difference was amazing both in performance and stability. But hey, the CSV version worked too! It just wasted a ton of people's time and attention. The Parquet version was far better operationally, even given the fact that you had to use parquet-tools instead of just opening files in a text editor.
Independent variations I have seen:
* Trailing commas allowed or not * Comments allowed or not * Multiple kinds of date serialization conventions * Divergent conventions about distinguishing floating point types from integers * Duplicated key names tolerated or not * Different string escaping policies, such as, but not limited to "\n" vs "\x0a"
There are bazillions of JSON variations.
The json spec does not allow commas. Although there are jsom supersets that do.
> Comments allowed or not
The json spec does not allow comments. Although there are jsom supersets that do.
> Multiple kinds of date serialization conventions
Json spec doesn't say anything about dates. That is dependent on your application schema.
> Divergent conventions about distinguishing floating point types from integers
This is largely due to divergent ways different programming languages handle numbers. I won't say jsom handles this the best, but any file format used across multiple languages will run into problems with differences in how numbers are represented. At least there is a well defined difference between a number and a string, unlike csv.
> Duplicated key names tolerated or not
According to the spec, they are tolerated, although the semantics of such keys is implementation defined.
> Different string escaping policies, such as, but not limited to "\n" vs "\x0a"
Both of those are interpreted as the same thing, at least according to the spec. That is an implementation detail of the serializer, not a different language.
Typically the big difference is there are different parsers that are less tolerant of in spec values. Clickhouse had a more restrictive parser, and recently I've dealt with matrix.
Maybe I've been lucky for json and unlucky for csv.
Basically, Excel uses the equivalent of ‘file’ (https://man7.org/linux/man-pages/man1/file.1.html), sees the magic “ID”, and decides a SYLK file, even though .csv files starting with “ID” have outnumbered .SYLK files by millions for decades.
Does any software fully follow that spec (https://www.rfc-editor.org/rfc/rfc4180)? Some requirements that I doubt are commonly followed:
- “Each record is located on a separate line, delimited by a line break (CRLF)” ⇒ editing .csv files using your the typical Unix text editor is complicated.
- “Spaces are considered part of a field and should not be ignored”
- “Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes” ⇒ fields containing lone carriage returns or new lines need not be enclosed in double quotes.