undefined | Better HN

0 pointslelanthran1y ago0 comments

To be honest, I'm wondering why you are rating JSON higher than CSV.

> Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files,

There is, actually, RFC 4180 IIRC.

> there are many flavours that are incompatible with each other in the sense that a reader for one flavour would not be suitable for reading the other and vice versa.

"There are many flavours that deviate from the spec" is a JSON problem too.

> you cannot tell programmatically whether line 1 contains column header names or already data (you will have to make an educated guess but there ambiguities in it that cannot be resolved by machine).

Also a problem in JSON

> Quoting, escaping, UTF-8 support are particular problem areas,

Sure, but they are no more nor no less a problem in JSON as well.

0 comments

2 comments · 2 top-level

IanCal1y ago

Have you had to work with csv files from the wild much? I'm not being snarky but what you're talking about is night and day to what I've experienced over the years.

There aren't vast numbers of different JSON formats. There's practically one and realistically maybe two.

Headers are in each line, utf8 has never been an issue for me and quoting and escaping are well defined and obeyed.

This is because for datasets, almost exclusively, the file is machine written and rarely messed with.

Csv files have all kinds of separators, quote characters, some parsers don't accept multi lines and some do, people sort files which mostly works until there's a multi line. All kinds of line endings, encodings and mixed encodings where people have combined files.

I tried using ASCII record separators after dealing with so many issues with commas, semicolons, pipes, tabs etc and still data in the wild had these jammed into random fields.

Lots of these things don't break when you hit the issue either, the parsers happily churn on with garbage data, leading to further broken datasets.

Also they're broken for clients if the first character is a capital I.

5 more replies

Someone1y ago

> There is, actually, RFC 4180 IIRC.

Does any software fully follow that spec (https://www.rfc-editor.org/rfc/rfc4180)? Some requirements that I doubt are commonly followed:

- “Each record is located on a separate line, delimited by a line break (CRLF)” ⇒ editing .csv files using your the typical Unix text editor is complicated.

- “Spaces are considered part of a field and should not be ignored”

- “Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes” ⇒ fields containing lone carriage returns or new lines need not be enclosed in double quotes.

j / k navigate · click thread line to collapse