Why CSV is still king (opens in new tab)

(konbert.com)

122 pointsboudra1y ago144 comments

144 comments

Sad that the ASCII specification includes 2 codes: 30 and 31, respectively field separator and record separator, precisely to answer cleanly the need that CSV fullfils addresses.

During the 90's I was anal for using them, pissing the hell out of my teammates and users for forcing them to use these 'standard compliant' files.

Had to give up.

viraptor1y ago

They're hard to type though. You need to teach people how to use those -vs- just using a comma.

jeff-hykin1y ago

And they still don't fix the escaping problem. You might as well use a niche utf8 emoji as a separator. Editors at least know how to consistently render an emoji.

2 more replies

NautilusWave1y ago

The ease of typing a character should only matter for artisanal, hand-typed files.

2 more replies

gklitz1y ago

What do you mean comma? csv uses tabs. /s

1 more reply

baggy_trough1y ago

A valuable lesson in user experience triumphing over pedantic correctness.

bandie911y ago

a lesson in confusing representation and data. if users can learn not to edit xls in text editor and to "go to next cell" type "tab" in spreadsheet software, they can learn edit csv in a proper csv editor. the only trap was that we made a non-text format so simple that tricked ourself that "it's only plain text".

hanche1y ago

Sqlite still supports it: .mode ascii

snthpy1y ago

I've recently learned about them and an trying to revive usage - .asv and .usv files.

The .usv separators make things easier to read at the expense of a bit more space.

Main point for me making the parsing so much simpler.

Who writes .csv files by hand anyway?

soared1y ago

Editing csv by hand is something I’ve seen a lot for internal-only software where every user is a super-power user who need to move small but bulk amounts of data and sometimes make small edits for formatting.

Easiest example is geo, I need 20 states listed as US-CO, US-CA, etc but one tool exported as US CO.

cqqxo4zV46cp1y ago

Standards-compliance and using esoteric features over catering for the realities of usability. Your coworkers were right to steamroll you.

aleph_minus_one1y ago

If these ASCII code points were actively used, the support in common editors that are used for editing CSV files would become much better very fast.

tanin1y ago

What surprised me the most about CSVs is that:

- To escape the delimiter, we should enclose the value with double quotes. Ok, makes sense.

- To escape double quotes within the enclosing double quotes, we need to use 2 double quotes.

Many tools are getting it wrong. Meanwhile some tools like pgadmin, justifiably, allows you to configure the escaping character to be double quote or single quote because CSV standard is often not respected.

Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.

arp2421y ago

> Many tools are getting it wrong.

They're not getting it wrong, they're just assuming a different variant.

There is no "standard" for CSV. Yes, there's an RFC, published in 2005, about 30 years after everyone was already using CSV. That's too late. You can't expect people to drop all compatibility just because someone published some document somewhere. RFC 4180 explicitly says that "it does not specify an Internet standard of any kind", although many people do take it as a "standard". But even if it did call itself a standard: it's still just some document someone published somewhere.

They should have just created a new "Comma Separated Data" (file.csd) standard or something instead of trying to retroactively redefine something that already exists. Then applications could add that as a new option, rather than "CSV, but different from what we already support". That was always going to be an uphill battle.

Never mind that RFC 4180 is just insufficient by not specifying character encodings in the file itself, as well as some other things such as delimiters. If someone were to write a decent standard and market it a bit, then I could totally see this taking off, just as TOML "standardized INI files" took off.

jorams1y ago

RFC 4180 says it "documents the format that seems to be followed by most implementations" and in practice I find that to be true, though my CSVs don't interact with a lot of very old software. You get very far by treating "RFC 4180, UTF-8" as a standard and considering every implementation that doesn't follow it to be broken. I'm not sure I have ever seen software that simultaneousy doesn't follow the RFC, but does consistently support escaping.

Cyberdog1y ago

Did TOML take off? As much as I love it, it seems really rare to see in the wild. I still see YAML everywhere and despair.

5 more replies

imtringued1y ago

> someone were to write a decent standard and market it a bit, then I could totally see this taking off, just as TOML "standardized INI files" took off.

Why? We have xlsx for the office crowd and arrow for the HPC crowd. In no universe does anyone actually have to invent another tabular data format using delimiters.

1 more reply

9999000009991y ago

I can't tell you how to run your business, but subscriptions for offline apps aren't going to be popular here.

Charge me more upfront for a perpetual license, or just version the software. Say 40$ today for V3, and every year charge a reasonable fee to upgrade, but allow me to use the software I purchased...

vdqtp31y ago

I recently saw a license that was based on a monthly subscription, but once you paid for a year you got a perpetual license to the version you started with. Every year, your perpetual license was updated to the next year's version. I find that to be a reasonable middle ground.

ABraidotti1y ago

I think you mean perpetual license, unless you really do mean a license that covers the clitoris or penis.

tanin1y ago

Thank you for your feedback. I think your opinion is super valid here.

I've been thinking about pricing, and a lot of people did complain about it. However, many people expense their software cost, so they don't mind the yearly subscription.

I'm improving the pricing right now and a perpetual license is what I'm going with.

AnonC1y ago

> Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.

Looks like SQL is the main selling point for your tool. For other simpler needs, Modern CSV [1] seems suitable (and it’s cheaper too, with a one time purchase compared to a yearly subscription fee). But Modern CSV does not support SQL or other ways to create complex queries.

[1]: https://www.moderncsv.com/

1vuio0pswjnm71y ago

https://www.ietf.org/rfc/rfc4180.txt

Works for SQLite at least, but not sure about other software.

mikhailfranco1y ago

It would be more useful if every RFC had a test suite of input/output and input/error.

Yes, those are potentially infinite, but a core set would be useful. As ambiguities come up, publish an addendum for clarification, and eventually, as the exceptions accumulate, a version step.

I don't understand how anyone can write a spec without concrete examples of pass/fail in their head. Perhaps there could be an informal example/counterexample syntax for those writing RFCs, which could be extracted into the 1.0 test suite.

The test suite must be a single open source repo, that accumulates acceptable edge cases until the relevant informed adults can make a call about revising the spec.

There has to be one approved, sanctioned, well-known and monitored test suite repo. It cannot be shrugged off into a free-for-all that makes it impossible to find a single canonical test suite. The interwebs are big and conflicted.

See Imre Lakatos 'Proofs and Refutations' for how this evolves.

1 more reply

kawakamimoeki1y ago

As is the case with Markdown, many parsers have prioritized ease of implementation over formal rigor.

mikhailfranco1y ago

I agree about markdown, but the only awkward implementation issue is nested syntax: what markup is parsed inside various other outer markup forms?

Italic headings? Bold links? Nested lists - how many levels? Code in list? How do paragraphs interact with lists? There are many opinions and many leaky implementations of those opinions. Newlines? Embedding HTML in Markdown !?!?

It all seems so sad, because (X)HTML nailed most of these issues a very long time ago. But HTML implementations were sloppy from the outset. And XML was born with inherited bloat, then got ever more complex over time (modular specs, XLink, XPath, XSLT, DTD -> XML Schema, ...)

With Markdown, it is relatively easy to introduce some recursion into the parser, but for what spec? In what contextual cases? At what cost?

mikhailfranco1y ago

One classic example is JSON.

It is possible to just treat commas as whitespace. It makes implementation so much easier. It accepts missing, trailing and repeated commas. It makes elements uniform. It ignores many common errors that arise from typos or cut'n'paste. It makes JSON writers simpler, by removing the first/last special case.

A JSON parser that treats commas as whitespace can be two dozen lines in most programming languages - if you do not want line/column, chapter and verse, for the remaining error messages.

lenkite1y ago

I wish there was a text format that used the ascii unit separator and record separator. It would have solved so many problematic edge cases.

cm21871y ago

The one tools get the most wrong is that there is no escaping of the new line character.

tanin1y ago

Oh yes, but I encountered it on the parsing side. A CSV parsing algorithm that does parallel processing would have this issue.

DuckDB has this problem when the parallel processing of CSV is enabled.

Understandably though because they want to process many lines in parallel.

Nihilartikel1y ago

I've found the Unicode cat emoji to be an effective delimiter to avoid escaping more common chars in my cat-separated-value artifacts.

Of course the cat emoji is escaped by the puppy emoji if it occurs in a value. The puppy emoji escapes itself when needed.

exidex1y ago

There is https://github.com/SixArm/usv which is exactly that, but with special unicode characters

theendisney41y ago

In the 80's i thought we should have an entire character set just for code. While never implemented the idea arguably aged well.

I also considered a dedicated keyboard like apl just to be dense about it.

Have each character signed by the keyboard so that we have proof by whoem it was typed and when.

People who dont work here don't get to write code. It just wont happen. haha

acuozzo1y ago

> In the 80's i thought we should have an entire character set just for code.

APL got pretty close.

Hackbraten1y ago

Instructions unclear, my puppy emoji is now chasing its own tail

geekodour1y ago

last line unclear ⬛ an example would be great!

ok_computer1y ago

I read that as the puppy emoji escapes itself as two characters print a single character, similar to \ in python strings using \\ to print \

TylerE1y ago

Think backlashes in shell. \$ is just $, \\$ is literal ‘\$’

1 more reply

zarzavat1y ago

Just use TSV. Commas are a terrible delimiter because many human strings have commas in them. This means that CSV needs quoting of fields and nobody can agree on how exactly that should work.

TSV doesn’t have this problem. It can represent any string that doesn’t have either a tab or a newline, which is many more than CSV can.

uncharted91y ago

It's 2024 and Excel still doesn't natively parse CSV with tabs as delimiters. When I send such csv files to my colleagues, they complain about not being able to open them directly in Excel. I wish Excel could pop up a window like LibreOffice does to confirm the delimiter before opening a csv file.

Risord1y ago

Excel does not support any delimeter natively since its region dependent.

I ended up saving my mental heath by supporting two different formats: "RFC csv" and "Excel csv". On excel you can for example use sep=# hint on beginning of file to get delimeter work consistently. Sep annotation obviously break parsing for every other csv parser but thats why there is other format.

Also there might be other reasons too to mess up with file to get it open correctly on excel. Like date formats or adding BOM to get it recognized as utf-8 etc. (Not quite sure was BOM case with excel or was it on some other software we used to work with )

1 more reply

pjmlp1y ago

I am quite sure that Excel import option has tabs as delimeters option.

https://support.microsoft.com/en-us/office/import-or-export-...

https://support.microsoft.com/en-us/office/text-import-wizar...

"Delimiters Select the character that separates values in your text file. If the character is not listed, select the Other check box, and then type the character in the box that contains the cursor."

Maybe they should know better their tools instead of plain double clicking and hope for the best.

imtringued1y ago

It's 2024 and people still haven't realized that Excel does not and never will support opening CSV files. The closest thing it allows you to do is import data from a CSV file into your current spreadsheet, but open a CSV file? It will never do that. Stop using CSV for excel, just generate .xlsx files like everyone else.

2 more replies

kamaal1y ago

Excel is the best tool out there but it has its quirks.

For example the web version doesn't have a dark mode. Google sheets and docs these days is more useful and feature rich than Excel.

3 more replies

didntcheck1y ago

The one flaw I do see with them is blank cells

If you go with the CSV convention of two adjacent tabs => blank cell in the middle, then rows of different length will not line up properly in most text editors. And "different length" depends on the client's tab width too

If you allow any amount of tabs between columns, then you need a special way to signify an actually-blank column. And escaping for when you want to quote that

If you say "use tabs for columns and spaces for alignment", then you've got to trim all values, which may not be desirable

orev1y ago

You’re talking about issues with alignment when data is displayed on a terminal or text editor, which is not at all related to data exchange.

In data exchange nobody ever allows multiple tabs between columns. If there are multiple tabs with nothing in between it means the column is empty for that row.

Just like with CSV, TSV, is always a pain to edit manually so the issues there are the same. Using tabs does have a lower likelihood of conflicting with the actual data.

1 more reply

MattPalmer10861y ago

That's what I always use when I need to write out some tabular data. Haven't had any problem importing them into anything.

btreecat1y ago

In that case, why not use "|" (pipe character)?

zarzavat1y ago

Tabs are rarer than pipes. They are the rarest displayable character for human strings excluding code.

The best would be to use ASCII separator control characters but nobody uses that format so TSV it is.

1 more reply

calibas1y ago

Why don't we use 0x1F (␟) instead of "," or TAB to separate units and 0x1E (␞) to separate records?

It seems like half the problems with CSV were solved back in the 70s with ASCII codes.

gaganyaan1y ago

Somebody went one step further and invented a format that uses the unicode codepoints as separators:

https://github.com/SixArm/usv

NikkiA1y ago

Because nobody made keyboards with those keys. Had they stuck on a 'next unit' and 'next record' key pair that sent them, we'd all be fine, but instead we got overly redundant text editing keys rather than keyboards more suited to data entry.

impure1y ago

I switched to TSV files for my app. None of my values contain tabs so I don't have to escape anything.

jillesvangurp1y ago

I tend to prefer that over CSV as well. But usually I go for ndjson files since that's a bit more flexible for more complex data and easier to deal with when parsing. But it depends on the context what I use.

However, a good reason to use TSV/CSV is import/export in spread sheets is really easy. TSV used to have an obscure advantage: google sheets could export that but not CSV. They've since fixed that and you can do both now.

And of course, getting CSV out of a database is straightforward as well. Both databases and spreadsheets are of course tabular data; so the format is a good fit for that.

Spreadsheets are nice when you are dealing with non technical people. Makes it easier to involve them for editing / managing content. Also, a spread sheet is a great substitute for admin tools to edit this data. I once was on a project where we payed some poor freelancer to work on some convoluted tool to edit data. In the end, the customer hated it and we unceremoniously replaced that with a spreadsheet (my suggestion). Much easier to edit stuff with those. They loved it. The poor guy worked for months on that tool with the help of a lot of misguided UX, design and product management. It got super complicated and it was tedious to use. Complete waste of time. All they needed was a simple spreadsheet and some way to get the data inside deployed. They already knew how to use those so they were all over that.

imtringued1y ago

If you have non technical people please for the love of god start using .xlsx directly.

Nobody on this planet wants to use e.g. Libre office to import your CSV file and save it as xslx so they can open it in Excel.

1 more reply

SoftTalker1y ago

The ASCII specification defines characters for separating fields, groups, records, and files, but I've rarely seen them used.

zepolen1y ago

That's because anyone can easily make a tab character with their keyboard. No one ever remembers the key combination for those special ascii characters.

1 more reply

Cyberdog1y ago

While that's true, the way text editors handle these characters is not standardized, and many may not let you input them. One of the important features of CSV/TSV is that they're relatively easy to edit by hand, and for that you need separator characters that are easy for both text editors and humans to work with.

Personally, since I've discovered the field/group/record/file separator characters in ASCII, I've been using them to concat fields and rows on one-to-many SQL joins. They work great for that purpose since (at least on all the projects I've done this with so far) I can be certain that none of the values in the joined data will have those characters, so no further escaping is necessary. For example, in MySQL:

  SELECT
    i.item_id,
    GROUP_CONCAT(CONCAT_WS(0x1F, f.field_id, f.field_value) SEPARATOR 0x1E) AS field_values
  FROM items i 
  LEFT JOIN fields f ON f.item_id = i.item_id
  WHERE ...

Then split field_values with 0x1E to get each field ID and field value pair, and split each of those on 0x1F. Easy as pie.

cm21871y ago

How do you escape newline characters?

1 more reply

userbinator1y ago

I wish binary length-prefixed formats would've become more common. Parsing text, and especially escaping, seems to be a continual source of bugs and confusion. Then again, those who don't implement escaping correctly may also overlap with those who can't be bothered learning how to use a hex editor.

Dwedit1y ago

They are pretty common, just not among "JSON/XML everywhere" people.

Kon-Peki1y ago

CSV comes from a world in which the producer and consumer know each other; if there are problems they talk to each other and work it out.

There is still plenty of this kind of data exchange happening, and CSV is perfectly fine for it.

If I'm consuming data produced by some giant tech company or mega bank or whatever, there is no chance I'll be able to get them to fix some issue I have processing it. From these kind of folks, I'd like something other than CSV.

LorenPechtel1y ago

But the big guy most likely exports the .csv correctly in the first place, you don't *need* to work with them.

Only once have I seen a bad .csv from a "big" company--big fish in a small pond type big. We were looking to get data out, hey, great, .csv is a valid export format. I'm not sure exactly what was in that file but it appeared to be the printout with some field info attached to each field. (Put this at that location on the paper etc, one field per line.) Every output format it has is in some scenario bugged.

theanonymousone1y ago

I fully agree that CSV is king and am quite happy about it. But the comma character was probably one of the worst choices they could make for the "standard", IMHO of course.

Tab makes far more sense here, because you are very likely able to just convert non-delimiter tabs to spaces without losing semantics.

Even considering how editors tend to mess with the tab character, there are still better choices based on frequency in typical text: |, ~, or even ;.

All IMHO, again.

endgame1y ago

I wasn't around at the time, but surely ASCII was (even if not ubiquitous)? Is there any particular reason that the FS/GS/RS/US (file/group/record/unit separator) characters didn't catch on in this role?

EvanAnderson1y ago

I did an ETL project years ago from a legacy app that used these delimiters. It was gloriously easy. No need to worry about escaping (as these characters were illegal in the input). It's a shame they didn't catch on.

al_borland1y ago

If I had to take a guess, I’d say the answer is as simple as there is no key for them on the keyboard.

euroderf1y ago

Sounds like a job for a macOS keyboard code whiz.

breck1y ago

I love CSVs.

I made, ScrollSets a language that compiles to CSVs! (https://scroll.pub/blog/scrollsets.html)

Here's a simple tool to turn your CSV into ScrollSet (https://scroll.pub/blog/csvToScrollSet.html)

This is what powers the CSV download on PLDB.io and how so many people collaborate on building a single CSV (https://pldb.io/csv.html)

jeff-hykin1y ago

> Efforts to standardize them

I actually just finished a library to add proper typed parsing that works with existing CSV files. Its designed to be as compatible as possible with existing spreadsheets, while allowing for perfect escaping and infinite nesting of complex data structures and strings. I think its an ideal compromise, as most CSV files won't change at all.

https://github.com/jeff-hykin/typed_csv

mannyv1y ago

CSV is king because most ETL department programmers suck. Half the time they can't generate a CSV correctly. Anything more complicated would cause their tiny brains to explode.

I'm not bitter, I just hate working with ETL 'teams' that struggle to output the data in a specified format - even when you specify it in the way they want you to.

fragmede1y ago

> Why CSV Will Remain King

it'll only remain king as long as we let it.

move to using Sqlite db files as your interchange format

__mharrison__1y ago

CSV is the VHS of data formats. Or to reference our discussion from yesterday, the markdown of data formats. It gets the job done.

I help clients deal with them frequently. For many cases they are sufficient, for other cases moving to something like parquet makes a lot of sense.

EvanAnderson1y ago

A lot of data that I see in CSV "format" would work fine as tab-delimited and wouldn't need any escaping (because most of the data I see doesn't allow literal tabs anyway). That would be a simple improvement over CSV.

valiant551y ago

I'm surprised that the article and the comments failed to mentioned pipe delimited files. I work with almost two dozen different vendors (in healthcare) and 90% use pipes. Doing data exchange with a variety of delimiters is so common that I just built out a bespoke system for taking in a set of common configurations and parsing the information. Other settings include line endings, encoding, escape characters, whether the header is included etc.

maerF0x01y ago

I prefer ndjson for systems I build. (with only json objects on the top level) It's much safer for a lot of edges. If there's significant repetition in the keys, they end up zipping well.

deafpolygon1y ago

CSV is still king because of one thing: inertia

It's just much easier to keep using it, since you're already doing it.

In the meantime, how about XML? /awaits the pack of raving mad HNers

nuc1e0n1y ago

As the article says, it will be interesting to see if NDJSON becomes more popular. Although it's a bit more difficult to parse and has makes for larger files than CSV it is more unambiguous.

penguin_booze1y ago

It's a bit annoying that jq quotes strings in the CSV output:

  echo foo | jq -rR 'split("") | @csv'

Havoc1y ago

I’ve been using parquet more lately. Different tradeoffs. Not having to worry about escaping chars and delimiters is nice though

pdyc1y ago

indeed, i created my own tool to preview and adjust csv files before viewing https://csvonline.newbeelearn.com/csvdemo . Its not ready yet would probably not work for large files but works well enough for csv's with appended data that screws up formatting.

newusertoday1y ago

wow! it does exactly what i want :-) . what are the odds of that! I tested with a bank file where csv starts after some lines and i was able to read csv after bit of fiddling with configure button. What is theming demo doing btw with csv?

bandie911y ago

the site says "something went wrong" just 1 sec AFTER it successfully displayed the content. something is so wrong that had withdraw the content from the user... use js only to enhance UX!

01HNNWZ0MV43FF1y ago

Because json5L hasn't caught on yet and everything else has obvious flaws

0cf8612b2e1e1y ago

I routinely interface with 1GB+ csvs. The size explosion for json would be huge. Disk IO aside, I assume a json parser is going to be slower to parse than csv.

imtringued1y ago

How would JSON cause a size explosion?

Nothing prevents you using ndjson where you define a header and then have an array per line.

1 more reply

im3w1l1y ago

Why do you use a text-based format at all at that size?

1 more reply

ryan_j_naughton1y ago

Eh, I'm skeptical of this statement.

CVS is explicitly about tabular data. JSON (including JSON5) is much more flexible. Flexibility can be great but also can be annoying. If you want tabular data, then a system that enables nesting isn't great.

yawnxyz1y ago

I love jsonlines but csvs are way more compact, since you don't have to repeat the column name for every line of data

2 more replies

jessekv1y ago

Rather than highlighting flexibility as the differentiator, I would say: CSV is for dense data, JSON is for sparse data. They are flexible in different ways. For example, CSV is very flexible when renaming a column title.

up2isomorphism1y ago

Not sure what is a “king “ in this case. But fav is one of example that is intuitive and straight horrible at the same time.

trillic1y ago

I like Pipe-separated values

corytheboyd1y ago

I don’t think CSV became king because “,” is a great delimiter (obviously it is not), it became king because it is an easy and logical separator _to most people_. Yeah it’s infuriatingly dumb from a technical standpoint. All the points here that tabs or ascii separators are superior are of course correct. I honestly respect it for how ubiquitous it became WITHOUT having a standard. Still going to curse when I have to deal with a broken one though.

dietr1ch1y ago

I think that we just need someone to get fed up and simply tackle the list of well known problems of CVS.

What we need is,

  - A standard (yeah, link xkcd 927, it's mentioned enough that I can recall it's ID) to be announced **after** the rest of things are ready.

  - Libraries to work with it in major languages. One in Rust + wrappers in common languages might get good traction these days. Having support for dataframe libraries right away might be necessary too.

  - Good tooling. I'm guessing one of the reasons CSV took off is that regular unix tools are able to deal with CVSs mostly fine (there's edge cases with field delimiters/commas, but it's not that bad).

The new format would ideally have types, the files would be sharded and have metadata to quickly scan them, and the tooling should be able to make simple joins, ideally automatically based on the metadata since most of the times there's a single reasonable way to join tables.

This seems too much work to get right since the very beginning, so maybe building on top of Apache Arrow might help reduce the solution space.

jillesvangurp1y ago

Most major languages have decent libraries, frameworks and tools for dealing with CSV. Those tend to have lots of tests for all the well known issues and edge cases. Especially in the python world, which is used for a lot of data processing, tooling is not really an issue. But most other languages also have decent frameworks. Most of that stuff covers the few standards that exist for this, the well known variants of the format that are out there (quite a few) and can deal with the quirks of those.

The only time people get in trouble with CSV is when they skip using those tools, hack something together, and then get it wrong.

> The new format would ideally have types, the files would be sharded and have metadata to quickly scan them

There's no need for new stuff. It would be redundant as there are several things already that do these things. Adding more isn't helpful. The problem is most of the stuff that supports CSV tends to support none of those things and fixing a lot of ancient systems to retrofit them with e.g. parquet support or whatever is a mission impossible. CSVs principle feature is that it is simply everywhere. That's hard to replicate. People have been trying for decades.

wenc1y ago

> The new format would ideally have types, the files would be sharded and have metadata to quickly scan them, and the tooling should be able to make simple joins, ideally automatically based on the metadata since most of the times there's a single reasonable way to join tables.

Parquet fits the bill here. It's not perfect (there is no perfect file format), but it's a practical compromise as of today, at least for new systems where a columnar format is appropriate. There are some columnar formats that are better in some aspects (like ORC and some proprietary formats) but they're not as widely supported.

It's not that CSV/TSV is bad in every situation, but more that CSV/TSV is overused for things it shouldn't be used for. (CSV is good as for tabular format for simple applications, very bad as the storage format for data lakes or anything you want to query, questionable as an data exchange format, okay as a semi-structured format for structurally simple data -- many open data platforms offer it as a a download format and it generally works).

To get a sense of how much variation a CSV reader needs to handle, we can take a look at the number of arguments there are in Pandas' read_csv. And it still fails on some CSVs! (I've had to preprocess CSVs before pd.read_csv would work)

https://pandas.pydata.org/pandas-docs/stable/reference/api/p...

CSV is not king, but it is popular. But popularity doesn't mean it's good for every use case. Optimizing for human readability and easy generation means trading off other very important characteristics (type safety, legibility across different tooling, random access performance, reliability/consistency).

You can't do anything about legacy systems, but when designing a new system, you should really ask yourself: is CSV really the right choice?

(With DuckDB, the answer for me is increasingly no)

kibwen1y ago

> Libraries to work with it in major languages. One in Rust

burntsushi is nine years ahead of you: https://crates.io/crates/csv

dietr1ch1y ago

Yeah, I used it about 7-8 years ago. I liked the idea of chaining things, but it's very clear that csv has not been holding up well in the past decades.

Also, what I have in mind for file sharding needs maybe a standard on top of a record/column file format. The successor to CSV should be easy to process in parallel.

hilbert421y ago

Exchanging information between different data formats is one of the biggest problems I've experienced in computing and IT and it's been thus from the earliest days.

Having so many formats is confusing, inefficient and leads to data loss. This article is right, CSV is king simply because it's essentially the lowest common denominator and I, like most of us, use it for that reason—at least that's so for data that can be stored in database type formats.

But take other data such as images, sound and AVI, and even text. There are dozens of sound, image and other formats. It's all a first-class mess.

For example, we fall back to the antiquated horrible JPG format because we can't agree on better ones such as say jpeg 2000, there being always excuses why we can't such speed, data size, inefficient algorithms etc.

Take word processing for instance, why is it so hard to convert Microsoft's confounded nasty DOC format to say the open document ODT format without errors. It's almost impossible to get the layout in one format converted accurately into another. Similarly, information is lost converting from lossless TIF to say JPG, or from WAV to MP3, etc. What's worse is that so few seem to care about such things.

Every time a conversion is done between lossless formats and lossy ones entropy increases. That's not to say that shouldn't happen it's just that in isolation one has little or no idea about the quality of the original material. Even with ever increasing speeds, more and more storage space so many still have an obsession—in fact a fetish—of compressing data into smaller and smaller sizes using lossy formats with little regard for what's actually lost.

It's not only in sound and image formats where data integrity suffers over convenience, take the case of converting data fields from one format to another. How often has one experienced the situation where a field is truncated during conversion—where say 128 characters suddenly becomes 64 or so after conversion and there's no indication from the converter that data has actually been truncated? Many times I'd suggest.

Another instance, is where fields in the original data don't exist in the converted format. For example, data is often lost from one's phone contacts when converted from an old phone to a new one because the new phone doesn't accommodate all the fields of the old one.

Programmers really have a damn hide for not only allowing this to occur but for not even warning the poor hapless user that some of his/her data has been lost.

That programmers have so little reagard and consideration for data integrity I reckon is a terrible situation and a blight on the whole IT industry.

Why doesn't computer science take these issues more seriously?

jmclnx1y ago

>Why doesn't computer science take these issues more seriously?

Simple, cost. A company is not going to approve any project to move to a new standard. Plus you have new hires coming it with their favorite "Standard of the Day" and start using that standard no matter what they are told.

Management only care about the end result (ie: bottom line), now how it got there.

hilbert421y ago

"Simple, cost."

That lack of consideration for users' data will ultimately lead to regulation. Much of a user's data is only machine-readable, so ordinary users shouldn't be expected to know when their data is truncated after say data conversion. They aren't responsible for realizing their data is corrupted long after the event and past the point where it can be corrected.

It's like everything else, originally there's the Wild West days when everything's a free-for-all, but regulations eventually kick in after the harm done is considered unacceptable. We've seen regulations introduced everywhere else, from foods—pure food acts, pharmaceutical—FDA, transport—NTSB, Water purity standards and so on. So eventually computing/IT will be no exception.

Unfortunately, computing/IT is still in the 'Wild West' days. Personally, I can hardly wait for those enforced regulations to become effective.

j / k navigate · click thread line to collapse

144 comments

thbb1231y ago

Sad that the ASCII specification includes 2 codes: 30 and 31, respectively field separator and record separator, precisely to answer cleanly the need that CSV fullfils addresses.

During the 90's I was anal for using them, pissing the hell out of my teammates and users for forcing them to use these 'standard compliant' files.

Had to give up.

viraptor1y ago

They're hard to type though. You need to teach people how to use those -vs- just using a comma.

jeff-hykin1y ago

And they still don't fix the escaping problem. You might as well use a niche utf8 emoji as a separator. Editors at least know how to consistently render an emoji.

2 more replies

NautilusWave1y ago

The ease of typing a character should only matter for artisanal, hand-typed files.

2 more replies

gklitz1y ago

What do you mean comma? csv uses tabs. /s

1 more reply

baggy_trough1y ago

A valuable lesson in user experience triumphing over pedantic correctness.

bandie911y ago

hanche1y ago

Sqlite still supports it: .mode ascii

snthpy1y ago

I've recently learned about them and an trying to revive usage - .asv and .usv files.

The .usv separators make things easier to read at the expense of a bit more space.

Main point for me making the parsing so much simpler.

Who writes .csv files by hand anyway?

soared1y ago

Easiest example is geo, I need 20 states listed as US-CO, US-CA, etc but one tool exported as US CO.

cqqxo4zV46cp1y ago

Standards-compliance and using esoteric features over catering for the realities of usability. Your coworkers were right to steamroll you.

aleph_minus_one1y ago

If these ASCII code points were actively used, the support in common editors that are used for editing CSV files would become much better very fast.

tanin1y ago

What surprised me the most about CSVs is that:

- To escape the delimiter, we should enclose the value with double quotes. Ok, makes sense.

- To escape double quotes within the enclosing double quotes, we need to use 2 double quotes.

arp2421y ago

> Many tools are getting it wrong.

They're not getting it wrong, they're just assuming a different variant.

jorams1y ago

Cyberdog1y ago

Did TOML take off? As much as I love it, it seems really rare to see in the wild. I still see YAML everywhere and despair.

5 more replies

imtringued1y ago

> someone were to write a decent standard and market it a bit, then I could totally see this taking off, just as TOML "standardized INI files" took off.

Why? We have xlsx for the office crowd and arrow for the HPC crowd. In no universe does anyone actually have to invent another tabular data format using delimiters.

1 more reply

9999000009991y ago

I can't tell you how to run your business, but subscriptions for offline apps aren't going to be popular here.

Charge me more upfront for a perpetual license, or just version the software. Say 40$ today for V3, and every year charge a reasonable fee to upgrade, but allow me to use the software I purchased...

vdqtp31y ago

ABraidotti1y ago

I think you mean perpetual license, unless you really do mean a license that covers the clitoris or penis.

tanin1y ago

Thank you for your feedback. I think your opinion is super valid here.

I've been thinking about pricing, and a lot of people did complain about it. However, many people expense their software cost, so they don't mind the yearly subscription.

I'm improving the pricing right now and a perpetual license is what I'm going with.

AnonC1y ago

[1]: https://www.moderncsv.com/

1vuio0pswjnm71y ago

https://www.ietf.org/rfc/rfc4180.txt

Works for SQLite at least, but not sure about other software.

mikhailfranco1y ago

It would be more useful if every RFC had a test suite of input/output and input/error.

Yes, those are potentially infinite, but a core set would be useful. As ambiguities come up, publish an addendum for clarification, and eventually, as the exceptions accumulate, a version step.

The test suite must be a single open source repo, that accumulates acceptable edge cases until the relevant informed adults can make a call about revising the spec.

See Imre Lakatos 'Proofs and Refutations' for how this evolves.

1 more reply

kawakamimoeki1y ago

As is the case with Markdown, many parsers have prioritized ease of implementation over formal rigor.

mikhailfranco1y ago

I agree about markdown, but the only awkward implementation issue is nested syntax: what markup is parsed inside various other outer markup forms?

With Markdown, it is relatively easy to introduce some recursion into the parser, but for what spec? In what contextual cases? At what cost?

mikhailfranco1y ago

One classic example is JSON.

A JSON parser that treats commas as whitespace can be two dozen lines in most programming languages - if you do not want line/column, chapter and verse, for the remaining error messages.

lenkite1y ago

I wish there was a text format that used the ascii unit separator and record separator. It would have solved so many problematic edge cases.

cm21871y ago

The one tools get the most wrong is that there is no escaping of the new line character.

tanin1y ago

Oh yes, but I encountered it on the parsing side. A CSV parsing algorithm that does parallel processing would have this issue.

DuckDB has this problem when the parallel processing of CSV is enabled.

Understandably though because they want to process many lines in parallel.

Nihilartikel1y ago

I've found the Unicode cat emoji to be an effective delimiter to avoid escaping more common chars in my cat-separated-value artifacts.

Of course the cat emoji is escaped by the puppy emoji if it occurs in a value. The puppy emoji escapes itself when needed.

exidex1y ago

There is https://github.com/SixArm/usv which is exactly that, but with special unicode characters

theendisney41y ago

In the 80's i thought we should have an entire character set just for code. While never implemented the idea arguably aged well.

I also considered a dedicated keyboard like apl just to be dense about it.

Have each character signed by the keyboard so that we have proof by whoem it was typed and when.

People who dont work here don't get to write code. It just wont happen. haha

acuozzo1y ago

> In the 80's i thought we should have an entire character set just for code.

APL got pretty close.

Hackbraten1y ago

Instructions unclear, my puppy emoji is now chasing its own tail

geekodour1y ago

last line unclear ⬛ an example would be great!

ok_computer1y ago

I read that as the puppy emoji escapes itself as two characters print a single character, similar to \ in python strings using \\ to print \

TylerE1y ago

Think backlashes in shell. \$ is just $, \\$ is literal ‘\$’

1 more reply

zarzavat1y ago

Just use TSV. Commas are a terrible delimiter because many human strings have commas in them. This means that CSV needs quoting of fields and nobody can agree on how exactly that should work.

TSV doesn’t have this problem. It can represent any string that doesn’t have either a tab or a newline, which is many more than CSV can.

uncharted91y ago

Risord1y ago

Excel does not support any delimeter natively since its region dependent.

1 more reply

pjmlp1y ago

I am quite sure that Excel import option has tabs as delimeters option.

https://support.microsoft.com/en-us/office/import-or-export-...

https://support.microsoft.com/en-us/office/text-import-wizar...

"Delimiters Select the character that separates values in your text file. If the character is not listed, select the Other check box, and then type the character in the box that contains the cursor."

Maybe they should know better their tools instead of plain double clicking and hope for the best.

imtringued1y ago

2 more replies

kamaal1y ago

Excel is the best tool out there but it has its quirks.

For example the web version doesn't have a dark mode. Google sheets and docs these days is more useful and feature rich than Excel.

3 more replies

didntcheck1y ago

The one flaw I do see with them is blank cells

If you allow any amount of tabs between columns, then you need a special way to signify an actually-blank column. And escaping for when you want to quote that

If you say "use tabs for columns and spaces for alignment", then you've got to trim all values, which may not be desirable

orev1y ago

You’re talking about issues with alignment when data is displayed on a terminal or text editor, which is not at all related to data exchange.

In data exchange nobody ever allows multiple tabs between columns. If there are multiple tabs with nothing in between it means the column is empty for that row.

Just like with CSV, TSV, is always a pain to edit manually so the issues there are the same. Using tabs does have a lower likelihood of conflicting with the actual data.

1 more reply

MattPalmer10861y ago

That's what I always use when I need to write out some tabular data. Haven't had any problem importing them into anything.

btreecat1y ago

In that case, why not use "|" (pipe character)?

zarzavat1y ago

Tabs are rarer than pipes. They are the rarest displayable character for human strings excluding code.

The best would be to use ASCII separator control characters but nobody uses that format so TSV it is.

1 more reply

calibas1y ago

Why don't we use 0x1F (␟) instead of "," or TAB to separate units and 0x1E (␞) to separate records?

It seems like half the problems with CSV were solved back in the 70s with ASCII codes.

gaganyaan1y ago

Somebody went one step further and invented a format that uses the unicode codepoints as separators:

https://github.com/SixArm/usv

NikkiA1y ago

impure1y ago

I switched to TSV files for my app. None of my values contain tabs so I don't have to escape anything.

jillesvangurp1y ago

And of course, getting CSV out of a database is straightforward as well. Both databases and spreadsheets are of course tabular data; so the format is a good fit for that.

imtringued1y ago

If you have non technical people please for the love of god start using .xlsx directly.

Nobody on this planet wants to use e.g. Libre office to import your CSV file and save it as xslx so they can open it in Excel.

1 more reply

SoftTalker1y ago

The ASCII specification defines characters for separating fields, groups, records, and files, but I've rarely seen them used.

zepolen1y ago

That's because anyone can easily make a tab character with their keyboard. No one ever remembers the key combination for those special ascii characters.

1 more reply

Cyberdog1y ago

  SELECT
    i.item_id,
    GROUP_CONCAT(CONCAT_WS(0x1F, f.field_id, f.field_value) SEPARATOR 0x1E) AS field_values
  FROM items i 
  LEFT JOIN fields f ON f.item_id = i.item_id
  WHERE ...

Then split field_values with 0x1E to get each field ID and field value pair, and split each of those on 0x1F. Easy as pie.

cm21871y ago

How do you escape newline characters?

1 more reply

userbinator1y ago

Dwedit1y ago

They are pretty common, just not among "JSON/XML everywhere" people.

Kon-Peki1y ago

CSV comes from a world in which the producer and consumer know each other; if there are problems they talk to each other and work it out.

There is still plenty of this kind of data exchange happening, and CSV is perfectly fine for it.

LorenPechtel1y ago

But the big guy most likely exports the .csv correctly in the first place, you don't *need* to work with them.

theanonymousone1y ago

I fully agree that CSV is king and am quite happy about it. But the comma character was probably one of the worst choices they could make for the "standard", IMHO of course.

Tab makes far more sense here, because you are very likely able to just convert non-delimiter tabs to spaces without losing semantics.

Even considering how editors tend to mess with the tab character, there are still better choices based on frequency in typical text: |, ~, or even ;.

All IMHO, again.

endgame1y ago

EvanAnderson1y ago

al_borland1y ago

If I had to take a guess, I’d say the answer is as simple as there is no key for them on the keyboard.

euroderf1y ago

Sounds like a job for a macOS keyboard code whiz.

breck1y ago

I love CSVs.

I made, ScrollSets a language that compiles to CSVs! (https://scroll.pub/blog/scrollsets.html)

Here's a simple tool to turn your CSV into ScrollSet (https://scroll.pub/blog/csvToScrollSet.html)

This is what powers the CSV download on PLDB.io and how so many people collaborate on building a single CSV (https://pldb.io/csv.html)

jeff-hykin1y ago

> Efforts to standardize them

https://github.com/jeff-hykin/typed_csv

mannyv1y ago

CSV is king because most ETL department programmers suck. Half the time they can't generate a CSV correctly. Anything more complicated would cause their tiny brains to explode.

I'm not bitter, I just hate working with ETL 'teams' that struggle to output the data in a specified format - even when you specify it in the way they want you to.

fragmede1y ago

> Why CSV Will Remain King

it'll only remain king as long as we let it.

move to using Sqlite db files as your interchange format

__mharrison__1y ago

CSV is the VHS of data formats. Or to reference our discussion from yesterday, the markdown of data formats. It gets the job done.

I help clients deal with them frequently. For many cases they are sufficient, for other cases moving to something like parquet makes a lot of sense.

EvanAnderson1y ago

valiant551y ago

maerF0x01y ago

I prefer ndjson for systems I build. (with only json objects on the top level) It's much safer for a lot of edges. If there's significant repetition in the keys, they end up zipping well.

deafpolygon1y ago

CSV is still king because of one thing: inertia

It's just much easier to keep using it, since you're already doing it.

In the meantime, how about XML? /awaits the pack of raving mad HNers

nuc1e0n1y ago

As the article says, it will be interesting to see if NDJSON becomes more popular. Although it's a bit more difficult to parse and has makes for larger files than CSV it is more unambiguous.

penguin_booze1y ago

It's a bit annoying that jq quotes strings in the CSV output:

  echo foo | jq -rR 'split("") | @csv'

Havoc1y ago

I’ve been using parquet more lately. Different tradeoffs. Not having to worry about escaping chars and delimiters is nice though

pdyc1y ago

newusertoday1y ago

bandie911y ago

the site says "something went wrong" just 1 sec AFTER it successfully displayed the content. something is so wrong that had withdraw the content from the user... use js only to enhance UX!

01HNNWZ0MV43FF1y ago

Because json5L hasn't caught on yet and everything else has obvious flaws

0cf8612b2e1e1y ago

I routinely interface with 1GB+ csvs. The size explosion for json would be huge. Disk IO aside, I assume a json parser is going to be slower to parse than csv.

imtringued1y ago

How would JSON cause a size explosion?

Nothing prevents you using ndjson where you define a header and then have an array per line.

1 more reply

im3w1l1y ago

Why do you use a text-based format at all at that size?

1 more reply

ryan_j_naughton1y ago

Eh, I'm skeptical of this statement.

yawnxyz1y ago

I love jsonlines but csvs are way more compact, since you don't have to repeat the column name for every line of data

2 more replies

jessekv1y ago

up2isomorphism1y ago

Not sure what is a “king “ in this case. But fav is one of example that is intuitive and straight horrible at the same time.

trillic1y ago

I like Pipe-separated values

corytheboyd1y ago

dietr1ch1y ago

I think that we just need someone to get fed up and simply tackle the list of well known problems of CVS.

What we need is,

  - A standard (yeah, link xkcd 927, it's mentioned enough that I can recall it's ID) to be announced **after** the rest of things are ready.

  - Libraries to work with it in major languages. One in Rust + wrappers in common languages might get good traction these days. Having support for dataframe libraries right away might be necessary too.

  - Good tooling. I'm guessing one of the reasons CSV took off is that regular unix tools are able to deal with CVSs mostly fine (there's edge cases with field delimiters/commas, but it's not that bad).

This seems too much work to get right since the very beginning, so maybe building on top of Apache Arrow might help reduce the solution space.

jillesvangurp1y ago

The only time people get in trouble with CSV is when they skip using those tools, hack something together, and then get it wrong.

> The new format would ideally have types, the files would be sharded and have metadata to quickly scan them

wenc1y ago

https://pandas.pydata.org/pandas-docs/stable/reference/api/p...

You can't do anything about legacy systems, but when designing a new system, you should really ask yourself: is CSV really the right choice?

(With DuckDB, the answer for me is increasingly no)

kibwen1y ago

> Libraries to work with it in major languages. One in Rust

burntsushi is nine years ahead of you: https://crates.io/crates/csv

dietr1ch1y ago

Yeah, I used it about 7-8 years ago. I liked the idea of chaining things, but it's very clear that csv has not been holding up well in the past decades.

Also, what I have in mind for file sharding needs maybe a standard on top of a record/column file format. The successor to CSV should be easy to process in parallel.

hilbert421y ago

Exchanging information between different data formats is one of the biggest problems I've experienced in computing and IT and it's been thus from the earliest days.

But take other data such as images, sound and AVI, and even text. There are dozens of sound, image and other formats. It's all a first-class mess.

Programmers really have a damn hide for not only allowing this to occur but for not even warning the poor hapless user that some of his/her data has been lost.

That programmers have so little reagard and consideration for data integrity I reckon is a terrible situation and a blight on the whole IT industry.

Why doesn't computer science take these issues more seriously?

jmclnx1y ago

>Why doesn't computer science take these issues more seriously?

Management only care about the end result (ie: bottom line), now how it got there.

hilbert421y ago

"Simple, cost."

Unfortunately, computing/IT is still in the 'Wild West' days. Personally, I can hardly wait for those enforced regulations to become effective.

j / k navigate · click thread line to collapse