During the 90's I was anal for using them, pissing the hell out of my teammates and users for forcing them to use these 'standard compliant' files.
Had to give up.
The .usv separators make things easier to read at the expense of a bit more space.
Main point for me making the parsing so much simpler.
Who writes .csv files by hand anyway?
Easiest example is geo, I need 20 states listed as US-CO, US-CA, etc but one tool exported as US CO.
- To escape the delimiter, we should enclose the value with double quotes. Ok, makes sense.
- To escape double quotes within the enclosing double quotes, we need to use 2 double quotes.
Many tools are getting it wrong. Meanwhile some tools like pgadmin, justifiably, allows you to configure the escaping character to be double quote or single quote because CSV standard is often not respected.
Anyway, if you are looking for a desktop app for querying CSVs using SQL, I'd love to recommend my app: https://superintendent.app (offline app) -- it's more convenient than using command-line and much better for managing a lot of CSVs and queries.
They're not getting it wrong, they're just assuming a different variant.
There is no "standard" for CSV. Yes, there's an RFC, published in 2005, about 30 years after everyone was already using CSV. That's too late. You can't expect people to drop all compatibility just because someone published some document somewhere. RFC 4180 explicitly says that "it does not specify an Internet standard of any kind", although many people do take it as a "standard". But even if it did call itself a standard: it's still just some document someone published somewhere.
They should have just created a new "Comma Separated Data" (file.csd) standard or something instead of trying to retroactively redefine something that already exists. Then applications could add that as a new option, rather than "CSV, but different from what we already support". That was always going to be an uphill battle.
Never mind that RFC 4180 is just insufficient by not specifying character encodings in the file itself, as well as some other things such as delimiters. If someone were to write a decent standard and market it a bit, then I could totally see this taking off, just as TOML "standardized INI files" took off.
Why? We have xlsx for the office crowd and arrow for the HPC crowd. In no universe does anyone actually have to invent another tabular data format using delimiters.
Charge me more upfront for a perpetual license, or just version the software. Say 40$ today for V3, and every year charge a reasonable fee to upgrade, but allow me to use the software I purchased...
I've been thinking about pricing, and a lot of people did complain about it. However, many people expense their software cost, so they don't mind the yearly subscription.
I'm improving the pricing right now and a perpetual license is what I'm going with.
Looks like SQL is the main selling point for your tool. For other simpler needs, Modern CSV [1] seems suitable (and it’s cheaper too, with a one time purchase compared to a yearly subscription fee). But Modern CSV does not support SQL or other ways to create complex queries.
Works for SQLite at least, but not sure about other software.
Yes, those are potentially infinite, but a core set would be useful. As ambiguities come up, publish an addendum for clarification, and eventually, as the exceptions accumulate, a version step.
I don't understand how anyone can write a spec without concrete examples of pass/fail in their head. Perhaps there could be an informal example/counterexample syntax for those writing RFCs, which could be extracted into the 1.0 test suite.
The test suite must be a single open source repo, that accumulates acceptable edge cases until the relevant informed adults can make a call about revising the spec.
There has to be one approved, sanctioned, well-known and monitored test suite repo. It cannot be shrugged off into a free-for-all that makes it impossible to find a single canonical test suite. The interwebs are big and conflicted.
See Imre Lakatos 'Proofs and Refutations' for how this evolves.
Italic headings? Bold links? Nested lists - how many levels? Code in list? How do paragraphs interact with lists? There are many opinions and many leaky implementations of those opinions. Newlines? Embedding HTML in Markdown !?!?
It all seems so sad, because (X)HTML nailed most of these issues a very long time ago. But HTML implementations were sloppy from the outset. And XML was born with inherited bloat, then got ever more complex over time (modular specs, XLink, XPath, XSLT, DTD -> XML Schema, ...)
With Markdown, it is relatively easy to introduce some recursion into the parser, but for what spec? In what contextual cases? At what cost?
It is possible to just treat commas as whitespace. It makes implementation so much easier. It accepts missing, trailing and repeated commas. It makes elements uniform. It ignores many common errors that arise from typos or cut'n'paste. It makes JSON writers simpler, by removing the first/last special case.
A JSON parser that treats commas as whitespace can be two dozen lines in most programming languages - if you do not want line/column, chapter and verse, for the remaining error messages.
DuckDB has this problem when the parallel processing of CSV is enabled.
Understandably though because they want to process many lines in parallel.
Of course the cat emoji is escaped by the puppy emoji if it occurs in a value. The puppy emoji escapes itself when needed.
I also considered a dedicated keyboard like apl just to be dense about it.
Have each character signed by the keyboard so that we have proof by whoem it was typed and when.
People who dont work here don't get to write code. It just wont happen. haha
APL got pretty close.
TSV doesn’t have this problem. It can represent any string that doesn’t have either a tab or a newline, which is many more than CSV can.
I ended up saving my mental heath by supporting two different formats: "RFC csv" and "Excel csv". On excel you can for example use sep=# hint on beginning of file to get delimeter work consistently. Sep annotation obviously break parsing for every other csv parser but thats why there is other format.
Also there might be other reasons too to mess up with file to get it open correctly on excel. Like date formats or adding BOM to get it recognized as utf-8 etc. (Not quite sure was BOM case with excel or was it on some other software we used to work with )
https://support.microsoft.com/en-us/office/import-or-export-...
https://support.microsoft.com/en-us/office/text-import-wizar...
"Delimiters Select the character that separates values in your text file. If the character is not listed, select the Other check box, and then type the character in the box that contains the cursor."
Maybe they should know better their tools instead of plain double clicking and hope for the best.
For example the web version doesn't have a dark mode. Google sheets and docs these days is more useful and feature rich than Excel.
If you go with the CSV convention of two adjacent tabs => blank cell in the middle, then rows of different length will not line up properly in most text editors. And "different length" depends on the client's tab width too
If you allow any amount of tabs between columns, then you need a special way to signify an actually-blank column. And escaping for when you want to quote that
If you say "use tabs for columns and spaces for alignment", then you've got to trim all values, which may not be desirable
In data exchange nobody ever allows multiple tabs between columns. If there are multiple tabs with nothing in between it means the column is empty for that row.
Just like with CSV, TSV, is always a pain to edit manually so the issues there are the same. Using tabs does have a lower likelihood of conflicting with the actual data.
It seems like half the problems with CSV were solved back in the 70s with ASCII codes.
However, a good reason to use TSV/CSV is import/export in spread sheets is really easy. TSV used to have an obscure advantage: google sheets could export that but not CSV. They've since fixed that and you can do both now.
And of course, getting CSV out of a database is straightforward as well. Both databases and spreadsheets are of course tabular data; so the format is a good fit for that.
Spreadsheets are nice when you are dealing with non technical people. Makes it easier to involve them for editing / managing content. Also, a spread sheet is a great substitute for admin tools to edit this data. I once was on a project where we payed some poor freelancer to work on some convoluted tool to edit data. In the end, the customer hated it and we unceremoniously replaced that with a spreadsheet (my suggestion). Much easier to edit stuff with those. They loved it. The poor guy worked for months on that tool with the help of a lot of misguided UX, design and product management. It got super complicated and it was tedious to use. Complete waste of time. All they needed was a simple spreadsheet and some way to get the data inside deployed. They already knew how to use those so they were all over that.
Nobody on this planet wants to use e.g. Libre office to import your CSV file and save it as xslx so they can open it in Excel.
Personally, since I've discovered the field/group/record/file separator characters in ASCII, I've been using them to concat fields and rows on one-to-many SQL joins. They work great for that purpose since (at least on all the projects I've done this with so far) I can be certain that none of the values in the joined data will have those characters, so no further escaping is necessary. For example, in MySQL:
SELECT
i.item_id,
GROUP_CONCAT(CONCAT_WS(0x1F, f.field_id, f.field_value) SEPARATOR 0x1E) AS field_values
FROM items i
LEFT JOIN fields f ON f.item_id = i.item_id
WHERE ...
Then split field_values with 0x1E to get each field ID and field value pair, and split each of those on 0x1F. Easy as pie.There is still plenty of this kind of data exchange happening, and CSV is perfectly fine for it.
If I'm consuming data produced by some giant tech company or mega bank or whatever, there is no chance I'll be able to get them to fix some issue I have processing it. From these kind of folks, I'd like something other than CSV.
Only once have I seen a bad .csv from a "big" company--big fish in a small pond type big. We were looking to get data out, hey, great, .csv is a valid export format. I'm not sure exactly what was in that file but it appeared to be the printout with some field info attached to each field. (Put this at that location on the paper etc, one field per line.) Every output format it has is in some scenario bugged.
Tab makes far more sense here, because you are very likely able to just convert non-delimiter tabs to spaces without losing semantics.
Even considering how editors tend to mess with the tab character, there are still better choices based on frequency in typical text: |, ~, or even ;.
All IMHO, again.
I made, ScrollSets a language that compiles to CSVs! (https://scroll.pub/blog/scrollsets.html)
Here's a simple tool to turn your CSV into ScrollSet (https://scroll.pub/blog/csvToScrollSet.html)
This is what powers the CSV download on PLDB.io and how so many people collaborate on building a single CSV (https://pldb.io/csv.html)
I actually just finished a library to add proper typed parsing that works with existing CSV files. Its designed to be as compatible as possible with existing spreadsheets, while allowing for perfect escaping and infinite nesting of complex data structures and strings. I think its an ideal compromise, as most CSV files won't change at all.
I'm not bitter, I just hate working with ETL 'teams' that struggle to output the data in a specified format - even when you specify it in the way they want you to.
it'll only remain king as long as we let it.
move to using Sqlite db files as your interchange format
I help clients deal with them frequently. For many cases they are sufficient, for other cases moving to something like parquet makes a lot of sense.
It's just much easier to keep using it, since you're already doing it.
In the meantime, how about XML? /awaits the pack of raving mad HNers
echo foo | jq -rR 'split("") | @csv'Nothing prevents you using ndjson where you define a header and then have an array per line.
CVS is explicitly about tabular data. JSON (including JSON5) is much more flexible. Flexibility can be great but also can be annoying. If you want tabular data, then a system that enables nesting isn't great.
What we need is,
- A standard (yeah, link xkcd 927, it's mentioned enough that I can recall it's ID) to be announced **after** the rest of things are ready.
- Libraries to work with it in major languages. One in Rust + wrappers in common languages might get good traction these days. Having support for dataframe libraries right away might be necessary too.
- Good tooling. I'm guessing one of the reasons CSV took off is that regular unix tools are able to deal with CVSs mostly fine (there's edge cases with field delimiters/commas, but it's not that bad).
The new format would ideally have types, the files would be sharded and have metadata to quickly scan them, and the tooling should be able to make simple joins, ideally automatically based on the metadata since most of the times there's a single reasonable way to join tables.This seems too much work to get right since the very beginning, so maybe building on top of Apache Arrow might help reduce the solution space.
The only time people get in trouble with CSV is when they skip using those tools, hack something together, and then get it wrong.
> The new format would ideally have types, the files would be sharded and have metadata to quickly scan them
There's no need for new stuff. It would be redundant as there are several things already that do these things. Adding more isn't helpful. The problem is most of the stuff that supports CSV tends to support none of those things and fixing a lot of ancient systems to retrofit them with e.g. parquet support or whatever is a mission impossible. CSVs principle feature is that it is simply everywhere. That's hard to replicate. People have been trying for decades.
Parquet fits the bill here. It's not perfect (there is no perfect file format), but it's a practical compromise as of today, at least for new systems where a columnar format is appropriate. There are some columnar formats that are better in some aspects (like ORC and some proprietary formats) but they're not as widely supported.
It's not that CSV/TSV is bad in every situation, but more that CSV/TSV is overused for things it shouldn't be used for. (CSV is good as for tabular format for simple applications, very bad as the storage format for data lakes or anything you want to query, questionable as an data exchange format, okay as a semi-structured format for structurally simple data -- many open data platforms offer it as a a download format and it generally works).
To get a sense of how much variation a CSV reader needs to handle, we can take a look at the number of arguments there are in Pandas' read_csv. And it still fails on some CSVs! (I've had to preprocess CSVs before pd.read_csv would work)
https://pandas.pydata.org/pandas-docs/stable/reference/api/p...
CSV is not king, but it is popular. But popularity doesn't mean it's good for every use case. Optimizing for human readability and easy generation means trading off other very important characteristics (type safety, legibility across different tooling, random access performance, reliability/consistency).
You can't do anything about legacy systems, but when designing a new system, you should really ask yourself: is CSV really the right choice?
(With DuckDB, the answer for me is increasingly no)
burntsushi is nine years ahead of you: https://crates.io/crates/csv
Also, what I have in mind for file sharding needs maybe a standard on top of a record/column file format. The successor to CSV should be easy to process in parallel.
Having so many formats is confusing, inefficient and leads to data loss. This article is right, CSV is king simply because it's essentially the lowest common denominator and I, like most of us, use it for that reason—at least that's so for data that can be stored in database type formats.
But take other data such as images, sound and AVI, and even text. There are dozens of sound, image and other formats. It's all a first-class mess.
For example, we fall back to the antiquated horrible JPG format because we can't agree on better ones such as say jpeg 2000, there being always excuses why we can't such speed, data size, inefficient algorithms etc.
Take word processing for instance, why is it so hard to convert Microsoft's confounded nasty DOC format to say the open document ODT format without errors. It's almost impossible to get the layout in one format converted accurately into another. Similarly, information is lost converting from lossless TIF to say JPG, or from WAV to MP3, etc. What's worse is that so few seem to care about such things.
Every time a conversion is done between lossless formats and lossy ones entropy increases. That's not to say that shouldn't happen it's just that in isolation one has little or no idea about the quality of the original material. Even with ever increasing speeds, more and more storage space so many still have an obsession—in fact a fetish—of compressing data into smaller and smaller sizes using lossy formats with little regard for what's actually lost.
It's not only in sound and image formats where data integrity suffers over convenience, take the case of converting data fields from one format to another. How often has one experienced the situation where a field is truncated during conversion—where say 128 characters suddenly becomes 64 or so after conversion and there's no indication from the converter that data has actually been truncated? Many times I'd suggest.
Another instance, is where fields in the original data don't exist in the converted format. For example, data is often lost from one's phone contacts when converted from an old phone to a new one because the new phone doesn't accommodate all the fields of the old one.
Programmers really have a damn hide for not only allowing this to occur but for not even warning the poor hapless user that some of his/her data has been lost.
That programmers have so little reagard and consideration for data integrity I reckon is a terrible situation and a blight on the whole IT industry.
Why doesn't computer science take these issues more seriously?
Simple, cost. A company is not going to approve any project to move to a new standard. Plus you have new hires coming it with their favorite "Standard of the Day" and start using that standard no matter what they are told.
Management only care about the end result (ie: bottom line), now how it got there.
That lack of consideration for users' data will ultimately lead to regulation. Much of a user's data is only machine-readable, so ordinary users shouldn't be expected to know when their data is truncated after say data conversion. They aren't responsible for realizing their data is corrupted long after the event and past the point where it can be corrected.
It's like everything else, originally there's the Wild West days when everything's a free-for-all, but regulations eventually kick in after the harm done is considered unacceptable. We've seen regulations introduced everywhere else, from foods—pure food acts, pharmaceutical—FDA, transport—NTSB, Water purity standards and so on. So eventually computing/IT will be no exception.
Unfortunately, computing/IT is still in the 'Wild West' days. Personally, I can hardly wait for those enforced regulations to become effective.