And if you are in control of producing data, just produce strict RFC 4180-compliant CSV data and everybody will be able to read it just fine. There is no need to make your reader’s lives difficult by using yet another non-standard data format.
See also: <https://news.ycombinator.com/item?id=39679753>
> file = [header CRLF] record *(CRLF record) [CRLF]
I find it kind of wild that you have to have at least one record. Suppose I have a program that lists the events that occurred on a given day. How do I represent the fact that the program ran successfully but that there weren't any events on that day?
Not necessarily how I'd do it intuitively, but doesn't seem that crazy
Tbf, ASCII delimiter characters have been around since the 1960s. They're not exactly reinventing the wheel
Not really.
What's bad is when people keep insisting on coming up with new and amazing CSV dialects.
https://www.ietf.org/rfc/rfc4180.txt is very clear about what CSV files are supposed to look like, and the fact that people keep ignoring this for whatever reason, is not the formats problem.
And no, "using another format" is not a solution to this. Because: I can just invent a new DSV dialect. Or a JSON dialect. Or a dialect where the field separator is "0xFF00FF00" and the row separator is the string `DECAFCOFFEE` encoded in EBCDIC, all other characters have to be UTF-32, except for a, b and f, which also need to be EBCDIC encoded.
> For starters, it’s rather unreadable when opened in a text editor. But I bet you don’t really do that with your CSVs all that often anyway!
Wrong. I do that with csv files all the time. In fact I even have an amazing vim plugin just for them [0]. That's pretty much the point of having a plaintext tabular data storage format: That I can view and edit it using standard text wrangling utilities.
---
There is a much simpler solution to this problem: Don't accept broken CSV. If people keep ignoring standards, thats their problem.
If you are a multi-billion dollar company creating a new integration, you can demand that your small supplier provide an RFC-4180 compliant file, and even refuse to process it if its schema or encoding is not conformant.
If you are the small supplier of a multi-billion dollar company, you will absolutely process whatever it is that they send you. If it changes, you will even adapt your processes around it.
TFA proposes a nice format that is efficient to parse and in some ways better than CSV, another ways are not. Use it if you can and makes sense.
Nevertheless, even in projects where my services are talking to something that's bigger, I will, at the very least ask "why cant it be RFC compliant? is there a reason?". And without blowing my own horn overly much, but quite a few systems larger than mine have changed because someone asked that question.
Mm, not really. By its own admission, it is descriptive, not prescriptive:
> This section documents the format that seems to be followed by most implementations
And it came out in 2005, by which date CSVs had already been in use for some twenty or thirty years.
Yes, CSV is much, much older. In fact it predates personal computers. And it went through changes. Again: None of that matters. We have a standard, we should use the standard, and systems should demand the standard.
Standards are meant to ensure minimal-friction interoperability. If systems don't enforce standards, then there is no point in having a standard in the first place.
From the very memo you link to (RFC 4180):
> Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files.
And yes, I am aware that the standard says this. My counter question to that is: How much client-liberty do I have to accept? Where do I draw the line? How much is too much liberty?
And the answer is: there is no answer. Wherever any system draws that line, it's an arbitrary decision; Except for one, which ensures the least surprise and maximum interoperability (aka. the point of a standard): to be "conservative", and simply demand the standard.
So this is gold. Editing xSV files has been an ongoing pain, and this plugin is just amazingly awesome. Thanks for the link to it.
Fun fact: I did. And not just for accounting systems, but all sorts of data ingestion pipelines. Did it work every time? No. Did it work in many cases? Yes. Is that better? Absolutely.
Here is the thing: If I accept broken CSV, where do I stop? What's next? Next thing my webservice backends have to accept broken HTTP? My JSON-RPC backends have to accept JSON with /*/ style block comments? My ODBC load-balancer has to accept natural language instead of SQL statements (I mean, its the age of the LLM, I could make that possible).
The intent of the format is to be human readable and editable. Sure, Tab characters can be used instead of commas. (TSV files) Yes that's that "" to escape a quote rule. Oh and quoted values are optional, unquoted strings are fine as long as they contain no newline or record separator characters.
Sure, you could make another CSV inspired format which uses the old mainframe control characters; except as keeps getting pointed out, even programmers often don't know how to enter raw flow control characters on their systems. Who even bashes those out these days? I know I have to look it up every time.
Rejoice that the format is so simple, it's all just text which software might convert to numbers or other values as it might desire.
When I was an undergrad, I had kind of an anal software engineering 101 professor who was treating the course like he was a scrum master. The deliverable was to make some dumb crud app, and a requirement was it used a "database." It was so stupid simple to write a csv to s3 or local disk that I just used that for the entire project. He tried to fail me for not following the requirements, and I had to go to the dean of CS and argue that by definition, a structured data format on a disk is absolutely a database, and I won. I got graded horribly after that though.
Yes, but that is because those characters are not meant to be entered directly. DSV values should either be created by a dedicated DSV editor or they should be constructed by a software library. You would rather use a paint program to create an image instead of writing the image's bytes in a text editor.
Also, I’m on a mobile right now, so can’t verify that, but it seems the format is flawed. The reader decodes UTF8 strings after splitting the binary buffer by the delimiter, but I believe the delimiter may be a part of a UTF8 character.
Edit: just checked and there’s actually no chance that the delimiter the author chose would be part of UTF8 encoding of any other character than the delimiter itself
It might seem that way if you didn't actually read the article:
> So what’s the downside? This custom FEC tooling might give you a hint.
> For starters, it’s rather unreadable when opened in a text editor.
Some nitpicks, maybe someone finds it useful. Could we talk about a code design a little bit?
class DSV:
@property
def delimiter(cls) -> bytes:
return b'\x1F'
@property
def record_separator(cls) -> bytes:
return b'\x1E'
@property
def encoding(cls) -> str:
return 'utf-8'
It's Python, do not make a premature properties for static values. class DSV:
delimiter = b'\x1F'
record_separator = b'\x1E'
encoding = 'utf-8'
Also it's a false inheritance relationship. Writer is not related to configuration. You can't make any other useful subclasses for DSV (ok maybe DSVReader, but that's it). At least it should be in the opposite way: an abstract Writer operating on instance configuration attributes and DSVWriter defining these attributes.Also `self._buffer += chunk` is O(N^2). It starts to bite even for buffers small as 100 bytes. It's ok for an example, but it's an issue for real code. Example at least buffers incomplete record not a whole read chunk (good!). But does only one split at a time (bad).
[Grumpy mode end]
Nevertheless article is very valuable and interesting to read. CSV gotchas are well described.
I can't tell you how many times I've downloaded a CSV that didn't escape quotes or newlines correctly, or how many times Excel has failed to correctly parse a perfectly valid CSV due to some decades-old quirk.
I know that there are better formats that make these types of situations pop up less, but is a little bit of quality control too much to ask for? If you've tested your software to make sure that it can handle CSVs with line breaks, tabs, and both types of quotes, then you've seemingly done more testing than 90% of the software out there.
On that note, the LibreOffice Calc team deserves major credit for how many different CSV formats it can handle. It's saved my bacon so many times when Excel wasn't up to the task.
Any file format needs a well-specified escape strategy, because every file format is binary and may contain binary data. CSV is kinda bad not only because, in practice, there's no consensus escaping, but also because we don't communicate what the chosen escaping is!
I think a standard meta header like follows would do wonders to improve interchangeability, without having to communicate the serialization format out-of-band.
``` #csv delim=";" encoding=utf8 quote=double locale="pt-BR" header=true ```
(RFC-4180 does specify that charset and header may be specified in the MIME type)
You're confusing the concept of tabular data with the file format. If the most natural way to represent tabular data is through a 2D array, then so be it. The vast majority of people aren't complaining about the fact that they have to hardcode the meaning of "the last name is written into the fifth column", they are cursing that the fifth column has suddenly shifted into the sixth column, because the first name contained a comma.
In my experience, the best way to handle this is:
1) Use TSV (tab-separated) instead of CSV (most things that export CSV also export TSV). Strip LF characters while reading and assume newlines are CR.
2) If you have a stubborn data source that insists on CSV, convert it to TSV in a pre-process. This could be a separate step or part of your reader as you're reading in the file. That means there's a single place to handle the escaping nonsense, and you can tailor that to each data source.
I have in fact seen CSV files used as an interchange format for things that include non-plaintext fields. And I've seen nested CSV files.
I really wish that were true.
Many times they just couldn't seem to find the comma. Other times there were commas in the item description (unescaped). My favourite though was when files were edited collaboratively using a Mac, Windows and Linux machines - multiple line-end types FTW! Like I said, a long and somewhat inglorious career..
DSVs didn't work with either Google Sheets, nor vim and neither Python – I assume this is the exhaustive list of software the author would have needed support from. The question, then: If no software understands the format, what's the point?
> I first learned about these ASCII delimiters while working with .fec [Federal Election Commission] files.
And then the author instantly chose a different delimiter. Two parties and already two standards. That should have been the final red flag for this proposal.
--- Aside: CSVs have so many problems wit their data format that you have to always verify them anyway.
Germans write 5.000,24€ where an American would write $5,000.24. Date strings. URL-encoded strings. Numbers as strings. Floating numbers.
Solving the delimiter problem accomplishes nothing.
There's also a more modern USV (Unicode Separated Values) which has visible separators.
Every time I have to do any major work with CSVs, I re-lament this exact thing.
I think the only way this could ever become more widespread is to fix all the open source tooling so that it's eventually just supported everywhere - then keep evangelizing for... ~30 yrs.
Probably you should also register a new mime type and extension and make it a new thing - don't overload .CSV any further - but make the same tooling support it.
Ban those things and it starts to become reasonable enough for general use.
I too have wondered why the hell aren't we using those special characters already ever since I discovered their existence
You still have the issue described by "" with '' if I read the examples correctly.
Am I alone in this?
The amount of times I have written a simple CSV parser to correctly handle quoted strings and the like is more times than I have digits when I could just pretend its JSON.
Going to make a mental note to try this next time!
It's less code for you and you can do neat things like zstd compression on the columns.
Bonus, it also doesn't require that you load and unload everything in memory.
I can only speculate on this but in Perl, for fake multimensional arrays à la `$foo{$x,$y,$z}`[^1], Perl uses ASCII character 28 (U+001C INFORMATION SEPARATOR FOUR) as its default subscript separator. Perl borrowed this feature from AWK, which uses the same character by default for the same purpose.
Based on Perl, I initally used that same character for that same purpose in a project or two. I cannot speculate on why Aho, Weinberger, and/or Kernighan chose that character. (On or before 1977.)
[^1]: Not to be confused with nested array (or hash) references in Perl, a truer form of multimensional arrays: `$foo->[$x]->{$y}->[$z]`
Good luck with that.
The problem isn't "CSV". The problems come from: - "excel doesn't like this CSV therefore it's not valid" - "what do you mean the CSV I sent you is wrong? I used excel" - "standard? What's a standard? I put info then a comma. That should be good enough for anyone"
CSV, when done right (i.e. following a standard) is a great format. It's human readable, less verbose than, say, JSON, and everybody understands it.
Just have to make sure business people (and less technical technical people) understand that CSV != Excel and vice-versa.
The purpose is to statically analyze the numbers for anomalies or any signs of deviation from expected randomness. I do this all in python3 with various libraries. It seems to be working, but...
What is a more efficient format than csv for this kind of operation?
Edit: I have also preserved all leading zeros by conversion to strings -- csv readers don't care much for leading zeros and simply disappear them, but quotes fix that.
My rule of thumb is that anything that fits into Excel (approx 1M lines) is "small data" and can be analysed with Pandas in memory.
It's something I've never understood why, why not use something more standard like SIP, or even a more structured message format. Having to parse CSV across N different packet boundaries is a royal PITA
As other said, most of the times, if you are producing them, just produce them right, or choose other formats.
If you don't then pray for the best.
?