A few months ago, I was trying to get some bulk data into ebay's proprietary TurboLister[1] program. Guess what, it can import csv but not JSON.
SQLite[2] can import csv but not JSON.
Google's terabyte ngram dataset[3] is csv (tsv) instead of JSON. I'm glad it's not JSON because it would have required extra disk space.
... plus tons of other real-world csv examples out in the wild.
Unfortunately, the csv format is very easy for programs to write but it's very difficult for programs to properly read because of the tricky parsing.
[1] http://pages.ebay.com/sellerinformation/sellingresources/tur...
[2] http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles
[3] http://storage.googleapis.com/books/ngrams/books/datasetsv2....
So no, if you control input and output, JSON is still easier to use than CSV, and just as performant. JSON stores straight arrays just fine. It's not the format's fault so many people choose to store hashes with it.
Sometimes, simplicity is better for everyone.
I've had to() parse json embedded in a field in a csv file. Unquoted of course.
Until I explained to the other developer just how stupid that was.CSV is bullshit, it's not good for anything except scenarios where you control both the export process, and the parser (so you know what delimiter is used and so on).
http://search.cpan.org/~makamaka/Text-CSV-1.32/lib/Text/CSV....
It would be awesome if someone made a table with CSV features in one dimension and application/library behaviour in the other.
Not too long ago I was struggling with one of these CSV issues and received some good advice from Hans Passant [1] on a Stack Overflow question pertaining to my problem (emphasis mine):
"It is pretty important that you don't try to fix it. That will make you responsible for bad data for a long time. Reject the file for being improperly formatted. If they hassle you about it then point out that it is not RFC-4180 compatible. There's another programmer somewhere that can easily fix this."
It makes perfect sense in hindsight. If you accept a malformed CSV file, people will expect you to accept any malformed data that has a CSV extension. You are taking on a lot of extra responsibility to cover for the lack of work by another programmer. Odds are they can make a change to fix the problem that takes a fraction of the time it would take you work around it. You just have to raise the issue.
I realize that rejecting bad files isn't really possible in every circumstance. But I have a feeling it is an option more times than you might initially think.
In my current job, the most common "invalid" CSV format we get is .xlsx files.
So I wrote an .xlsx parser (way, way faster than Apache POI).
Another interesting hiccup to consider is CSV inside individual fields - i.e. recursive CSV. There are various ways to handle this, but in my company's line of business the usual route is to duplicate that line once per CSV element found in the field.
Likely the next invalid format we'll have to parse is PDFs containing tables...
cough people doing e-invoicing with pdf's...
And after that you will have to parse PDFs containing scans (as images, not text) of pages containing tables...
I was pulling data from a medical system that I knew full well I would not be able to get changes into for YEARS (and I got to meet the vendor, who was working on a shiny new XML export system - I wonder if that has quoting issues too - it wasn't released by the time I finished working the project)
So I wound up writing perl that knew enough to fix all the common problems with the source data, and emailed me any odd lines it couldn't cope with, so I could go in and update the regular expressions. It kinda sucked, but the end result was better antibiotic coverage for a bunch of people. Worst case of a line it really couldn't handle was that person didn't get the benefit of an expert system checking that they didn't have doubled-up medicines, which is no worse than they would have had without this system.
Trying to compensate for bad CSV format will more likely cause headaches and confusion rather than clarity. It can also discourage the need for CSV writers to be rigorous about their output formatting.
It there are many "what ifs" like in the posted article. You probably need another format like JSON (preferably) or XML.
Of course you can embed comma-separated lists in XML, but with JSON it will parse them for you.
(And of course it's not as good as a protobuf, but not bad for a text format.)
However, that doesn't excuse sloppy CSV writers.
We get "CSV" files from Klarna, an invoicing company, with the payments they've processed for us. Because we're Danish and they are Swedish, it's not really weird that they would use comma as the decimal separator. So to compensate for having used the comma, they for some reason picks ", " ( that's comma + space ) as the field separator. Most good csv parsers can handle the field separator to be any character you like, as long is it's just ONE character. By picking a two character separator they've just dictated that I write my own or resort to just splitting a line on ", ".
I have an function that I usually use in projects that counts , and ; on each line to determine which one is most likly beeing used in the file.
The most annoying thing I have found in csv files is the escape sign I would like it to be \" but very often I see """ as the escape for "
Don't do things that screw up the standard tools other developers depend on.
When I add a CSV import feature to a project I'm working on, I tell people "this works with MS Excel flavor of CSV." This covers most, if not all, real world cases because in my world the people who want to import data are non-programmer types who all use Excel.
I'll often include the basic rules in the screen that accepts the import. If I ever had to accept data from something that was _not_ Excel I'd probably include a combo box on the web form that lets you pick the dialect. So far I haven't had to do that.
The only thing I might not be totally covering is how Excel handles newlines, but in practice I've never had to deal with that.
commas don't delimit field entries? CSV -> comma separated values.
new lines inside a field? i've never written a parser that would be foiled by this. could be an issue if you use a built-in tokeniser (e.g strtok, etc.). be aware.
variable number of fields? you’re probably writing this for something with an expected input form. throw errors if you see something you do not accept. make sure you catch them.
ascii/unicode? yea. it’s a fucking mess. everywhere.
just do it. handle failure gracefully. learn from your mistakes. don't be naive. consider a library if the (risk of failure):(time) ratio is skewed the wrong way. the only time i would absolutely insist that a 3rd party library be used is when crypto is involved. even then, be aware that they are not perfect.
absolutely ignore people who's argument is along the lines of "you are not smart enough to implement this standard. let someone else do it.”. fuck everything about that statement, and it’s false sense of superiority.
nothing comes for free. wether you use a library, or do your own thing, you’re going to run into problems.
In general it's not about being smart enough (although for some complicated standards maybe it's true), but rather biting off more than you realize. Everything sounds simple before you find the edge case implementation issues and have to rework and rethink a bunch of hard issues that a dozen people have already thought through. Doing it yourself is on the table, but rarely the most efficient decision.
Too many "enterprise" coworkers who don't know how to write a finite state machine. They do need to use a library.
I can't but disagree when I read stuff like this. Why shouldn't I release a library if I think it's good enough for the community? Even the powerful and versatile Ruby library for CSV parsing started as a gem from a person who didn't give a s... about advise like "do not inflict another one into this world".
Because, you know, learning, having fun and stuff.
Only thing that I don't like is that many candidates will assume that they have to fix the code within the parser, given those instructions, even if they know that a battle-tested library is how they would actually do it. I hope you accept an off-hand comment such as, "ew, why is this hand-rolled" as a sufficient indicator in favor of your solution.
It uses regex lookaheads to deal with quoting, so it's not 100% portable. But it's only about one page.
As for the other things mentioned by the OP (BOM, encoding), those should be handled by the stream, and are not the provenance of CSV per se.
Unnecessary string copy operations are what make the parser slow.
Primarily, I didn't fully understand the date objects and functions available in the languages/libraries i was using so simple things like formatting a string date seemed difficult to me.
This was an awful idea. Dreadful.
I came up with all sort of delightful helper methods to cover common use cases like adding one month to the current date. I made this decision to represent dates internally with a timestamp, so adding a month is easy, right?! No. ...What's 1 month from January 31st? February 28th? Well then what's 1 month from February 28th? The list of edge cases goes on.
Most things in life are more complicated than they, at first, seem.
> Not kidding.
We'd ll be better off really, but that ship has sailed. Using CSV for data which is only ever read by a machine is a dumb decision. Use the RS (record separator) character and many of these ambiguities disappear.
Of course, like I said, that ship has sailed. If you want your data to be read nicely by other programs you're probably stuck with CSV, TSV, or something similar.
Very little data is actually true CSV.
The code isn't particularly long (~900 lines), it's Python (hence readable) and it's well commented:
https://github.com/numpy/numpy/blob/v1.8.1/numpy/lib/npyio.p...
I can only guess that since it's astronomy data and constellation coordinates have decimal places, it's best to avoid the comma character because some countries use it as a decimal separator.
Of course you can come up with scenarios where it doesn't work, but anyone who considers themselves to be a competent programmer should be able to deal with these issues, use another data format, or just talk to whomever is giving you the data to correct their data issues.
Seriously, The overhwleming CSV_bashing in these comment really makes me worry that coders just can't handle the basics anymore.
1. Of course there are exceptions to the rules: perhaps the CSV is malformed or there are special considerations in the backend, but the general point stands.
I will solve your problem with only UNIX utilities. And I'm sure others will solve it other ways.
Usually I only need sed and tr. Sometimes lex or AWK.
Arguing about something without ever pointing to an example accomplishes nothing; it's just whining.
Post an example.
Thank you.
None of these questions are particularly daunting. CSV means "comma separated values", so if you want to play games and use other delimiters, please fuck off. If it's not a comma, then guess what: it's not delimited. New line characters are well-known, and well-understood, across all platforms and easy to detect. If you manage to fuck that up in your file, then take a look in the mirror, because the problem is you. Enforcing the practice of enclosing the target data in quotation marks among users is a good idea. It's something that should be supported and encouraged, and ignored at one's own risk.
Additionally, employing an escape character (such as backslash) to allow for the use of a quotation mark within enclosing quotation marks is a nice feature to add in. After that, the concept of a CSV file has provided enough tools, to tolerate [an arbitrarily large percentage] of all use cases. If you need something more robust, XML is thataway.
If you look at lib/csv.rb [1] it's:
* 2325 Lines
* 2161 Non-blank lines
* 950 Lines of Code
Overall I agree with the article, there's no point in reinventing the wheel if there are libraries out there. And CSV specifically is a horribly complex format to deal with. But sometimes rolling your own is the best and/or only choice you have, and you might come out the other end enjoying the experience, and having learned a lot.
As for what happened to my old CSV parser? It ended up being quite popular, but stuck in the dark ages as I'd mostly moved on from PHP years ago. But thanks to a contributor, we've recently put renewed effort into bringing the project in to modern times: https://github.com/parsecsv/parsecsv-for-php
No alternate delimiters, no backslashes.
Now I have to put up with offshore staff trying to use apostrophes (') instead of quotes (") :-(
Barring alternate delimiters, and disallowing newlines* in fields, I can write the parser for 4180 in about 30 lines of perl, reading a char at a time and flipping between about 4 states. (avoids getting root access and days of paperwork to install from CPAN)
* disallowing newlines in the data is admittedly a big restriction, but it works for many use-case/applications, and allows the caller to pull in a line before calling the parse function.
For Java, the "Ostermiller" library is pretty good for CSV handling, and has a few options for dealing with freaky variants.
I was trying to get Perl tar libraries working, when my colleague asked why I don't just use backticks to do it in the shell. Basically because I don't know that much about tar. I can use it to untar file, or create a new archive. Someone else who has written a library probably has taken the time to read through the whole manual and make it work nicely. They know the errors and warnings, and have abstracted that to a sensible level hopefully. They have thought about these things, so hopefully I won't have to.
CSV or Comma Separated Values are not only for RFC, but also for EVERYONE who wants to use this word or phrase. Pedantry sucks!
knowing my needs i could easily account for all possible muck ups and avoid the instances where ambiguity could play a part
i was then able to use the bits i pulled out of the ~4G file, now 16M, in the parser with all of its assurances
sure, edge cases justify using a tried and true library for generics, but there are also edge cases that justify mocking up your own naive implementation.. if only, like in my case, to make the dada usable in such a library