So You Want to Write Your Own CSV code (opens in new tab)

(tburette.github.io)

158 pointsMonkeyget12y ago121 comments

121 comments

99 comments · 39 top-level

mantrax512y ago· 11 in thread

Why are people using CSV when better (and less fuzzily defined) solutions exist, such as JSON?

In addition to aforementioned import/export data interop with MS Excel, there are tons of legacy systems (mainframes, etc) that import/export csv but not XML or JSON. The csv format is everywhere and will continue to be with us for decades. People will always look for a quality library (in whatever new programming language) that handles all tricky edge cases.

A few months ago, I was trying to get some bulk data into ebay's proprietary TurboLister[1] program. Guess what, it can import csv but not JSON.

SQLite[2] can import csv but not JSON.

Google's terabyte ngram dataset[3] is csv (tsv) instead of JSON. I'm glad it's not JSON because it would have required extra disk space.

... plus tons of other real-world csv examples out in the wild.

Unfortunately, the csv format is very easy for programs to write but it's very difficult for programs to properly read because of the tricky parsing.

[1] http://pages.ebay.com/sellerinformation/sellingresources/tur...

[2] http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles

[3] http://storage.googleapis.com/books/ngrams/books/datasetsv2....

jzwinck12y ago

If your data are rectangular and you care about performance, CSV is better than JSON just because it avoids repetitive key names everywhere. Then again, if your data are rectangular and you really care about performance, you would not use any of these (you might use HDF5, which has support in many programming languages and will destroy the others in terms of speed).

lukeschlather12y ago

JSON is almost a subset of CSV, with the understanding that you have to wrap every line in [], the document in [], and every field must be quoted. (And JSON doesn't have built-in support for headings, so you need to write a little loop instead of the library building a hash for you.)

So no, if you control input and output, JSON is still easier to use than CSV, and just as performant. JSON stores straight arrays just fine. It's not the format's fault so many people choose to store hashes with it.

1 more reply

xxs12y ago

The idea that JSON is the substitute made me chuckle. JSON is more verbose to boot. CSV is a poor format but JSON is not panacea, actually personally I'd never use it for anything that's not web (browser) related.

keeperofdakeys12y ago

If you have simple data, why use something as complicated as JSON? For a recent project, I had a simple CSV file with an int and float per row; using JSON would probably double the datasize. I used a simple string.split(",") for the javascript decoder, because I control the data, and know it's safe. I don't need another javascript library (I'd probably do differently if I had a standard library, not a hodge-podge of scripts).

Sometimes, simplicity is better for everyone.

wiredfool12y ago

Why not use both?

I've had to() parse json embedded in a field in a csv file. Unquoted of course.

Until I explained to the other developer just how stupid that was.

jiggy201112y ago

Importing into excel is probably a big reason.

mantrax512y ago

If Excel compatibility is the goal, one should use libraries that read and produce Excel files.

CSV is bullshit, it's not good for anything except scenarios where you control both the export process, and the parser (so you know what delimiter is used and so on).

2 more replies

minimaxir12y ago

CSV is far, far more ubiquitous and much more usable in non-web settings. (e.g. desktop data analysis programs)

jgalt21212y ago

true, and in those settings you largely don't see situations that trip up naive parsers such as newlines or delimiters inside fields.

chrismcb12y ago

Because Excel doesn't export JSON. Because if you have a table, offering comma delimited field is easy. But really, people you work with give you csv files, and you don't have a choice.

jstsch12y ago· 9 in thread

So, which library? CSV is a mess.

qnaal12y ago

perl's Text::CSV

http://search.cpan.org/~makamaka/Text-CSV-1.32/lib/Text/CSV....

draegtun12y ago

Here's some links which (will always) point to latest versions on MetaCPAN:

http://p3rl.org/Text::CSV | http://p3rl.org/Text::CSV_XS

earino12y ago

I really have to second this. It's fast, it's smart, it handles almost anything, and it reliably gives good error messages. It's an important part of my data science toolbelt.

MonkeygetOP12y ago

No idea. I've been bitten by a library that turned out well into a project to have pernicious flaws.

It would be awesome if someone made a table with CSV features in one dimension and application/library behaviour in the other.

justincormack12y ago

The problem is that CSV is basically non discoverable, as it has no metadata eg about encoding or delimiters or locale... Thats why Excel gives you a sample and lets you change delimiters etc. You can guess, or you can write something that seems to work for a particular input source. But it is best avoided.

mrweasel12y ago

I would say use the parser included in your programming environment, but I spend way to long looking of the non existing CSV parser in .Net.

EpicEng12y ago

Not a single language I use on a routine basis includes a CSV parser in its standard library.

1 more reply

rwmj12y ago

ocaml-csv can handle anything Excel can throw at it (and throw it back). Worked out well in the real world.

https://forge.ocamlcore.org/projects/csv/

mholt12y ago

Have you tried Papa Parse?

slg12y ago· 7 in thread

CSV are a headache. Like the article says, RFC4180 doesn't necessarily represent the real world. However sometimes you just have to reject things that aren't spec.

Not too long ago I was struggling with one of these CSV issues and received some good advice from Hans Passant [1] on a Stack Overflow question pertaining to my problem (emphasis mine):

"It is pretty important that you don't try to fix it. That will make you responsible for bad data for a long time. Reject the file for being improperly formatted. If they hassle you about it then point out that it is not RFC-4180 compatible. There's another programmer somewhere that can easily fix this."

It makes perfect sense in hindsight. If you accept a malformed CSV file, people will expect you to accept any malformed data that has a CSV extension. You are taking on a lot of extra responsibility to cover for the lack of work by another programmer. Odds are they can make a change to fix the problem that takes a fraction of the time it would take you work around it. You just have to raise the issue.

I realize that rejecting bad files isn't really possible in every circumstance. But I have a feeling it is an option more times than you might initially think.

[1] - http://stackoverflow.com/users/17034/hans-passant

barrkel12y ago

On the other hand, the ability to handle all kinds of input can be a chief selling point of your product.

In my current job, the most common "invalid" CSV format we get is .xlsx files.

So I wrote an .xlsx parser (way, way faster than Apache POI).

Another interesting hiccup to consider is CSV inside individual fields - i.e. recursive CSV. There are various ways to handle this, but in my company's line of business the usual route is to duplicate that line once per CSV element found in the field.

Likely the next invalid format we'll have to parse is PDFs containing tables...

mschuster9112y ago

> Likely the next invalid format we'll have to parse is PDFs containing tables...

cough people doing e-invoicing with pdf's...

Mister_Snuggles12y ago

> Likely the next invalid format we'll have to parse is PDFs containing tables...

And after that you will have to parse PDFs containing scans (as images, not text) of pages containing tables...

brongondwana12y ago

Yeah, that's a great idea if you can.

I was pulling data from a medical system that I knew full well I would not be able to get changes into for YEARS (and I got to meet the vendor, who was working on a shiny new XML export system - I wonder if that has quoting issues too - it wasn't released by the time I finished working the project)

So I wound up writing perl that knew enough to fix all the common problems with the source data, and emailed me any odd lines it couldn't cope with, so I could go in and update the regular expressions. It kinda sucked, but the end result was better antibiotic coverage for a bunch of people. Worst case of a line it really couldn't handle was that person didn't get the benefit of an expert system checking that they didn't have doubled-up medicines, which is no worse than they would have had without this system.

mholt12y ago

This is good advice. It's the philosophy adopted by Papa Parse, http://papaparse.com - try to gracefully handle malformed CSV and report all errors so they're obvious and actually helpful by telling the user where the syntax error is at.

Trying to compensate for bad CSV format will more likely cause headaches and confusion rather than clarity. It can also discourage the need for CSV writers to be rigorous about their output formatting.

muteh12y ago

I've never really got my head round RFCs, but 4180 is only informational, not a standard. I have used exactly your argument before though, and will again. Have also been on the other side and needed to convert horribly inconsistent data to fit it.

userbinator12y ago

It's not a standard but it should very well be one, in my opinion. Whenever someone has to process CSV I always point to it to make them aware of it, that a lot of subtle points like escaping have already been defined to be done in one way. There is no good reason to NOT follow RFC4180 if you want to produce/consume CSV.

huherto12y ago· 7 in thread

CSV works for simple cases. It is trivial to parse, you shouldn't even need a library.

It there are many "what ifs" like in the posted article. You probably need another format like JSON (preferably) or XML.

EpicEng12y ago

Yet in the real world you don't often have such luxury. Often times that inconsistent CSV file that you have to parse is not of your own creation. It comes from some other data source, or perhaps you have multiple data sources spitting out their own variants. You just need to get the job done, and splitting on ',' won't work.

daigoba6612y ago

Off topic, but why JSON over XML? What are the technical advantages for using JSON instead of XML (and don't say anything about "human readable"). If you're consuming the data with JavaScript, I'll grant you that JSON has quite an edge. But most every language has standard libs for XML. Both are easy to parse, but XML is easier to validate given a schema definition.

skybrian12y ago

JSON is considerably more compact, especially if you use lists instead of maps. For a list of numbers, there is only one character of overhead per item. For a list of strings, it's three characters per item.

Of course you can embed comma-separated lists in XML, but with JSON it will parse them for you.

(And of course it's not as good as a protobuf, but not bad for a text format.)

bartonfink12y ago

Because of closing tags, XML is approximately twice as noisy as JSON encoding. If response sizes and the network traffic they entail are a concern, JSON is worth considering. If response sizes + validation are necessary, something like Protocol Buffers or Thrift may also fit, as they are widely supported as well.

tetha12y ago

If your data touches the network, conform xml parsing is an attack vector (billion laughs, external entity exploits, ...) and non-conform xml parsing ends up being a headache. Even more, the sheer absurd complexity of xml contains so much stuff, who knows how many more exploits by specification are in there.

chrismcb12y ago

Well a lot of times you are reading in a csv file given to you by someone else. You can't expect it to be "simple" and you need to parse it correctly

mholt12y ago

JSON is arguably more complex than CSV. Though it is at least well-defined (mostly).

However, that doesn't excuse sloppy CSV writers.

mrweasel12y ago· 6 in thread

The most retard structure I've seen in a CSV file relates to the "What if the character separating fields is not a comma?".

We get "CSV" files from Klarna, an invoicing company, with the payments they've processed for us. Because we're Danish and they are Swedish, it's not really weird that they would use comma as the decimal separator. So to compensate for having used the comma, they for some reason picks ", " ( that's comma + space ) as the field separator. Most good csv parsers can handle the field separator to be any character you like, as long is it's just ONE character. By picking a two character separator they've just dictated that I write my own or resort to just splitting a line on ", ".

callesgg12y ago

In General Swedish csv files are separated with ;

I have an function that I usually use in projects that counts , and ; on each line to determine which one is most likly beeing used in the file.

The most annoying thing I have found in csv files is the escape sign I would like it to be \" but very often I see """ as the escape for "

chrismcb12y ago

Well the standard says to double the double quote to escape it. If you are seeing \" or even """ to escape a quote, then someone isn't following the standard.

mrweasel12y ago

I believe that the Swedish Klarna files uses ; but the Danish ones uses the comma + space. That only adds to the stupidity though, why not have just one format?

1 more reply

tikumo12y ago

it can be irritating, but you can just as easy parse ", " to "|" or something, by simple string replacing, pre parsing..

mrweasel12y ago

True, but in my mind picking ", " indicate to me that they don't care or don't know what they're doing. I often run into something similar with XML. I've had more than one partner call or write me saying that the elements in a file are not in the right order. Every single time they've admitted to not actually using an XML parser.

Don't do things that screw up the standard tools other developers depend on.

1 more reply

sunir12y ago

Think it through. What if there is free text in the field? "How are you, Sally?"

2 more replies

seanwoods12y ago· 5 in thread

This article makes it much more complicated than it needs to be. It tries to be all things to all people. In practice you're going to have to sacrifice some functionality for the sake of usability and your own sanity.

When I add a CSV import feature to a project I'm working on, I tell people "this works with MS Excel flavor of CSV." This covers most, if not all, real world cases because in my world the people who want to import data are non-programmer types who all use Excel.

I'll often include the basic rules in the screen that accepts the import. If I ever had to accept data from something that was _not_ Excel I'd probably include a combo box on the web form that lets you pick the dialect. So far I haven't had to do that.

The only thing I might not be totally covering is how Excel handles newlines, but in practice I've never had to deal with that.

alayne12y ago

I found out that Windows Excel and Mac OS Excel use different character encodings for CSV.

leni53612y ago

Does it work with Hungarian Ms Excel? It uses semicolons as delimiters.

Moto745112y ago

If all you care about is Excel compatibility you can add "sep=," on the first line. You can also use the Text Import Wizard. Changing the extension to .txt should cause Excel to show the Wizard upon opening the file.

chrismcb12y ago

What is wrong with trying to be all things to all people? If you use a good solid library you don't need to tell people "this works with some versions of MS Excel" And that is the main point of the article.

blablabla12312y ago

Makes total sense to focus on the format the user actually uses. Still...why don't you use a library?

foxhill12y ago· 2 in thread

as the article mentions, CSV is not well defined. libraries are.. well, different. you'd spend as much time becoming familiar with one as you would writing a basic parser.

commas don't delimit field entries? CSV -> comma separated values.

new lines inside a field? i've never written a parser that would be foiled by this. could be an issue if you use a built-in tokeniser (e.g strtok, etc.). be aware.

variable number of fields? you’re probably writing this for something with an expected input form. throw errors if you see something you do not accept. make sure you catch them.

ascii/unicode? yea. it’s a fucking mess. everywhere.

just do it. handle failure gracefully. learn from your mistakes. don't be naive. consider a library if the (risk of failure):(time) ratio is skewed the wrong way. the only time i would absolutely insist that a 3rd party library be used is when crypto is involved. even then, be aware that they are not perfect.

absolutely ignore people who's argument is along the lines of "you are not smart enough to implement this standard. let someone else do it.”. fuck everything about that statement, and it’s false sense of superiority.

nothing comes for free. wether you use a library, or do your own thing, you’re going to run into problems.

B-Con12y ago

> absolutely ignore people who's argument is along the lines of "you are not smart enough to implement this standard. let someone else do it.”. fuck everything about that statement, and it’s false sense of superiority.

In general it's not about being smart enough (although for some complicated standards maybe it's true), but rather biting off more than you realize. Everything sounds simple before you find the edge case implementation issues and have to rework and rethink a bunch of hard issues that a dozen people have already thought through. Doing it yourself is on the table, but rarely the most efficient decision.

Roboprog12y ago

On the one hand, I hear you. On the other hand...

Too many "enterprise" coworkers who don't know how to write a finite state machine. They do need to use a library.

iagooar12y ago· 2 in thread

"CSV is not a well defined file-format. The RFC4180 does not represent reality. It seems as every program handles CSV in subtly different ways. Please do not inflict another one onto this world. Use a solid library."

I can't but disagree when I read stuff like this. Why shouldn't I release a library if I think it's good enough for the community? Even the powerful and versatile Ruby library for CSV parsing started as a gem from a person who didn't give a s... about advise like "do not inflict another one into this world".

chrismcb12y ago

IF your library is a solid library, then release it. What he is saying though, is don't roll your own if you can use a solid library. And if a good solid library exists, why bother writing your own?

iagooar12y ago

> And if a good solid library exists, why bother writing your own?

Because, you know, learning, having fun and stuff.

kemayo12y ago· 2 in thread

We actually use CSV-reading as an incidental part of a hiring exercise. We provide a really simple homemade CSV parser as part of a PHP project, with a "could you find and fix bugs in this?" instruction. The way to get full marks is to rip out the parser and replace it with the appropriate standard library function.

ohwaitnvm12y ago

I like this.

Only thing that I don't like is that many candidates will assume that they have to fix the code within the parser, given those instructions, even if they know that a battle-tested library is how they would actually do it. I hope you accept an off-hand comment such as, "ew, why is this hand-rolled" as a sufficient indicator in favor of your solution.

kemayo12y ago

Such a comment would be acceptable, yeah. So long as we can tell they looked at it and thought "wow, that might go incredibly wrong"...

gavinpc12y ago· 1 in thread

My most popular stackoverflow answer [1] includes a CSV writer and reader. Yeah, I'd clean it up a little if I were doing it now (return enumerator instead of array, etc). But people keep using it.

It uses regex lookaheads to deal with quoting, so it's not 100% portable. But it's only about one page.

As for the other things mentioned by the OP (BOM, encoding), those should be handled by the stream, and are not the provenance of CSV per se.

[1] http://stackoverflow.com/a/769713/4525

EvanPlaice12y ago

Regex lookaheads are more efficient because you're copying everything between terminal chars at once as opposed to one char at a time.

Unnecessary string copy operations are what make the parser slow.

encoderer12y ago· 1 in thread

Early on in my career, just a year out of school, I, for some absurd reason, had the idea to build my own date library.

Primarily, I didn't fully understand the date objects and functions available in the languages/libraries i was using so simple things like formatting a string date seemed difficult to me.

This was an awful idea. Dreadful.

I came up with all sort of delightful helper methods to cover common use cases like adding one month to the current date. I made this decision to represent dates internally with a timestamp, so adding a month is easy, right?! No. ...What's 1 month from January 31st? February 28th? Well then what's 1 month from February 28th? The list of edge cases goes on.

Most things in life are more complicated than they, at first, seem.

dceddia12y ago

Especially dates.

EpicEng12y ago· 1 in thread

> What if the character separating fields is not a comma?

> Not kidding.

We'd ll be better off really, but that ship has sailed. Using CSV for data which is only ever read by a machine is a dumb decision. Use the RS (record separator) character and many of these ambiguities disappear.

Of course, like I said, that ship has sailed. If you want your data to be read nicely by other programs you're probably stuck with CSV, TSV, or something similar.

brianpgordon12y ago

On the other hand, there's definitely some value to being able to directly inspect and alter your data in a text editor. It would be nice to not have to deal with unprintable characters.

Sami_Lehtinen12y ago· 1 in thread

Parsing CSV is easier than handling XML or JSON. I do integrations as my job and most common format used is CSV because it's handy simple and reliable compared to other formats. That is exactly the reason why ini and props file are also preferred over database for data which isn't too volatile or big. Any one can open the datafile and see what's stored and what's wrong.

JensRantil12y ago

Have a look at `jq`: https://stedolan.github.io/jq/ It makes working with JSON a breeze.

michaelmior12y ago· 1 in thread

The best tool I've found for working with CSV files is csvkit[1]. I've run into some of the issues mentioned in the article and it's handled them all gracefully. It's basically a bunch of scripts mirroring sort, grep, cut, etc. but specifically for dealing with CSV files.

[1] http://csvkit.readthedocs.org/

voltagex_12y ago

Hey, this looks good. I've also used csvfix [1] to get me out of trouble before.

1: http://neilb.bitbucket.org/csvfix/

michaelfeathers12y ago· 1 in thread

Easy to write, hard to read. Perfect illustration of an emergent case of Postel's Law.

astrobe_12y ago

To me it's more the perfect illustration of the broken window theory.

joshvm12y ago· 1 in thread

I trust Numpy a lot for CSV handling. It deals with lots of edge cases including missing data, weird delimiters (pipes '|' are popular in astro for some reason) and massive files. If in doubt, whack it into Excel which has been doing this stuff for decades now. I prefer using Numpy to Python's CSV library which I find a bit clunky.

Very little data is actually true CSV.

The code isn't particularly long (~900 lines), it's Python (hence readable) and it's well commented:

https://github.com/numpy/numpy/blob/v1.8.1/numpy/lib/npyio.p...

jasode12y ago

>weird delimiters (pipes '|' are popular in astro for some reason)

I can only guess that since it's astronomy data and constellation coordinates have decimal places, it's best to avoid the comma character because some countries use it as a decimal separator.

http://en.wikipedia.org/wiki/Decimal_mark

codingdave12y ago· 1 in thread

Its a flippin' CSV.

Of course you can come up with scenarios where it doesn't work, but anyone who considers themselves to be a competent programmer should be able to deal with these issues, use another data format, or just talk to whomever is giving you the data to correct their data issues.

Seriously, The overhwleming CSV_bashing in these comment really makes me worry that coders just can't handle the basics anymore.

SoftwareMaven12y ago

It's not a question of can, it's a question of should. If any engineer on my team came to me and told me he was building a CSV reader/writer, I would seriously question his judgement as an engineer[1]. My thoughts would be that either he isn't capable of seeing obvious challenges in building a "simple" CSV feature or he isn't able to prioritize his time well, focusing on useless toys at the expense of getting important work done.

1. Of course there are exceptions to the rules: perhaps the CSV is malformed or there are special considerations in the backend, but the general point stands.

yp_all12y ago· 1 in thread

Post a sample .csv file you believe is too difficult.

I will solve your problem with only UNIX utilities. And I'm sure others will solve it other ways.

Usually I only need sed and tr. Sometimes lex or AWK.

Arguing about something without ever pointing to an example accomplishes nothing; it's just whining.

Post an example.

Thank you.

josephlord12y ago

It isn't that any particular file is difficult but that the variations that you haven't even thought about might catch you out. It is the deceptive simplicity of the samples that you have at hand that may catch out your code when it hits a different (also simple but different) example in the field.

qwerty_asdf12y ago

Garbage in? Garbage out. You give me a shitty file, you get shitty results. Tough shit.

None of these questions are particularly daunting. CSV means "comma separated values", so if you want to play games and use other delimiters, please fuck off. If it's not a comma, then guess what: it's not delimited. New line characters are well-known, and well-understood, across all platforms and easy to detect. If you manage to fuck that up in your file, then take a look in the mirror, because the problem is you. Enforcing the practice of enclosing the target data in quotation marks among users is a good idea. It's something that should be supported and encouraged, and ignored at one's own risk.

Additionally, employing an escape character (such as backslash) to allow for the use of a quotation mark within enclosing quotation marks is a nice feature to add in. After that, the concept of a CSV file has provided enough tools, to tolerate [an arbitrarily large percentage] of all use cases. If you need something more robust, XML is thataway.

Dorian-Marie12y ago

> Ruby CSV library is 2321 lines.

If you look at lib/csv.rb [1] it's:

* 2325 Lines

* 2161 Non-blank lines

* 950 Lines of Code

[1]: https://github.com/ruby/ruby/blob/trunk/lib/csv.rb

jimeh12y ago

Personally I know the pain of creating a CSV parser. In late 2006 I was working on a PHP project that required a CSV parser, and what was available at the time did not come close to cutting it. So I created my own parser/generator, which among many other things included automatic delimiter character detection. It was a rather painful project to create, but I learned a lot, and found the experience really fun.

Overall I agree with the article, there's no point in reinventing the wheel if there are libraries out there. And CSV specifically is a horribly complex format to deal with. But sometimes rolling your own is the best and/or only choice you have, and you might come out the other end enjoying the experience, and having learned a lot.

As for what happened to my old CSV parser? It ended up being quite popular, but stuck in the dark ages as I'd mostly moved on from PHP years ago. But thanks to a contributor, we've recently put renewed effort into bringing the project in to modern times: https://github.com/parsecsv/parsecsv-for-php

p0nce12y ago

Can we stop being liberal in what we accept from others? It only leads to an unfixable mess.

mooreds12y ago

This goes for most complex problems. The first step of any dev problem should be to make sure you understand the problem, the second to map out the main pieces and the third to make sure you are leveraging every (well maintained) library possible. There are, of course, issues with dependencies and tying yourself to code you didn't write, but what would you rather depend on--code that has had tens or hundreds of eyes on it, or code that you, and maybe one or two team members has reviewed?

Rabidgremlin12y ago

One of my first open source projects was a JDBC driver that read CSV files. It started simply enough but once you started adding in support for all the quirks things became really complicated really quickly. Just check out all the "options" for the driver that have been added by the community over the last 14ish years http://csvjdbc.sourceforge.net/doc.html

Roboprog12y ago

CSVs were simpler back in the 80s, when there were a few products (e.g. - Lotus 123, xBASE) that all wrote RFC 4180 compliant text (and I'm pretty sure there was no RFC 4180 yet)

No alternate delimiters, no backslashes.

Now I have to put up with offshore staff trying to use apostrophes (') instead of quotes (") :-(

Barring alternate delimiters, and disallowing newlines* in fields, I can write the parser for 4180 in about 30 lines of perl, reading a char at a time and flipping between about 4 states. (avoids getting root access and days of paperwork to install from CPAN)

* disallowing newlines in the data is admittedly a big restriction, but it works for many use-case/applications, and allows the caller to pull in a line before calling the parse function.

For Java, the "Ostermiller" library is pretty good for CSV handling, and has a few options for dealing with freaky variants.

collyw12y ago

I think this example is relevant to many seemingly trivial problems. Where the task seems simple, but once you think about the details a bit more it becomes complex.

I was trying to get Perl tar libraries working, when my colleague asked why I don't just use backticks to do it in the shell. Basically because I don't know that much about tar. I can use it to untar file, or create a new archive. Someone else who has written a library probably has taken the time to read through the whole manual and make it work nicely. They know the errors and warnings, and have abstracted that to a sensible level hopefully. They have thought about these things, so hopefully I won't have to.

winter_blue12y ago

A good and performant alternative to CSVs are Google's protocol buffers: https://code.google.com/p/protobuf/

izietto12y ago

But it is still by far the most readable text data format out there. Which is the reason for its wide adoption. I'll be downvoted, but I really believe in this.

NaNaN12y ago

Why CSV is not just for readability? I think RFC is sometimes too pedantic, that it let CSV can handle both plain text and binary files. COMMA is not just a COMMA, but a COMMA not in different environments. Why should we use the phrase CSV or Comma Separated Values just for RFC?

CSV or Comma Separated Values are not only for RFC, but also for EVERYONE who wants to use this word or phrase. Pedantry sucks!

mrcozz12y ago

We switched from a CSV based delivery to Apache Avro files. These are binary files which have the record schema embedded in the file header. We're pretty happy with this solution for the time being and it seems to be an awesome alternative to CSV. I wonder if anyone else is doing something similar? Good article but I'd appreciate if the author gave some alternatives.

mschuster9112y ago

I usually take advantage of the fixed formats of the individual exporting tool. Everyone does it a bit different - so what? I have a php parser for it and adapt it for every of my clients. It's cheaper to have a small parser, adapted for the customer's needs, than having one 10k SLOC library to handle a boatload of files...

mantis36912y ago

CSV is really slow to work with, because you have to check for well-formedness, like you do with XML. And in the end, I always end up making specific concessions for the files that my customers use (which must be patched again and again) or having to take a hard stance on what can and can't be in the "CSV" files.

aubergene12y ago

Mike Bostock's DSV library handles pretty much all of the cases listed for encoding and decoding. Written in JavaScript, in 116 lines.

https://github.com/mbostock/dsv/blob/master/dsv.js

justifier12y ago

i recently needed to deal with a ~4G xml file.. i tried a parser but after waiting thirty minutes for it to load i decided to parse out the bits i needed manually with a bash script

knowing my needs i could easily account for all possible muck ups and avoid the instances where ambiguity could play a part

i was then able to use the bits i pulled out of the ~4G file, now 16M, in the parser with all of its assurances

sure, edge cases justify using a tried and true library for generics, but there are also edge cases that justify mocking up your own naive implementation.. if only, like in my case, to make the dada usable in such a library

minimaxir12y ago

Most likely prompted by discussion on https://news.ycombinator.com/item?id=7794684

kabdib12y ago

CSV: Where the only way to win is not to play . . .

neoyagami12y ago

This article represents all my feeligs when my boss says " just write a csv parser for this, its just csv . So aint that hard"

itamarhaber12y ago

A non-standard standard is always a sure way to shoot yourself in the foot. Endianess also causes some confusion...

epeus12y ago

Never ever use csv to export. Use tab separated, as it takes work to type a tab in excel.

j / k navigate · click thread line to collapse

121 comments

99 comments · 39 top-level

mantrax512y ago· 11 in thread

Why are people using CSV when better (and less fuzzily defined) solutions exist, such as JSON?

jasode12y ago

A few months ago, I was trying to get some bulk data into ebay's proprietary TurboLister[1] program. Guess what, it can import csv but not JSON.

SQLite[2] can import csv but not JSON.

Google's terabyte ngram dataset[3] is csv (tsv) instead of JSON. I'm glad it's not JSON because it would have required extra disk space.

... plus tons of other real-world csv examples out in the wild.

Unfortunately, the csv format is very easy for programs to write but it's very difficult for programs to properly read because of the tricky parsing.

[1] http://pages.ebay.com/sellerinformation/sellingresources/tur...

[2] http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles

[3] http://storage.googleapis.com/books/ngrams/books/datasetsv2....

jzwinck12y ago

lukeschlather12y ago

1 more reply

xxs12y ago

keeperofdakeys12y ago

Sometimes, simplicity is better for everyone.

wiredfool12y ago

Why not use both?

I've had to() parse json embedded in a field in a csv file. Unquoted of course.

Until I explained to the other developer just how stupid that was.

jiggy201112y ago

Importing into excel is probably a big reason.

mantrax512y ago

If Excel compatibility is the goal, one should use libraries that read and produce Excel files.

CSV is bullshit, it's not good for anything except scenarios where you control both the export process, and the parser (so you know what delimiter is used and so on).

2 more replies

minimaxir12y ago

CSV is far, far more ubiquitous and much more usable in non-web settings. (e.g. desktop data analysis programs)

jgalt21212y ago

true, and in those settings you largely don't see situations that trip up naive parsers such as newlines or delimiters inside fields.

chrismcb12y ago

Because Excel doesn't export JSON. Because if you have a table, offering comma delimited field is easy. But really, people you work with give you csv files, and you don't have a choice.

jstsch12y ago· 9 in thread

So, which library? CSV is a mess.

qnaal12y ago

perl's Text::CSV

http://search.cpan.org/~makamaka/Text-CSV-1.32/lib/Text/CSV....

draegtun12y ago

Here's some links which (will always) point to latest versions on MetaCPAN:

http://p3rl.org/Text::CSV | http://p3rl.org/Text::CSV_XS

earino12y ago

I really have to second this. It's fast, it's smart, it handles almost anything, and it reliably gives good error messages. It's an important part of my data science toolbelt.

MonkeygetOP12y ago

No idea. I've been bitten by a library that turned out well into a project to have pernicious flaws.

It would be awesome if someone made a table with CSV features in one dimension and application/library behaviour in the other.

justincormack12y ago

mrweasel12y ago

I would say use the parser included in your programming environment, but I spend way to long looking of the non existing CSV parser in .Net.

EpicEng12y ago

Not a single language I use on a routine basis includes a CSV parser in its standard library.

1 more reply

rwmj12y ago

ocaml-csv can handle anything Excel can throw at it (and throw it back). Worked out well in the real world.

https://forge.ocamlcore.org/projects/csv/

mholt12y ago

Have you tried Papa Parse?

slg12y ago· 7 in thread

CSV are a headache. Like the article says, RFC4180 doesn't necessarily represent the real world. However sometimes you just have to reject things that aren't spec.

Not too long ago I was struggling with one of these CSV issues and received some good advice from Hans Passant [1] on a Stack Overflow question pertaining to my problem (emphasis mine):

I realize that rejecting bad files isn't really possible in every circumstance. But I have a feeling it is an option more times than you might initially think.

[1] - http://stackoverflow.com/users/17034/hans-passant

barrkel12y ago

On the other hand, the ability to handle all kinds of input can be a chief selling point of your product.

In my current job, the most common "invalid" CSV format we get is .xlsx files.

So I wrote an .xlsx parser (way, way faster than Apache POI).

Likely the next invalid format we'll have to parse is PDFs containing tables...

mschuster9112y ago

> Likely the next invalid format we'll have to parse is PDFs containing tables...

cough people doing e-invoicing with pdf's...

Mister_Snuggles12y ago

> Likely the next invalid format we'll have to parse is PDFs containing tables...

And after that you will have to parse PDFs containing scans (as images, not text) of pages containing tables...

brongondwana12y ago

Yeah, that's a great idea if you can.

mholt12y ago

muteh12y ago

userbinator12y ago

huherto12y ago· 7 in thread

CSV works for simple cases. It is trivial to parse, you shouldn't even need a library.

It there are many "what ifs" like in the posted article. You probably need another format like JSON (preferably) or XML.

EpicEng12y ago

daigoba6612y ago

skybrian12y ago

Of course you can embed comma-separated lists in XML, but with JSON it will parse them for you.

(And of course it's not as good as a protobuf, but not bad for a text format.)

bartonfink12y ago

tetha12y ago

chrismcb12y ago

Well a lot of times you are reading in a csv file given to you by someone else. You can't expect it to be "simple" and you need to parse it correctly

mholt12y ago

JSON is arguably more complex than CSV. Though it is at least well-defined (mostly).

However, that doesn't excuse sloppy CSV writers.

mrweasel12y ago· 6 in thread

The most retard structure I've seen in a CSV file relates to the "What if the character separating fields is not a comma?".

callesgg12y ago

In General Swedish csv files are separated with ;

I have an function that I usually use in projects that counts , and ; on each line to determine which one is most likly beeing used in the file.

The most annoying thing I have found in csv files is the escape sign I would like it to be \" but very often I see """ as the escape for "

chrismcb12y ago

Well the standard says to double the double quote to escape it. If you are seeing \" or even """ to escape a quote, then someone isn't following the standard.

mrweasel12y ago

I believe that the Swedish Klarna files uses ; but the Danish ones uses the comma + space. That only adds to the stupidity though, why not have just one format?

1 more reply

tikumo12y ago

it can be irritating, but you can just as easy parse ", " to "|" or something, by simple string replacing, pre parsing..

mrweasel12y ago

Don't do things that screw up the standard tools other developers depend on.

1 more reply

sunir12y ago

Think it through. What if there is free text in the field? "How are you, Sally?"

2 more replies

seanwoods12y ago· 5 in thread

The only thing I might not be totally covering is how Excel handles newlines, but in practice I've never had to deal with that.

alayne12y ago

I found out that Windows Excel and Mac OS Excel use different character encodings for CSV.

leni53612y ago

Does it work with Hungarian Ms Excel? It uses semicolons as delimiters.

Moto745112y ago

chrismcb12y ago

blablabla12312y ago

Makes total sense to focus on the format the user actually uses. Still...why don't you use a library?

foxhill12y ago· 2 in thread

as the article mentions, CSV is not well defined. libraries are.. well, different. you'd spend as much time becoming familiar with one as you would writing a basic parser.

commas don't delimit field entries? CSV -> comma separated values.

new lines inside a field? i've never written a parser that would be foiled by this. could be an issue if you use a built-in tokeniser (e.g strtok, etc.). be aware.

variable number of fields? you’re probably writing this for something with an expected input form. throw errors if you see something you do not accept. make sure you catch them.

ascii/unicode? yea. it’s a fucking mess. everywhere.

nothing comes for free. wether you use a library, or do your own thing, you’re going to run into problems.

B-Con12y ago

Roboprog12y ago

On the one hand, I hear you. On the other hand...

Too many "enterprise" coworkers who don't know how to write a finite state machine. They do need to use a library.

iagooar12y ago· 2 in thread

chrismcb12y ago

IF your library is a solid library, then release it. What he is saying though, is don't roll your own if you can use a solid library. And if a good solid library exists, why bother writing your own?

iagooar12y ago

> And if a good solid library exists, why bother writing your own?

Because, you know, learning, having fun and stuff.

kemayo12y ago· 2 in thread

ohwaitnvm12y ago

I like this.

kemayo12y ago

Such a comment would be acceptable, yeah. So long as we can tell they looked at it and thought "wow, that might go incredibly wrong"...

gavinpc12y ago· 1 in thread

My most popular stackoverflow answer [1] includes a CSV writer and reader. Yeah, I'd clean it up a little if I were doing it now (return enumerator instead of array, etc). But people keep using it.

It uses regex lookaheads to deal with quoting, so it's not 100% portable. But it's only about one page.

As for the other things mentioned by the OP (BOM, encoding), those should be handled by the stream, and are not the provenance of CSV per se.

[1] http://stackoverflow.com/a/769713/4525

EvanPlaice12y ago

Regex lookaheads are more efficient because you're copying everything between terminal chars at once as opposed to one char at a time.

Unnecessary string copy operations are what make the parser slow.

encoderer12y ago· 1 in thread

Early on in my career, just a year out of school, I, for some absurd reason, had the idea to build my own date library.

Primarily, I didn't fully understand the date objects and functions available in the languages/libraries i was using so simple things like formatting a string date seemed difficult to me.

This was an awful idea. Dreadful.

Most things in life are more complicated than they, at first, seem.

dceddia12y ago

Especially dates.

EpicEng12y ago· 1 in thread

> What if the character separating fields is not a comma?

> Not kidding.

Of course, like I said, that ship has sailed. If you want your data to be read nicely by other programs you're probably stuck with CSV, TSV, or something similar.

brianpgordon12y ago

On the other hand, there's definitely some value to being able to directly inspect and alter your data in a text editor. It would be nice to not have to deal with unprintable characters.

Sami_Lehtinen12y ago· 1 in thread

JensRantil12y ago

Have a look at `jq`: https://stedolan.github.io/jq/ It makes working with JSON a breeze.

michaelmior12y ago· 1 in thread

[1] http://csvkit.readthedocs.org/

voltagex_12y ago

Hey, this looks good. I've also used csvfix [1] to get me out of trouble before.

1: http://neilb.bitbucket.org/csvfix/

michaelfeathers12y ago· 1 in thread

Easy to write, hard to read. Perfect illustration of an emergent case of Postel's Law.

astrobe_12y ago

To me it's more the perfect illustration of the broken window theory.

joshvm12y ago· 1 in thread

Very little data is actually true CSV.

The code isn't particularly long (~900 lines), it's Python (hence readable) and it's well commented:

https://github.com/numpy/numpy/blob/v1.8.1/numpy/lib/npyio.p...

jasode12y ago

>weird delimiters (pipes '|' are popular in astro for some reason)

I can only guess that since it's astronomy data and constellation coordinates have decimal places, it's best to avoid the comma character because some countries use it as a decimal separator.

http://en.wikipedia.org/wiki/Decimal_mark

codingdave12y ago· 1 in thread

Its a flippin' CSV.

Seriously, The overhwleming CSV_bashing in these comment really makes me worry that coders just can't handle the basics anymore.

SoftwareMaven12y ago

1. Of course there are exceptions to the rules: perhaps the CSV is malformed or there are special considerations in the backend, but the general point stands.

yp_all12y ago· 1 in thread

Post a sample .csv file you believe is too difficult.

I will solve your problem with only UNIX utilities. And I'm sure others will solve it other ways.

Usually I only need sed and tr. Sometimes lex or AWK.

Arguing about something without ever pointing to an example accomplishes nothing; it's just whining.

Post an example.

Thank you.

josephlord12y ago

qwerty_asdf12y ago

Garbage in? Garbage out. You give me a shitty file, you get shitty results. Tough shit.

Dorian-Marie12y ago

> Ruby CSV library is 2321 lines.

If you look at lib/csv.rb [1] it's:

* 2325 Lines

* 2161 Non-blank lines

* 950 Lines of Code

[1]: https://github.com/ruby/ruby/blob/trunk/lib/csv.rb

jimeh12y ago

p0nce12y ago

Can we stop being liberal in what we accept from others? It only leads to an unfixable mess.

mooreds12y ago

Rabidgremlin12y ago

Roboprog12y ago

CSVs were simpler back in the 80s, when there were a few products (e.g. - Lotus 123, xBASE) that all wrote RFC 4180 compliant text (and I'm pretty sure there was no RFC 4180 yet)

No alternate delimiters, no backslashes.

Now I have to put up with offshore staff trying to use apostrophes (') instead of quotes (") :-(

* disallowing newlines in the data is admittedly a big restriction, but it works for many use-case/applications, and allows the caller to pull in a line before calling the parse function.

For Java, the "Ostermiller" library is pretty good for CSV handling, and has a few options for dealing with freaky variants.

collyw12y ago

I think this example is relevant to many seemingly trivial problems. Where the task seems simple, but once you think about the details a bit more it becomes complex.

winter_blue12y ago

A good and performant alternative to CSVs are Google's protocol buffers: https://code.google.com/p/protobuf/

izietto12y ago

But it is still by far the most readable text data format out there. Which is the reason for its wide adoption. I'll be downvoted, but I really believe in this.

NaNaN12y ago

CSV or Comma Separated Values are not only for RFC, but also for EVERYONE who wants to use this word or phrase. Pedantry sucks!

mrcozz12y ago

mschuster9112y ago

mantis36912y ago

aubergene12y ago

Mike Bostock's DSV library handles pretty much all of the cases listed for encoding and decoding. Written in JavaScript, in 116 lines.

https://github.com/mbostock/dsv/blob/master/dsv.js

justifier12y ago

i recently needed to deal with a ~4G xml file.. i tried a parser but after waiting thirty minutes for it to load i decided to parse out the bits i needed manually with a bash script

knowing my needs i could easily account for all possible muck ups and avoid the instances where ambiguity could play a part

i was then able to use the bits i pulled out of the ~4G file, now 16M, in the parser with all of its assurances

minimaxir12y ago

Most likely prompted by discussion on https://news.ycombinator.com/item?id=7794684

kabdib12y ago

CSV: Where the only way to win is not to play . . .

neoyagami12y ago

This article represents all my feeligs when my boss says " just write a csv parser for this, its just csv . So aint that hard"

itamarhaber12y ago

A non-standard standard is always a sure way to shoot yourself in the foot. Endianess also causes some confusion...

epeus12y ago

Never ever use csv to export. Use tab separated, as it takes work to type a tab in excel.

j / k navigate · click thread line to collapse