ASCII had addressed the problem of separating entries ever since its creation: Separator control codes. There are:
x01 SOH "Start of Heading"
x02 STX "Start of Text"
x03 ETX "End of Text"
x04 EOT "End of Transmission"
x1C FS "File Separator"
x1D GS "Group Separator"
x1E RS "Record Separator"
x1F US "Unit Separator"
You can use those just fine for exchanging data as you would using CSV, but without the ambiguities of separation characters and the need to quote strings. Heck if payload data is limited to the subset ASCII/UTF-8 without control codes you can just dump anything without the need for escaping or quoting.
So my suggestion is simple. Don't use CSV or "P"SV (printable separated values). Use ASV (ASCII separated values).
Sure, if you're building some kind of system where you need to ingest data from one application from another application you control, then using a different interchange format like ASV is an option. But then people tend to use more powerful formats like JSON or XML.
That, and data in CSV format is human readable in any old text editor or even work processor which many use as a quick sanity check to make sure their data looks sane. A lot of editors will not display the ASCII control characters at all so the fields on the line get mashed together, or may even reject the file as containing what it considers to be unexpected characters.
Worse; most 'non computer people' cannot get them imported into a spreadsheet properly (for whatever reason; usually it just puts everything in one field or column, people curse and give up), so they have to edit them in Notepad or worse, in MS Word and then send them back.
Not really seeing the beauty I guess.
:help digraph
:help digraph-table
Feel free to implement mappings for quickly accessing these digraphs. Those pesky F<n> keys are perfect for this. Easy to reach, gets the job done.
If Excel decides that text between Start of Text and End of Text that begins with a "=" is a formula, then you're in the same spot.
I use CSV all the time when I am working with R. My data can come in the form of CSV, XLS, or PDF. Which would you want to work with?
I can easily look at the data. I never touch my incoming data and my output is in reports, but CSV can be the easiest way to get data into a computer.
"ASV" is only a viable option if you then also use your time machine to go back 40 years and make everyone start using it then.
So when an Excel cell contains the UPC 123456123456, we get a CSV file that contains "1.23456E+11", which is worse than useless.
The hardest ones to work with were the mom and pop shops who suddenly had some success on Amazon and came to us after fulfilling out of their garage for a year and a half. Try telling a semi-retired 60 year old electrician in the middle of Iowa that the file he sent is worthless because none of the product codes match what you have, especially when once he closes the file he doesn't have any idea where it is.
The worst part is that it's not something that can be solved with an external dependency on some new startup - that would just add another layer we'd have to go through in the error cases, which would be numerous.
God, I wish I could share some of the files we've received. I cannot conceive what sort of monster would write a data exporter that would produce these unreadable things.
I ask for XLSX files since at least it's structured, unambiguous and documented, but even better: a minimal XLSX parser is trivial (about a page) to write.
Also: Educating users on how to specify the character set in every application that the user seems to want to use is a special kind of hell.
People use Excel when they should really use a database, they use it because they want to format something on a 2-d grid, edit tabular data, make plots, do calculations, make projections, etc.
The problems go down to the data structures in use.
For instance there is nothing 2-dimensional about financial reports (and projections), really financial reports are hyperdimensional. Proper handling of dates and times is absolutely critical. Also the use of floating point with a binary exponent is a huge distraction in any kind of math tools aimed at ordinary people. (Mainframes got that right in the 1960s!)
Google Sheets is just a stripped down version of Excel and other than the ability for multiple people to work on it simultaneously, is really no better.
What are the downsides of using it for ~everything?
01101 -> 1101
To get comma separated CSVs to show properly in Excel we have to mess around with OS language settings. CSV as a format should have died years ago, it's a shame so many apps/services only export CSV files. Many developers (mainly US/UK based) are probably not aware of how much of a headache they inflict on people in other countries by using CSV files.
Why? Aren't the import settings enough?
https://support.office.com/en-us/article/Text-Import-Wizard-...
I'm not just being pedantic, it makes a big difference. If I want to change some values in a spreadsheet I should be able to just open it, change the values, save, and be certain that the document will be identical apart from the deliberate changes. This is especially important for CSV files, which are commonly used for import/export operations.
Semicolons are really better though, because they aren't used as a decimal separator unlike commas in most countries.
I don't know about Excel, but LibreOffice allows very easily to select which parameters to use when opening a CSV file, it works just fine.
If you're going to separate values with semicolons--which is perfectly reasonable--I feel like you probably shouldn't do that with a format called Comma Separated Values.
>Unfortunately that’s not the end of the story. The character might not show up, but it is still there. A quick string length check with =LEN(D4) will confirm that.
The documented way is prefixing with a ' character. It doesn't have the length issue either.
As to the root issue, I can't think of any perfect way to transfer a series of values between applications that apply different types to those values and applications that don't. At some point, something is going to have to guess.
It is suggested in comments, but the author answered
> Yes, this prevents formula expansion... once. Unfortunately Excel's own CSV exporter doesn't write the ', so if the user saves the ‘safe’ file and then loads it again all the problems are back.
:-/
As someone mentioned elsewhere this is an issue with long numbers. Excel converts them to scientific notation. Reformat and export, all good. Reopen said file, back to scientific notation.
Really anything that relies on an escape character (') or a specific format gets lost on export to CSV. It exports correctly but there is simply no way to document these formats in a CSV file and have it be compatible with anything but Excel.
Amazing.
Interestingly, in Excel removing the quotes entirely also causes a formula to be interpreted as a formula and text (even with spaces) as text and numbers as numbers.
In my testing, quotes are only needed when a field contains a comma to prevent it being interpreted as a delimiter.
If we thought about it as an API mechanism, we would parse the strings and apply rules to sanitise or reject it.
Here is a principle for thinking about data. Distinguish internal data structures (persistence, search) from interchange structures (APIs). Codebase A should not be able to directly access the structures of Codebase B. To communicate, they must use explicit APIs.
At the moment, this principle is not mainstream. The CSV loader is not sure if it is loading an interchange format or persistence format. Another, that happens regularly: (1) developer builds a database as a storage mechanism. (2) developer decides to have other separate codebases query into that database. Is the database an application-data-structure (interal) or an API (external)? It is acting as both.
1,foo,'=SUM(A1:A10),bar
and open it, then the single apostrophe is visible in the cell.
1,foo,='SUM(A1:A10),bar
Why would these apps go off executing code from a text file? How odd.
Is there a way to tell Excel or Sheets to open a CSV file without executing code?
I have never seen a way to disable a full recalculation when Excel opens a CSV file, which beyond the security implications is painful for people like me who keep their calculations on manual because I often have very heavy workbooks opened all the time.
CSV -> import on web app -> SQLi
Malicious input -> CSV download from web app -> Excel -> formula -> sneaky data exfil
CSV -> JS -> import into web app XSS (in places no other XSS existed because of the data)
CSV import -> weird CSV header -> arbitrary data loading (headers were column names.... Schema injection .. like SQLi only more hilarious
Point is apps and devs can have blind spots (knowledge gaps) or just not think of a CSV import or export like other functionality.
>calling a pentested a script kiddie
welp, my work is done here
These vulnerabilities can't be blamed on CSV so much as on the desire of application vendors to treat data as code.
Excel conflates the idea of display format and data type which is the source of countless headaches. It is legacy pain in the purest form.
Dates are a /type/ of text; parsing dates in to machine readable formats is an /entire/ other can of spam.
[1] https://genomebiology.biomedcentral.com/articles/10.1186/s13...
EDIT: LibreOffice also allows you to tell it what encoding a file uses and what character(s) are used as separators.
In any case, I imagine the Ensembl ID is still safer than other encodings in the case of invertebrates. For example, genes IDs in the Fruit fly genome look like FBgn0034730.
Some things I dislike about CSV:
* No distinction between categorical data and strings. R thinks your strings are categories, and Pandas thinks your categories are strings.
* I'm not a fan of the empty field. Pandas thinks it's a floating point NaN, while R doesn't. So is it a NaN? Is it an empty string? Does it mean Not Applicable? Does it mean Don't Know? Maybe it should be removed altogether.
* No agreement about escape characters.
* No agreement about separator characters.
* No agreement about line endings.
* No agreement about encoding. Is it ASCII, or UTF-8, or UTF-16, or Latin-whatever?
* None of the choices above are made explicit in the file itself. They all have the same extension "CSV".
These use up a bit of time whenever I get a CSV from a colleague, or even when I change operating system. Sometimes I end up clobbering the file itself.
Good things: * Human readable. * Simple.
I think the addition of some rules, and a standard interpretation of them, could go some way to improving the format.
The thing you use CSV for is not it's technical merit. You use CSV for its ubiquity. If you nailed down all those things you talk about, you would have a much, much smaller user base and there would be no reason to use CSV in the first place.
(Hey, this reminds me of a similar situation governing s/CSV/C/g...)
I think this is more of an R-ism than a standardization issue. Strings are a pretty universal data type, where as categorical data (factors) are mostly specific to the domain of statistical modeling. IMO Python is doing the correct thing here. Personally I find factors to be more trouble than they are worth, and fortunately `data.table::fread` mimics Python in this regard.
This one is your data turned out to be code. There are many, many books on all the various forms this takes. Memory corruption cat and mouse..... It is a long complex story that we can sweep up to that generalization. But it is important to know that high, medium, and low level of these issues. They form a gigantic tree. The medium level somewhere between is where devs need to threat model most of the time. But some of the time things are very specific and you just need to know about the specific thing and not it's various generalized forms, because the specific thing can really matter. E.g. simple programming mistakes lead to side channels, etc. We can generically understand a side channels easily. But it takes a ton of specific hard earned knowledge to avoid it.
I agree, it catches people off guard to think CSV files once interpreted can do more than give columns of information, but it's not an injection which is my beef.
http://blog.hackensplat.com/2013/09/never-sanitize-your-inpu...
Makes me wanna troll ops people at my own startup just for funsies.
(But please, just do me a small favor and don't submit any reports for SQL injection or information disclosure if you're using the SQL-like API that we expressly provide for the purpose of accessing public data. We get a couple clueless people sending such reports every week.)
A more common format is TSV (TAB delimited) which makes a lot more sense, however the best choice when importing data in Excel is still to change the file extension to a non-recognized extension (like - say - .txt) and in the "import wizard" set the appropriate separator and set all columns as "text".
>CSV files are just text files (the format is defined in RFC 4180) and evaluating formulas is a behavior of only a subset of the applications opening them - it's rather a side effect of the CSV format and not a vulnerability in our products which can export user-created CSVs. This issue should mitigated by the application which would be importing/interpreting data from an external source, as Microsoft Excel does (for example) by showing a warning. In other words, the proper fix should be applied when opening the CSV files, rather then when creating them.
[0]: https://sites.google.com/site/bughunteruniversity/nonvuln/cs...
Their policy makes it sound like that the second vulnerability should indeed be fixed in Google Sheets itself (it is the one opening the file, after all)
CSVs are still the most portable format for moving data around despite all of their evils of escaping characters, comma delimitation, etc...
A lot of old legacy systems know CSV and its easy to inspect visually as compares to more efficient binary formats like ORC or Paquet.
Use anything else, even XLSX which is at least a typed and openly standardized format.