The spreadsheet GUI, lack of good version tracking/history, and eagerness to coerce data types and "correct" values makes it easy to introduce errors that will go unrecognized and propagated through calculations. Unfortunately this story just keeps repeating itself.
But all of this is just a secondary concern to Excel's real trouble: it's history of incorrectly implementing numerical and statistical procedures. One could plumb the depths of this topic for hours, but here are a few highlights: regression formula accepts illegal/nonsensical inputs (e.g. collinear predictors) and gives illegal/nonsensical outputs [0], variance/standard deviation change incorrectly with sample size [0], output of a paired t-test changes when missing values are included [0], formulas are mislabeled [0], v. 2007 gives very wrong answers to 11 of 27 tests in the NIST test suite used for statistical software benchmarks [1], the random number generator was broken as late as v. 2007 [1], and calculations relying on any of 12 particular floats display an incorrect result [2]. There are plenty of other issues mentioned in the links and elsewhere; if you're interested you'll have no trouble finding them.
Remember, friends don't let friends use Excel for science. :)
[0] http://people.stern.nyu.edu/jsimonof/classes/1305/pdf/excelr...
[1] http://www.pages.drexel.edu/~bdm25/excel2007.pdf
[2] https://blogs.office.com/2007/09/25/calculation-issue-update...
Edit: clarify and add a new issue I became aware of while researching further.
Excel is great at churning out fast and dirty estimates for low impact work. The problem is when it's used for large, complex, or important problems because these are just not Excel's domain--something obvious when looking at the kinds of features and bug fixes MS has prioritized over the years.
Most engineers are more similar to business people in that regards, they see traditional programming languages as too complicated. We learned MATLAB in university, two classes, basic intro to programming and then numerical methods. Many people had to retake the first class at least once.
I'd love to bang out my calculations in IPython Notebooks, but the most important requirement of engineering calculations is that they can be documented and understood by peers. Since none of my peers are interested in learning python, it's useless. Excel is the lowest common denominator, everyone gets it.
I once read this interesting contrast on HN and it stuck with me:
"Traditional programming languages show the program but hide the data. Spreadsheets hide the program but show the data".
I do however recall being mightily annoyed that I was having to calculate ephemera in "arcane bullshit", so I can understand why many shortcut to excel.
Experience makes fools of our past selves.
If you have the data in a non csv format and use excel to transform it into csv, python and R have utilities for that job (python has pandas, I don't use R as much for data cleaning, but I know it can).
- Ask HN: Suggestions for spacewalk procedure writing?
https://news.ycombinator.com/item?id=5585535
(It is asking for alternatives to Word for EVA planning)
(OT now:) If anyone thinks that's cheating them out of money - the choices are not "read for free" and "read for the appropriate price", the choices are "read for free" or "don't read". The reason I read is to procrastinate, so the value I get out of it is actually negative (same with HN...). Even their occasionally excellent articles (like their series on asset forfeiture) is stuff I'm at most mildly interested in (as a foreigner) to distract myself. That's my issue with today's media, I don't actually feel "informed" as in "it this is good for my life that I know these things". I can't do anything about 99.99% of the stuff I read about anyway, nor is it a representative sample of reality but consists almost solely on reporting the outliers.
Entertainment has value doesn't it? It's not a simple financial value like the simplistic use of opportunity cost as being the price you could bill those hours at, but it's a useful and functional part of being human.
I don't have a problem with you reading something people broadcast to the public internet though.
32 points by pns 6 hours ago | flag | hide | past | --> web <-- | 17 comments | favorite
https://www.google.com/search?q=An%20alarming%20number%20of%...
Google Search usually gets around paywalls
"Excel automatically converting gene names to things like calendar dates or random numbers"
In this case, I think what is needed is some kind of rudimentary knowledge of data-types. Or perhaps more simply a scientific template which is actually plain text by default.
But how are people not noticing auto-correction and auto formatting taking place!
The only perfect solution is to hire a developer to build you a data entry system. The developer can build the system which they have no cause to entirely understand the science behind, and thus a human to take the blame for errors instead of excel.
Excel should really offer an easy way around this.
The SEPT9 gene is problematic enough to be memorable, though. https://en.wikipedia.org/wiki/SEPT9
When Excel encounters the first cell in a new sheet that it thinks should be auto-converted, why does it not ask if that is desirable for that sheet?
Like: "Do you want Excel to interpret and auto-convert all strings with format <X> into the type <Y> in this sheet?"
At least for conversions where the original data is lost.
The article describes in some detail how inputting SEPT2 in a cell with default formatting displays 9/2/2016, but is stored as 42615 (which you get if you later change the cell to text formatting).
"Spaghetti" doesn't even begin to describe it. "Ball of yarn under a cat-lady's sofa" comes readily to mind, as does gouging my eyes out and amputating my fingers.
The problem isn't excel. The problem is scientists.
I can think of a lot of reasons why that is, but the one-function-per-file and single flat directory structure of MATLAB programs is part of it.
Language quirks are another, but I could write an entire book about that.
How did you guess? ;)
I've been gently pointing people towards python for this exact reason. The younger generations need little convincing, but the old dogs would rather write the same shitty code.
I suppose change takes time.
Type apostrophe at beginning of the gene name ('MARCH1) or format the column for gene names as text (click column letter, then Format | Cell and select text)
If people want to use a spreadsheet application for this kind of data collection (and that is a big if I think) then they perhaps need to have some agreed lab protocols for setting up and checking the spreadsheets. This is a known issue in financial circles...
http://www.bloomberg.com/news/articles/2013-04-18/faq-reinha...
If there aren't enough resources / skilled eyes to catch these simple errors, what are the chances they would catch errors in source code too?
https://www.youtube.com/watch?v=2Cdgew5zvI4
http://www.felienne.com/archives/tag/spreadsheets
[0] pronounced Fay-lee-nuh
Disclaimers: 1) Yes, scientific peer review needs improvement. 2) Yes, spreadsheets are not ideal for science... what makes business less important?