I think the head scratching about Excel's (and probably other spreadsheets) behavior on empty cells is misguided. Yes, if you have a computer scientist mindset then it might seem very mathematically satisfying to have strict rigorous logic around empty cells but we're getting into 3-valued NULL handling like databases. That type of extra logic checking is misplaced considering the origins of spreadsheets (e.g. Dan Bricklin's VisiCalc) as a business computation tool instead of a GUI for an E.F. Codd relational db.
If a particular power user finds the typical "null" cell behavior to be "wrong", he can add formulas to his cells such as:
=IF(ISBLANK(C3), NA(), C3)
... such that #NA# (Excel's quasi "null") propagates throughout any SUM() and other calculations.
However, imagine if the situation was reversed and we had an alternate universe where Excel propagated empty cells as #NA# instead of zero to satisfy the logic of computer scientists. We'd inevitably have:
1) Microsoft adds a button wizard on the toolbar to bulk fill "null" cells with 0.
2) User forums with power-users trading VBA macros to fill in zeros of empty cells. "Ugh! How do I get rid of all these useless #NA# littering my spreadsheet???"
3) Microsoft adds an user option checkbox to "treat empty cells as zero instead of #NA#" -- which everyone ends up activating as a defacto setting. This cancels out the strict null handling the computer scientists were imposing on the spreadsheet!
I contend this alternate universe scenario is much worse for business users of Excel. In other words, you can't "force fit" the concept of NULL as a default where it is not natural. The end users will just work their way around it.
Yes, many companies pays good money for people to update those Excel sheets.
Me? I see almost every single Excel sheet that is updated over more than a week as a possible sale. But I'm not replacing Excel spreadsheets full time, at least not yet : )
* show N/A everywhere
* silently ignore sources of potential errors.
Why not make the calculation while adding a visual cue (e.g. a yellow or red indicator in the corner) that marks the field as tainted? Now the users get to decide whether they want to investigate or ignore that.Even things written by computer scientists sometimes follow the empty-is-zero model for convenience. awk's arrays behave very similarly to Excel's cell grid in this particular manner: they are conceptually of infinite size, and any element not yet set is implicitly 0. Hence you can calculate word-counts with something along the lines of count[$i]++ without first checking whether count[$i] exists; and you can also sum the counts of a predefined list of words, without throwing in if-count[$i]-is-defined checks.
No doubt, it poorly serves users who want to create huge, complicated spreadsheets, or large programs.
Spreadsheets could address the problem by having this as a cell property. Suppose you could highlight a rectangular region of the spreadsheet and mark the whole region as having "strict initialization": any cells in the region which are empty will trigger a diagnostic if they are accessed.
(Though I think the failure to use exact arbitrary precision numeric values except where an imprecise operation forces fallback to limit precision binary floating-point approximations -- and then visually distinguishing when the latter has occurred -- is a bigger problem for common spreadsheet use cases than the bad null handling.)
Spreadsheets are the dominant end-user-programmable tool in existence. Much of what people user them for is horrifying to programmers who know better.
But if spreadsheets actually did the "right" thing, they would never have been so widely adopted in the first place. That's an unprovable assertion on my part, but I strongly suspect it's true. I think it's a clear case of Worse is Better[0].
I don't think spreadsheets are widely adopted because they do the wrong thing, I think they are widely adopted because they provide UI affordances that systems that do the Right Thing didn't at the time that spreadsheets became widely adopted, and, since then, IT departments have imposed lockdown requirements which prevent anything programmable from being accessible to end-users except spreadsheets, because spreadsheets were so widely adopted before that lockdown began and end users simply refuse to give them up.
Its not, IMO, a worse-is-better situation, its simply better-is-better (in terms of UI affordances at the time of wide adoption) combined with non-technical, socially-imposed constraints which have locked the dominance in place by preventing any competition in the end-user accessible programming space.
The real right thing to do there is to give the user a warning: visually mark the field as potentially tainted, but assume the user knows what they're doing and give them a result anyway. Let them decide how much they care about correctness in that particular case.
> As for "the developer is supposed to implement it",
> next week I'll be giving my annual ethics lecture
> and I'll be pointing out to students that the codes
> of practice of the various professional societies
> all agree that your duty goes beyond simply doing
> what you are told.
>
> If you are told to write consumer software that
> gets its sums wrong, you should not do it.They want the program to just assume you know what you are doing.
This is what they want, so this is what they get.
Unfortunately, I never met another user of the software and the company eventually went nowhere. The spreadsheet was a concept very easy for bookkeepers, accountants and other non-programmers to understand. Not so much with (the better) Javelin.
Here is an add for the software from 1985: http://www.thecomputerarchive.com/archive/Software/Applicati...
We have only had the concept of zero for a thousand years and null for about fifty. Expecting there to be a common way of handling this in all domains and industries is a bit much and expecting it to be handled the way the OP prefers is highly unlikely (although I do agree with their opinion)
If the convention in accountancy is BLANK == 0 then that's their convention. Implement it using a single method that's nice and easy to instrument and discover for later on, perhaps raising warnings out to the reports produced.
Leave the worrying over not set values for when Unicode and UTC handling in excel is brought up :-)
(you can get close to programming in a spreadsheet if you know what you're doing with Insert > Name > Create, and one day I'd like to see a spreadsheet that lets you write your macro functions in Python rather than VB, but nobody is taught to use Excel that way except people who also program.)
Heck, even without the GUI requirement, "real" programming languages have only very recently added competitive functionality, with the exception of Common Lisp, which had the Cells package ages ago. Now it's getting more common to find various kinds of dataflow/reactive/data-binding constructs in mainstream languages other than Excel, but it's quite new.
How much easier would spreadsheets be to understand if you had
sum([widget.price for widget in catalogue])
instead of SUM(F3:F405)
?Teaching people to think more about their data structures, rather than teaching them to squash everything into a table even if it doesn't really fit into one, would dramatically expand their skills in this type of analysis!
That's LibreOffice :) Though it's not "rather", but "besides".
Require all formulas that refer to column A be continually changed to reference the true amount of rows? Stop at the first blank row? Stop at the last blank row? It seems more confusing to define rules for that case than to assume "undefined values are zero".
IME, that's not really all that important of a case, because this:
> Require all formulas that refer to column A be continually changed to reference the true amount of rows?
Is the normal way to handle it in Excel, so normal in fact that Excel has a couple of automated ways to handle it.
If you are doing a table (which is the only case where "sum of a column" really makes sense), then for Table T and column C, the formula is SUM(T[[#Data],[C]]).
The other is the fact that Excel automatically updates ranges in formulas as rows are inserted, etc. (though this doesn't work if you are appending, only if you are inserting within the range.)
The two cases I've seen where relying on blank row handling is a common way of dealing with data that can be added rather than relying on either structured references in tables or excels automated updates are:
1) Spreadsheets written prior to Excel supporting structured references in tables (or by people who learned Excel that long ago and haven't updated their skills), and
2) Creating forms with a fixed number of rows (usually, to be printed in hardcopy form, so that page layout is an issue), where simply zero-filling the relevant cells with a formula that doesn't display zero values would be a simple solution (since you have a fixed set of cells to fill.)
Of course, you could also have function fail by default on blank cells but take an optional parameter to treat blank cells as the appropriate identity, the same way Excel lookup functions rely on sorted data by default but can be given flag that tells them to find an exact match without relying on sorted input.
2. What if spreadsheets started with 0s in every cell, would this "solve" the problem? I don't think so, and yet it would address this argument.
Let A1 = 1, A2 = blank, A3 = 3
PRODUCT(A1:A3) is 1
A1 * A2 * A3 is 0
I think the original comments are still over the top, as it seems to be arguin that a defined convention exists but the OP disagreed with it.
However if your example is what they were actually complaining about then hell yes.
SQL behaves similarly. See: http://www.sqlfiddle.com/#!15/37025/5
> PRODUCT(A1:A3) is 1
3, I think (based on experimentation).
count++ # if count doesn't exist, it becomes 1
A previously unseen array reference materializes automatically too: count[$1]++
(I think this is garbage too, but it leads to throwaway programs that are very concise. If you write anything large, it's going to bite you on the butt: mispelled[foo] += bar; # wee A1 = 1, A2 = blank, A3 = 3
sum(A1:A3) = 4
product(A1:A3) = 3
count(A1:A3) = 2
average(A1:A3) = sum(A1:A3)/count(A1:A3) = 2
geometric_mean(A1:A3) = product(A1:A3)**(1/count(A1:A3)) = sqrt(3)
Note that the last line wouldn't work under your proposal, because the identity for "product" isn't the same as the identity for "count". I think the user would be massively confused by an error like "using conflicting identity values for cell A2", and would prefer a spreadsheet that just gave them the damn geometric mean.