1. How do you choose which specific cases to calculate manually?
Remember, you have no prior knowledge of what the correct answer looks like or where the interesting parts of the input space are. You have no way to determine whether any given set of manual calculations is representative of the work your program will be doing or covers the areas with the greatest risk of error.
2. How practical is it to make all those manual calculations?
Even in this very simple case, calculating the correct answer for a single input point manually might require hundreds of complex arithmetic operations to be performed. That’s going to be slow and error-prone. After all, isn’t that why we’re writing a program to do this for us?
Now, how does this idea scale? What if our program isn’t computing a nice analytical solution to a simple arithmetic problem, but instead running a complicated numerical method to process many thousands of data points in each input? It quickly becomes impractical to rely on this strategy for testing.
3. How will having a set of known outputs you can test against drive an implementation from scratch?
I mentioned before that Mandebrot is my second standard challenge to TDD evangelists. The first is to write add(x,y) driven by tests. After a bit of back and forth, this invariably ends up with an implementation that was generalised from however many specific cases were given to the general case. Invariably, that generalisation is the step that actually creates a useful solution to the original problem, and invariably it uses an insight that was not driven by the tests.
Our Mandelbrot scenario is the same situation, just a slightly more complicated example. No matter how many individual tests you create by choosing sample points in the input space and manually calculating the expected output, you won’t have a systematic way to work back from those answers to derive a correct general implementation of the Mandelbrot calculation. Your specific cases might be useful for verifying an existing implementation, but they give you no insight into how to write a good implementation from scratch. (If I’m dealing with a particularly strident advocate of TDD, this is the point where I mention the word “sudoku”.)
And again, we have to ask how this process scales. What if we had a more challenging problem, say extracting an audio track from a video file and running a speech recognition process on it to generate subtitles? It might actually be easier in that scenario to identify individual test cases: just take a known video file as input and write down the expected output for it, and now you have an end-to-end test. However, it’s still true that no amount of end-to-end tests will necessarily offer any insight into how to structure a good implementation of the required functionality in detail, nor will it tell us how to implement any specific part or generate useful unit tests cases for those parts. End-to-end test cases might help us to verify an existing implementation, but in general they won’t reliably drive a correct one from scratch.
You can’t reach a program that solves the general case by iteratively writing a failing test for a specific case, making the smallest change required to make that test pass, refactoring, and then moving on to the next test. The important step — what you called “fully implementing the add function” — is taken by implementing some other insight that is not driven by the tests. That step isn’t a refactoring either, because it very much does change the behaviour of the program.
To design and implement a good program, sometimes you just have to know what you’re doing. There is no substitute for understanding the problem you are trying to solve and how you intend to solve it. And if you have that understanding and you design and implement your program accordingly, what is the value of following the red-green-refactor process compared to simply writing any unit tests you find helpful for verifying your implementation?