You test case is more useless than a turd in the middle of the dining room table unless you put a comment in front of it that explains what it assumes, what it attempts, and what you expect to happen as a result.
Because if you just throw in some code, you're only giving the poor bastard investigating it two puzzles to debug instead of one.
The obvious result of Goodhart's Law ensued, leading to test cases like you mention.
Lesson to leaders: Please stop your bad managers from pulling stupid crap like this. It wastes a lot more time in the longer run.
If you have to document your documentation, you might be missing something fundamental in how you are writing your first order documentation. Not to mention that in doing so you defeat the reason for writing your documentation in an executable form (to be able to automatically validate that the documentation is true).
Over time im inclined to value human written documentation. Especially when things involve integrations of multiple systems. I had real cases, where two parties point at code and say their code is correct. And in isolation code looks correct. But when time comes to integrate these systems. It breaks. And then if you have human readable document where intentions and expectations are specified it's much easier to come to common (working) solution.
Not all languages have capability to express complex intentions so code as documentation does not work most of the time.
Auto-generated API docs combined with handwritten documentation that covers what can't be expressed in code and includes some useful examples seems like the right approach to me. In practice that's the kind of doc I tend to have the best experience with. For example the Rust stdlib docs are auto-generated but the language also supports notes and (automatically unit-tested) examples in docstrings which means the API docs are filled with explanations & examples and mentions what assumptions are made about inputs.
They almost convinced me somewhere in my career. But the hard truth I learnt is that most people are saying this because they aren’t capable of verbalizing what they are programming.
If your "code is doc", it should be extremely easy to add a little sentence above your method to explain what it does. And no, doc doesn’t stale. If your documentation isn’t up to what your function does, it’s probably because you should have written a brand new function instead of changing a function’s behavior.
And, as you note, when integrating systems you need more than just the code and comments, since the code might not even be written with the other system in mind.
It not always feasible to document every little edge case in natural language and keep it in sync with your code. If you "document" edge cases as tests, they _have_ to be in sync with your code. It shouldn't replace traditional documentation though and is better suited for internal components and not for public API.
If the documentation can also be interpreted by machine to validate what it claims is true you have a nice side benefit, but not the reason for writing your documentation.
Now, you could also have a well organized test suite that goes from most obvious to most detailed, split into sections for each use-case, but this sounds a lot more tedious than "write a one-line comment describing the unit test".
No, the point of automated testing is to verify that what is under test behaves correctly and to be able to scale this verification cheaper than having humans do it. Documenting what it verifies and under what conditions is just a side effect.
A test must be reproduceable. If it is not, is not a test.
This is why I found Gherkin/Cucumber (and BDD in general) to be a total revelation when I first encountered it. No one should be writing tests any other way IMO.
The revelation of TDD, which was later rebranded as BDD to deal with the confusion that arose with other types of testing, was that if your documentation was also executable the machine could be used to prove that the documentation is true. The Gherkin/Cucumber themselves are not executable and require you to re-document the function in another language with no facilities to ensure that the two are consistent with each other.
If you are attentive enough to ensure that the documentation and the implementation are aligned, you may as well write it in plain English. It will give you all of the same benefits without the annoying syntax.
BDD is a QA concern, primarily used for QA tests against a written (BDD) requirement.
TDD is about unit testing, which is about testing the implementation BY developers FOR other developers.
TDD says nothing about the correctness of the software against a spec, only that a given implementation aligns with a developer's intention.
Then try to debug a "document"...
I like the idea. But having tried it at scale, it becomes a mess. Code I can understand. I can read English comments. I can't debug English.
We use Spock, which make "comments" a very expected thing, which helps us not let tests without comments pass a code review.
Just use a tool that helps you and stop writing stupid tests whose impl code looks worse than the code being tested.
"Average of list" should "be within range" in {
forAll() {
(l: List[Float]) => {
val avg = l.average
assert(avg >= l.min && avg <= l.max)
}
}
This test will fail, since it doesn't hold for e.g. empty lists. Requiring non-empty lists will still fail, if we have awkward values like NaNs, etc. The following version has a better chance of passing: "Average of list" should "be within range" in {
forAll() {
(raw: List[Float]) => {
val l = raw.filter(n => !n.isNaN && !n.isInfinite)
whenever (l.nonEmpty) {
val avg = l.average
assert(avg >= l.min && avg <= l.max)
}
}
}
Getting this test to pass required us to make those assumptions explicit. Of course, it doesn't spot everything; here's an article which explores this example in more depth (in Python) https://hypothesis.works/articles/calculating-the-mean@Test
public void myTestMethod_Scenario_ShouldReturnThis() {....
It(“throws when the object belongs to another user”)
It(“does a business thing when thing is in state BLAH”)
I don't think it quite does it right, but it is of note.
(I would buy a Copilot subscription for this)
I admit to have been guilty of this myself. I have a famous anecdote-example where I had a very well-paid contractor job and explained something about how my then department's software worked to someone from another department. I think I must have sounded very convincing, the person went off to change something in how they used our stuff. A few minutes later, after accidentally meeting and casually chatting with my boss for that job I realized everything I had said was total garbage. I quickly excused myself from my boss and hurried after the person to tell them to forget and ignore everything I had just explained to them because it was all wrong. I think this last step is not what happens in those cases because we don't usually realize that such a thing just happened.
The brain, or parts of it, are great at producing "explanations". I think that it was part of the more established and reproducible results of psychology that our brain first decides and acts, and only then produces some (often bullshit) "reason" when/if our conscious self asks for one? Does anybody remember if this is true and has a link?
Relevant are Sperry & Gazzaniga's split brain experiments. Participants of these experiments had had their corpus callosum (one of the major "information" pathways between our brain's two halves) cut. This was an operation performed to keep epileptic seizures in check.
https://en.wikipedia.org/wiki/Split-brain
In these participants, specific brain "functions" such as speech were highly lateralized, meaning only one half of the brain was able to perform it to a satisfying degree.
Note that these were already not neuro-typical people prior to the experiments (given the regular, debilitating epileptic seizures), so reaching general conclusions from these experiments is hard.
Remember also that, like our brains, our bodies are also highly lateralized, such that the right-half of our brain controls the left-side of our body, and the left-half of the brain controls the right-side of our body. If you ever wanted proof against intelligent design, the way our brain connects to our eyes & body is one very strong argument..
Anyway, one experiment stands to mind where one half of the brain was instructed to perform some action (move the left arm, or something similar). Then the other half would be asked _why_ that arm was just moved. It would confabulate, on the spot, totally legit, but obviously bullshit, sounding reasoning. E.g. "I felt cold so I wanted to put on a coat", rather than "the experimenter instructed me to move it".
So, rather than claiming "I don't know", it would just make up a plausible reasoning. It is really unimaginable to _not_ know why you moved your arm..
And that's why we test and why tests shouldn't be allowed to fail.
Just because the scenarios described make testing hard does not change reality of what makes tests valuable.
If pre-existing failures are halting the production pipeline and you don't like it, switch off trunk based development and see if you like the waits and constant rebasing in large projects/teams. But don't eff with the bloody tests!
At $dayjob this works well, if your CI comes up red with some unrelated test failing, you can mark the test as flakey in the UI and CI will allow your code to merge and a Jira ticket will be created for the test owner to fix their test (and it will be disabled for future test runs)
I think for small to medium projects, you can have all tests succeed but once the repo is large enough / has frequent enough changes, flakey tests are bound to slip in.
I've heard google does something fancier where they take a test, run it a bunch after it fails to check if it's failed
I think the system at work only runs each test a couple times before giving up and marking it failed
This is pretty much the one feature that's nice, otherwise it's like a worse version of circle/github actions
If testing that way is painful (and it is), then work with people to remove the pain. Tests are supposed to help developers, not constrain or punish them.
Put tests in the same repo as the SUT. Do more testing closer to the code (more service and component tests) and do less end-to-end testing. Ban "flakey" tests - they burn engineering time for questionable payoff.
Test failures can be thought of as "things developers should investigate." Make sure the tests are focused on telling you about those things as fast as possible.
Also, take the human out of the "wait for green, then submit PR" steps. Open a PR but don't alert everyone else about it until you run green, maybe?
The problem becomes: I want to know if there are significant regressions in the vendor tests, ie. tests that were green for a long time and suddenly changed. You could flag any test that became green at some point as "required" to pass the CI, but then you have tests that randomly succeed or fail depending on code you have not yet written (eg. locking around concurrent structures). Marking these tests manually is impractical and could definitively be replaced by tooling that supports some statistical modeling of success/failure.
You may have the best testing strategy for internal code but as long as you have to test against these conformance tests it's simply unfeasible to say "sorry, only green allowed".
It'd be great if GitHub could open a PR for reviews (aka un-draft) automatically after CI succeeds. (If not in the core product, is there a bot that does that?)
The reviewer then does a git fetch, and then checks out the newly created rr/ branch. They make any small changes that aren't worth a roundtrip and push them to the rr branch. They add FIXME comments for bigger changes. They then either assign the ticket back to the developer, or go ahead and merge straight into their own dev branch. Once an rr branch is merged it's simply deleted. The dev branch is then pushed and CI will merge it to that user's master when it's green.
IntelliJ will show branches in each origin organized by "folder" if you use backslashes in branch names, and gitolite (which is what we use to run our repos) can impose ACLs by branch name too. So for example only user alice can push to a branch named rr/alice/whatever in each persons repo. That ensures it's always clear where a PR/RR is coming from.
Because each user gets their own git repo and cloned set of individual CI builds, you can push experimental or WIP branches to your personal area and iterate there without bothering other people.
This workflow gets rid of things like draft PRs (which are a contradiction), it ensures each reviewer has a personal review queue, it means work and progress is tracked via the bug tracker (which understands commands in commit messages so you can mark bugs as fixed when they clear CI automatically) and it eliminates the practice of requesting dozens of tiny changes that'd be faster for the reviewer to apply themselves, because reviewer and task owner can trade commits on the rr branch using git's features to keep it all organized and mergeable.
Test starts failing, immediately send a report with the failing input, then continue with the test case minimisation and send another report when that finishes.
Concurrently, start up another long running process to look for other failures, skipping the input that caused the previous failure. We do want new inputs for the same failure though. This is the tricky one. We could probably make it work by having the prop test framework not reuse previously-failing inputs, but that’s one of the big strategies it uses to catch regressions.
[1] specifically, hypothesis on python
I once witnessed a team creating an app, specs and tests in three respective repositories. For no other reason than "each project should be in it's own repository".
The added work/maintenance around that is crazy, for absolutely no gain in that case.
Phase 1. Code and test basic functions concerning any kind of arithmetic, mathematical distribution, state machines, file operations and datetimes. This documents any assumptions and makes a solid foundation.
Phase 2. Write a simulation for generating randomized inputs to test the whole system. Run it for hours. If I can't generate the inputs, find as big a variety of inputs as possible. Collect any bugs, fix, repeat. This reduces the chances of finding real time bugs by three orders of magnitude.
This has worked really well in the past whether I'm working on games, parsers or financial software. I don't conform to corporate whatever driven testing patterns because they are usually missing the crucial part 2 and time part 1 incorrectly.
And the answer is pretty simple: pin the specific test repo version! Use lockfiles, or git submodules, or put "cd tests && git checkout 3e524575cc61" in your CI config file _and keep it in the same repo as source code_ (that part is very important!).
This solves all of author problems:
> new test case is added to the conformance test suite, but that test happens to fail. Suddenly nobody can submit any changes anymore.
Conformance test suite is pinned, so new test is not used. A separate PR has to update conformance test suite version/revision, and it must go through regular driver PR process and therefore must pass. Practically, this is a PR with 2 changes: update pin and disable new test.
> are you going to remember to update that exclusion list?
That's why you use "expect fail" list (not exclusion) and keep it in driver's dir. Ad you submit your PR you might see a failure saying: "congrats, test X which was expect-fail is now passing! Please remove it from the list". You'll need to make one more PR revision but then you get working tests.
> allowing tests to be marked as "expected to fail". But they typically also assume that the TB can be changed in lockstep with the SUT and fall on their face when that isn't the case.
And if your TB cannot be changed in lockstep with SUT, you are going to have truly miserable time. You cannot even reproduce the problems of the past! So make sure your kernel is known or at least recorded, repos are pinned. Ideally the whole machine image, with packages and all is archived somehow -- maybe via docker or raw disk image or some sort of ostree system.
> Problem #2 is that good test coverage means that tests take a very long time to run.
The described system sounds very nice, and I would love to have something like this. I suspect it will be non-trivial to get working, however. But meanwhile, there is a manual solution: have more than one test suite. "Pre-merge" tests run before each merge and contain small subset of testing. A bigger "continuous" test suite (if you use physical machines) or "every X hours" (if you use some sort of auto-scaling cloud) will run a bigger set of tests, and can be triggered manually on PRs if a developer suspects the PR is especially risky.
You can even have multiple levels (pre-merge, once per hour, 4 times per day) but this is often more trouble than it worth.
And of course it is absolutely critical to have reproducible tests first -- if you come up to work and find a bunch of continuous failures, you want to be able to re-run with extra debugging or bisect what happened.
Indeed. Where I work we have a bunch of repos, but they always reference each other via pinned commits. We happen to use Nix, with its built in 'fetchGit' function; it's also easy to override any of these dependencies with a different revision. For example:
{ helpers ? import (fetchGit {
url = "git://url-of-helpers.git";
ref = "master";
rev = "11111";
})
, some-library ? import (fetchGit {
url = "git://url-of-some-library.git";
ref = "master";
rev = "22222"
}) {}
}:
helpers.build-a-service {
name = "my-service";
src = ./src;
deps = { inherit some-library; };
}
This is a function taking two arguments ('helpers' and 'some-library'), with default arguments that fetch particular git commits. This gives us the option of calling the function with different values, to e.g. build against different commits.We run our CI on GitHub Actions, which allows some jobs to be marked as 'required' for PRs (using branch protection rules). The normal build/test jobs use the default arguments, and are marked as required: everything is pinned, so there should be no unexpected breakages.
Some of our libraries also define extra CI jobs, which are not marked as required. Those fetch the latest revision of various downstream projects which are known to use that library, and override the relevant argument with themselves. For example, the 'some-library' repo might have a test like this:
import (fetchGit {
url = "git://url-of-some-library.git";
ref = "master";
# No 'rev' given, so it will fetch 'HEAD'
}) {
# Build with this checkout of some-library, instead of the pinned version
some-library = import ./. {};
}
This lets us know if our PR would break downstream projects, if they were to subsequently update their pinned dependencies (either because we've broken the library, or the downstream project is buggy). It's useful for spotting problems early, regardless of whether the root cause is upstream or downstream.Those tests should be as small as possible to verify that everything is still wired together correctly.
Everything else should be either unit tests or narrow integration tests between a small handful of components. And as you said, they should live in the repository of the software they test.
Even if you do have external tests, you still need internal ones for the surface area your external tests don't check for. Unit tests and such don't make sense at all combined with a separate test repo.