For those who do this (post author or anyone else here), how do you deal with low-impact bugs?
As a concrete example, we're building a chat toolkit. One customer observed that in some versions of Firefox, when combining a particular set of features in our product, the scroll position wouldn't be remembered. This was 100% a bug. It's also an edge case of an edge case that likely only happened for this one customer, and even there, had a relatively small impact on UX for a small subset of their users. It was essentially a browser bug, and fixing it would require a big workaround that made one component of our product significantly more complex (and thus more prone to other bugs).
With a zero bug policy, we'd have to fix that before shipping anything else. But it made no business sense to do so, very much in the same sense that building a niche feature used by tiny % of customers tends to make no sense.
But once you let that one fly, there's no zero-bug policy left, right? You can just declare any bug as "not important enough right now" and -poof!- zero bugs! Yay, time to ship features.
For context, I'm talking a comparably small, tight-knit team as the author.
If you know what the fix entails and have decided that the fix will make the product worse then that seems different than just not fixing a bug because it's not important enough to look into. If you don't know why the product is broken and what the fix will entail then there could be serious underlying problems that you don't know about and which are getting more expensive to fix with every passing day, but it sounds like that isn't a factor in this case.
To me a zero bug policy doesn't mean that the product will work for every person and every use case, but rather that it's working as intended and there aren't things that are broken for reasons that no one understands.
The policy is that we fix every bug immediately, or decide that it is so unimportant it should never be fixed.
Bugs that are unimportant are neither fixed nor tracked.
Edit: Example of an unimportant bug that I remember: “On Safari, when you resize the browser window, phantom lines appear temporarily in parts of the web page.” We decided it was unimportant because it only happened on desktop Safari, the impact was minor, and it would have been hard to fix.
I would perhaps call this "Bug first policy" vs "Feature first policy", or some such thing, versus "0 bugs"
[0] https://en.m.wikipedia.org/wiki/Principle_of_least_astonishm...
My experience is that you pay a premium early on, but it pays dividends down the line. At some point, little bugs interact.
A big correlary of "zero bug" is "clean, simple architecture." If fixing bugs increases complexity, that's usually a symptom of something deeper....
A) Prolonged discussions about what the exact and precise definition of the word "bug" is, and how whatever that was released last night and caused mayhem between clients and the support was not it
B) Bargaining so my pet stuff is released right after the holidays and I don't look like the unproductive schmuck
C) Using this to nitpick at and get rid of employees someone doesn't like
D) Stack ranking the employees by the number of bugs they let slip past them
E) A full-on war between developers and the QA department
F) Fears to make any progress at all because a bug might creep in
...and of course, anything else you might have seen in your favorite Kafka books, in "Brazil", in "1984", in Lem's "Memoirs Found in a Bathtub", you name it. Of course, in the metrics it's going to look as if the company exceeds any expectations in implementing its zero bug tolerance policy! The managers will work hard on the infographics to show you.
"I know this has bugs, but the PM has promised this feature by end of this month"
Penny-wise, pound-foolish. Sounds like they’re saving money by cutting corners and that that is incurring a cost greater than the amount saved.
IMHO in the long run this saves a lot of time and money. Even bugs with zero user impact can signal some deep misunderstanding about technology, and fixing the problem immediately before it gets replicated everywhere else in the codebase is hugely valuable. Several times there have been cases where there was an extremely inconsequential issue that led to us discovering and fixing all sorts of important bugs that we hadn't even known about.
Cudos on having an awesome product.
But it gets hard when the users do not cooperate, and there is nothing to reproduce
In an open-source app I only got 3 bug reports in the bug tracker in nearly 15 years. And they were not really bugs either, one question and two https problems. I hope it is because I had tested any change for months and have thousands of automated tests. Or it is because the users do not find the bug tracker.
I do get a lot of mails. They are all useless. Most common is, "There is an error message 'Invalid password'". Then I reply that message comes when they enter an invalid password. Then they do not respond. And then I do not know if there is a bug or whether they have entered a wrong password. Then I also test it for a few hours and see it sends exactly the password to the server that was entered
Another project, a http client. Bug report: "untrusted https certificate" on someone else's server. I try it on my system, and it works fine. Then I ask for their OpenSSL version, and do not get an answer. Now what can I do about this? I try it on multiple computers, and it works on all of them
Another open-source project, much more popular. Bug report: it crashes frequently. Because it much more popular and has competent users, I do not have to do anything about it, the users investigate it themselves. Two months later, the user has extracted the crashing code. They remove as much code as possible, until they obtain a minimal crashing program. It shares zero code with my project. All remains are calls to an open-source library and they report it upstream to the developers of the library. I guess there is nothing to do until they fix it there?
On that project I also get emails. They are also useless, because the competent users use the bug tracker. After moving to 64-bit, I get a lot of "it does not start anymore". Guess they use a 32-bit OS
Zero code is pretty much the only way you can guarantee zero bugs.
Essentially the antithesis of "Move fast, break things"
It's not a "Zero bug guarantee"
If you accept that Chaos destroys Value (and it surely does), then it is a no brainer to do workflows that find and kill Chaos.
One value add pattern that is really helpful for finding Chaos is using software health metrics to find the echoes of Chaos. Much like how we find black holes by looking for gravitational lensing, Chaos can be found by looking at metrics like software response times under Representative Load, inconsistent response times are an indication of unhealthy contention in the solution (things waiting on other things that are waiting on other things, but some thing is pausing intermittently). Obviously becoming slower over time is also an indication of poor health as well.
Some other useful insights from the Value, Filler & Chaos model are:
• Teams run at 20% value or less. This really has to do with the nature of discovering new, valuable software embodiments. Discovery of new things requires many value attempts, most of which fail, but result in new learning
• Removing unused features is a win because you reduce Filler and sources of Chaos
• Mobile apps taken as a whole run about 1% value (positive revenue), the rest is all Chaos and Filler
• To know if something is Value vs Filler there has to be Traction. Chaos also destroys Traction. The article is a classic case of the team recognizing that Chaos was destroying Traction
It's very possible that this particular team could stand to put a higher priority on fixing bugs before implementing new features. That's ultimately going to be what it is, no matter what they call it. They are free to call it "Zero Bug Tolerance" as long as everybody understands that it's hyperbole and they don't get into endless bikeshedding on what constitutes a bug and if they really should fix it.
It's pretty obvious there will eventually be a bug that's too rare and weird to really troubleshoot, or too niche and complex to bother fixing, or more trouble than it's worth.
I came up with this system of coverage which would be a huge improvement and much tighter testing process, eventually moving "left" up the development pipeline:
I proposed this and a bunch of other ideas, and the general reaction was flat. My boss said he didn't understand any of it and cut me off trying to explain.
I realize now, the goal set by the CTO was just talk, they had no interest in any real process change. And so, nothing changed.
The concepts are sound, granted they could be better explained, I'm working on it, just not being paid to do so.
What you have to do, is quickly triage all bugs, and then make a call on when you're going to address it.
Some you fix immediately, some you postpone to a scheduled release, some you never fix, just document them.
Before a release, you set a bar on what bugs are acceptable for release, but you make a call on them, involving quality, product and engineering teams.
A company has finite resources, you need to invest wisely.
Time to market is still a thing and incompatible with overzealous bug fixing. I would settle on a zero-regression policy. Then at least you keep happy customers happy and potentially reach fewer bugs in the future.
Of course it depends on the application. An automotive ABS system has different requirements then an IRC server.
"In retrospect, maybe the strategy to reach approximate feature parity real fast was not the optimal one."
Or maybe that strategy was, and usually is, the only way to attract paying customers.
How to balance bug-fixes with new feature development is always a trade-off. It's never as simple as "Zero Bug Tolerance." Unless, I suppose, you're writing software for a space ship or deep sea vehicle.