Zero Bug Tolerance (opens in new tab)

(karlerss.com)

42 pointskarlerss5y ago37 comments

37 comments

35 comments · 16 top-level

skrebbel5y ago· 9 in thread

I really like the idea of zero-bug policies but I struggle with them in practice.

For those who do this (post author or anyone else here), how do you deal with low-impact bugs?

As a concrete example, we're building a chat toolkit. One customer observed that in some versions of Firefox, when combining a particular set of features in our product, the scroll position wouldn't be remembered. This was 100% a bug. It's also an edge case of an edge case that likely only happened for this one customer, and even there, had a relatively small impact on UX for a small subset of their users. It was essentially a browser bug, and fixing it would require a big workaround that made one component of our product significantly more complex (and thus more prone to other bugs).

With a zero bug policy, we'd have to fix that before shipping anything else. But it made no business sense to do so, very much in the same sense that building a niche feature used by tiny % of customers tends to make no sense.

But once you let that one fly, there's no zero-bug policy left, right? You can just declare any bug as "not important enough right now" and -poof!- zero bugs! Yay, time to ship features.

For context, I'm talking a comparably small, tight-knit team as the author.

Alex39175y ago

>But once you let that one fly, there's no zero-bug policy left, right?

If you know what the fix entails and have decided that the fix will make the product worse then that seems different than just not fixing a bug because it's not important enough to look into. If you don't know why the product is broken and what the fix will entail then there could be serious underlying problems that you don't know about and which are getting more expensive to fix with every passing day, but it sounds like that isn't a factor in this case.

To me a zero bug policy doesn't mean that the product will work for every person and every use case, but rather that it's working as intended and there aren't things that are broken for reasons that no one understands.

paulryanrogers5y ago

I.e. zero-unknown bugs?

1 more reply

jdlshore5y ago

I follow a zero bug policy, and have taught several teams to do so.

The policy is that we fix every bug immediately, or decide that it is so unimportant it should never be fixed.

Bugs that are unimportant are neither fixed nor tracked.

Edit: Example of an unimportant bug that I remember: “On Safari, when you resize the browser window, phantom lines appear temporarily in parts of the web page.” We decided it was unimportant because it only happened on desktop Safari, the impact was minor, and it would have been hard to fix.

staticassertion5y ago

Based on the article the goal is not to have 0 bugs, it is to strive for 0 bugs. Bugs are prioritized highly, testing becomes front and center, bug reports are triaged and well documented.

I would perhaps call this "Bug first policy" vs "Feature first policy", or some such thing, versus "0 bugs"

oso2k5y ago

For me, it's the Principle of Least Surprise [0]. While you may consider it a browser bug, but, your bug is not checking for the set of enabled product features and browser version then warning the user, or (better) protecting the user somehow from that situation. You might not have the resources to fix the browser issue. But, you are aware of the situation and it seems rude to not take the opportunity to "apologize" for issue when it occurs or you know it will occur. Imagine now your system is a Safety Critical System [1]. Not warning could harm someone by using your system or allow someone to be harmed by being caught unawares.

[0] https://en.m.wikipedia.org/wiki/Principle_of_least_astonishm...

[1] https://en.m.wikipedia.org/wiki/Safety-critical_system

joshdev5y ago

I think having a zero P0 and P1 bug policy makes sense. As others have said some bugs may not make business sense to fix. I do think it is useful to set thresholds for other bug priorities as well. 1,000 P2 bugs may make your product feel unusable.

murgindrag5y ago

I like zero bug policies in my code. If it's a browser bug, it's a bug not in my code.

My experience is that you pay a premium early on, but it pays dividends down the line. At some point, little bugs interact.

A big correlary of "zero bug" is "clean, simple architecture." If fixing bugs increases complexity, that's usually a symptom of something deeper....

Archelaos5y ago

Have you considered fixing the browser bug by providing a patch?

skrebbel5y ago

I have not. The browser is written in C++, a language that I have not touched for 15 years, and I'm 100% unfamiliar with the codebase. Just getting it to compile on my box would likely be more time consuming than I'd be willing to spend. I don't even think I have a C++ compiler on my computer anymore.

WesolyKubeczek5y ago· 3 in thread

I can see how it can work in a very tight-knit, small team. I fail to see how such a policy can work at a bigger company, especially once you have middle management layer. Once your company grows large enough, you're going to have:

A) Prolonged discussions about what the exact and precise definition of the word "bug" is, and how whatever that was released last night and caused mayhem between clients and the support was not it

B) Bargaining so my pet stuff is released right after the holidays and I don't look like the unproductive schmuck

C) Using this to nitpick at and get rid of employees someone doesn't like

D) Stack ranking the employees by the number of bugs they let slip past them

E) A full-on war between developers and the QA department

F) Fears to make any progress at all because a bug might creep in

...and of course, anything else you might have seen in your favorite Kafka books, in "Brazil", in "1984", in Lem's "Memoirs Found in a Bathtub", you name it. Of course, in the metrics it's going to look as if the company exceeds any expectations in implementing its zero bug tolerance policy! The managers will work hard on the infographics to show you.

RegnisGnaw5y ago

You forgot:

"I know this has bugs, but the PM has promised this feature by end of this month"

markbeare5y ago

100%

marcinzm5y ago

Suddenly there will be a decrease in reported bugs and an increase in reported feature requests. New policy works, we have fewer bugs in production, bonuses for management!

tantalor5y ago· 2 in thread

This is kind of silly. Some bugs are more severe than others. Some bugs cost you money, some do not. How can you prioritize fixing bugs vs. developing features when you have "zero tolerance" for bugs?

rspeele5y ago

Easy - just define a the lack of a feature as a bug when it's convenient to do so.

gregorygoc5y ago

Great ideas must come from great minds.

1 more reply

closeparen5y ago· 2 in thread

Our QA has zero working hours overlap with backend engineering, which has one working hour overlap with mobile engineering. QA’s bug reports never include the relevant IDs on the first pass, so if it’s potentially a backend issue where we need logs, we have to comment on the JIRA and wait 24 hours. It’s amazing. I wish we could have zero bug tolerance or drop everything to fix bugs. Local management would even like to. But a globally distributed cost-conscious company is physically incapable of collaborating that fast. All we could do is sit there and twiddle our thumbs while waiting for our peers around to the world to wake up and see our messages. So we work on features.

samus5y ago

This rather sounds like your management struggles to improve the bug reporting process, or that they don't care. Features can always be developed separately and parked in a branch or Pull Request (plenty of open source projects work that way). But it must be crazy to have ~24h turnaround time. If it's a more complex bug then you are looking at a week completely lost just waiting for the sun to rise on the other side of the world!

chrismorgan5y ago

> But a globally distributed cost-conscious company is physically incapable of collaborating that fast.

Penny-wise, pound-foolish. Sounds like they’re saving money by cutting corners and that that is incurring a cost greater than the amount saved.

Alex39175y ago· 1 in thread

At FWD:Everyone we always ensure there are zero known bugs in production at any given time. So if a bug is reported, it's always fixed with 24 hours, and no feature development is done until we're back to zero known bugs. If a user reports a bug before they go to bed, then more often than not they get an email with a postmortem and an explanation of the fix before they wake up. E.g. here is one that I published: https://www.fwdeveryone.com/t/Ebdvx32aSz2DAqxKBpee7w/feature...

IMHO in the long run this saves a lot of time and money. Even bugs with zero user impact can signal some deep misunderstanding about technology, and fixing the problem immediately before it gets replicated everywhere else in the codebase is hugely valuable. Several times there have been cases where there was an extremely inconsequential issue that led to us discovering and fixing all sorts of important bugs that we hadn't even known about.

winrid5y ago

This is possible because you are very focused on one product, not an enterprise company with ten products and three teams with four PMs that also do sales calls :)

Cudos on having an awesome product.

benibela5y ago· 1 in thread

I used to have a zero bug policy in my projects

But it gets hard when the users do not cooperate, and there is nothing to reproduce

In an open-source app I only got 3 bug reports in the bug tracker in nearly 15 years. And they were not really bugs either, one question and two https problems. I hope it is because I had tested any change for months and have thousands of automated tests. Or it is because the users do not find the bug tracker.

I do get a lot of mails. They are all useless. Most common is, "There is an error message 'Invalid password'". Then I reply that message comes when they enter an invalid password. Then they do not respond. And then I do not know if there is a bug or whether they have entered a wrong password. Then I also test it for a few hours and see it sends exactly the password to the server that was entered

Another project, a http client. Bug report: "untrusted https certificate" on someone else's server. I try it on my system, and it works fine. Then I ask for their OpenSSL version, and do not get an answer. Now what can I do about this? I try it on multiple computers, and it works on all of them

Another open-source project, much more popular. Bug report: it crashes frequently. Because it much more popular and has competent users, I do not have to do anything about it, the users investigate it themselves. Two months later, the user has extracted the crashing code. They remove as much code as possible, until they obtain a minimal crashing program. It shares zero code with my project. All remains are calls to an open-source library and they report it upstream to the developers of the library. I guess there is nothing to do until they fix it there?

On that project I also get emails. They are also useless, because the competent users use the bug tracker. After moving to 64-bit, I get a lot of "it does not start anymore". Guess they use a 32-bit OS

Archelaos5y ago

One of my open-source project was quite popular in the 2000s. My policy regarding invalid bug reports was that I did not take them very seriously if they occurred only once for a particular issue. But if I got a second independent email about the same problem, I treated it as a bug in my documentation and fixed the manual.

collyw5y ago· 1 in thread

Zero bugs sounds like the zero covid fantasy that some authoritarians are pushing for.

Zero code is pretty much the only way you can guarantee zero bugs.

thitcanh5y ago

"Zero bug policy" means: Let's build in a way that avoids new bugs and regressions (via testing, for example) and prioritize bug fixing over new features.

Essentially the antithesis of "Move fast, break things"

It's not a "Zero bug guarantee"

mawise5y ago

I recognize this as a valuable counter point to the "move fast and break things" ethos, but I disagree with the framing. A while ago I learned about a model of "error budgets" from google[1] which really resonated with me. You want errors to happen sufficiently infrequently that when the user encounters an issue it usually isn't your fault (instead something with the user's hardware, or their ISP, etc). Optimizing beyond this point is a waste of time because you can't eliminate errors that are outside the control of your system. It provides a very well defined framework for how to define the threshold of "does this matter enough to slow down and fix it".

[1]: https://sre.google/sre-book/embracing-risk/

LegitGandalf5y ago

It is useful to think of software change as being a mix of Value, Filler & Chaos. Value being something your customers need & use, Filler being something crafted, but customers say "meh" to, and Chaos being bugs, poor performance, etc.

If you accept that Chaos destroys Value (and it surely does), then it is a no brainer to do workflows that find and kill Chaos.

One value add pattern that is really helpful for finding Chaos is using software health metrics to find the echoes of Chaos. Much like how we find black holes by looking for gravitational lensing, Chaos can be found by looking at metrics like software response times under Representative Load, inconsistent response times are an indication of unhealthy contention in the solution (things waiting on other things that are waiting on other things, but some thing is pausing intermittently). Obviously becoming slower over time is also an indication of poor health as well.

Some other useful insights from the Value, Filler & Chaos model are:

• Teams run at 20% value or less. This really has to do with the nature of discovering new, valuable software embodiments. Discovery of new things requires many value attempts, most of which fail, but result in new learning

• Removing unused features is a win because you reduce Filler and sources of Chaos

• Mobile apps taken as a whole run about 1% value (positive revenue), the rest is all Chaos and Filler

• To know if something is Value vs Filler there has to be Traction. Chaos also destroys Traction. The article is a classic case of the team recognizing that Chaos was destroying Traction

ufmace5y ago

This is the kind of thing where everything is a judgement call and arbitrary policies applied strictly become absurd and useless.

It's very possible that this particular team could stand to put a higher priority on fixing bugs before implementing new features. That's ultimately going to be what it is, no matter what they call it. They are free to call it "Zero Bug Tolerance" as long as everybody understands that it's hyperbole and they don't get into endless bikeshedding on what constitutes a bug and if they really should fix it.

It's pretty obvious there will eventually be a bug that's too rare and weird to really troubleshoot, or too niche and complex to bother fixing, or more trouble than it's worth.

r0s5y ago

In my largish company, the CTO announced a similar "zero regressions in production" goal.

I came up with this system of coverage which would be a huge improvement and much tighter testing process, eventually moving "left" up the development pipeline:

https://eratestcoverage.org/

I proposed this and a bunch of other ideas, and the general reaction was flat. My boss said he didn't understand any of it and cut me off trying to explain.

I realize now, the goal set by the CTO was just talk, they had no interest in any real process change. And so, nothing changed.

The concepts are sound, granted they could be better explained, I'm working on it, just not being paid to do so.

S_A_P5y ago

I am all for this. I think it is something to strive for. I also think that bug fixing can sometimes take multiple hundreds of percentages of the original development time. This is where initiatives like this lose steam. Complexity of setup/recreation/intermittent bugs means explaining to a project manager and/or development manager that the task is going to miss the sprint. Or that the task is taking x number of hours and this causes said manager to see what the bug actually costs to fix. Then they look at the feature backlog and something has to give. Can’t tell senior leadership that we are missing one of their arbitrary(or even well thought out and pragmatic) deadlines because then the PM worries that they will be perceived as losing control of the project. So priority changes and features are the focus. It’s just the way I have seen things go way too many times, whether I was involved or not.

juancn5y ago

It's a waste of resources. Bugs have to be triaged and prioritized. Not all bugs are equal, some are existential threats to the business, others have so little impact that they can be postponed (note that a bug fix has a chance of introducing a new one, sometimes worse than the original).

What you have to do, is quickly triage all bugs, and then make a call on when you're going to address it.

Some you fix immediately, some you postpone to a scheduled release, some you never fix, just document them.

Before a release, you set a bar on what bugs are acceptable for release, but you make a call on them, involving quality, product and engineering teams.

A company has finite resources, you need to invest wisely.

guenthert5y ago

Zero bugs sounds neither practical nor necessarily desirable. I still remember to have been quite happy about the first bug report I received -- only then I knew that the software was actually being used and didn't vanish in someone's vault.

Time to market is still a thing and incompatible with overzealous bug fixing. I would settle on a zero-regression policy. Then at least you keep happy customers happy and potentially reach fewer bugs in the future.

Of course it depends on the application. An automotive ABS system has different requirements then an IRC server.

pwinnski5y ago

It's easier to have a "strive for 0 bugs" policy after you've already built a bunch of features and attracted paying customers.

"In retrospect, maybe the strategy to reach approximate feature parity real fast was not the optimal one."

Or maybe that strategy was, and usually is, the only way to attract paying customers.

How to balance bug-fixes with new feature development is always a trade-off. It's never as simple as "Zero Bug Tolerance." Unless, I suppose, you're writing software for a space ship or deep sea vehicle.

jjjeii35y ago

You don't need to comply with GDPR, unless your business is located in the EU or you have a subsidiary in the EU. There is no legal framework that will allow EU to enforce GDPR overseas, except both countries have an agreement. Due to this limitation, some people including Edward Snowden called GDPR a "paper tiger".

j / k navigate · click thread line to collapse

37 comments

35 comments · 16 top-level

skrebbel5y ago· 9 in thread

I really like the idea of zero-bug policies but I struggle with them in practice.

For those who do this (post author or anyone else here), how do you deal with low-impact bugs?

But once you let that one fly, there's no zero-bug policy left, right? You can just declare any bug as "not important enough right now" and -poof!- zero bugs! Yay, time to ship features.

For context, I'm talking a comparably small, tight-knit team as the author.

Alex39175y ago

>But once you let that one fly, there's no zero-bug policy left, right?

paulryanrogers5y ago

I.e. zero-unknown bugs?

1 more reply

jdlshore5y ago

I follow a zero bug policy, and have taught several teams to do so.

The policy is that we fix every bug immediately, or decide that it is so unimportant it should never be fixed.

Bugs that are unimportant are neither fixed nor tracked.

staticassertion5y ago

Based on the article the goal is not to have 0 bugs, it is to strive for 0 bugs. Bugs are prioritized highly, testing becomes front and center, bug reports are triaged and well documented.

I would perhaps call this "Bug first policy" vs "Feature first policy", or some such thing, versus "0 bugs"

oso2k5y ago

[0] https://en.m.wikipedia.org/wiki/Principle_of_least_astonishm...

[1] https://en.m.wikipedia.org/wiki/Safety-critical_system

joshdev5y ago

murgindrag5y ago

I like zero bug policies in my code. If it's a browser bug, it's a bug not in my code.

My experience is that you pay a premium early on, but it pays dividends down the line. At some point, little bugs interact.

A big correlary of "zero bug" is "clean, simple architecture." If fixing bugs increases complexity, that's usually a symptom of something deeper....

Archelaos5y ago

Have you considered fixing the browser bug by providing a patch?

skrebbel5y ago

WesolyKubeczek5y ago· 3 in thread

A) Prolonged discussions about what the exact and precise definition of the word "bug" is, and how whatever that was released last night and caused mayhem between clients and the support was not it

B) Bargaining so my pet stuff is released right after the holidays and I don't look like the unproductive schmuck

C) Using this to nitpick at and get rid of employees someone doesn't like

D) Stack ranking the employees by the number of bugs they let slip past them

E) A full-on war between developers and the QA department

F) Fears to make any progress at all because a bug might creep in

RegnisGnaw5y ago

You forgot:

"I know this has bugs, but the PM has promised this feature by end of this month"

markbeare5y ago

100%

marcinzm5y ago

Suddenly there will be a decrease in reported bugs and an increase in reported feature requests. New policy works, we have fewer bugs in production, bonuses for management!

tantalor5y ago· 2 in thread

rspeele5y ago

Easy - just define a the lack of a feature as a bug when it's convenient to do so.

gregorygoc5y ago

Great ideas must come from great minds.

1 more reply

closeparen5y ago· 2 in thread

samus5y ago

chrismorgan5y ago

> But a globally distributed cost-conscious company is physically incapable of collaborating that fast.

Penny-wise, pound-foolish. Sounds like they’re saving money by cutting corners and that that is incurring a cost greater than the amount saved.

Alex39175y ago· 1 in thread

winrid5y ago

This is possible because you are very focused on one product, not an enterprise company with ten products and three teams with four PMs that also do sales calls :)

Cudos on having an awesome product.

benibela5y ago· 1 in thread

I used to have a zero bug policy in my projects

But it gets hard when the users do not cooperate, and there is nothing to reproduce

Archelaos5y ago

collyw5y ago· 1 in thread

Zero bugs sounds like the zero covid fantasy that some authoritarians are pushing for.

Zero code is pretty much the only way you can guarantee zero bugs.

thitcanh5y ago

"Zero bug policy" means: Let's build in a way that avoids new bugs and regressions (via testing, for example) and prioritize bug fixing over new features.

Essentially the antithesis of "Move fast, break things"

It's not a "Zero bug guarantee"

mawise5y ago

[1]: https://sre.google/sre-book/embracing-risk/

LegitGandalf5y ago

If you accept that Chaos destroys Value (and it surely does), then it is a no brainer to do workflows that find and kill Chaos.

Some other useful insights from the Value, Filler & Chaos model are:

• Removing unused features is a win because you reduce Filler and sources of Chaos

• Mobile apps taken as a whole run about 1% value (positive revenue), the rest is all Chaos and Filler

• To know if something is Value vs Filler there has to be Traction. Chaos also destroys Traction. The article is a classic case of the team recognizing that Chaos was destroying Traction

ufmace5y ago

This is the kind of thing where everything is a judgement call and arbitrary policies applied strictly become absurd and useless.

It's pretty obvious there will eventually be a bug that's too rare and weird to really troubleshoot, or too niche and complex to bother fixing, or more trouble than it's worth.

r0s5y ago

In my largish company, the CTO announced a similar "zero regressions in production" goal.

I came up with this system of coverage which would be a huge improvement and much tighter testing process, eventually moving "left" up the development pipeline:

https://eratestcoverage.org/

I proposed this and a bunch of other ideas, and the general reaction was flat. My boss said he didn't understand any of it and cut me off trying to explain.

I realize now, the goal set by the CTO was just talk, they had no interest in any real process change. And so, nothing changed.

The concepts are sound, granted they could be better explained, I'm working on it, just not being paid to do so.

S_A_P5y ago

juancn5y ago

What you have to do, is quickly triage all bugs, and then make a call on when you're going to address it.

Some you fix immediately, some you postpone to a scheduled release, some you never fix, just document them.

Before a release, you set a bar on what bugs are acceptable for release, but you make a call on them, involving quality, product and engineering teams.

A company has finite resources, you need to invest wisely.

guenthert5y ago

Of course it depends on the application. An automotive ABS system has different requirements then an IRC server.

pwinnski5y ago

It's easier to have a "strive for 0 bugs" policy after you've already built a bunch of features and attracted paying customers.

"In retrospect, maybe the strategy to reach approximate feature parity real fast was not the optimal one."

Or maybe that strategy was, and usually is, the only way to attract paying customers.

jjjeii35y ago

j / k navigate · click thread line to collapse