At some point you'll have a weird problem, or a load spike that shows up at regular intervals. If all of your intervals are 5/15/30 minutes, you will have 2 things running every 15 minutes and 3 things running every 30 minutes, you won't necessarily know which one causes the issue.
If you use (co)prime numbers, say, 5/7/11/13/17/19 as intervals: One, you won't have a thundering herd of tasks all running at the exact same time every few minutes, and two, when someone notices a weird issue that happens every 17 minutes, you will know exactly what the cause is.
Edit: Yeah, I guess GP said "if you control everything". Still, how often do you actually control everything? Or how long can you control everything? Everything (heh) sufficiently complex to worry about this talks with other systems at some point, right?
But...
“Can you check /var/log/messages and see if there’s messages every 30 minutes about ENA going down and then back up?”
Isn't this "sysadmin 101" ? Like... the first thing to check on any server exhibiting weird behaviour ? :-) A message about a NIC going up & down every 30min would have triggered many here instantly.
Interesting journey nevertheless!
It's a good writeup overall, but it's amazing how this bit applies to challenging scientific problems that have nothing to do with code, so try to read it from that point of view:
>One of the highest-productivity things your team can do is not have any “mysterious” bugs, so any new symptom that appears will instantly stand out. That way, it can be investigated while the code changes that produced it are still fresh in your mind.
>A rare, stochastic, poorly understood machine crash would completely poison our efforts to eradicate mysterious bugs. Any time a machine crashed, we would be tempted to dismiss it with, “Oh, it’s probably that weird rare issue again.” We decided that with this bug in the background, it would be impossible to maintain the discipline of digging into every machine failure and thoroughly characterizing it. Then more and more such issues would creep into our work, slowly degrading the quality of our systems.
>There are many people who say that a “zero bugs” mindset is excessive because, for rare bugs, the cost of fixing them exceeds the cost of living with them. But I find these people are rarely considering the indirect costs of rare bugs – on team velocity, discipline, and culture.
Some industries are so risky that the "zero defects" approach goes back way before there was software involved, that attitude can be practiced on other things besides code, and can definitely be applied to an advantage when coding.
In things like experimental chemistry with a growing layer of electronics, computers, and software on top of it, and where one of the main ideas can be to strive for more "9's", this is another wide opportunity for discrepancy.
Bugs propagate even worse in nature.
It’s briefly mentioned in a footnote here, but we have a lot of debugging war stories around the hypervisor protocol, many of which could themselves be blog posts. My personal favorite: we expected a certain hyperproperty related to determinism to hold during a refactor of the component on the other end of the hypervisor, but it was only holding some of the time, depending on the values of some parameters that were getting randomized during our testing. We dug in and figured out that, because we were round-robining across proposers of protocol messages into several pipelines, determinism held iff the number of proposers divided the number of pipelines or vice versa, and totally failed if they were coprime! If they had a smaller common factor greater than 1, there would be “partial determinism.” We very rarely ditch a suggested test property instead of trying to make it work, but that time we were defeated by number theory.