The worst bug we faced at Antithesis (opens in new tab)

(antithesis.com)

56 pointswizerno2y ago18 comments

18 comments

15 comments · 7 top-level

If you are ever building a platform and have control over everything, one thing that can make problems like this easier to find is to not use regular intervals like 5/15/30/60 minutes everywhere.

At some point you'll have a weird problem, or a load spike that shows up at regular intervals. If all of your intervals are 5/15/30 minutes, you will have 2 things running every 15 minutes and 3 things running every 30 minutes, you won't necessarily know which one causes the issue.

If you use (co)prime numbers, say, 5/7/11/13/17/19 as intervals: One, you won't have a thundering herd of tasks all running at the exact same time every few minutes, and two, when someone notices a weird issue that happens every 17 minutes, you will know exactly what the cause is.

cozzyd2y ago

wouldn't it have made it harder to find in this case if the DCHP lease time was 34 minutes?

eszed2y ago

If it's in your control use co-prime scheduling. If it's not in your control, um... Hope they didn't use co-primes? Or a multiple of your co-primes? Er, yeah. I see the justification for doing it, but it's not exactly a cure for complexity. It'll work up until everyone else catches up with your weird trick, and then there might be may be more collisions than there were before.

Edit: Yeah, I guess GP said "if you control everything". Still, how often do you actually control everything? Or how long can you control everything? Everything (heh) sufficiently complex to worry about this talks with other systems at some point, right?

1 more reply

justinsaccount2y ago

Why?

rdg422y ago· 1 in thread

Great read!

But...

“Can you check /var/log/messages and see if there’s messages every 30 minutes about ENA going down and then back up?”

Isn't this "sysadmin 101" ? Like... the first thing to check on any server exhibiting weird behaviour ? :-) A message about a NIC going up & down every 30min would have triggered many here instantly.

Interesting journey nevertheless!

nusl2y ago

It’s probable that they did do that, but also that the network issue didn’t appear related, even though it’s suspect on its own.

cbanek2y ago· 1 in thread

Seems like the other lesson is every time you're adding a 9 to your uptime by fixing a bug, it's going to take longer each time to find those issues, either on wall time or dev time.

fuzzfactor2y ago

This stands out like some of the same things faced in natural science where you don't have to be an entomologist for the primary goal to be to elucidate the complex variety of creepy irregularities thrown at you by nature, whether unexpected or not. Or whether there will ever be a solution/progress or not.

It's a good writeup overall, but it's amazing how this bit applies to challenging scientific problems that have nothing to do with code, so try to read it from that point of view:

>One of the highest-productivity things your team can do is not have any “mysterious” bugs, so any new symptom that appears will instantly stand out. That way, it can be investigated while the code changes that produced it are still fresh in your mind.

>A rare, stochastic, poorly understood machine crash would completely poison our efforts to eradicate mysterious bugs. Any time a machine crashed, we would be tempted to dismiss it with, “Oh, it’s probably that weird rare issue again.” We decided that with this bug in the background, it would be impossible to maintain the discipline of digging into every machine failure and thoroughly characterizing it. Then more and more such issues would creep into our work, slowly degrading the quality of our systems.

>There are many people who say that a “zero bugs” mindset is excessive because, for rare bugs, the cost of fixing them exceeds the cost of living with them. But I find these people are rarely considering the indirect costs of rare bugs – on team velocity, discipline, and culture.

Some industries are so risky that the "zero defects" approach goes back way before there was software involved, that attitude can be practiced on other things besides code, and can definitely be applied to an advantage when coding.

In things like experimental chemistry with a growing layer of electronics, computers, and software on top of it, and where one of the main ideas can be to strive for more "9's", this is another wide opportunity for discrepancy.

Bugs propagate even worse in nature.

ajkjk2y ago· 1 in thread

So why the 8 minute offset? I think they never said?

cperciva2y ago

EC2 bare metal instances take a long time to boot. The machine was probably running for 8 minutes before DHCP started up (and then it got a new response every 30 minutes after that).

nusl2y ago· 1 in thread

Kudos. We have a similar unknown bug at work so we’ll see how it goes as we scale. Folks aren’t currently giving the fix too high of a priority but I suspect it will become a real problem soon enough.

cperciva2y ago

If your similar unknown bug is on FreeBSD/EC2, I want to hear about it!

maherbeg2y ago· 1 in thread

I'm curious what the fix was, presumably just retry?

cperciva2y ago

The fix was to teach the ENA driver that "set the MTU to the value it already has" should be a no-op. With that change, the interface didn't bounce.

intuitionist2y ago

(Disclosure: I’m an Antithesis employee.)

It’s briefly mentioned in a footnote here, but we have a lot of debugging war stories around the hypervisor protocol, many of which could themselves be blog posts. My personal favorite: we expected a certain hyperproperty related to determinism to hold during a refactor of the component on the other end of the hypervisor, but it was only holding some of the time, depending on the values of some parameters that were getting randomized during our testing. We dug in and figured out that, because we were round-robining across proposers of protocol messages into several pipelines, determinism held iff the number of proposers divided the number of pipelines or vice versa, and totally failed if they were coprime! If they had a smaller common factor greater than 1, there would be “partial determinism.” We very rarely ditch a suggested test property instead of trying to make it work, but that time we were defeated by number theory.

j / k navigate · click thread line to collapse

18 comments

15 comments · 7 top-level

justinsaccount2y ago· 3 in thread

If you are ever building a platform and have control over everything, one thing that can make problems like this easier to find is to not use regular intervals like 5/15/30/60 minutes everywhere.

cozzyd2y ago

wouldn't it have made it harder to find in this case if the DCHP lease time was 34 minutes?

eszed2y ago

1 more reply

justinsaccount2y ago

Why?

rdg422y ago· 1 in thread

Great read!

But...

“Can you check /var/log/messages and see if there’s messages every 30 minutes about ENA going down and then back up?”

Isn't this "sysadmin 101" ? Like... the first thing to check on any server exhibiting weird behaviour ? :-) A message about a NIC going up & down every 30min would have triggered many here instantly.

Interesting journey nevertheless!

nusl2y ago

It’s probable that they did do that, but also that the network issue didn’t appear related, even though it’s suspect on its own.

cbanek2y ago· 1 in thread

Seems like the other lesson is every time you're adding a 9 to your uptime by fixing a bug, it's going to take longer each time to find those issues, either on wall time or dev time.

fuzzfactor2y ago

It's a good writeup overall, but it's amazing how this bit applies to challenging scientific problems that have nothing to do with code, so try to read it from that point of view:

Bugs propagate even worse in nature.

ajkjk2y ago· 1 in thread

So why the 8 minute offset? I think they never said?

cperciva2y ago

EC2 bare metal instances take a long time to boot. The machine was probably running for 8 minutes before DHCP started up (and then it got a new response every 30 minutes after that).

nusl2y ago· 1 in thread

cperciva2y ago

If your similar unknown bug is on FreeBSD/EC2, I want to hear about it!

maherbeg2y ago· 1 in thread

I'm curious what the fix was, presumably just retry?

cperciva2y ago

The fix was to teach the ENA driver that "set the MTU to the value it already has" should be a no-op. With that change, the interface didn't bounce.

intuitionist2y ago

(Disclosure: I’m an Antithesis employee.)

j / k navigate · click thread line to collapse