undefined | Better HN

0 pointsanonymousDan11y ago0 comments

Why is it a good thing that it's not an external simulation?

0 comments

2 comments · 2 top-level

itp11y ago

I'll give you some brief thoughts from my experience working at FoundationDB, but if you really want the in-depth answer to what makes our simulation framework an enabling technology for everything we do, you should take a look at the talk my colleague Will Wilson gave at Strange Loop [0].

Everything in our core is written in Flow, our asynchronous programming environment. The architecture of our server is essentially single threaded, with (pseudo-)concurrent actors allowing a single process to satisfy multiple roles at once.

The interfaces that these actors use to communicate -- with the disk, over the network, with the OS in general -- are abstracted and implemented in Flow as well. Each of these interfaces not only has at least one "real" implementation, but at least one, and sometimes more, simulated implementations.

Since we run multiple actors in a single process all the time, running yet more actors, pretending to be different machines, all still in the same process, was an obvious step. These workers communicate with each other via the _same_ network interfaces that real-world workers do in real clusters.

In the end we are able to become our own adversaries, pushing the limits of the system in ways that just don't happen often enough in the real world to test and debug in the wild. Our simulated implementations are allowed to present any behavior that can be observed in the real world, or justified by the spec, or implied by the contract. And they can do so much more frequently.

But most importantly -- as in, without this our system would never have been developed -- we can simulate these pathological behaviors in a completely deterministic fashion. Having the power to run a million tests, simulating multi-machine clusters trying to complete workloads while suffering from the most unbelievable combination of network partitions, dropped packets, failed disks, etc. etc., all while knowing that any error can be replayed, stepped through in a debugger, in a single process on a single machine... without this capability, we would never have had the confidence to build our product or evolve it as aggressively as we have.

tl;dr: We're not making simulations _of_ the system, we're running the system _in_ simulation, and that makes all the difference.

[0]: https://www.youtube.com/watch?v=4fFDFbi3toc

TomVDB11y ago

It makes everything completely determistic.

j / k navigate · click thread line to collapse