Fallacies of distributed systems (opens in new tab)

(architecturenotes.co)

224 pointsgoogletron4y ago60 comments

60 comments

50 comments · 20 top-level

siddontang4y ago· 5 in thread

Thanks so much for sharing this article, it reminded me of a fallacy before when we built TiDB cloud (a distributed database service) at first, the fallacy is Transport cost is zero.

We deploy TiDB in three available zone(AZ) in one region in AWS. Each AZ contains a SQL computing server, and a columnar storage replica server, sometimes the computing server will send request to the replica server in another region and get the data result, if the data acquisition volume is high, this will cost us much money. Unfortunately, we only found out about this when we received the bill from AWS. After that, We start optimizing the size of the transferred data :sob:

googletronOP4y ago

Thanks for sharing that experience. What did you end up doing to lower your costs?

siddontang4y ago

I have to say, this problem has not been well solved. But we've still been doing something:

- We used gRPC with no compression before, and we tune an acceptable compression level to achieve a balance of performance and data volume.

- Improve our SQL optimizer to consider the network cost more accurately and data placement rule to balance data more logically to reduce data transfer crossing AZ.

- As a cloud service, we will also expose the cost transparently to our customers like MongoDB Atlas does. :-)

ddorian434y ago

The only way to fix this is you have to build a new cloud that has lower networking costs and force the existing competitors to lower their prices.

2 more replies

hinkley4y ago

Cross region or cross AZ?

siddontang4y ago

Crossing-AZ.

We didn't pay attention to the cost of crossing AZ before, just thought this might be cheap. Of course, we were wrong :sob:

1 more reply

vinay_ys4y ago· 4 in thread

Network is much much more reliable than it used to be. In a typical large-scale datacenter today, it is becoming common to deploy clos networks with multiple ECMP paths to get from one server to another. Also, cost of network relative to cost of the server is quite low (less than 8-10% of overall cost). Also, fat connectivity between datacenters in a metro area is much much cheaper than ever before (due to huge increases in fibre capacity in the last 10 years). If anything, network has become cheaper faster than compute has in the last 10 years. (And storage has done the same at an even higher rate). So, this affects architectural decisions in very interesting ways. Of course, public cloud make a huge profit on network bandwidths by billing to customers based on old mental models.

ignoramous4y ago

> Network is much much more reliable than it used to be.

But would anyone think of the bit flips?

> We've now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect.

> We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn't detect it and it spread throughout the system causing the symptoms described above. We hadn't encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.

https://web.archive.org/web/20150726045623/http://status.aws...

See also: At scale, rare events aren't rare, https://news.ycombinator.com/item?id=14038044 (2017).

jiggawatts4y ago

It blows my mind a tiny bit that ordinary cloud VMs are running on hosts with 100 Gbps NICs.

Even “medium” VMs can get gigabytes per second of throughout..

wizzard04y ago

> Network is much much more reliable than it used to be

Within DCs, sure. At the same time, a lot more company networks become spread thin over 4G/5G networks and finicky cable ISPs in recent years with WFH.

This caused quite a strain on internal LOB apps which could reasonably assume zero packet loss and sub-1ms LAN latency beforehand :)

3np4y ago

At the same time, IoT, mobile and roaming make for unreliability and consistent topology reshuffling. Granted different environment and for various reasons properly distributed solutions on these networks aren't widespread so far. Just saying there's another side to it.

pjmlp4y ago· 4 in thread

And this is why before "fixing" the monolith by placing a network in the middle, one should start to learn about modules, packages and libraries on the language of choice.

dagss4y ago

YES! It is a shame one feels one has to fix organizational and coordination issues by introducing network barriers incidental to the real problems. I cannot wait for microservices to become unfashionable for cases where they are not solving technical scalability issues.

Software engineering is not a mature field but microservices feels like a dead alley everyone is marching on.

ignoramous4y ago

Relevant discussion on Modules, Monoliths, and Microservices, https://news.ycombinator.com/item?id=26247052 (2021).

delusional4y ago

Any time you interact with a third party API you're creating a distributed system. The principles of distributed systems are useful for many more things than "microservices".

pjmlp4y ago

An API isn't a network call and a WebAPI is a distributed service by definition of doing a network request.

rsaxvc4y ago· 4 in thread

> Pictured on the right is the time to access memory in a modern system, on the left the time it takes to do a round trip across the world.

These seem swapped.

maffydub4y ago

Even if they're swapped, the maths doesn't work out - the ratio of squares is 1:42 while the ratio of latencies is 1:1500.

adwn4y ago

You're off by a factor of 1000: 150 ms / 100 ns = 150e-3 / 100e-9 = 1.5 million. So it's even worse!

1 more reply

googletronOP4y ago

Definitely true! Good eye. Drawing out 1500 squares is tough.

googletronOP4y ago

Thanks for pointing this out I will fix it.

henrik_w4y ago· 3 in thread

These fallacies can also lead to one of my favorite programming quoutes:

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” — Leslie Lamport

photochemsyn4y ago

You'd think a well-designed distributed system would be able to handle the failure of a component and fallback into a different configuration to handle the problem. That's the fundamental feature of Internet routing systems, i.e. they expect components to occasionally fail. Building a distributed system that can't handle such failures just seems like sloppy engineering.

xwolfi4y ago

The joke is that no matter how many redundancies, the complexity is so beyond human's ability to map at once, that you always miss a single point of failure.

Each failure scenario must be gracefully handled in each subpart to take into account all possible impact on the whole system.

You'll always get a random crap you never heard of destroying an assumption you made in your own crap.

I remember an instance where a national mobile phone provider in France got a single db outage but during a large sms day (think new year), so the failover happened but totally overloaded the new servers, and it created a chain reaction to other providers to a point you couldnt read your email for a day or two because of inability to get smses. Why is that random db in a provider I dont even use impacting my google account ? Because the system is distributed.

ketzu4y ago

In our distributed systems lecture, my professor started the introduction lesson with some descriptions on 'what is a distributed system.' The one by Lamport was by far the most quoted in exams, as it is the most memorable, because it is funny.

After watching Lamports introduction to TLA+, I got the impression he's just a really funny guy (while being a genius, too).

Matthias2474y ago· 3 in thread

A few other ones that come up regularly:

- Servers/applications never restart

- Startup order is fully deterministic

- All instances run the same software revision (maybe somewhat covered by 8)

- Hardware is reliable

- All clients are playing nicely along (and I'm not even talking about attacks)

- Logging and metrics are cheap

- There are no software bugs in any layer

dijit4y ago

> Startup order is fully deterministic

It still is for the minority of systems.

On single OSs Google runs an init that is deterministic; as do the BSDs and some distributions of Linux.

For distributed systems your ops team controls for this; usually on the service discovery layer. (Don’t publish until ready, don’t start service until you can establish a connection to required endpoints).

Matthias2474y ago

Many people think it is - but the realitiy is that it depends a lot on the system one is working on, can be super tricky, and often relying on it will cause some pain points later on.

e.g. let's take the example of an init system starting up all processes. Now what happens if a if one of the processes crashes and gets restarted by a processmanager? Now the order already changed, and e.g. a former process which relied on the restarted one might work based on outdated data. Similar things can also happen on other layers - e.g. one of the services in a dependency chain might disappear and reappear.

Another example is a developer/administrator manually changing the config of a certain service and restarting it to take effect - that could also trigger dependency problems.

Now those are absolutely solvable - either by making sure all services operate gracefully with any startup order or by other mitigations (e.g. "always reboot the full box"). But like everything else in the list, it still is a problem that is observed in very often in distributed systems.

snug4y ago

These are fallacies of distributed systems, most of these are the reasons you want to have a distributed system. You build distributed systems because you know - Servers restart, so you replicate and load balance - Dependencies matter, so you build circuit breakers, and health checks - Infra is heterogeneous, this is why you use containers - Hardware is unreliable, again, replicate, load balance, and HA - Provide APIs for your clients to communicate with your service - Logging and metrics are expensive, collect what you need, prom helps with this, logging is not fully solved - Do people really think there are no bugs?

beckingz4y ago· 2 in thread

This is a good summary of many of the most annoying problems of distributed systems.

googletronOP4y ago

Thanks. I would say quorum issues with flakey nodes is up there too.

macintux4y ago

Failing/misbehaving nodes are the worst. Everything from slow performance to data corruption.

Much better to lose a node immediately than to have it fail gradually.

1 more reply

ramraj074y ago· 2 in thread

I’d love if someone created fallacies of big data technologies: most important being if you’re using spark, unless you’re really, really good at spark, you’re not actually working with big data.

anonymousDan4y ago

Can you expand on what you mean?

ramraj074y ago

My experience has been that spark actually doesn’t work with real big data without significant babysitting. It’s agorithms, be it window functions or joins, can take forever if not never finish on actual data that’s hundreds of terabytes large (even tens of tb). You immediately need to worry about crap like garbage collectors, worker memory, cache and data distribution. The majority of the data engineers out there can not actually deal with these problems but spark + actually not big data let’s them think they’re actually good at their jobs when in reality they’re not.

1 more reply

googletronOP4y ago· 1 in thread

Short little post about Fallacies of Distributed Systems. These are some thoughts I feel often get downplayed when engineers are designing microservices.

ignoramous4y ago

Anti-dotes exist too. I particularly like go to these ones from AWS back in the day when distributed systems weren't as popular:

- Amazon S3 design requirements and principles (2006), https://archive.is/EP6HU#selection-1205.0-1425.66

- Amazon S3: Architecting for resiliency (2009), http://web.archive.org/web/20100719163331/https://qconsf.com...

avinassh4y ago· 1 in thread

related: I love the illustration of fallacies by Denis Yu. It has cats!

http://deniseyu.io/art/sketchnotes/topic-based/8-fallacies.p...

googletronOP4y ago

That’s really awesome!

kirdie4y ago· 1 in thread

Hey there’s a typo in the article. In the image of memory vs round trip says that roundtrip is 100ns and memory access 150ms I guess you meant the other way around :P

googletronOP4y ago

Thanks. I believe it has been fixed now.

macintux4y ago

“The network is reliable” is also the title of a relatively famous collaboration between Kyle Kingsbury and Peter Bailis which generated a long list of counter-examples.

* https://cacm.acm.org/magazines/2014/9/177925-the-network-is-...

Discussions here:

* https://news.ycombinator.com/item?id=5820245

* https://news.ycombinator.com/item?id=8162720

aliswe4y ago

The original article seems to be from 1994.

https://en.m.wikipedia.org/wiki/Fallacies_of_distributed_com...

grymoire14y ago

This was first mentioned in 1994, BTW.

https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...

capn_duck4y ago

This is like a meme in software development reposts and I hate it because it just says "things can go wrong with component X of your system". It doesn't say when, it doesn't say what the threshold is. There's no equations, no direction for someone that hasn't dealt with one of these problems before. It's less than useless.

googletronOP4y ago

Here is the full illustration https://architecturenotes.co/content/images/size/w2000/2022/...

nraynaud4y ago

I really don't think anybody believes the network is reliable, it's mostly that you have no freaking clue how the software layers lower than you will react to various conditions and the way an error will be visible to your app will be completely random.

"oh, today the underlying layer decided to wait 1min30 for a timeout",

"oh, the DNS resolution is broken",

"no route to host???",

"invalid certificate, did somehone hijack the DNS query?"

"this library decided to wrap the errors, now I have learn an entire new vocabulary of failure modes"

samsquire4y ago

Don't underestimate how performant a single thread is! Some problems are embarrassingly parallel.

Of course coordinating threads is difficult.

I am designing an LMAX disruptor system that translates a synchronous piece of code into nested LMAX disruptors.

Essentially you can do high IO and high CPU and not block other requests as they're pipelined.

https://github.com/samsquire/ideas4#51-rewrite-synchronous-c...

Don't block the UI thread or the node js event loop. We don't have to pick between tokio and rayon. We can have multiple event loops.

qaq4y ago

And higher level issue for some types of workloads a disturbed system might never achieve the same level of performance and/or will cost orders of magnitudes more than a vanilla system making it impractical.

vira284y ago

The illustrations look really cool. Wondering what software/tool is being used?

j / k navigate · click thread line to collapse

60 comments

50 comments · 20 top-level

siddontang4y ago· 5 in thread

Thanks so much for sharing this article, it reminded me of a fallacy before when we built TiDB cloud (a distributed database service) at first, the fallacy is Transport cost is zero.

googletronOP4y ago

Thanks for sharing that experience. What did you end up doing to lower your costs?

siddontang4y ago

I have to say, this problem has not been well solved. But we've still been doing something:

- We used gRPC with no compression before, and we tune an acceptable compression level to achieve a balance of performance and data volume.

- Improve our SQL optimizer to consider the network cost more accurately and data placement rule to balance data more logically to reduce data transfer crossing AZ.

- As a cloud service, we will also expose the cost transparently to our customers like MongoDB Atlas does. :-)

ddorian434y ago

The only way to fix this is you have to build a new cloud that has lower networking costs and force the existing competitors to lower their prices.

2 more replies

hinkley4y ago

Cross region or cross AZ?

siddontang4y ago

Crossing-AZ.

We didn't pay attention to the cost of crossing AZ before, just thought this might be cheap. Of course, we were wrong :sob:

1 more reply

vinay_ys4y ago· 4 in thread

ignoramous4y ago

> Network is much much more reliable than it used to be.

But would anyone think of the bit flips?

https://web.archive.org/web/20150726045623/http://status.aws...

See also: At scale, rare events aren't rare, https://news.ycombinator.com/item?id=14038044 (2017).

jiggawatts4y ago

It blows my mind a tiny bit that ordinary cloud VMs are running on hosts with 100 Gbps NICs.

Even “medium” VMs can get gigabytes per second of throughout..

wizzard04y ago

> Network is much much more reliable than it used to be

Within DCs, sure. At the same time, a lot more company networks become spread thin over 4G/5G networks and finicky cable ISPs in recent years with WFH.

This caused quite a strain on internal LOB apps which could reasonably assume zero packet loss and sub-1ms LAN latency beforehand :)

3np4y ago

pjmlp4y ago· 4 in thread

And this is why before "fixing" the monolith by placing a network in the middle, one should start to learn about modules, packages and libraries on the language of choice.

dagss4y ago

Software engineering is not a mature field but microservices feels like a dead alley everyone is marching on.

ignoramous4y ago

Relevant discussion on Modules, Monoliths, and Microservices, https://news.ycombinator.com/item?id=26247052 (2021).

delusional4y ago

Any time you interact with a third party API you're creating a distributed system. The principles of distributed systems are useful for many more things than "microservices".

pjmlp4y ago

An API isn't a network call and a WebAPI is a distributed service by definition of doing a network request.

rsaxvc4y ago· 4 in thread

> Pictured on the right is the time to access memory in a modern system, on the left the time it takes to do a round trip across the world.

These seem swapped.

maffydub4y ago

Even if they're swapped, the maths doesn't work out - the ratio of squares is 1:42 while the ratio of latencies is 1:1500.

adwn4y ago

You're off by a factor of 1000: 150 ms / 100 ns = 150e-3 / 100e-9 = 1.5 million. So it's even worse!

1 more reply

googletronOP4y ago

Definitely true! Good eye. Drawing out 1500 squares is tough.

googletronOP4y ago

Thanks for pointing this out I will fix it.

henrik_w4y ago· 3 in thread

These fallacies can also lead to one of my favorite programming quoutes:

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” — Leslie Lamport

photochemsyn4y ago

xwolfi4y ago

The joke is that no matter how many redundancies, the complexity is so beyond human's ability to map at once, that you always miss a single point of failure.

Each failure scenario must be gracefully handled in each subpart to take into account all possible impact on the whole system.

You'll always get a random crap you never heard of destroying an assumption you made in your own crap.

ketzu4y ago

After watching Lamports introduction to TLA+, I got the impression he's just a really funny guy (while being a genius, too).

Matthias2474y ago· 3 in thread

A few other ones that come up regularly:

- Servers/applications never restart

- Startup order is fully deterministic

- All instances run the same software revision (maybe somewhat covered by 8)

- Hardware is reliable

- All clients are playing nicely along (and I'm not even talking about attacks)

- Logging and metrics are cheap

- There are no software bugs in any layer

dijit4y ago

> Startup order is fully deterministic

It still is for the minority of systems.

On single OSs Google runs an init that is deterministic; as do the BSDs and some distributions of Linux.

Matthias2474y ago

Many people think it is - but the realitiy is that it depends a lot on the system one is working on, can be super tricky, and often relying on it will cause some pain points later on.

Another example is a developer/administrator manually changing the config of a certain service and restarting it to take effect - that could also trigger dependency problems.

snug4y ago

beckingz4y ago· 2 in thread

This is a good summary of many of the most annoying problems of distributed systems.

googletronOP4y ago

Thanks. I would say quorum issues with flakey nodes is up there too.

macintux4y ago

Failing/misbehaving nodes are the worst. Everything from slow performance to data corruption.

Much better to lose a node immediately than to have it fail gradually.

1 more reply

ramraj074y ago· 2 in thread

anonymousDan4y ago

Can you expand on what you mean?

ramraj074y ago

1 more reply

googletronOP4y ago· 1 in thread

Short little post about Fallacies of Distributed Systems. These are some thoughts I feel often get downplayed when engineers are designing microservices.

ignoramous4y ago

Anti-dotes exist too. I particularly like go to these ones from AWS back in the day when distributed systems weren't as popular:

- Amazon S3 design requirements and principles (2006), https://archive.is/EP6HU#selection-1205.0-1425.66

- Amazon S3: Architecting for resiliency (2009), http://web.archive.org/web/20100719163331/https://qconsf.com...

avinassh4y ago· 1 in thread

related: I love the illustration of fallacies by Denis Yu. It has cats!

http://deniseyu.io/art/sketchnotes/topic-based/8-fallacies.p...

googletronOP4y ago

That’s really awesome!

kirdie4y ago· 1 in thread

Hey there’s a typo in the article. In the image of memory vs round trip says that roundtrip is 100ns and memory access 150ms I guess you meant the other way around :P

googletronOP4y ago

Thanks. I believe it has been fixed now.

macintux4y ago

“The network is reliable” is also the title of a relatively famous collaboration between Kyle Kingsbury and Peter Bailis which generated a long list of counter-examples.

* https://cacm.acm.org/magazines/2014/9/177925-the-network-is-...

Discussions here:

* https://news.ycombinator.com/item?id=5820245

* https://news.ycombinator.com/item?id=8162720

aliswe4y ago

The original article seems to be from 1994.

https://en.m.wikipedia.org/wiki/Fallacies_of_distributed_com...

grymoire14y ago

This was first mentioned in 1994, BTW.

https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...

capn_duck4y ago

googletronOP4y ago

Here is the full illustration https://architecturenotes.co/content/images/size/w2000/2022/...

nraynaud4y ago

"oh, today the underlying layer decided to wait 1min30 for a timeout",

"oh, the DNS resolution is broken",

"no route to host???",

"invalid certificate, did somehone hijack the DNS query?"

"this library decided to wrap the errors, now I have learn an entire new vocabulary of failure modes"

samsquire4y ago

Don't underestimate how performant a single thread is! Some problems are embarrassingly parallel.

Of course coordinating threads is difficult.

I am designing an LMAX disruptor system that translates a synchronous piece of code into nested LMAX disruptors.

Essentially you can do high IO and high CPU and not block other requests as they're pipelined.

https://github.com/samsquire/ideas4#51-rewrite-synchronous-c...

Don't block the UI thread or the node js event loop. We don't have to pick between tokio and rayon. We can have multiple event loops.

qaq4y ago

vira284y ago

The illustrations look really cool. Wondering what software/tool is being used?

j / k navigate · click thread line to collapse