undefined | Better HN

0 pointsmnm17y ago0 comments

Workers and queues fail. SQS was down for us for almost two weeks while AWS fixed a bug. We had no choice but to wait or rewrite our implementation... Again! We've already had to rewrite once due to poor visibility and rare occasional problems processing data. Debugging such distributed systems is legendary hell. And that's just for simple async processing so that we can return a response quickly to the user and finish the task in a few seconds. There is simply no comparison between such a complex, failure prone distributed system and the simplicity, reliability, and ease of use of having support built into the language for this, IMO.

0 comments

2 comments · 1 top-level

StreamBright7y ago· 1 in thread

I am sorry but I disagree. You are trying to make it sound that your cloud provider downtime has something to do how you manage your workload in your code.

Debugging __any__ distributed system is difficult, this is why monitoring and tracing should be first class citizens in your deployments. It seems they are not for you.

mnm1OP7y ago

Yeah, monitoring told us it was down and eventually we figured it was an AWS issue we could do nothing about until they patched it. My main point there is actually that for many use cases, this doesn't have to be a distributed computing problem and thus the non-distributed version is superior to the distributed version.

j / k navigate · click thread line to collapse