Any guess on what's causing it?
In hindsight, I guess the foresight of some organizations to go multi-cloud was correct after all.
It's not easy though.
I'm curious—at what point did you decide the overhead was worth it? Was it after experiencing an outage, or did you architect for it from day one?
As someone launching a product soon (more on the builder/product side than infra-engineer), I keep wrestling with this. The pragmatist in me says "start simple, prove the concept, then layer in resilience." But then you see events like this week and think "what if this happens during launch?"
How did you handle the operational complexity? Did you need dedicated DevOps folks, or are there patterns/tools that made it manageable for a smaller team?
I would recommend focusing on multi-region within a single CSP instead (both for workloads AND your tooling), which covers the vast majority of incidents and lays some of the architectural foundation for multi-cloud down the road. Develop failover plans for each service in your architecture (eg. planned/tested runbooks to migrate to Traffic Manager in the event AFD goes down)
Also choose your provider wisely. We experience 3-5x the number of service-impacting incidents on Azure that we do on AWS. I'm sure others have different experiences, but I would never personally start a company on Azure. AWS has its own issues, of course, but reliability has not been a major one (relatively speaking) over the past 10 years. Last week's incident with DynamoDB in us-east-1 had zero impact on our AWS workloads in other regions.