I am sorry but I disagree. You are trying to make it sound that your cloud provider downtime has something to do how you manage your workload in your code.
Debugging __any__ distributed system is difficult, this is why monitoring and tracing should be first class citizens in your deployments. It seems they are not for you.