I like that you're in here commenting. After reading through the postmortem, it reminds me of scaling issues we had at previous job. We had hundreds of thousands of clients that would get "hyper active" if they had issues connecting, retry loops FTW.
System goes down and it was hard resurrecting it since the traffic just kept pounding away. No autoscaling, no cloud. It woulda been handy to just fire up some more servers, let alone have things auto-scale via CPU %.