[1] http://serverfault.com/questions/60553/why-is-dns-failover-n...
Initial though was: f#*k, I ran out of network I/O, since EC2 simply states "low, mid, high" as performance specs, therefore some proper planning is out. Turned out that with load avg under 0.02 on all machines, the I/O wait wasn't to blame. Average response time, per Pingdom, went up from 170ms to 450ms. New Relic isn't happy either. I guess we should all thank Amazon. Again.
Pingdom still claims 100% uptime, but New Relic (which includes an equivalent pinging service) reports downtime from time to time. Around 25 timeout alerts into the last couple of hours.
Now that the prices has normalized I think the load is distributing more evenly, but for historical reasons I think that region sees a lot more churn (starting/stopping/deploying/etc.) -- just more grinding on the hardware at that region that others.
Obviously the right answer is to deploy applications into multiple AWS regions, but that's not appropriate for every service's architecture unless built that way from the start or modified specifically to do so.
US-EAST-1 is enormous. It's now made up of 10 data centers with tens of thousands of machines (and associated infrastructure). There are thus more moving parts, and more points of failure. While Amazon builds a bunch of redundancy into its systems, "smaller" issues will tend to impact a larger number of users in US-EAST-1.
5:17 AM PDT Between 12:51 AM and 4:52 AM PDT we experienced elevated packet loss affecting instances in the US-EAST-1 region. Some of our APIs also experienced increased error rates and latencies. The issue has been resolved and the service is currently operating normally.