AWS: Network Connectivity issues affecting EC2 in US-EAST-1 (opens in new tab)

(status.aws.amazon.com)

55 pointsalanbyrne13y ago31 comments

31 comments

26 comments · 10 top-level

ksdsh13y ago· 4 in thread

My site is affected. I don't know how to handle this situation because the issues affect whole region.

dsl13y ago

You should have duplicate infrastructure in another region which you fail over to automatically or manually.

william_uk13y ago

We are going to set this up for our app, however there's a limit to how useful it is when you are reliant on services such as MongoHQ/PubNub (in our case) that are also affected!

1 more reply

laszlocph13y ago

Duplicating across regions is not well supported by AWS unfortunately. Any idea doing that without completely rebuild the infrastructure in an other region?

2 more replies

michaelt13y ago

How would you recommend performing that failover, when Elastic Load Balancers only cover a single region? If you host your own load balancer it'll have to be hosted somewhere, and DNS failover seems unpopular [1].

[1] http://serverfault.com/questions/60553/why-is-dns-failover-n...

1 more reply

1SaltwaterC13y ago· 3 in thread

Had a bunch of timeout alerts. Some machines of an application server array have the packet drop issue. ELB says that everything is peachy. Folks, we're experiencing yet another "EC2 flavored SNAFU™".

Initial though was: f#*k, I ran out of network I/O, since EC2 simply states "low, mid, high" as performance specs, therefore some proper planning is out. Turned out that with load avg under 0.02 on all machines, the I/O wait wasn't to blame. Average response time, per Pingdom, went up from 170ms to 450ms. New Relic isn't happy either. I guess we should all thank Amazon. Again.

mrcalzone13y ago

I see the same thing. Pingdom reports higher response-time, but no downtime (meaning no alert). Also no alerts from AWS Cloudwatch. I first became aware of the issue when internal api-tests started failing at 9:56am CET. I see users accessing the site, but I don't know how many it's failing for.

1SaltwaterC13y ago

No issues in the internal EC2 network. At least, none that I could find. I guess that's the reason why ELB doesn't shift any traffic. The whole issue seems to be on the Internet facing network. Failing routers, maybe.

Pingdom still claims 100% uptime, but New Relic (which includes an equivalent pinging service) reports downtime from time to time. Around 25 timeout alerts into the last couple of hours.

1SaltwaterC13y ago

Down vote? Really? I guess I should be thankful that I didn't have to explain this to a client: http://i.imgur.com/URf0H.png

sudhirj13y ago· 3 in thread

Odd... my site is on Heroku and there seems to be no trouble.

bad_user13y ago

Our app is also on Heroku and it did experience problems.

damniatx13y ago

git push seems usable. :(

manaslutech13y ago

i can't git push to heroku...

api13y ago· 3 in thread

US-EAST-1 seems to have more issues than their other data centers... anyone know if this is really true?

rkalla13y ago

It does, but not for nefarious reasons -- it (until the more recent history) the cheapest region in the world for AWS -- it was only right before Oregon rolled out that Ireland, US-EAST and US-WEST-2 all became the same price point, but for the 4 years prior to that, it was always the cheapest so that is where most customers rolled out most of their infra.

Now that the prices has normalized I think the load is distributing more evenly, but for historical reasons I think that region sees a lot more churn (starting/stopping/deploying/etc.) -- just more grinding on the hardware at that region that others.

flyt13y ago

rkalla's answer is good, but there's also the issue of the US east coast being a good place to deploy applications that need decent performance to the majority of the english-speaking world. Dropping your servers there puts them within reach of Europe without as much RTT as the US west coast.

Obviously the right answer is to deploy applications into multiple AWS regions, but that's not appropriate for every service's architecture unless built that way from the start or modified specifically to do so.

spartango13y ago

In addition to the other two excellent answers, there's another reason for the apparently higher issue count:

US-EAST-1 is enormous. It's now made up of 10 data centers with tens of thousands of machines (and associated infrastructure). There are thus more moving parts, and more points of failure. While Amazon builds a bunch of redundancy into its systems, "smaller" issues will tend to impact a larger number of users in US-EAST-1.

garindra13y ago· 2 in thread

This got me wondering for quite a long time: I'm pretty sure almost all EC2 issues happen on the US east region -- why is that? Is it because it's the most used region?

sudhirj13y ago

Think so... it's the cheapest, and is the default choice. Sounds like it would have an order of magnitude more usage (and therefore problems) than the other regions.

rplnt13y ago

The US-East-1 is also spread across more than 10 datacenters, which doesn't make things easy. Might not be true for other, more expensive, regions.

Xymak1y13y ago· 1 in thread

Health Dashboard updated the status to "Resolved":

5:17 AM PDT Between 12:51 AM and 4:52 AM PDT we experienced elevated packet loss affecting instances in the US-EAST-1 region. Some of our APIs also experienced increased error rates and latencies. The issue has been resolved and the service is currently operating normally.

jayzalowitz13y ago

Bullshit, I am still down.

plasma13y ago

Again with the green tick with an 'i' icon as a status, rather than a yellow/red icon, jeeze.

bgentry13y ago

This started a little before 01:00 PDT (08:00 UTC), so we're approaching the 3 hour mark now. FWIW, that's about 1h15m before there was any sort of indication of a problem on status.aws.amazon.com

Jare13y ago

Some of my us-east-1 servers are affected, others are happy so far, so it's not the entire datacenter. However, connectivity on affected servers has gone from just flaky to completely gone.

manaslutech13y ago

Looks like this is the reason I can't push to heroku either.

j / k navigate · click thread line to collapse

31 comments

26 comments · 10 top-level

ksdsh13y ago· 4 in thread

My site is affected. I don't know how to handle this situation because the issues affect whole region.

dsl13y ago

You should have duplicate infrastructure in another region which you fail over to automatically or manually.

william_uk13y ago

We are going to set this up for our app, however there's a limit to how useful it is when you are reliant on services such as MongoHQ/PubNub (in our case) that are also affected!

1 more reply

laszlocph13y ago

Duplicating across regions is not well supported by AWS unfortunately. Any idea doing that without completely rebuild the infrastructure in an other region?

2 more replies

michaelt13y ago

[1] http://serverfault.com/questions/60553/why-is-dns-failover-n...

1 more reply

1SaltwaterC13y ago· 3 in thread

mrcalzone13y ago

1SaltwaterC13y ago

Pingdom still claims 100% uptime, but New Relic (which includes an equivalent pinging service) reports downtime from time to time. Around 25 timeout alerts into the last couple of hours.

1SaltwaterC13y ago

Down vote? Really? I guess I should be thankful that I didn't have to explain this to a client: http://i.imgur.com/URf0H.png

sudhirj13y ago· 3 in thread

Odd... my site is on Heroku and there seems to be no trouble.

bad_user13y ago

Our app is also on Heroku and it did experience problems.

damniatx13y ago

git push seems usable. :(

manaslutech13y ago

i can't git push to heroku...

api13y ago· 3 in thread

US-EAST-1 seems to have more issues than their other data centers... anyone know if this is really true?

rkalla13y ago

flyt13y ago

spartango13y ago

In addition to the other two excellent answers, there's another reason for the apparently higher issue count:

garindra13y ago· 2 in thread

This got me wondering for quite a long time: I'm pretty sure almost all EC2 issues happen on the US east region -- why is that? Is it because it's the most used region?

sudhirj13y ago

Think so... it's the cheapest, and is the default choice. Sounds like it would have an order of magnitude more usage (and therefore problems) than the other regions.

rplnt13y ago

The US-East-1 is also spread across more than 10 datacenters, which doesn't make things easy. Might not be true for other, more expensive, regions.

Xymak1y13y ago· 1 in thread

Health Dashboard updated the status to "Resolved":

jayzalowitz13y ago

Bullshit, I am still down.

plasma13y ago

Again with the green tick with an 'i' icon as a status, rather than a yellow/red icon, jeeze.

bgentry13y ago

This started a little before 01:00 PDT (08:00 UTC), so we're approaching the 3 hour mark now. FWIW, that's about 1h15m before there was any sort of indication of a problem on status.aws.amazon.com

Jare13y ago

Some of my us-east-1 servers are affected, others are happy so far, so it's not the entire datacenter. However, connectivity on affected servers has gone from just flaky to completely gone.

manaslutech13y ago

Looks like this is the reason I can't push to heroku either.

j / k navigate · click thread line to collapse