AWS instance was scheduled for retirement (opens in new tab)

(forums.aws.amazon.com)

77 pointsthemonk12y ago75 comments

If you are hosting on cloud, you must automate everything.

75 comments

60 comments · 17 top-level

mdellabitta12y ago· 9 in thread

They generally send you an advance email. I just had to migrate our Jenkins server a week or two ago because of this. I received something like 15 days notice on that one.

But obviously if there's a hard failure, they aren't always going to be able to give you the amount of time you'd want. Generally speaking, you should have accounted for this situation ahead of time in your engineering plans. Amazon EC2 doesn't have anything like vmotion, it's just a bunch of KVM virts.

If you're using the GUI, the first time you try a shutdown, it will do a normal request, but then if you go back and try it again while the first request is still pending, you should see the option for doing a hard restart. Try that and give it some time. Sometimes it takes an hour or two to get through. Otherwise, Amazon's tech support can help you.

rschmitty12y ago

> Generally speaking, you should have accounted for this situation ahead of time in your engineering plans.

I believe this comes as a shock to most people the first time they receive this email, it was to us at least. When we signed up with amazon there was no guideline or advice saying "hey in a year or 2 your hardware might fail or need replaced, have a migration plan ready"

Perhaps it was our naivety, but we just thought hey, its the cloud, what could go wrong?! Now of course are are battle hardened

m_ram12y ago

FWIW, they have a brief section about hardware failure in the Getting Started Guide [1]. I don't know how long it's been there.

[1] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-... (bottom of the page)

michaelmior12y ago

More accurately: "Your hardware my fail at any time."

1 more reply

cmelbye12y ago

I honestly thought most people knew that about EC2 as one of the core "trade offs" or engineering decision that allows the platform to be what it is, compared to a more traditional VPS provider.

1 more reply

vidarh12y ago

EC2 used to terminate instances with no warning in many situations when it launched. It seems they've concluded most people didn't understand that, and avoid that whenever possible now.

But "cloud" compute services should in general be treated as less reliable per individual unit unless your provider explicitly explain to you why not (such as guaranteeing to use a high-availability distributed filesystem), as you no direct way of ascertaining status of the underlying hardware. You need to plan for failure regardless.

pbiggar12y ago

One of the reasons running jenkins on EC2 sucks for developers :( The data is stored on the machine, and there's a big risk of losing all your CI/CD infrastructure.

Have you given any thought to moving to something like https://circleci.com? [disclosure: I work there]

maaku12y ago

That's why you use EBS and snapshots...

1 more reply

mdellabitta12y ago

It wasn't that bad. I stopped and started an instance. Done.

misframer12y ago

EC2 uses Xen.

chris_wot12y ago· 8 in thread

What, you don't get notified?

travem12y ago

It looks like the forum thread has been updated with the following comment from Luke@AWS

> I just looked into your instance a bit further, it does appear this was due to an issue with the underlying host on which your instance resided upon when you encountered the issues today.

> I do want to clarify here as our original reply mentioned a scheduled retirement, this was not the case and no notice was sent out because of this.

It looks like the original email was incorrect.

To add to this though if people rely on advance email notifications with AWS then they are putting at risk their availability. Just because it is on the cloud doesn't insulate you from hardware issues, these need to be planned for. AWS does provide some building blocks to address this (auto-scalaing and load balancers come to mind).

kalleboo12y ago

I've gotten emails a week in advance and again a day in advance when an instance needed maintenance that would result in a 10 second network reset, so it'd really surprise me if Amazon completely retired an instance with no notification. This person must have missed the email or it got spammed.

themonkOP12y ago

Can you tell me subject line of email.

I searched my mails for couple of words including instance ID, result is negative. No email in spam folder in last one month.

1 more reply

sudhirj12y ago

Maybe 'scheduled for retirement' is a euphemism for 'someone tripped over that rack's power cord'.

primitivesuave12y ago

Made my day.

themonkOP12y ago

No I did not got any notification.

It was EBS backed instance, since I had snapshot, it does not took much time to recreate a new instance. Beside that I had another instance behind ELB to avoid downtime.

If is fair to expect some notification for scheduled retirement.

latch12y ago

They usually send an email a couple weeks before

the_mitsuhiko12y ago

You do.

regularfry12y ago· 6 in thread

Interesting that they've gone that way rather than attempt any sort of live migration.

devicenull12y ago

Not really. Live migration requires shared storage, which is yet another bottleneck, and yet another point of failure.

regularfry12y ago

This case was about an EBS-backed instance, which is shared storage. Either way, you could avoid centrally shared storage by migrating the backing store first, then the running instance.

In any case, I'm not really trying to criticise here. Just pointing out an engineering trade-off.

travem12y ago

Live migration doesn't have to require this. VMware vSphere for example includes storage vMotion capabilities which remove the need for shared storage.

1 more reply

wmf12y ago

Live migration requires the VM to be alive. It doesn't help when a hardware failure takes out the whole machine, so people would still need to plan for that.

regularfry12y ago

Yes, and that's the direction Amazon want you to be thinking in. They could throw engineering effort at reducing the likelihood of instances going away in this sort of scenario, but they've chosen not to.

oakwhiz12y ago

Live migration is useful, but in my opinion it's kind of a band-aid solution for the problem of availability. Live migration is not necessary if you design your application as a distributed system.

Corrado12y ago· 5 in thread

Ok, the key to working with AWS EC2 instances is to remember that they are ephemeral and can disappear at any point in time. If your treating it like a traditional server that you have in a rack you're doing it wrong. Just turn it off and start a new one. You are using a configuration manager (puppet, chef, etc) aren't you?

bowlofpetunias12y ago

I've learned a long time ago to treat traditional servers in a rack like they can disappear (or get compromised) at any time for a huge range of reasons. You can never be too paranoid.

manmal12y ago

Yeah, I like how Netflix even goes so far as killing instances randomly: http://venturebeat.com/2013/09/10/netflix-chaos-monkeys/

jpitz12y ago

This ^^^ comment cannot be upvoted enough. Please treat all servers, virtual or not, as ephemeral.

toomuchtodo12y ago

Well, sort of. As long as you're storing all your data on EBS volumes, you can treat EC2 instances as machines in a rack. Problem with your instance? Reboot, and you'll be good as new.

Now, if you lose an EBS volume, that's totally different. You are snapshotting your EBS volumes, correct?

jes519912y ago

as someone who used to be a Puppet maintainer, I say: use Chef. or Ansible.

kartikkumar12y ago· 5 in thread

I think I'm missing something. Why isn't Amazon sorting this out behind the scenes so that any failing hardware is seamlessly replaced and the user is none the wiser? Am I expecting too much?

ghshephard12y ago

EC2 instances don't come with vmotion. It's up to the customer to detect a failed/retired node and restart on another EC2 instance.

The first thing you discover when reading through the various options is that you need to treat ALL local storage like /tmp, subject to deletion at will. Keep your persistent storage on EBS/S3.

arturhoo12y ago

And even if you do keep your important stuff on EBS, make sure you take snapshots on a frequent basis. We have received this email a couple of times:

    Your volume experienced a failure due to multiple failures of the 
    underlying hardware components and we were unable to recover it.

    Although EBS volumes are designed for reliability, backed by multiple 
    physical drives, we are still exposed to durability risks caused by 
    concurrent hardware failures of multiple components, before our systems 
    are able to restore the redundancy. We publish our durability expectations 
    on the EBS detail page here (http://aws.amazon.com/ebs).


    Sincerely,
    EBS Support

Fortunately, we had recent snapshots and it was a matter of (manually) spinning up a new instance from those.

Edit: proper quotation

RSDnuyIsk9jMSWb12y ago

Windows Azure actually does this. If the host your virtual machine is on for some reason fails or needs to be replaced your entire VM is migrated to another host. The migration process can take a few minutes but all your data is safe.

My point being. On this topic AWS could learn from Microsoft on how to do cloud.

SudoAlex12y ago

The one problem they have is that the majority of their instances include local storage, which would make migration impossible. So the best they can offer is a reboot so the server ends up on another host.

They could potentially do this on their second generation (M3) instances, as well as micro instances if they wanted to. However I'd guess that these instances are just a small percentage of the overall servers used.

1 more reply

cheeseprocedure12y ago

While Xen should make live migrations technically possible, it would probably reduce EC2's provisioning flexibility and introduce undesirable complexity.

Migrations would be restricted to hosts running specific releases of the hypervisor [1], and AWS's SDN systems would need to handle these changes in very-near-realtime.

[1] wiki.xen.org/wiki/Xen_Version_Compatibility

noonespecial12y ago· 4 in thread

Remember kids, an EC2 is not a server. It's a process on someone else's server and all of your data is stored in /tmp. Do plan accordingly.

guiambros12y ago

Even if it were a server, you'd have to protect yourself against the exact same risks: hardware may fail, the datacenter may burn, your data may be destroyed by cosmic rays.

Cloud platforms let you avoid physically dealing with the hardware, and conveniently using ec2-create-snapshot instead of tapes back-and-forth, but the paradigm is exact the same.

If you care about your data and your servers, you have to plan for failure. Cloud or not.

acmecorps12y ago

It's a process on someone else's server? Do you have a link where I can know more on this?

misframer12y ago

I'm not sure about Xen (what AWS uses), but if you look at KVMs as an analogy, instances are literally processes on the host.

2 more replies

ChuckMcM12y ago

Given the relatively low cost of excess PC hardware these days it is extremely helpful to install one with a Xen, HyperV, or some other hypervisor type system and run multiple instances on it. By doing so you will get a much better feel for what is going on when you "start", "stop", "buy" etc an EC2 instance or 'Droplet' or VPS etc.

rpm432112y ago· 4 in thread

This is somewhat unrelated, but what's the general consensus on the security of EC2 for very sensitive computation?

For example, I have a client who has some algorithms and data that are potentially quite valuable. EC2 and other AWS services would be a huge help with their project, but is there any way measures could be taken to ensure that no one - even Amazon employees - can get to their code and data?

Edit: devicenull makes some good points - I guess I had the CIA's $600 million AWS contract in my head when asking my question.

jeffbarr12y ago

There's no need to wonder about these things. Check out the AWS Security Center at http://aws.amazon.com/security/ to get the facts. At that address you will find a very detailed (39 page) Security White Paper.

devicenull12y ago

No. You don't control the execution environment, so if it's really that valuable it can't be trusted.

After all, you cannot stop someone from taking a full snapshot of the VM and grabbing all the information. Encryption is no help here, as the VM ultimately needs to store the key in memory.

If it's really that valuable (lot's of companies seem to overestimate how much people would want to steal their data), then it really should never leave hardware under their control.

dfc12y ago

I have never heard anyone complain about a company taking infosec too seriously, let alone lots of companies.

Dude, my bank/email-host/health-insurer is teh suk. They overestimate the value of data confidentiality. I hope this does not become a new trend. I expect the companies that I deal with to play fast and loose with the data they control. Encrypting Data at rest? C'mon bro, if the data is so important why is it just sitting there with nobody using it.

1 more reply

goblin8912y ago

I'd be OK with uploading sensitive data onto S3, as long as it's properly encrypted, but with EC2 I guess you can never tell.

sudhirj12y ago· 2 in thread

I'm working with another team of people who haven't yet tried working with cloud servers, and one of the things they're struggling with the most is that cloud servers need to be thought of as disposable. They can't easily digest the idea that servers can and will go down randomly for no known reason.

I think Amazon needs to put a lot more effort into educating people about the best practices involved here - creating immutable and disposable servers, make it easier (console access) to create availability groups, etc.

dlgeek12y ago

> They can't easily digest the idea that servers can and will go down randomly for no known reason.

Then you should educate them. This isn't something unique to the cloud, physical servers absolutely can do this too. I work with thousands of (physical) servers in the day job, we have all kinds of failures that take out individual hosts on a regular basis.

vidarh12y ago

The problem for a lot of people is that on a small scale physical hosts can appear to be extremely stable.

With a few dozen servers total, I have servers at work that have not had a failure in 8+ years, and we have some hardware that is 12+ years, and until office and data centre moves recently we had hardware that had not been rebooted for 5 years.

We have moved everything to VMs that we take hourly copies of, and can redeploy most of our VMs in minutes because we do know we need to be prepared for hardware failures, and occasionally face them, but they are rare events at our scale.

For people with even smaller setups, with only a handful of servers, they cane easily have periods of years without any failures. Then it's easy for people to get complacent.

apetresc12y ago

Not only do they send you an e-mail about this, they even have an API call for it: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitorin...

Anyone who's surprised that this happens has not used EC2 very much. It is this way by design.

tschellenbach12y ago

Usually they send you a nice email about this. Then you have to lookup the instance and hope its a webworker and not our main database :)

RockyMcNuts12y ago

I've gotten one of those emails and thought, OK it's gonna reboot, not a problem for that instance, has no persistent data I care about.

Then it kept running, but there was no way to reboot it from EC2 console or ssh, so that was a bit of a problem, had to get support to do it.

Moral - reboot it yourself at a convenient time.

keithgabryelski12y ago

To work in AWS's system you must have redundant nodes -- such that any single node can be rebooted without affecting the system as a whole.

Notification that your system is on old hardware that has been deprecated is part of the price of doing business in this cloud system.

As others have noted: yes, it is a little tense (is this my production database or my Continuous Integrations machine) -- The email you get just gives you an aws-id token, so you must look it up.

but, AWS has enough components that help you build resilient systems that, if you've done you job correctly, you shouldn't care about these messages other than the labor of spinning up a replacement.

dabs_return12y ago

Luke@AWS updated your thread. Makes a lot more sense now as a notice would only be sent if it was a scheduled eviction.

sudhirj12y ago

Reminds me of http://www.goodreads.com/quotes/379100-there-s-no-point-in-a...

gyepi12y ago

War story: I was once called in to scale an application that had been running on AWS for 6 or 7 months and was failing due to excessive traffic. Normally a good problem to have, but this turned into a difficult problem because the application stored critical data on an EBS and those are, of course, not sharable. The only solution was to move to increasingly larger instances until the application could be rewritten. Moral: If you are on the "cloud", make sure your application design fits your infrastructure.

aidos12y ago

Once upon a time there was EC2, without EBS. It was actually a pretty good place to be. There was no ambiguity because everyone who used EC2 was given a lot of warnings about how they'd have to architect their systems to avoid critical failure. I wonder if the introduction of EBS has actually increased data loss because people aren't as paranoid about it.

tbarbugli12y ago

Whats the point of this entry ? Are we surprised that hardware fails ? I am the complete opposite of an EC2 fanboy but every time they decided to shut down a machine they had the good taste of sending an email to us.

j / k navigate · click thread line to collapse

75 comments

60 comments · 17 top-level

mdellabitta12y ago· 9 in thread

They generally send you an advance email. I just had to migrate our Jenkins server a week or two ago because of this. I received something like 15 days notice on that one.

rschmitty12y ago

> Generally speaking, you should have accounted for this situation ahead of time in your engineering plans.

Perhaps it was our naivety, but we just thought hey, its the cloud, what could go wrong?! Now of course are are battle hardened

m_ram12y ago

FWIW, they have a brief section about hardware failure in the Getting Started Guide [1]. I don't know how long it's been there.

[1] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-... (bottom of the page)

michaelmior12y ago

More accurately: "Your hardware my fail at any time."

1 more reply

cmelbye12y ago

I honestly thought most people knew that about EC2 as one of the core "trade offs" or engineering decision that allows the platform to be what it is, compared to a more traditional VPS provider.

1 more reply

vidarh12y ago

EC2 used to terminate instances with no warning in many situations when it launched. It seems they've concluded most people didn't understand that, and avoid that whenever possible now.

pbiggar12y ago

One of the reasons running jenkins on EC2 sucks for developers :( The data is stored on the machine, and there's a big risk of losing all your CI/CD infrastructure.

Have you given any thought to moving to something like https://circleci.com? [disclosure: I work there]

maaku12y ago

That's why you use EBS and snapshots...

1 more reply

mdellabitta12y ago

It wasn't that bad. I stopped and started an instance. Done.

misframer12y ago

EC2 uses Xen.

chris_wot12y ago· 8 in thread

What, you don't get notified?

travem12y ago

It looks like the forum thread has been updated with the following comment from Luke@AWS

> I just looked into your instance a bit further, it does appear this was due to an issue with the underlying host on which your instance resided upon when you encountered the issues today.

> I do want to clarify here as our original reply mentioned a scheduled retirement, this was not the case and no notice was sent out because of this.

It looks like the original email was incorrect.

kalleboo12y ago

themonkOP12y ago

Can you tell me subject line of email.

I searched my mails for couple of words including instance ID, result is negative. No email in spam folder in last one month.

1 more reply

sudhirj12y ago

Maybe 'scheduled for retirement' is a euphemism for 'someone tripped over that rack's power cord'.

primitivesuave12y ago

Made my day.

themonkOP12y ago

No I did not got any notification.

It was EBS backed instance, since I had snapshot, it does not took much time to recreate a new instance. Beside that I had another instance behind ELB to avoid downtime.

If is fair to expect some notification for scheduled retirement.

latch12y ago

They usually send an email a couple weeks before

the_mitsuhiko12y ago

You do.

regularfry12y ago· 6 in thread

Interesting that they've gone that way rather than attempt any sort of live migration.

devicenull12y ago

Not really. Live migration requires shared storage, which is yet another bottleneck, and yet another point of failure.

regularfry12y ago

This case was about an EBS-backed instance, which is shared storage. Either way, you could avoid centrally shared storage by migrating the backing store first, then the running instance.

In any case, I'm not really trying to criticise here. Just pointing out an engineering trade-off.

travem12y ago

Live migration doesn't have to require this. VMware vSphere for example includes storage vMotion capabilities which remove the need for shared storage.

1 more reply

wmf12y ago

Live migration requires the VM to be alive. It doesn't help when a hardware failure takes out the whole machine, so people would still need to plan for that.

regularfry12y ago

oakwhiz12y ago

Live migration is useful, but in my opinion it's kind of a band-aid solution for the problem of availability. Live migration is not necessary if you design your application as a distributed system.

Corrado12y ago· 5 in thread

bowlofpetunias12y ago

I've learned a long time ago to treat traditional servers in a rack like they can disappear (or get compromised) at any time for a huge range of reasons. You can never be too paranoid.

manmal12y ago

Yeah, I like how Netflix even goes so far as killing instances randomly: http://venturebeat.com/2013/09/10/netflix-chaos-monkeys/

jpitz12y ago

This ^^^ comment cannot be upvoted enough. Please treat all servers, virtual or not, as ephemeral.

toomuchtodo12y ago

Well, sort of. As long as you're storing all your data on EBS volumes, you can treat EC2 instances as machines in a rack. Problem with your instance? Reboot, and you'll be good as new.

Now, if you lose an EBS volume, that's totally different. You are snapshotting your EBS volumes, correct?

jes519912y ago

as someone who used to be a Puppet maintainer, I say: use Chef. or Ansible.

kartikkumar12y ago· 5 in thread

I think I'm missing something. Why isn't Amazon sorting this out behind the scenes so that any failing hardware is seamlessly replaced and the user is none the wiser? Am I expecting too much?

ghshephard12y ago

EC2 instances don't come with vmotion. It's up to the customer to detect a failed/retired node and restart on another EC2 instance.

The first thing you discover when reading through the various options is that you need to treat ALL local storage like /tmp, subject to deletion at will. Keep your persistent storage on EBS/S3.

arturhoo12y ago

And even if you do keep your important stuff on EBS, make sure you take snapshots on a frequent basis. We have received this email a couple of times:

    Your volume experienced a failure due to multiple failures of the 
    underlying hardware components and we were unable to recover it.

    Although EBS volumes are designed for reliability, backed by multiple 
    physical drives, we are still exposed to durability risks caused by 
    concurrent hardware failures of multiple components, before our systems 
    are able to restore the redundancy. We publish our durability expectations 
    on the EBS detail page here (http://aws.amazon.com/ebs).


    Sincerely,
    EBS Support

Fortunately, we had recent snapshots and it was a matter of (manually) spinning up a new instance from those.

Edit: proper quotation

RSDnuyIsk9jMSWb12y ago

My point being. On this topic AWS could learn from Microsoft on how to do cloud.

SudoAlex12y ago

1 more reply

cheeseprocedure12y ago

While Xen should make live migrations technically possible, it would probably reduce EC2's provisioning flexibility and introduce undesirable complexity.

Migrations would be restricted to hosts running specific releases of the hypervisor [1], and AWS's SDN systems would need to handle these changes in very-near-realtime.

[1] wiki.xen.org/wiki/Xen_Version_Compatibility

noonespecial12y ago· 4 in thread

Remember kids, an EC2 is not a server. It's a process on someone else's server and all of your data is stored in /tmp. Do plan accordingly.

guiambros12y ago

Even if it were a server, you'd have to protect yourself against the exact same risks: hardware may fail, the datacenter may burn, your data may be destroyed by cosmic rays.

Cloud platforms let you avoid physically dealing with the hardware, and conveniently using ec2-create-snapshot instead of tapes back-and-forth, but the paradigm is exact the same.

If you care about your data and your servers, you have to plan for failure. Cloud or not.

acmecorps12y ago

It's a process on someone else's server? Do you have a link where I can know more on this?

misframer12y ago

I'm not sure about Xen (what AWS uses), but if you look at KVMs as an analogy, instances are literally processes on the host.

2 more replies

ChuckMcM12y ago

rpm432112y ago· 4 in thread

This is somewhat unrelated, but what's the general consensus on the security of EC2 for very sensitive computation?

Edit: devicenull makes some good points - I guess I had the CIA's $600 million AWS contract in my head when asking my question.

jeffbarr12y ago

devicenull12y ago

No. You don't control the execution environment, so if it's really that valuable it can't be trusted.

After all, you cannot stop someone from taking a full snapshot of the VM and grabbing all the information. Encryption is no help here, as the VM ultimately needs to store the key in memory.

If it's really that valuable (lot's of companies seem to overestimate how much people would want to steal their data), then it really should never leave hardware under their control.

dfc12y ago

I have never heard anyone complain about a company taking infosec too seriously, let alone lots of companies.

1 more reply

goblin8912y ago

I'd be OK with uploading sensitive data onto S3, as long as it's properly encrypted, but with EC2 I guess you can never tell.

sudhirj12y ago· 2 in thread

dlgeek12y ago

> They can't easily digest the idea that servers can and will go down randomly for no known reason.

vidarh12y ago

The problem for a lot of people is that on a small scale physical hosts can appear to be extremely stable.

For people with even smaller setups, with only a handful of servers, they cane easily have periods of years without any failures. Then it's easy for people to get complacent.

apetresc12y ago

Not only do they send you an e-mail about this, they even have an API call for it: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitorin...

Anyone who's surprised that this happens has not used EC2 very much. It is this way by design.

tschellenbach12y ago

Usually they send you a nice email about this. Then you have to lookup the instance and hope its a webworker and not our main database :)

RockyMcNuts12y ago

I've gotten one of those emails and thought, OK it's gonna reboot, not a problem for that instance, has no persistent data I care about.

Then it kept running, but there was no way to reboot it from EC2 console or ssh, so that was a bit of a problem, had to get support to do it.

Moral - reboot it yourself at a convenient time.

keithgabryelski12y ago

To work in AWS's system you must have redundant nodes -- such that any single node can be rebooted without affecting the system as a whole.

Notification that your system is on old hardware that has been deprecated is part of the price of doing business in this cloud system.

As others have noted: yes, it is a little tense (is this my production database or my Continuous Integrations machine) -- The email you get just gives you an aws-id token, so you must look it up.

but, AWS has enough components that help you build resilient systems that, if you've done you job correctly, you shouldn't care about these messages other than the labor of spinning up a replacement.

dabs_return12y ago

Luke@AWS updated your thread. Makes a lot more sense now as a notice would only be sent if it was a scheduled eviction.

sudhirj12y ago

Reminds me of http://www.goodreads.com/quotes/379100-there-s-no-point-in-a...

gyepi12y ago

aidos12y ago

tbarbugli12y ago

j / k navigate · click thread line to collapse