But obviously if there's a hard failure, they aren't always going to be able to give you the amount of time you'd want. Generally speaking, you should have accounted for this situation ahead of time in your engineering plans. Amazon EC2 doesn't have anything like vmotion, it's just a bunch of KVM virts.
If you're using the GUI, the first time you try a shutdown, it will do a normal request, but then if you go back and try it again while the first request is still pending, you should see the option for doing a hard restart. Try that and give it some time. Sometimes it takes an hour or two to get through. Otherwise, Amazon's tech support can help you.
I believe this comes as a shock to most people the first time they receive this email, it was to us at least. When we signed up with amazon there was no guideline or advice saying "hey in a year or 2 your hardware might fail or need replaced, have a migration plan ready"
Perhaps it was our naivety, but we just thought hey, its the cloud, what could go wrong?! Now of course are are battle hardened
[1] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-... (bottom of the page)
But "cloud" compute services should in general be treated as less reliable per individual unit unless your provider explicitly explain to you why not (such as guaranteeing to use a high-availability distributed filesystem), as you no direct way of ascertaining status of the underlying hardware. You need to plan for failure regardless.
Have you given any thought to moving to something like https://circleci.com? [disclosure: I work there]
> I just looked into your instance a bit further, it does appear this was due to an issue with the underlying host on which your instance resided upon when you encountered the issues today.
> I do want to clarify here as our original reply mentioned a scheduled retirement, this was not the case and no notice was sent out because of this.
It looks like the original email was incorrect.
To add to this though if people rely on advance email notifications with AWS then they are putting at risk their availability. Just because it is on the cloud doesn't insulate you from hardware issues, these need to be planned for. AWS does provide some building blocks to address this (auto-scalaing and load balancers come to mind).
I searched my mails for couple of words including instance ID, result is negative. No email in spam folder in last one month.
It was EBS backed instance, since I had snapshot, it does not took much time to recreate a new instance. Beside that I had another instance behind ELB to avoid downtime.
If is fair to expect some notification for scheduled retirement.
In any case, I'm not really trying to criticise here. Just pointing out an engineering trade-off.
Now, if you lose an EBS volume, that's totally different. You are snapshotting your EBS volumes, correct?
The first thing you discover when reading through the various options is that you need to treat ALL local storage like /tmp, subject to deletion at will. Keep your persistent storage on EBS/S3.
Your volume experienced a failure due to multiple failures of the
underlying hardware components and we were unable to recover it.
Although EBS volumes are designed for reliability, backed by multiple
physical drives, we are still exposed to durability risks caused by
concurrent hardware failures of multiple components, before our systems
are able to restore the redundancy. We publish our durability expectations
on the EBS detail page here (http://aws.amazon.com/ebs).
Sincerely,
EBS Support
Fortunately, we had recent snapshots and it was a matter of (manually) spinning up a new instance from those.Edit: proper quotation
My point being. On this topic AWS could learn from Microsoft on how to do cloud.
They could potentially do this on their second generation (M3) instances, as well as micro instances if they wanted to. However I'd guess that these instances are just a small percentage of the overall servers used.
Migrations would be restricted to hosts running specific releases of the hypervisor [1], and AWS's SDN systems would need to handle these changes in very-near-realtime.
[1] wiki.xen.org/wiki/Xen_Version_Compatibility
Cloud platforms let you avoid physically dealing with the hardware, and conveniently using ec2-create-snapshot instead of tapes back-and-forth, but the paradigm is exact the same.
If you care about your data and your servers, you have to plan for failure. Cloud or not.
For example, I have a client who has some algorithms and data that are potentially quite valuable. EC2 and other AWS services would be a huge help with their project, but is there any way measures could be taken to ensure that no one - even Amazon employees - can get to their code and data?
Edit: devicenull makes some good points - I guess I had the CIA's $600 million AWS contract in my head when asking my question.
After all, you cannot stop someone from taking a full snapshot of the VM and grabbing all the information. Encryption is no help here, as the VM ultimately needs to store the key in memory.
If it's really that valuable (lot's of companies seem to overestimate how much people would want to steal their data), then it really should never leave hardware under their control.
Dude, my bank/email-host/health-insurer is teh suk. They overestimate the value of data confidentiality. I hope this does not become a new trend. I expect the companies that I deal with to play fast and loose with the data they control. Encrypting Data at rest? C'mon bro, if the data is so important why is it just sitting there with nobody using it.
I think Amazon needs to put a lot more effort into educating people about the best practices involved here - creating immutable and disposable servers, make it easier (console access) to create availability groups, etc.
Then you should educate them. This isn't something unique to the cloud, physical servers absolutely can do this too. I work with thousands of (physical) servers in the day job, we have all kinds of failures that take out individual hosts on a regular basis.
With a few dozen servers total, I have servers at work that have not had a failure in 8+ years, and we have some hardware that is 12+ years, and until office and data centre moves recently we had hardware that had not been rebooted for 5 years.
We have moved everything to VMs that we take hourly copies of, and can redeploy most of our VMs in minutes because we do know we need to be prepared for hardware failures, and occasionally face them, but they are rare events at our scale.
For people with even smaller setups, with only a handful of servers, they cane easily have periods of years without any failures. Then it's easy for people to get complacent.
Anyone who's surprised that this happens has not used EC2 very much. It is this way by design.
Then it kept running, but there was no way to reboot it from EC2 console or ssh, so that was a bit of a problem, had to get support to do it.
Moral - reboot it yourself at a convenient time.
Notification that your system is on old hardware that has been deprecated is part of the price of doing business in this cloud system.
As others have noted: yes, it is a little tense (is this my production database or my Continuous Integrations machine) -- The email you get just gives you an aws-id token, so you must look it up.
but, AWS has enough components that help you build resilient systems that, if you've done you job correctly, you shouldn't care about these messages other than the labor of spinning up a replacement.