The way it works is that you read the RAM pages from one machine to another in real time and when the RAM cache is almost synchronized you slam the IP address over to the new box (and then you let Amazon reboot your old box and then migrate back post-upgrade if you want to).
You can try it out on our public cloud at terminal.com if you'd like to (we auto-migrate all of our customers off of the degrading hardware before it reboots on our public cloud, but you can control that if you're running terminal as your infrastructure).
Are you migrating just a process tree / other contained environment, or the entire machine?
Are you using CRIU or similar? Do open TCP connections survive the transfer?
Custom container implementation, custom networking, custom storage.
It's just really good hardcore kernel engineering.
If you wanna talk more and you're in SF, come to our meetup on the 10th: machinelearningsf.eventbrite.com.
Edit: the whole machine including RAM cache, CPU instructions, IP connections, etc. is carried over. We can also resize your machine in seconds while it's running.
Or if the SDN of your cloud is good enough, even TCP_REPAIR might not be needed!
Process isolation is hard, but we've achieved it. We currently have some tens of thousands of users on our public cloud with zero container breakout, and while no security is perfect, we're constantly trying to improve our offering through White Hat bounties and constant security testing. In this case, I can tell you heuristics with which you can infer security, but I can't blanket label something as secure. I would say I think it's the most secure new virtualization tech, but I would also note that's a matter of personal opinion. Again, zero container breakout is probably the main point.
You can run our virtualization inside of Amazon, in which case you only really have the pain of Xen host + Amazon Xen, but it performs faster on bare-metal (as one might expect).
How is it different from Xen or KVM live migration?
It runs inside of any hypervisor or on bare metal.
Feel free to email me at josh[at]terminal[dot]com if you want to talk more. I can peel back the kimono quite far (we're also in SF if you wanna meet up).
What is sad is that we learn about it from Hacker News and not from AWS, even when we have premium support and our own account manager. :/
Let's see how many of us did their homework after previous "xen update", and how much "10%" is now ;-)
Not wanting a repeat of this we have migrated as many services we can into autoscaling groups, and automated all resource creation with CloudFormation.
This was inspired by this excellent Netflix blogpost: http://techblog.netflix.com/2014/10/a-state-of-xen-chaos-mon...
Overall, I'm looking at a huge quantity of affected servers. That said, I don't blame AWS. I blame my incompetent architect for designing systems that are incredibly hard to upgrade, and that can't be rebooted safely. Definitely not bitter at that idiocy at all.
a) only 10% of the fleet are running a version of the hypervisor that is affected by the bug
b) based on the turnover rate, they expect 10% to need rebooting under the customers by the date the bugs are being released.
c) 10% are running a combination of the affected hypervisor and vm's that are reasonably at risk of exploitation, other's may have the faulty hypervisor but either are being used as single tenant (there is no risk of someone breaking out and affecting someone else) or are running vm's that may not be able to break out depending on the nature of bugs.
Just speculating, any ideas?
Here's a previous Xen vulnerability based on Intel implementing the SYSRET instruction (originally introduced by AMD, along with SYSCALL; Intel's version of this was SYSENTER and SYSEXIT, with different semantics about kernel stacks and things) in a slightly different way from how AMD implemented it. Both Intel's docs and AMD's docs were accurate for their own processors, but if you only read AMD's docs, you'd implement syscalls in a way that was vulnerable on Intel.
https://blog.xenproject.org/2012/06/13/the-intel-sysret-priv...
And of course, Amazon has an interest in encouraging its users to build their systems in a cloud-friendly way. ie, properly designed services on AWS should not suffer from having a handful of VMs get rebooted at any time. So, from that POV, it's just good medicine to encourage the culture they built their service to accommodate.
they probably are operating at a pretty
high percentage, so gradually idling 10%
of their host hardware would probably
put a pinch on availability
If they can withstand the loss of one of out of three availability zones, they must have at least 33% spare capacity in each AZ, ready for new instances to start to take up the slack from the lost AZ.Even if they have more availability zones than the three they show to users, I would hope they would have more than 10% spare capacity!
It sounds like you could spin up new instances until you get one that doesn't require the update.
These must be some seriously bad mojo to force reboots with little to no notice over a week before they're scheduled to leave embargo.
5 undisclosed Xen vulns. Wheeeeee!
The ticket said it was for emergency physical maintenance, i assumed it was hardware failure but i guess that definition could extend to hypervisor issues. Now to wait and see if linodes in other regions are rebooted too
If they treat it like the last round of Xen vulnerabilities, they will simply place a warning on their dashboard an hour beforehand - not sending out any form of email notice. The first we knew about it was when we started receiving alerts from nagios.
https://community.rackspace.com/general/f/53/t/4978
I wasn't able to find anything on Digital Ocean's public facing websites.
If they are using Xen, they shouldn't know the details of the vuln yet as they aren't on the pre-disclosure list: