Upcoming AWS Security Maintenance (opens in new tab)

(aws.amazon.com)

98 pointsmattybrennan11y ago57 comments

57 comments

39 comments · 13 top-level

josh260011y ago· 9 in thread

If you use Terminal on top of AWS (one deployment option) we can just migrate your workloads without rebooting.

The way it works is that you read the RAM pages from one machine to another in real time and when the RAM cache is almost synchronized you slam the IP address over to the new box (and then you let Amazon reboot your old box and then migrate back post-upgrade if you want to).

You can try it out on our public cloud at terminal.com if you'd like to (we auto-migrate all of our customers off of the degrading hardware before it reboots on our public cloud, but you can control that if you're running terminal as your infrastructure).

geofft11y ago

... how?? That is seriously nifty.

Are you migrating just a process tree / other contained environment, or the entire machine?

Are you using CRIU or similar? Do open TCP connections survive the transfer?

josh260011y ago

We wrote a bunch of hacks to the linux kernel to do it.

Custom container implementation, custom networking, custom storage.

It's just really good hardcore kernel engineering.

If you wanna talk more and you're in SF, come to our meetup on the 10th: machinelearningsf.eventbrite.com.

Edit: the whole machine including RAM cache, CPU instructions, IP connections, etc. is carried over. We can also resize your machine in seconds while it's running.

2 more replies

theonewolf11y ago

With TCP_REPAIR, presumably they could...but both ends need to implement the REPAIR option I think, so maybe not in practice yet.

Or if the SDN of your cloud is good enough, even TCP_REPAIR might not be needed!

1 more reply

ngrilly11y ago

Are you running your VMs inside Amazon VMs? Or are you running containers instead, to avoid the overhead of having 3 nested OSs (the Xen host > the Amazon Xen guest > your VM)? If you run containers, how do you guarantee isolation of tenants (it is generally considered to be very difficult to achieve)?

josh260011y ago

We are running a custom container implementation. The goal of our implementation is containers that perform like VMWare.

Process isolation is hard, but we've achieved it. We currently have some tens of thousands of users on our public cloud with zero container breakout, and while no security is perfect, we're constantly trying to improve our offering through White Hat bounties and constant security testing. In this case, I can tell you heuristics with which you can infer security, but I can't blanket label something as secure. I would say I think it's the most secure new virtualization tech, but I would also note that's a matter of personal opinion. Again, zero container breakout is probably the main point.

You can run our virtualization inside of Amazon, in which case you only really have the pain of Xen host + Amazon Xen, but it performs faster on bare-metal (as one might expect).

1 more reply

ngrilly11y ago

Isn't your ability to "migrate workloads without rebooting" similar to Google Compute Engine transparent maintenance and to the live-update capability that Amazon is progressively deploying (which is explained in the post)?

How is it different from Xen or KVM live migration?

josh260011y ago

It's much faster and doesn't use VMs.

1 more reply

jedberg11y ago

I don't see anything on your web page about running on top of AWS...? It looks like you guys only run your own cloud. Can you point me at some docs or anything about running on AWS?

josh260011y ago

I don't have docs yet because I haven't written them, but it's running on AWS right now.

It runs inside of any hypervisor or on bare metal.

Feel free to email me at josh[at]terminal[dot]com if you want to talk more. I can peel back the kimono quite far (we're also in SF if you wanna meet up).

1 more reply

zytek11y ago· 4 in thread

Been there, done that. AWS re:Boot in September 2014 showed us how good it was to invest in Ansible roles for all parts of our infrastructure. Still, a lot of hassle for Ops Team, especially that it was done during DevOps Days Warsaw ;-) AWS also said '10%' then, but for us it was 81 out of ~300 instances.

What is sad is that we learn about it from Hacker News and not from AWS, even when we have premium support and our own account manager. :/

Let's see how many of us did their homework after previous "xen update", and how much "10%" is now ;-)

dups11y ago

Similar experience here. I have a few particularly memorable experiences of dealing with the fallout from the September reboots. Although to be truthful this is partly due to the fact we were moving office and I had people jubilantly packing up around me as I worked to keep things afloat.

Not wanting a repeat of this we have migrated as many services we can into autoscaling groups, and automated all resource creation with CloudFormation.

This was inspired by this excellent Netflix blogpost: http://techblog.netflix.com/2014/10/a-state-of-xen-chaos-mon...

soccerdave11y ago

I have 19 instances (18 in US-West-2) and none of them are affected. I would guess that lots of people here run in us-east-1 since that's the longest running region and I would bet that a lot of that 10% exists there. So, it may be 10% total in all regions but higher percentage if you run in us-east-1. Just a guess though.

soccerdave11y ago

I guess all the Events weren't showing up yet because now I have 6 / 19 instances going down for a reboot.

pkapkg11y ago

44% of my instances in us-west-2 are affected, 55% of my instance in us-east-1, and 18% of my instances in eu-west-1 are affected. It seems to be tied pretty tightly to instance types.

Overall, I'm looking at a huge quantity of affected servers. That said, I don't blame AWS. I blame my incompetent architect for designing systems that are incredibly hard to upgrade, and that can't be rebooted safely. Definitely not bitter at that idiocy at all.

1 more reply

alimoeeny11y ago· 4 in thread

Anybody knows what this 10% mean? I mean :

a) only 10% of the fleet are running a version of the hypervisor that is affected by the bug

b) based on the turnover rate, they expect 10% to need rebooting under the customers by the date the bugs are being released.

c) 10% are running a combination of the affected hypervisor and vm's that are reasonably at risk of exploitation, other's may have the faulty hypervisor but either are being used as single tenant (there is no risk of someone breaking out and affecting someone else) or are running vm's that may not be able to break out depending on the nature of bugs.

Just speculating, any ideas?

geofft11y ago

In the past, Xen has has vulnerabilities based on things different between Intel and AMD processors, or even between different processors from the same company. It seems likely that the fleet is all running the same version of the hypervisor, but the bug only matters on 10% of their hardware.

Here's a previous Xen vulnerability based on Intel implementing the SYSRET instruction (originally introduced by AMD, along with SYSCALL; Intel's version of this was SYSENTER and SYSEXIT, with different semantics about kernel stacks and things) in a slightly different way from how AMD implemented it. Both Intel's docs and AMD's docs were accurate for their own processors, but if you only read AMD's docs, you'd implement syscalls in a way that was vulnerable on Intel.

https://blog.xenproject.org/2012/06/13/the-intel-sysret-priv...

cortesoft11y ago

In this case, as is explained in the post, the reason it is only 10% is because the newer hardware can be upgraded without requiring a reboot.

jerf11y ago

They explain this in the second paragraph.

jarfil11y ago

d) All of the machines require the patch, but on 90% they have already applied it with no need to reboot.

elmin11y ago· 3 in thread

It's a bit odd that they don't stop launching new VMs on the old hardware. That would allow people who wanted to control the transition to just stop and start their VMs.

skywhopper11y ago

This was my reaction, too. Thinking it over some, I think it's likely that certain tiers of instance types are more affected than others. And it's also likely that even though AWS seems to have lots of open capacity available, they probably are operating at a pretty high percentage, so gradually idling 10% of their host hardware would probably put a pinch on availability. The average lifetime of instances may also be a factor. I have a few long-running instances going most of the time, but I do tend to start and stop dozens of instances within minutes, depending on what I'm working on. So for those use cases, it hardly matters.

And of course, Amazon has an interest in encouraging its users to build their systems in a cloud-friendly way. ie, properly designed services on AWS should not suffer from having a handful of VMs get rebooted at any time. So, from that POV, it's just good medicine to encourage the culture they built their service to accommodate.

michaelt11y ago

  they probably are operating at a pretty 
  high percentage, so gradually idling 10% 
  of their host hardware would probably 
  put a pinch on availability

If they can withstand the loss of one of out of three availability zones, they must have at least 33% spare capacity in each AZ, ready for new instances to start to take up the slack from the lost AZ.

Even if they have more availability zones than the three they show to users, I would hope they would have more than 10% spare capacity!

3 more replies

tlrobinson11y ago

"there is no guarantee your new instance would not also need to be rebooted during the update window."

It sounds like you could spin up new instances until you get one that doesn't require the update.

hendersoon11y ago· 2 in thread

Linode forced a reboot for us last night also. They did not disclose why, for some reason, even though I pointedly asked. Downtime was ~20 minutes.

These must be some seriously bad mojo to force reboots with little to no notice over a week before they're scheduled to leave embargo.

VonGuard11y ago

Yup: http://xenbits.xen.org/xsa/

5 undisclosed Xen vulns. Wheeeeee!

mappu11y ago

I had a forced reboot on one of my Linodes yesterday, also with ~20 minutes downtime.

The ticket said it was for emergency physical maintenance, i assumed it was hardware failure but i guess that definition could extend to hypervisor issues. Now to wait and see if linodes in other regions are rebooted too

jamescun11y ago· 2 in thread

We contacted SoftLayer about this issue, they literally had not heard anything about it and they would "contact their datacenter team".

If they treat it like the last round of Xen vulnerabilities, they will simply place a warning on their dashboard an hour beforehand - not sending out any form of email notice. The first we knew about it was when we started receiving alerts from nagios.

blacksmith_tb11y ago

Sigh. I opened a ticket with Softlayer regarding it, too. And got pretty much exactly the same response - nothing is scheduled 'at this time' but they will 'let us know' if they need to reboot any of the hosts we're on. Joy.

iancarroll11y ago

I just got notified that they'll be sending times for the reboots soon (if needed).

ericcholis11y ago· 2 in thread

Rackspace notice regarding the same patch:

https://community.rackspace.com/general/f/53/t/4978

I wasn't able to find anything on Digital Ocean's public facing websites.

akerl_11y ago

DigitalOcean uses KVM, I thought? Assuming that's true, they're almost certainly not affected.

If they are using Xen, they shouldn't know the details of the vuln yet as they aren't on the pre-disclosure list:

http://www.xenproject.org/security-policy.html

infamouscow11y ago

DigitalOcean uses KVM.

WestCoastJustin11y ago

Related: Five new undisclosed Xen vulnerabilities (xen.org) https://news.ycombinator.com/item?id=9116937

edibleEnergy11y ago

They've updated the announcement, most of the restarts have been cancelled due to them being able to upgrade the machines without reboots.

mrsirduke11y ago

I think it will be interesting to see how other providers handles this.

teh11y ago

Does anyone know what this means for spot instances?

admbk11y ago

Wouldn't using kpatch remove the need to reboot instances ?

thebouv11y ago

Rackspace is doing the same due to the Xen vulns announced.

j / k navigate · click thread line to collapse

57 comments

39 comments · 13 top-level

josh260011y ago· 9 in thread

If you use Terminal on top of AWS (one deployment option) we can just migrate your workloads without rebooting.

geofft11y ago

... how?? That is seriously nifty.

Are you migrating just a process tree / other contained environment, or the entire machine?

Are you using CRIU or similar? Do open TCP connections survive the transfer?

josh260011y ago

We wrote a bunch of hacks to the linux kernel to do it.

Custom container implementation, custom networking, custom storage.

It's just really good hardcore kernel engineering.

If you wanna talk more and you're in SF, come to our meetup on the 10th: machinelearningsf.eventbrite.com.

Edit: the whole machine including RAM cache, CPU instructions, IP connections, etc. is carried over. We can also resize your machine in seconds while it's running.

2 more replies

theonewolf11y ago

With TCP_REPAIR, presumably they could...but both ends need to implement the REPAIR option I think, so maybe not in practice yet.

Or if the SDN of your cloud is good enough, even TCP_REPAIR might not be needed!

1 more reply

ngrilly11y ago

josh260011y ago

We are running a custom container implementation. The goal of our implementation is containers that perform like VMWare.

You can run our virtualization inside of Amazon, in which case you only really have the pain of Xen host + Amazon Xen, but it performs faster on bare-metal (as one might expect).

1 more reply

ngrilly11y ago

How is it different from Xen or KVM live migration?

josh260011y ago

It's much faster and doesn't use VMs.

1 more reply

jedberg11y ago

I don't see anything on your web page about running on top of AWS...? It looks like you guys only run your own cloud. Can you point me at some docs or anything about running on AWS?

josh260011y ago

I don't have docs yet because I haven't written them, but it's running on AWS right now.

It runs inside of any hypervisor or on bare metal.

Feel free to email me at josh[at]terminal[dot]com if you want to talk more. I can peel back the kimono quite far (we're also in SF if you wanna meet up).

1 more reply

zytek11y ago· 4 in thread

What is sad is that we learn about it from Hacker News and not from AWS, even when we have premium support and our own account manager. :/

Let's see how many of us did their homework after previous "xen update", and how much "10%" is now ;-)

dups11y ago

Not wanting a repeat of this we have migrated as many services we can into autoscaling groups, and automated all resource creation with CloudFormation.

This was inspired by this excellent Netflix blogpost: http://techblog.netflix.com/2014/10/a-state-of-xen-chaos-mon...

soccerdave11y ago

I guess all the Events weren't showing up yet because now I have 6 / 19 instances going down for a reboot.

pkapkg11y ago

44% of my instances in us-west-2 are affected, 55% of my instance in us-east-1, and 18% of my instances in eu-west-1 are affected. It seems to be tied pretty tightly to instance types.

1 more reply

alimoeeny11y ago· 4 in thread

Anybody knows what this 10% mean? I mean :

a) only 10% of the fleet are running a version of the hypervisor that is affected by the bug

b) based on the turnover rate, they expect 10% to need rebooting under the customers by the date the bugs are being released.

Just speculating, any ideas?

geofft11y ago

https://blog.xenproject.org/2012/06/13/the-intel-sysret-priv...

cortesoft11y ago

In this case, as is explained in the post, the reason it is only 10% is because the newer hardware can be upgraded without requiring a reboot.

jerf11y ago

They explain this in the second paragraph.

jarfil11y ago

d) All of the machines require the patch, but on 90% they have already applied it with no need to reboot.

elmin11y ago· 3 in thread

It's a bit odd that they don't stop launching new VMs on the old hardware. That would allow people who wanted to control the transition to just stop and start their VMs.

skywhopper11y ago

michaelt11y ago

  they probably are operating at a pretty 
  high percentage, so gradually idling 10% 
  of their host hardware would probably 
  put a pinch on availability

Even if they have more availability zones than the three they show to users, I would hope they would have more than 10% spare capacity!

3 more replies

tlrobinson11y ago

"there is no guarantee your new instance would not also need to be rebooted during the update window."

It sounds like you could spin up new instances until you get one that doesn't require the update.

hendersoon11y ago· 2 in thread

Linode forced a reboot for us last night also. They did not disclose why, for some reason, even though I pointedly asked. Downtime was ~20 minutes.

These must be some seriously bad mojo to force reboots with little to no notice over a week before they're scheduled to leave embargo.

VonGuard11y ago

Yup: http://xenbits.xen.org/xsa/

5 undisclosed Xen vulns. Wheeeeee!

mappu11y ago

I had a forced reboot on one of my Linodes yesterday, also with ~20 minutes downtime.

jamescun11y ago· 2 in thread

We contacted SoftLayer about this issue, they literally had not heard anything about it and they would "contact their datacenter team".

blacksmith_tb11y ago

iancarroll11y ago

I just got notified that they'll be sending times for the reboots soon (if needed).

ericcholis11y ago· 2 in thread

Rackspace notice regarding the same patch:

https://community.rackspace.com/general/f/53/t/4978

I wasn't able to find anything on Digital Ocean's public facing websites.

akerl_11y ago

DigitalOcean uses KVM, I thought? Assuming that's true, they're almost certainly not affected.

If they are using Xen, they shouldn't know the details of the vuln yet as they aren't on the pre-disclosure list:

http://www.xenproject.org/security-policy.html

infamouscow11y ago

DigitalOcean uses KVM.

WestCoastJustin11y ago

Related: Five new undisclosed Xen vulnerabilities (xen.org) https://news.ycombinator.com/item?id=9116937

edibleEnergy11y ago

They've updated the announcement, most of the restarts have been cancelled due to them being able to upgrade the machines without reboots.

mrsirduke11y ago

I think it will be interesting to see how other providers handles this.

teh11y ago

Does anyone know what this means for spot instances?

admbk11y ago

Wouldn't using kpatch remove the need to reboot instances ?

thebouv11y ago

Rackspace is doing the same due to the Xen vulns announced.

j / k navigate · click thread line to collapse