Visualizing Meltdown on AWS (opens in new tab)

(blog.appoptics.com)

197 pointsmike_heffner8y ago69 comments

69 comments

It would be nice if AWS could write something official about what they are doing.

I've been noticing major performance changes in our instances and have no idea if it is related to Meltdown or something else.

Google released a blog post specifically on performance: https://blog.google/topics/google-cloud/protecting-our-googl...

It would be nice to have similar transparency from AWS.

bamboozled8y ago

Same, I feel it's irresponsible not to provide more details about this issue.

pathseeker8y ago

Google likely only wrote a blog post about it because they were able to find a way to brag that there was effectively no performance hit. They have never written any blog posts IIRC explaining bad/unpredictable performance on GCE

jacksmith210068y ago

Do they not deserve to brag about it? Heck they among others are who found the flaws as well as Broadpwn, Cloudbleed, Heartbleed among others and deserve credit for the work they have done, imo.

Think the constant ragging on Google that appears on HN is becoming a little too much.

plandis8y ago

It was Google who discovered the vulnerabilities in the first place. How long did they internally wait to give themselves and advantage before disclosing these? I'm honestly asking I don't know.

__bjoernd8y ago

That's still cool and worth bragging about.

forgot-my-pw8y ago

Amazon will probably earn more from autoscaling due to the slowdown.

rdtsc8y ago

That's one interesting aspect of these issues and mitigations is that performance really depends on the workload. Just because Google saw little performance impact on their servers, doesn't mean your application won't see. Or because someone said their CPU usage went up 2x doesn't mean it will go up for you.

On an unrelated note, kind of wish Meltdown had been discovered and exposed separately from Spectre. Intel has managed to weasel its way of out of taking responsibility by implying that this is not a bug and all the other CPUs have similar issues. If they had to respond to Meltdown only, it would have made it a bit harder for their PR and legal department to deny the security and performance implications.

boulos8y ago

Disclosure: I work on Google Cloud.

Just a nit, we said [1] that not only are our own applications in production doing fine (even against Variant 2) but also we haven’t been inundated with support calls over the last few months while mitigations were silently rolled out at the host kernel and hypervisor layer. So this class of “Hey, my instances are suddenly way slower, I didn’t do anything” isn’t happening on GCE.

That does not mean that if you perform guest OS updates, don’t have PCID enabled, etc. that you won’t see a degradation. We certainly haven’t tried all permutations of all guests, and the kernel patches are still improving. That’s why we’re actively trying to get everyone to rally behind retpoline, KPTI (with PCID enabled), and so on.

[1] https://www.blog.google/topics/google-cloud/protecting-our-g...

rdtsc8y ago

> but also we haven’t been inundated with support calls over the last few months

Sorry, I didn't mean to imply that it was Google's customers specifically who should measure and be suspicious. I meant in general, say someone running a service on bare metal outside of any public cloud for example drawing this conclusion - "Google measured the impact of these bugs and mitigations and they didn't see any a significant performance regressions so I probably don't have to worry either".

Also (since you disclosed you work on GCP), I like how Google sponsored Project Zero. Fantastic work and I am sure it will be a great return on that investment. I can certainly see someone thinking about that when deciding to go with GCP vs other solution.

mrep8y ago

My team saw a 40% CPU usage increase on all of our EC2 instances and even our RDS instances. We were shocked since the media was downplaying the performance impact.

I tried to start a poll but it seems as though my team was just the unlucky one: https://news.ycombinator.com/item?id=16109036

javitury8y ago

> 40% CPU usage increase

When I read this I thought how weird. It is usually claimed that max performance drop is around 30%. Then I remembered that we are talking in ratios, so a 30% drop in performance(0.7) means a 42% increase in CPU usage(1/0.7=1.42).

schak8y ago

Hi,

How are you trying to measure the performance impact? Are you checking the cloudwatch data or running any specific test?

I am interested to assess the baseline statistics (unpatched so far) and after patching. Any suggestions?

mrep8y ago

my teams usage is pretty even so the cloudwatch graph is pretty obvious: https://m.imgur.com/a/khGxU

tomsmeding8y ago

That is the clearest graph ever to show the degradation in performance. How do you get your usage so flat?

plasma8y ago

Are your ec2 instances PV? They are known to suffer perf issues more than the other type which you can switch to.

Not sure about RDS

sorenbs8y ago

Do you have premium support with AWS? Maybe try opening a ticket.

hanklazard8y ago

Pardon this likely naive question, but I haven’t seen it addressed yet in all the coverage: what’s the cost in electricity of patching this vulnerability? Does a company like amazon running a massive cloud infrastructure see a non-negligible increase in their cost of doing business?

dkuebric8y ago

It may impact the AWS control plane and amazon.com, but as far as AWS services go, it just means customers will be paying for more instances.

cm21878y ago

But for VM, do customers pay by CPU usage or runtime? I am sure the Netflix of this world optimise their CPU usage but I also suspect the majority of VMs are mostly idle or have little traffic as their task are either intermittent or are sized for peak usage.

virtualwhys8y ago

Not on AWS but as far as I can tell from their pricing estimator[1], for 24/7 instances you pay based on time running, not cpu usage.

Can see the appeal of AWS for scaling, redundancy, cross region availability, etc., but for long lived services it seems you pay a high price compared to services like Digital Ocean where the per VM cost is significantly lower.

[1] https://calculator.s3.amazonaws.com/index.html

013a8y ago

Technically, you pay for both time (hours on) and CPU Usage (instance tier). Its not like different instance tiers (at least in the same class) use fundamentally more or less powerful processors. They all use the same processors, you just get more or less of it depending on what you pay.

Conceptually it is "pay as you use" by CPU usage, but just rounded into buckets by instance tier.

Of course, there's a lot of underutilization within each bucket, because the granularity isn't per 1% used, but (more or less) per 100% used (aka each core). And also, most applications can't switch instance tiers easily to adapt to demand (though some certainly can).

1 more reply

mike_heffnerOP8y ago

Would love to know if anyone else had data on:

* Impact on M5/C5 instances over similar time period, any difference with the Nitro hypervisor?

* Were Dedicated instances (https://aws.amazon.com/ec2/purchasing-options/dedicated-inst...) patched as well?

* Other examples of software that adapted batching performance automatically with increase in call latency.

cthalupa8y ago

Not able to answer your questions, but a comment on the article -

>During this same time period, we saw additional CPU increases on our PV instances that had been previously upgraded. This seems to imply some level of HVM patching was occurring on these PV instances around the same time that all pure-HVM instances were patched

This is likely due to Vixen: https://lists.xenproject.org/archives/html/xen-devel/2018-01...

>.... Instead of trying to make a KPTI-like approach work for Xen PV, it seems reasonable to run a copy of Xen within an HVM (or PVH) domU ..... >.... all PV instances in EC2 are using this ....

So the initial bump after the reboot would have been the shim hypervisor which mitigates Vixen. The secondary bump, and bump the native HVM instances saw, would have been the Spectre related stuff.

Based on https://aws.amazon.com/security/security-bulletins/AWS-2018-... - guessing Intel microcode updates

taf28y ago

We had a lot of m5 and c5 servers randomly die. It was as if someone was running chaos monkey from Netflix in our VPCs...

otterley8y ago

Likewise. Can you reach out to me privately? I'd love to have independent corroboration.

_msw_8y ago

Could you send a list of instance IDs and timeframes where you saw this?

jdangu8y ago

Anyone has more info on the performance recovery today? We experienced similar performance issues over the last few days with a seemingly complete recovery today (on a cluster of ~2500 HVM T-1s).

bpicolo8y ago

Very curious as to what changed today if performance increased. Some sort of smarter patch? That'd be an amazingly impressive thing to cobble together so quickly.

k__8y ago

This is especially interesting for workloads that already ran on >70%

Some stuff won't run in the free tiers anymore and people will have to switch to bigger machines :/

yclept8y ago

We saw instances which normally kept a healthy stock of CPU credits quickly burn through them and severely degrade in performance thanks to Meltdown :<

alacombe8y ago

Trying to foresee the future...

Could we expect Intel to fix the design flaw^Wfeature so that future server appliance (but also desktop) can run without KPTI while still not being affected by Meltdown ? If so, what timeline could we expect ? Say a year for new CPU designs, plus a year to roll-out new machines in datacenter ?

MBCook8y ago

What I’ve seen is it takes five years to design a CPU from scratch.

I imagine they’ll try and rush this to get it out there as fast as possible (obviously a lot of people would like to buy CPUs they don’t have this issue for security/performance reasons) but it’s going to take a while. I think years is definitely the minimum.

Meltdown is easy enough (relatively) but Spectre is kind of a disaster. What do you do? Does the branch predictor have to start tagging every branch guess with some sort of process ID to prevent one process from messing with another’s predictions? Tag the cache lines instead so even though the data is in cache you can’t see it because YOUR process didn’t pull it in yet? What a mess.

voidlogic8y ago

>Does the branch predictor have to start tagging every branch guess with some sort of process ID to prevent one process from messing with another’s predictions?

Its worth pointing out that for their newest designs AMD (and Samsung Exynos) uses the full memory address for branch predictions; no doubt Intel's next design will be doing this.

MBCook8y ago

Ah, that makes since. Sounds like a much less complicated fix than my idea.

dboreham8y ago

Just isolate code on different cores.

scurvy8y ago

"Why I like to run my own hardware for $100, Alex"

You can patch various tiers of servers at your own leisure, depending on threat levels and exposure. Measure the impact, capacity plan, etc. Rather than it being forced on you across all tiers because cloud.

amazingman8y ago

You forgot to type 5 or 6 zeros there.

gnosek8y ago

https://www.kimsufi.com/us/en/

Granted, that's the bottom of the barrel (single disk, no IPKVM etc.), but $100 keeps you running for over a year. Better servers are easily available as well, usually a couple of times cheaper than AWS.

Is this a US thing? Based on HN only, I'd never know there's anything between the public cloud and racks of own hardware that you have to wire up and maintain.

I have a bunch of quad core 32 GB machines with dual 480GB SSDs for less than $100/month each (and that's a rather expensive provider with great support, you'll cut the price almost in half with e.g. SoYouStart).

Yes, AWS is convenient, but it's far from the only thing in the world.

user59944618y ago

HN has a lot of professionals. They can't run a business on a refurbished server without ECC and without RAID and without dual power supplies.

Saying that they should run on kumsufi is like explaining to a wholesale company that they should use motorbikes instead of trucks, because motorbikes are cheaper.

2 more replies

amazingman8y ago

Yeah, and AWS has a free tier. $0/month is better than $100/month, right?

scurvy8y ago

HN types are enamored with the scale and size that is necessary to run the massive framework for their blogs.

lykr0n8y ago

If you know what you are doing, it is leaps and bounds cheaper to run your own hardware (co-located, rented from soneone else). The only issue is latency on scaling out (hours), but if you are halfway decent with trend lines you can preempt this.

xiaodown8y ago

That's not the only issue; there's also a lot of compliance issues that having a hosting company can take care of. There's whole sections of PCI and HIPPA compliance that you can just write off as "not our problem, talk to AWS".

1 more reply

amazingman8y ago

And what if the delta between your upper and lower daily “trend lines” is measured in millions of requests per hour? Per second? We can leave off weekly/seasonal trends for now, and keep it nice and easy for you.

The utter lack of imagination that I see on HN when people are judging others’ technical decisions is kind of hilarious.

scurvy8y ago

It was a Jeopardy! reference. Sorry if it didn't come through.

stephengillie8y ago

It's very easy for a single admin with a single machine to provide 4 nines to a small group with a small load. But this usually scales exponentially.

scurvy8y ago

Just because you can't do it doesn't make it impossible. A small team of 4-5 good ops people can scale a network of many thousand nodes, petabytes of storage, and terabits of network throughput.

Tech stuff isn't hard. The biggest problem is the lack of capacity planning and project communication in tech today. Nimble startup is a euphemism for pure anarchy and chaos. No one wants to plan anything any more.

user59944618y ago

The same infra could be managed by a single guy if it were in the cloud.

2 more replies

perfmode8y ago

Over a 15 year time scale, there is no way AWS will remain competitive with GCP.

dgsb8y ago

Why is that ? They both belongs to the top cloud providers today, both seems to invest highly on their R&D. I don't see anything obvious on what could happen in the future.

perfmode8y ago

Google’s network engineering is better.

bufferoverflow8y ago

Is there an option of AWS dedicated instances without these patches? I thought all these new vulnerabilities are only really dangerous in shared environments.

nolok8y ago

> I thought all these new vulnerabilities are only really dangerous in shared environments.

You're not the first I see saying that, here and on other sites, and this is absolutely wrong.

Shared environments like clouds were singled out because not only were they impacted the worst security wise, they were also going to suffer the most from the fixes.

But even if you only have a regular normal happy server or computer for you alone, remote code execution vulnerabilities aren't unheard of; one your application (be it your own, or a specific one you use, or one of the bazillion stuff running on your system as part of the OS) gets broken and you're a free target. Or anything with a proper scripting surface.

If you're system isn't protected, and any of the application you run has a major security hole, everything will be at risk.

k__8y ago

Guess humanity lost 30% of its computing power

j / k navigate · click thread line to collapse

69 comments

patrickxb8y ago

It would be nice if AWS could write something official about what they are doing.

I've been noticing major performance changes in our instances and have no idea if it is related to Meltdown or something else.

Google released a blog post specifically on performance: https://blog.google/topics/google-cloud/protecting-our-googl...

It would be nice to have similar transparency from AWS.

bamboozled8y ago

Same, I feel it's irresponsible not to provide more details about this issue.

pathseeker8y ago

jacksmith210068y ago

Do they not deserve to brag about it? Heck they among others are who found the flaws as well as Broadpwn, Cloudbleed, Heartbleed among others and deserve credit for the work they have done, imo.

Think the constant ragging on Google that appears on HN is becoming a little too much.

plandis8y ago

It was Google who discovered the vulnerabilities in the first place. How long did they internally wait to give themselves and advantage before disclosing these? I'm honestly asking I don't know.

__bjoernd8y ago

That's still cool and worth bragging about.

forgot-my-pw8y ago

Amazon will probably earn more from autoscaling due to the slowdown.

rdtsc8y ago

boulos8y ago

Disclosure: I work on Google Cloud.

[1] https://www.blog.google/topics/google-cloud/protecting-our-g...

rdtsc8y ago

> but also we haven’t been inundated with support calls over the last few months

mrep8y ago

My team saw a 40% CPU usage increase on all of our EC2 instances and even our RDS instances. We were shocked since the media was downplaying the performance impact.

I tried to start a poll but it seems as though my team was just the unlucky one: https://news.ycombinator.com/item?id=16109036

javitury8y ago

> 40% CPU usage increase

schak8y ago

Hi,

How are you trying to measure the performance impact? Are you checking the cloudwatch data or running any specific test?

I am interested to assess the baseline statistics (unpatched so far) and after patching. Any suggestions?

mrep8y ago

my teams usage is pretty even so the cloudwatch graph is pretty obvious: https://m.imgur.com/a/khGxU

tomsmeding8y ago

That is the clearest graph ever to show the degradation in performance. How do you get your usage so flat?

plasma8y ago

Are your ec2 instances PV? They are known to suffer perf issues more than the other type which you can switch to.

Not sure about RDS

sorenbs8y ago

Do you have premium support with AWS? Maybe try opening a ticket.

hanklazard8y ago

dkuebric8y ago

It may impact the AWS control plane and amazon.com, but as far as AWS services go, it just means customers will be paying for more instances.

cm21878y ago

virtualwhys8y ago

Not on AWS but as far as I can tell from their pricing estimator[1], for 24/7 instances you pay based on time running, not cpu usage.

[1] https://calculator.s3.amazonaws.com/index.html

013a8y ago

Conceptually it is "pay as you use" by CPU usage, but just rounded into buckets by instance tier.

1 more reply

mike_heffnerOP8y ago

Would love to know if anyone else had data on:

* Impact on M5/C5 instances over similar time period, any difference with the Nitro hypervisor?

* Were Dedicated instances (https://aws.amazon.com/ec2/purchasing-options/dedicated-inst...) patched as well?

* Other examples of software that adapted batching performance automatically with increase in call latency.

cthalupa8y ago

Not able to answer your questions, but a comment on the article -

This is likely due to Vixen: https://lists.xenproject.org/archives/html/xen-devel/2018-01...

>.... Instead of trying to make a KPTI-like approach work for Xen PV, it seems reasonable to run a copy of Xen within an HVM (or PVH) domU ..... >.... all PV instances in EC2 are using this ....

So the initial bump after the reboot would have been the shim hypervisor which mitigates Vixen. The secondary bump, and bump the native HVM instances saw, would have been the Spectre related stuff.

Based on https://aws.amazon.com/security/security-bulletins/AWS-2018-... - guessing Intel microcode updates

taf28y ago

We had a lot of m5 and c5 servers randomly die. It was as if someone was running chaos monkey from Netflix in our VPCs...

otterley8y ago

Likewise. Can you reach out to me privately? I'd love to have independent corroboration.

_msw_8y ago

Could you send a list of instance IDs and timeframes where you saw this?

jdangu8y ago

Anyone has more info on the performance recovery today? We experienced similar performance issues over the last few days with a seemingly complete recovery today (on a cluster of ~2500 HVM T-1s).

bpicolo8y ago

Very curious as to what changed today if performance increased. Some sort of smarter patch? That'd be an amazingly impressive thing to cobble together so quickly.

k__8y ago

This is especially interesting for workloads that already ran on >70%

Some stuff won't run in the free tiers anymore and people will have to switch to bigger machines :/

yclept8y ago

We saw instances which normally kept a healthy stock of CPU credits quickly burn through them and severely degrade in performance thanks to Meltdown :<

alacombe8y ago

Trying to foresee the future...

MBCook8y ago

What I’ve seen is it takes five years to design a CPU from scratch.

voidlogic8y ago

>Does the branch predictor have to start tagging every branch guess with some sort of process ID to prevent one process from messing with another’s predictions?

Its worth pointing out that for their newest designs AMD (and Samsung Exynos) uses the full memory address for branch predictions; no doubt Intel's next design will be doing this.

MBCook8y ago

Ah, that makes since. Sounds like a much less complicated fix than my idea.

dboreham8y ago

Just isolate code on different cores.

scurvy8y ago

"Why I like to run my own hardware for $100, Alex"

amazingman8y ago

You forgot to type 5 or 6 zeros there.

gnosek8y ago

https://www.kimsufi.com/us/en/

Is this a US thing? Based on HN only, I'd never know there's anything between the public cloud and racks of own hardware that you have to wire up and maintain.

Yes, AWS is convenient, but it's far from the only thing in the world.

user59944618y ago

HN has a lot of professionals. They can't run a business on a refurbished server without ECC and without RAID and without dual power supplies.

Saying that they should run on kumsufi is like explaining to a wholesale company that they should use motorbikes instead of trucks, because motorbikes are cheaper.

2 more replies

amazingman8y ago

Yeah, and AWS has a free tier. $0/month is better than $100/month, right?

scurvy8y ago

HN types are enamored with the scale and size that is necessary to run the massive framework for their blogs.

lykr0n8y ago

xiaodown8y ago

1 more reply

amazingman8y ago

The utter lack of imagination that I see on HN when people are judging others’ technical decisions is kind of hilarious.

scurvy8y ago

It was a Jeopardy! reference. Sorry if it didn't come through.

stephengillie8y ago

It's very easy for a single admin with a single machine to provide 4 nines to a small group with a small load. But this usually scales exponentially.

scurvy8y ago

Just because you can't do it doesn't make it impossible. A small team of 4-5 good ops people can scale a network of many thousand nodes, petabytes of storage, and terabits of network throughput.

user59944618y ago

The same infra could be managed by a single guy if it were in the cloud.

2 more replies

perfmode8y ago

Over a 15 year time scale, there is no way AWS will remain competitive with GCP.

dgsb8y ago

Why is that ? They both belongs to the top cloud providers today, both seems to invest highly on their R&D. I don't see anything obvious on what could happen in the future.

perfmode8y ago

Google’s network engineering is better.

bufferoverflow8y ago

Is there an option of AWS dedicated instances without these patches? I thought all these new vulnerabilities are only really dangerous in shared environments.

nolok8y ago

> I thought all these new vulnerabilities are only really dangerous in shared environments.

You're not the first I see saying that, here and on other sites, and this is absolutely wrong.

Shared environments like clouds were singled out because not only were they impacted the worst security wise, they were also going to suffer the most from the fixes.

If you're system isn't protected, and any of the application you run has a major security hole, everything will be at risk.

k__8y ago

Guess humanity lost 30% of its computing power

j / k navigate · click thread line to collapse