I've been noticing major performance changes in our instances and have no idea if it is related to Meltdown or something else.
Google released a blog post specifically on performance: https://blog.google/topics/google-cloud/protecting-our-googl...
It would be nice to have similar transparency from AWS.
Think the constant ragging on Google that appears on HN is becoming a little too much.
On an unrelated note, kind of wish Meltdown had been discovered and exposed separately from Spectre. Intel has managed to weasel its way of out of taking responsibility by implying that this is not a bug and all the other CPUs have similar issues. If they had to respond to Meltdown only, it would have made it a bit harder for their PR and legal department to deny the security and performance implications.
Just a nit, we said [1] that not only are our own applications in production doing fine (even against Variant 2) but also we haven’t been inundated with support calls over the last few months while mitigations were silently rolled out at the host kernel and hypervisor layer. So this class of “Hey, my instances are suddenly way slower, I didn’t do anything” isn’t happening on GCE.
That does not mean that if you perform guest OS updates, don’t have PCID enabled, etc. that you won’t see a degradation. We certainly haven’t tried all permutations of all guests, and the kernel patches are still improving. That’s why we’re actively trying to get everyone to rally behind retpoline, KPTI (with PCID enabled), and so on.
[1] https://www.blog.google/topics/google-cloud/protecting-our-g...
Sorry, I didn't mean to imply that it was Google's customers specifically who should measure and be suspicious. I meant in general, say someone running a service on bare metal outside of any public cloud for example drawing this conclusion - "Google measured the impact of these bugs and mitigations and they didn't see any a significant performance regressions so I probably don't have to worry either".
Also (since you disclosed you work on GCP), I like how Google sponsored Project Zero. Fantastic work and I am sure it will be a great return on that investment. I can certainly see someone thinking about that when deciding to go with GCP vs other solution.
I tried to start a poll but it seems as though my team was just the unlucky one: https://news.ycombinator.com/item?id=16109036
When I read this I thought how weird. It is usually claimed that max performance drop is around 30%. Then I remembered that we are talking in ratios, so a 30% drop in performance(0.7) means a 42% increase in CPU usage(1/0.7=1.42).
How are you trying to measure the performance impact? Are you checking the cloudwatch data or running any specific test?
I am interested to assess the baseline statistics (unpatched so far) and after patching. Any suggestions?
Not sure about RDS
* Impact on M5/C5 instances over similar time period, any difference with the Nitro hypervisor?
* Were Dedicated instances (https://aws.amazon.com/ec2/purchasing-options/dedicated-inst...) patched as well?
* Other examples of software that adapted batching performance automatically with increase in call latency.
>During this same time period, we saw additional CPU increases on our PV instances that had been previously upgraded. This seems to imply some level of HVM patching was occurring on these PV instances around the same time that all pure-HVM instances were patched
This is likely due to Vixen: https://lists.xenproject.org/archives/html/xen-devel/2018-01...
>.... Instead of trying to make a KPTI-like approach work for Xen PV, it seems reasonable to run a copy of Xen within an HVM (or PVH) domU ..... >.... all PV instances in EC2 are using this ....
So the initial bump after the reboot would have been the shim hypervisor which mitigates Vixen. The secondary bump, and bump the native HVM instances saw, would have been the Spectre related stuff.
Based on https://aws.amazon.com/security/security-bulletins/AWS-2018-... - guessing Intel microcode updates
Some stuff won't run in the free tiers anymore and people will have to switch to bigger machines :/
Could we expect Intel to fix the design flaw^Wfeature so that future server appliance (but also desktop) can run without KPTI while still not being affected by Meltdown ? If so, what timeline could we expect ? Say a year for new CPU designs, plus a year to roll-out new machines in datacenter ?
I imagine they’ll try and rush this to get it out there as fast as possible (obviously a lot of people would like to buy CPUs they don’t have this issue for security/performance reasons) but it’s going to take a while. I think years is definitely the minimum.
Meltdown is easy enough (relatively) but Spectre is kind of a disaster. What do you do? Does the branch predictor have to start tagging every branch guess with some sort of process ID to prevent one process from messing with another’s predictions? Tag the cache lines instead so even though the data is in cache you can’t see it because YOUR process didn’t pull it in yet? What a mess.
Its worth pointing out that for their newest designs AMD (and Samsung Exynos) uses the full memory address for branch predictions; no doubt Intel's next design will be doing this.
You can patch various tiers of servers at your own leisure, depending on threat levels and exposure. Measure the impact, capacity plan, etc. Rather than it being forced on you across all tiers because cloud.
Granted, that's the bottom of the barrel (single disk, no IPKVM etc.), but $100 keeps you running for over a year. Better servers are easily available as well, usually a couple of times cheaper than AWS.
Is this a US thing? Based on HN only, I'd never know there's anything between the public cloud and racks of own hardware that you have to wire up and maintain.
I have a bunch of quad core 32 GB machines with dual 480GB SSDs for less than $100/month each (and that's a rather expensive provider with great support, you'll cut the price almost in half with e.g. SoYouStart).
Yes, AWS is convenient, but it's far from the only thing in the world.
Tech stuff isn't hard. The biggest problem is the lack of capacity planning and project communication in tech today. Nimble startup is a euphemism for pure anarchy and chaos. No one wants to plan anything any more.
You're not the first I see saying that, here and on other sites, and this is absolutely wrong.
Shared environments like clouds were singled out because not only were they impacted the worst security wise, they were also going to suffer the most from the fixes.
But even if you only have a regular normal happy server or computer for you alone, remote code execution vulnerabilities aren't unheard of; one your application (be it your own, or a specific one you use, or one of the bazillion stuff running on your system as part of the OS) gets broken and you're a free target. Or anything with a proper scripting surface.
If you're system isn't protected, and any of the application you run has a major security hole, everything will be at risk.