Its hard to get your head around how big a deal this is. This vulnerability is so bad they killed x86 indirect jump instructions. It's so bad compilers --- all of them --- have to know about this bug, and use an incantation that hacks ret like an exploit developer would. It's so bad that to restore the original performance of a predictable indirect jump you might have to change the way you write high-level language code.
It's glorious.
It truly is difficult to predict all the ripple effects from this. I can't think of a single computer bug in the last 30 years that's similar in reach to this Intel Meltdown.
[EDITED following text to replace "Intel bug" with "Spectre bug" based on ars and jcranmer clarification. The Intel Meltdown can be fixed with operating system update patches for kpti instead of a complete recompile.]
Journalists like to overuse the bombastic metaphor "shaken the very foundations" but this Spectre bug actually seems very fitting of it. Off the top of my head:
- browsers like Chrome & Firefox have to compile with new defensive compilation flags because it runs untrusted Javascript
- cloud providers have to recompile and patch their code to protect themselves from hostile customer vms
- operating systems like Linux/Windows/MacOS have to recompile and patch code to protect users from malware
Imagine the economics of all these mitigations. Also imagine that each of the cloud vendors AWS/Google/Azure/Rackspace had very detailed Excel spreadsheets extrapolating cpu usage for the next few years to plan for millions of $$$ of capital expenditures. Because of the severe performance implications of the bugfix (5% to 50% slowdown?), the cpu utilization assumptions in those spreadsheets are now wrong. They will have to spend more than they thought they did to meet goals of workload throughput.
There are dozens of other scenarios that we can't immediately think of.
Wrong bug. Intel meltdown is bad, but not anywhere near as bad as Spectre which affects everything! No AMD immunity here.
Spectre needs a more perfect storm of factors to lead to exploitation. No hardware is immune to it, but not all software is vulnerable, either. You need code execution and you need a vulnerable target and you need to somehow trigger the vulnerable targets path and that vulnerable target needs data you want.
Meltdown just needs code execution and you have full read access to all memory.
I mentioned Meltdown because multiple entities (gcc, llvm, Google Cloud, Azure, Linux, Windows, etc) have already converged on concrete solutions such as new compiler flags and patches which gives us a glimpse into the costs and severity. The Spectre bug may be "bigger" but it doesn't have complete consensus mitigation yet and in the meantime, we really can't tell people to "just keep your laptop unplugged from internet and don't run any apps to avoid the Spectre bug." The Spectre hole seems like it will be an open problem for many years and the new gcc/llvm is an incomplete fix.
Not meaning to be that rude, yet this itself summarises (and the issue perhaps will shed more light on) how stupid an idea is to let everybody run untrusted code from other peoples, let alone third party stuff like "privacy-intrusion-as-a-service" startups et aliae.
From the spectre paper:
"A minor variant of this could be to instead use an out-of-bounds read to a function pointer to gain control of execution in the mis-speculated path. We did not investigate this variant further."
If I were developing processors, I'd be having emergency meetings on trying to craft exploits to figure out where our processors' weaknesses are. While being happy that Intel is getting all the bad PR for this and I'm not.
Speculative execution does not create side-channels in and of itself, side effects of speculative execution does that. In this case the side effect of cache state. Just don't change the cache during speculative execution and there's no problem.
Agreed. This is an entirely new class of vulnerabilities, and we're just at the beginning.
For any other non-sanboxed application you pretty much have to trust the code anyway. Privilege escalation is always a bad thing of course, but for single user desktop machines getting user shell access as an attacker means that you can do pretty much anything you want.
As far as I can see the only surface of attack for my current machine would be a website running untrusted JS. For all other applications running on my machine if one of them is actually hostile them I'm already screwed.
Frankly I'm more annoyed at the ridiculous over-engineering of the Web than at CPU vendors. Because in 2017 you need to enable a turing complete language interpreter in your browser in order to display text and pictures on many (most?) websites.
Gopher should've won.
Does the non-shared non-virtualized system have any encryption keys in memory that you want to protect?
Do you use full-disk encryption or ssh to other machines or use a cryptocurrency wallet?
For JavaScript won't it be sufficient to check all the calls out of it so that they can't pass data that controls an exploitable speculative execution, and also generate JIT code so the JS itself can't create exploitable instructions. The API will have to be heavily scrutinized and the JS will run somewhat slower.
If the rest of the browser code is vulnerable, but the JS code can't control the speculative execution then it should be safe to run any JS.
It ends with the performance advantages of OOO execution being effectively negated by the workarounds to address the security issues it causes.
The following parable is edifying: https://www.cs.utexas.edu/users/EWD/transcriptions/EWD05xx/E...
Yes. We should really start to learn from history, MULTICS operating system had already 16 CPU ring support back in the early 1970s. MULTICS is the mother of UNIX, its smaller child. MULTICS had so many advanced features that barely got implemented (often reinvented) in newer OS. It's time to read old docs and ask the old devs who are still alive. (Another such often overlooked gem is Plan9, but it's better known thanks to Go lang devs).
Older Intel CPUs only supported 2 rings. Modern Intel CPU supports only 4 rings. Windows and Linux use ring 0 for kernel mode and ring 3 for user mode. And Intel introduced a ring -1 for VT.
"To assist virtualization, VT and Pacifica insert a new
privilege level beneath Ring 0. Both add nine new machine
code instructions that only work at "Ring -1," intended to
be used by the hypervisor
It's time for modern operating systems to use more rings, and modern CPUs to correctly protect between different rings.What's glorious is that serious software security people now have to start being literate about what it means to reverse engineer and dump the branch history buffers on different CPUs. Getting dragged through this kind of minutiae is the reason I'm still in this field after 22 years.
And I'm just a bystander here. Imagine what it must have been like for Jann Horn over the last several months!
This subsection describes how we reverse-engineered the internals of the Haswell branch predictor. Some of this is written down from memory, since we didn't keep a detailed record of what we were doing.
... because shit was so crazy while they were working this out that they didn't have the cycles to write everything down!
[Edit] Or, how far down does the rabbit hole go?
Additionally, it is quite fascinating to me to compare the complexity of modern CPRUs with, say, a compiler.
I believe the generalized fix is to restore the entire CPU state after a mispredict. You’d either need to add an extra copy of the entire processor state (tens of megabits) for every simultaneous predict you support ($$$) or keep track of how to revert all changes and revert them one at a time ($, slow).
Only the "extra copy of processor state" thing is really viable. You have to have a speculative cache and buffer in reads that only get flushed to the main cache once they're confirmed to be valid, which is enormously complicated. This facility already exists for writes, but now it needs to exist for reads too.
GP is absolutely correct that this is a fundamental assault on processor design as we know it, the speculative execution concept is going back to the drawing board for a major re-think.
Sorry for quoting wikipedia, but I'm not at school, hah! [1]
'''' TSX provides two software interfaces for designating code regions for transactional execution. Hardware Lock Elision (HLE) is an instruction prefix-based interface designed to be backward compatible with processors without TSX support. Restricted Transactional Memory (RTM) is a new instruction set interface that provides greater flexibility for programmers.[13]
TSX enables optimistic execution of transactional code regions. The hardware monitors multiple threads for conflicting memory accesses, while aborting and rolling back transactions that cannot be successfully completed. Mechanisms are provided for software to detect and handle failed transactions.[13]
In other words, lock elision through transactional execution uses memory transactions as a fast path where possible, while the slow (fallback) path is still a normal lock. ''''
[1] https://en.wikipedia.org/wiki/Transactional_Synchronization_...
- we didn't have hyperconverged cloud infrastructures running arbitrary entities' code next to each other
The web issue is easier to mitigate if not fix completely since there is already a massive infrastructure for widespread, rapid browser updates, and crippling Javascript to eliminate attack vectors such as high-resolution timers is completely acceptable.
The cloud/vm infrastructure is a massive problem though. It is 100% required that VMs be fully isolated. The entire infrastructure breaks down if they aren't.
What's new and surprising is the power of these side-channel attacks--you can use these, reliably, to exfiltrate arbitrary memory, including across privilege modes in some cases (apparently, some ARM cores are affected by the latter vulnerability, in addition to Intel).
Are you sure?
The right fix is to prevent speculatively executed code from leaking information.
Here that perhaps means associating cache lines with a speculative branch somehow so that they aren't accessible until/unless the speculative branch becomes the real branch. (I have no idea exactly how that would be done or what the performance cost might be... I'd really need to know the details of how speculative execution is implemented in a particular CPU to even be able to guess.)
Ouch! This is independent of other performance hurts, like from the kernel syscall overhead that was the hot topic yesterday. This is pretty crazy.
Will be intrigued to see how processor manufacturers respond to this. If they were even slightly relaxed about it prior to disclosure I expect there's going to be some very hurried attempts to engineer some solutions pronto. This is the sort of thing where it might even be worth throwing away all of your future roadmap plans and just getting a revision of the current chips out there ASAP, whatever that may do to the rest of your roadmap.
I'd hope maybe, just maybe, this would be enough to put a focus on compilers producing code that ends up using processor-optimized paths chosen at runtime, to avoid "overheads ranging from 10% to 50%".
Though, in this case, that would essentially mean making the entire executable region writable for some window of time, which is clearly too dangerous, so I guess the 0.1% speedups from compiling undefined behavior in new and interesting ways, will continue taking priority.
I mean, it's a compiler flag right, obviously whoever's going to run a program on an unaffected platform will take the effort to recompile everything with the flag removed.
Just the same way every serious application currently provides different executables for running on systems where SSE2, SSE4.1, or AVX2 is present.
When this all shakes out, the general story is going to be "upgrade sooner, current-gen Intel chips are x% faster", where x is going to be a larger number than it was a week ago.
It's more-or-less the Apple battery story all over again. Current devices are going to be slower, newer ones are going to be faster. Even if you know the why and the how, you're still in the same place as everybody else (at best, you could upgrade just the chip if your MB is new enough, but you're still buying Intel). Unless there's some clear way of imposing the external cost of this bug on Intel, it's a win-win for them.
Ryzen 1700X has 3.3 times the multicore performance of my i5 3450 from 2012 for almost the same price. With 7nm we will see at least another 50% increase in multicore performance on top of that.
Although actually, we already are - binary distros already don't take into account per-microarchitecture scheduling, nor any ISA extensions above a common baseline (e.g. just SSE2, no autovectorising to AVX2 etc).
This might provide enough impetus to restructure how binary distros work and get the whole distro compiled with some newer CPU flags (march={first corrected architecture}?) but in the short term i assume every package will take the hit.
Great time to learn about source-based distros!
> However, real-world workloads exhibit substantially lower performance impact.
I feel like you could have mentioned this.
I feel bad for all of the engineers currently working on performance sensitive applications in these languages. There's a whole lot of Java, .NET, and JavaScript that's about to get slower[1]. Enterprise-y, abstract class heavy (i.e.: vtable using) C++ will get slower. Rust trait objects get slower. Haskell type classes that don't optimize out get slower.
What a mess.
[1] These mitigations will need to be implemented for interpreters, and JITs will want to switch to emitting "retpoline" code for dynamic dispatch. There's no world in which I don't expect the JVM, V8, and others to switch to these by default soon.
Maybe I'm being naive, but would a simple modulo instruction work? Consider the example code from https://googleprojectzero.blogspot.com/2018/01/reading-privi...:
unsigned long untrusted_offset_from_caller = ...;
if (untrusted_offset_from_caller < arr1->length) {
unsigned char value = arr1->data[untrusted_offset_from_caller];
...
}
If instead we did: unsigned char value = arr1->data[untrusted_offset_from_caller % arr1->length];
Would this produce a data dependency that prevents speculative execution from reading an out-of-bounds memory address? (Ignore for the moment that a sufficiently smart compiler might "optimize" out the modulo here.)You only need to wipe between syscalls that have side effects. Number crunching AVX heavy subroutines should never have to deal with safety once entered.
The obvious drawback that it effectively disables sharing code in memory, it would still allow sharing code on disk though. So it would be a middle ground between the current state in dynamic and static linking.
https://www.technovelty.org/c/position-independent-code-and-...
* https://lkml.org/lkml/2018/1/3/797 (https://news.ycombinator.com/item?id=16066968)
What do people more knowledgeable in the field think about this?
All of these attacks assume you are running something you don't trust on your CPU, whether it is another user's program, a non-root executable, or a JavaScript program from a website.
When do we stop hacking processors, kernels, and compilers and revisit our assumptions of what we can and can't do securely.
But my usecase might be a physical computer that isn't networked which does data science with some programs and prints out results.
These patches are focused at Amazon and cloud providers that are in the business of running separate individual's applications on the same machine.
In the consumer world, the slope would be browser scripts and user applications that aren't running as super. But even then, do you download and run software that you expect might steal information or damage your computer?
These are fundamental security questions. Creating rings and sandboxes are what create the assumptions of privacy and security.
side channels have always been some of the most insidious exploits. Many are basically un-solvable (timing attacks are always going to leak some information, and compression is basically completely at odds with secure information storage), many more are easily enough overlooked that it would be easy to maliciously include them without raising any eyebrows, and the "fixes" for them almost always murder performance.
I think the only real fully-encompassing solution to this is a redesign in how we use computers. Either a massive step backwards on performance and turning off most automatic "optimizations" until they can be proven through a much more rigorous process (both in compilers, and in hardware), or a significant change in how computers are architected adding more hardware level isolation for processes and systems running on the machine (just daydreaming now, but something like a cluster of isolated micro-CPUs that run one application only).
The "true secure" way is not running any untrusted code, not connecting to any untrusted networks, and not accepting or storing any untrusted data.
At that point you can't run a computer. It's just not an answer to say "don't let bad things in" because bad things are always going to get in.
And with side channel exploits getting more and more common, and with them being worse and worse, running any code on your machine is basically giving that code any and all of your data on the machine...
Telling users (even highly technical users) to "never ever run any untrusted code ever, and if you mess up once and run untrusted code you have completely ruined the trust of the whole machine and need to start over from scratch while assuming all of your data has been compromised" is not only infeasible, it's impossible. If this is the case, we have lost the "security" game.
It's easy to say "throwing oo execution away" is an overreaction, but if it's necessary to allow multiple programs to run on one machine without them all having what amounts to full access to one another's information, then it might be necessary. Already we know that we can't use compression with encryption, we can't use any kind of "exit early" with most kinds of encryption. It might just be that OO execution is fundamentally opposed to secure computing. At it's core, it's letting the processor do different things depending on what it can see is coming, it's almost a definition of an oracle! That's always going to be a very dangerous game to play.
It could even just be that CPUs need a "secure computing" mode, or maybe even a secure co-processor that disables all of these optimizations. But at the very least, I think changes are going to be necessary, and a 15% perf reduction might be the least of our worries.
Chrome and FF need "execute JS" to be an explicit per-site permission, similar to the permissions model of native smartphone apps.
Google will never do this because they're an ad company and care more about targeting ads than protecting Chrome users.
However this wouldn't completely solve the issue (see Google Play Store).
You can do a lot by separating these speedup mechanisms across security boundaries. The biggest factor that makes this hard to mitigate is in-process security boundaries. Total isolation between processes is neither necessary nor sufficient.
https://marc.info/?l=openbsd-misc&m=118296441702631&w=2
(from 2007)
RISCV has already had to fix its memory consistency model, so it is not without problems. But it that is a spec bug, not an implementation bug. Whether there is an out of order, speculative execution RISCV core in the wild which suffers from this is as far as I know very unlikely. If there is, no doubt it's designers have had a busy time lately.
I'd presume that the slowest RISC-V designs are immune due to not speculating enough, while any high-performance implementation is vulnerable.
>> While makeshift processor-specific countermeasures are possible in some cases, sound solutions will require fixes to processor designs as well as updates to instruc- tion set architectures (ISAs) to give hardware architects and software developers a common understanding as to what computation state CPU implementations are (and are not) permitted to leak.
Early processors had speculative execution? I thought this had been added to Intel/AMD/ARM about 20 years ago?
Trusting in a compiler you hope was used to build all the executables on your system isn't trustworthy enough to be the final solution.
[1] https://www.win.tue.nl/~aeb/linux/hh/thompson/trust.html
Unless the compiler is also patched to either disallow inserted assembly, or to modify the inserted assembly (this being both hard and dangerous), someone who wants to exploit the bug will just add their own inserted assembly code that exploits the bug, and a patched compiler won't help one bit in that case.
* https://lkml.org/lkml/2018/1/4/432 * http://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a...
It appears that Skylake and later can actually predict retpolines? Some hardware features called IBRS, IBPB, STIBP (not a lot of details on this are out there) are supposedly coming in a microcode update.
There's a lot of prominence being given to all kinds of damage malicious users might inflict, and ways to prevent or mitigate, but little to the malice itself. Whence does it arise? What emotions drive those users? What unmet needs?
Meanwhile, when these slowing-down patches for Sceptre and Meltdown arrive, I intend to not run them, to the possible extent. I intend to keep aside a VM with patches for critical stuff, like banking or others' data entrusted to me. But I don't want my machine to be slowed down just because someone, sometime, might invest effort in targeting these attacks at it. Given how transparent I want to be with my life, that's a risk I'm willing to take.
Sure, you might not have anything you want to hide in your life, but the drive-by javascript doesn't care about your secrets - it'll hack you anyway. Best-case scenario, you lose access to a bunch of accounts you used to use and need to create new identities from scratch. Worst-case, they clean you out financially, steal your identity, etc.
Also, any insight about performance impact here?
Most programs have indirect jumps somewhere. Higher-level languages with virtual function calls have lots of indirect jumps, because they parameterize functions: to get the "length" of the variable "foo", the function "bar" has to call one of 30 different functions, depending on the type of "foo"; the function to call is read out of a table at some offset from the base address of "foo". Or, another example is switch statements, which can compile down to jump tables.
What we want, to mitigate Spectre, is to be able to disable speculative execution for indirect jumps. The CPU doesn't provide a clean way to do that directly.
So we just stop using the indirect jump instructions. Instead, we abuse the fact that "ret" is an indirect jump.
"Call" and "ret" are how CPUs support function calls. When you "call" a function, the CPU pushes the return address --- the next instruction address after the "call" --- to the stack. When you return from a function, you pop the return address and jump to it. There's a sort of "jmp %register" hidden in "ret".
You abuse "ret" by replacing indirect jumps with a sequence of call/mov/jump, where the mov does a switcheroo on the saved return address.
The obvious next question to ask here is, "why don't CPUs predict and speculatively execute rets?" And, they do. So the retpoline mitigates this: instead of just "call/pop/jump", it does "call/...pause/jmp.../mov/jmp", where the middle sequence of instructions set off in "..." is jumped over and not executed, but captures the speculative execution that the CPU does --- the CPU expects the "ret" to return to the original "call", and does not know how to predict around the fact that we did the switcheroo on the return address.
How'd I do?
Is the overhead of the retpoline such that it's no longer a benefit to compile switches to jump tables?
Here is an example of the most common programming patterns that end up causing indirect jumps/calls:
Imagine every virtual function call in a C++ program being mispredicted and taking twice as long.
(Instead of forcing us to recompile the world, maybe Intel should just disable branch prediction in microcode.)
> (Instead of forcing us to recompile the world, maybe Intel should just disable branch prediction in microcode.)
Wouldn't the performance impact be dramatic ? In this[1] example there's a 6 times slowdown between situation with and without correct branch prediction.
[1]: https://stackoverflow.com/questions/11227809/why-is-it-faste...
I’m not sure, but that’s what it looks like, so far.
Quite possibly, the worst affected code will be OO code that is dynamically (open) polymorphic.
It's just that some language features such as virtual functions in C++ often require indirect invocation when the compiler can't devirtualize a call, and there is lots of it in the kernel in performance-critical paths (think interrupts, syscalls).
Something like that could allow the CPU to speculate agressively while preventing information leak exploits.
The bug here is that the CPU is not aborting the speculation when fetches occur to addresses marked as "access denied". Instead the fetch happens and a line of normally inaccessible memory is put into cache by code that should not be able to get it read into the cache normally.
One hardware fix would be to plug that hole. Speculative reads get blocked when they encounter permission denied errors from the paging system and do not change the cache state. That blocks the Meltdown attack, but not the Spectre attack.
Also maybe the context switching would need to be made faster, because you would need to do that whenever eg javascript calls browser interfaces.
It also sets a very bad precedent: I understand people want to mitigate/fix as much as possible, but this is basically giving an implicit message to the hardware designers: "it doesn't matter if our instructions are broken, regardless of how widespread in use they already are --- they'll just fix it in the software."
What are any other options? It's hardware, that cannot be patched. Of course they will change chip designs going forward, but what else do you suggest folks do with the billions of chips that exhibit this problem?
> We built multi-tenant cloud computing on top of processors and chipsets that were designed and hyper-optimized for
> single-tenant use. We crossed our fingers that it would be OK and it would all turn out great and we would all profit.
> In 2018, reality has come back to bite us.
This is the root of all the problems.
Right now many function calls don't safely wipe registers and the new side channel caches found in Spectre. There really needs to be two kinds of function calls. Maybe a C PRAGMA?
The complier has parent function call wiping as a flag; the code has pragmas that over-ride the flag.
Wayback Machine: https://web.archive.org/web/20180104131631/https://reviews.l...
:)
You basically can't make a deeply-pipelined processor fast without speculative execution.