I don't know what that spells for cloud hosting providers - maybe they have to buy a lot more CPUs so every client can have their own, or commission a special "shared" SKU of CPU that doesn't have any speculative execution - but I know for me, if I have untrusted code running on my CPU, I've already lost. I could then care less about information leakage between threads.
We're going to wind up undoing the last 20 years of performance gains in the name of 'security', and it scares me.
Don’t forget about JavaScript, a common way for people to run untrusted code on their computers. Not all of micro-architectural data sample are exploitable in JavaScript, but some are.
If anyone can show a proof-of-concept ("this page grabs your password manager extension's data") I'll eat my words. But I feel confident that most of these issues are purely academic and, while interesting, serve more to provide content for PhD theses than represent urgent hazards on the web.
I believe it is going towards JIT being disabled, or most severely limited.
This actually excites me. When the foundation is shown to be rotten, it's time for a new foundation.
I'm optimistic, though, that the future holds a fork, with some devices insecure-but-fast and others secure-but-slow. Because there's a market for both. I don't care if my gaming hardware is vulnerable to Spectre because ideally there's nothing worth stealing there anyway. Email/messaging hardware can afford to be a -lot- slower than my gaming rig without any appreciable impact on the experience.
Perhaps the future holds motherboards that look like the physical embodiment of Qubes OS, with secure and insecure chips running compartmentalized features based on their security/speed requirements. We already do something like this for performance with the divide between CPUs and GPUs.
A major advantage of not adopting new CPU designs for a while is that you get to keep insecure-but-fast and secure-but-slow behavior in the same CPU by simply tweaking mitigations.
It'd seem both theoretically and practically possible to engineer hardware that could enable and disable certain optimizations and extensions dynamically.
(note that energy consumption may also be a related factor here)
The safe execution of any untrusted Turing complete code is a pipe dream.
You, at least, need a clean sheet CPU design starting from ISA, and basic logic operations formally validated against instruction level analysis to have a fighting chance.
But even such chip do get pwned, as shown by key recovery from credit cards in the wild.
I don't think that's true. It's not turing completeness that's the real problem here. It's that software usually has access to accurate timing information, whether it's via RDTSC, gettimeofday(), or sharing memory with another thread that does things that take a predictable amount of time. If a program has no notion of current time and cannot measure how long something takes, then a lot of those side channel attacks no longer work. (Note that this precludes using styles of threading that have nondeterministic results, but it doesn't preclude using styles of threading that are deterministic, like Haskell's parMap.)
I do think maybe we should move away from the model of "let's let people run programs comprised of arbitrary instructions on their computers, and build all our security around keeping programs from reading and writing things they shouldn't" to a model of "all programs running on this computer were compiled by a trusted compiler, and our security is based on the compiler disallowing certain unsafe constructs". This is sort of analogous to web browsers running javascript in a sandbox, or running eBPF in the Linux kernel.
The safe execution of any code requires an operating environment that never trusts the code with more than the least privilege required to complete a task. It has worked in mainframes that way for decades.
The IT zeitgeist these days makes me sad. Things can be better, but almost everyone is pushing in counterproductive directions, or has given up hope.
However attempting to do so is multi year, for modern CPUs certainly multi decade project. This is not feasible as long as Moore's Law goes on.
Formal verification of kernels (SEL4) and a micro processors it runs on have been performed together to proof properties.
Large ALU blocks and vector units (such as multipliers) can be formally verified.
However efforts of end-to-end formal verification of entire processors won't happen unless there is demand to justify the huge investment of engineering resources and the market would be fine with chips many years behind.
Unless everyone is running dozens of clients on all their hardware it is cheaper to give everyone their own machine instead of investing the engineering time in to verifying chips. The economic incentives for Intel that selling more machines for isolation brings them more revenue without having to make investments in to decade long verification projects means it is not a mathematical certainty that it can't happen but just an economic one. After all speculative execution is just extra state and extra logic which can be formally verified. After all programs are just bit patterns and all quarters can be addressed in the formalism of Quantified Boolean Formulas.
But nobody is gonna undo 20 years of performance. You will just be told to buy more machines to isolate workloads.
In the grand scheme of things, high-intensity tasks are only infrequently high-security tasks - those two sets of workloads are mostly disjoint. So the long-term solution is to have "fast cores" and "secure cores".
The fast cores can have all the OoO, speculation, all of that good stuff. That's where you run anything that needs to go fast, or anything running "trusted" code. By and large, nobody cares if an ffmpeg process or HPC node might leak data. Databases? You control the queries that are running on them, right? There are some edge cases like video games where leaking data is moderately harmful (could be useful for exploits if you can reliably leak useful data) yet you still want maximum performance, but at the end of the day leaking data at a couple kB/s usually isn't going to be the end of the world especially if the data is rapidly changing.
If the code is untrusted or user-generated, or the data is sufficiently sensitive, then run it on a "secure" core. The "secure" cores have to be in-order, non-speculative, all that crap. Probably non-SMT as that seems to be a bottomless pit of sidechannels as well. But usually, you aren't churning huge workloads in the "secure" situations. You can still have crypto acceleration instructions built into the cores, AVX, whatever, just not speculative. It's probably better to get them fully out of the "normal" cache hierarchy as well.
There are a couple obvious problems here, but much smaller than trying to fix everything for every use-case. In particular web browsers are running untrusted code, and every single website is running 15 mb of shitty javascript code. It sucks but it's basically become an inner platform and you can't trust the code that it's bringing in, so that needs to be permanently isolated on its own secure cores. People will have to start paying attention to the performance of their javascript and optimizing out the real shitty bits.
Another big one is shared hosting environments - VPS environments are a prime target for trying to leak data from other clients on the same core/cache hierarchy, so those either need to be moved to "secure" cores, or switched to a model of renting out a whole core (or moved to a "hard time slice" where when the slice goes active you get the whole core for X seconds, then the processor stops, flushes everything, then switches clients). But VPS could conceivably be moved to "arrays of little cores" (to the extent that they aren't already) and that won't pose much problem for a lot of typical "micro" use-cases as long as every instance doesn't hit the server at once. Maybe for people that need faster than a dedicated "little" core the next increment becomes leasing the whole core, or even the whole complex of cores on that cache hierarchy.
Web application servers (not necessarily databases) are another one, unfortunately, since you can time web requests and use that to "leak" data down different code paths. If it's a directly user-facing service, probably best to get it onto a secure core.
The big task for humans is going to be identifying what stuff is allowable to run on the "fast" cores, and then get the schedulers set up so they understand that some stuff can only run in certain processor domains. It's not insurmountable, it just is going to take some time to plug away at it. Perhaps distribute whitelists, and allow the end-user to manually override it if they're really sure.
But yes I've been saying that too, my suspicion is that basically all of OoO and speculation is fundamentally incompatible with not leaking timing data between processes, and that the harder we tilt at this the more attacks we're going to turn up, it's going to turn into an endless game of whack-a-mole and it's going to eat up all the performance gains that we've spent the last 20 years building on the backs of OoO and speculation.
AMD is quite well-placed for this imo since each CCX basically acts like its own NUCA (non-uniform cache architecture) domain and they just happen to share a memory controller. That's pretty much the design you need to make it work right, just with big and little CCXs instead of only big. They just have to come up with their own little cores. Intel is going to be harder because the classic Sandy Bridge architecture (which is largely unchanged today) has all the cores collectively sharing their last-level cache, and I think that's probably a problem in the long term too. I think Skylake-X still works on the principle of cache being attached to each core and them talking to each other to share it.
AMD and Intel engineers, please make your consulting checks out to 'cash'. Thanks! ;)
I can’t seem to shake the notion that this idea of transparent, multi-level caching might have to go away too. That cache shared between cores may have to morph into a layer of chip-local memory that you allocate imperatively. It’s possible that languages like Rust or VMs like the Beam could either adapt to such hardware with fewer problems, or even leverage it. We keep trying to pretend like memory is flat but now we’re up to 3-4 layers of cache and memory banks. How much longer can you torture that abstraction?
The most intense thing my phone does is decrypt my password database, and it does this dozens of times a day.
I see alot about private keys etc etc, but is just a blind attack? Or do you need more info on the target? How quickly can you attack to get info?
In essence, is this something Joe Public needs to worry about their $5 vps, or something nefarious using against $CORP's public cloud infrastructure?
That said, its a pretty amazing time if you're a computer architect since you now have the transistors to spend on pretty much any crazy scheme you can dream up. So perhaps we'll see 'code safe' computer architectures emerge.
I don’t see why. Cloud providers have been offering dedicated hardware for a long time. If this problem isn‘t reliably fixable then more customers will make use of these options.
Completely agree about SpecEx. That's a misfeature that needs to die.
The challenge is more how to make a fast CPU like that.
For my money we’d end up going toward many-core with loads of simple in-order cores on a die. It’d almost look like a GPU. With 5nm how many in-order ARM or RISC-V cores could you put on a chip? You’d also probably move away from shared caches toward each core having more cache and processes having stronger core affinity. That would be both faster and less likely to allow cache timing attacks. You’d have so many cores a core per process would be feasible with sharing only happening at saturation.
Another direction would be to go back to trying to crank up clock speed with some new approaches. What could we do with today’s manufacturing techniques if we focused on faster transistors more than smaller ones? AFAIK almost nobody has been working on this since the game has been to use more transistors to implement more features and hacks instead.
I read about 10ghz parts on the lab bench in the 2000s. That’s eternity ago in terms of semiconductor process. A 10ghz in-order core would be like a 4X parallel 2.5ghz core… roughly… but more secure and broadly faster on code that’s hard to parallelize. Get rid of speculation and instead give it low branch latency and a ton of on board cache.
Plenty of smart people spent lots of money trying it and as it happens the physics doesn't work out. There are countless articles explaining why processor clock speed isn't increasing, e.g.
- https://www.maketecheasier.com/why-cpu-clock-speed-isnt-incr...
- https://software.intel.com/content/www/us/en/develop/blogs/w...
That helps with multitasking and parallel-friendly workloads, but lots of stuff isn't easy to make multithreaded.
As a counter example, how about the 8086?
Yeah, because they didn't realize how terribly the power consumption / heat output would scale; a 10GHz CPU will just melt itself.
For Microsoft, this means that they've literally doubled their software licensing revenue relative to the hardware it is licensed to.
This kind of false incentive worries me a lot, because while I like the technical concepts like infrastructure-as-code enabled by the public cloud, I feel like greed will eventually destroy what they've built and we'll all be back to square one.
Ask your cloud sales representative these questions next time you have coffee with them:
- What incentive do you have to make your logging formats efficient, if you charge by the gigabyte ingested?
- If your customers are forced to "scale out" to compensate for a platform inefficiency, what incentive do you have to fix the underlying issue?
- What incentive do you have to make network flows take direct paths if you charge for cross-zone traffic? Or to put it another way: Why does load balancer team refuse to implement same-zone-preference as a default?
Etc...
Once you start looking at the cloud like this, you suddenly realise why there are so many user voice feedback posts with thousands of upvotes where the vendor responds with "willnotfix" or just radio silence.
Hell, even Amazon's Graviton CPUs don't have it (though I'm sure that's a product of being ARM derived rather than a design decision).
So turning SMT off is at the least wasted potential for those cores, the way they've been designed
Let's say you can queue up 100 instructions. This yields the following
1 port 100% of the time
2 ports 60% of the time
3 ports 30% of the time
4 ports 10% of the time
5 ports 2% of the time
Increasing the buffer to 200 instructions yields the following 2 ports 80% of the time
3 ports 40% of the time
4 ports 15% of the time
5 ports 4% of the time
As in that made-up example, doubling the window you can inspect doesn't double performance. You really want those extra ports because they offer a few percentage IPC uptick, but the cost is too high. So you keep increasing the window size until the extra ports become viable. As an aside, AMD Caymen switched from VLIW5 to VLIW4 because the fifth port was mostly unused. A few applications suffered from the slightly lower theoretical performance, but using that space for more VLIW 4 units (along with other changes) meant that for most things the overall performance went up.Now comes the x86 fly in the ointment -- the decoders width gives rapidly diminishing returns (I believe an AMD exec mentioned 4 was the hard limit to keep power consumption under control). This limits the size of the reorder buffer that you can keep queued up. Since you have a maximum instruction window size, you have a hard port limit.
So you add a second thread. Sure, it requires it's own entire frontend and register sets, but in exchange you get a ton more opportunities to use those other ports. There are tradeoffs with the complexity and extra units required for SMT, but that's beyond our scope.
As you can see, SMT performance is DIRECTLY related to how inefficiently the main thread can use the resources. In less interdependent code, SMT performance increases are worse because finding uses for those extra ports on the main thread is easier.
Now, let's consider the M1 and one reason why it doesn't have SMT. Going 5, 6, or even 8-wide on the decoders is trivial compared to x86. Apple's M1 (and even the upcoming V1 or N2) have wider decode. This in turn feed a much larger buffer which can in turn extract more parallelism from the thread (this seems to be taking about as many transistors as the extra frontend stuff to implement SMT). Because they can keep most of their ports fed with just one thread, there's no need for the complexity of SMT.
IBM POWER does show a different side of SMT though. They go with 8-way SMT. This isn't because they have that many ports. It's so they can hide latency in their supercomputers. It's kind of like MIMT (multiple instruction, multiple thread) in modern GPUs, but even more flexible. They help to ensure that even when other threads waiting for data that there's still another thread that can be executing.
[1] https://en.wikipedia.org/wiki/Explicitly_parallel_instructio...
This very much is never going to be feasible for consumer and general purpose computing.
So process-based sandboxing will continue to be the defense here, and process switching will just get a little bit slower as increasingly more caches are flushed (toss the uOp cache into that list now). For basically all consumer usages this will be perfectly fine. On the other hand, things like Cloudflare's Workers are looking a lot more suspect.
If something is in cache, and you also have access to that cache, accessing that thing will be fast and few CPU resources will be used.
So you can tell that something is in cache. And you know you didn't put it there. So some other thread that you're sharing a CPU core with must have put it there.
To exploit those attacks, you're going to intentionally watch the other thread as it, for example, (speculatively) takes a branch, and either puts something in cache or doesn't. Now you know whether the other thread (speculatively) took a branch or not! Just from measuring timings of the cache.
From that, you work back to what the branch condition (that was still only speculatively executed) must have been, and if this branch is based on (speculatively loaded) data, you just leaked one or more bits of the data.
Suddenly, things are not speculative anymore. You guessed data that wasn't yours, because speculatively using it had an effect on the cache, and you could measure that effect. Here, they use the micro-op cache (I haven't read the paper, so I don't know the details, but this is broad strokes).
Any mechanism that you can use during speculation, and that you can extract timing information from is potentially a problem. And these are everywhere.
That's why the Spectre problem is so hard to fix now that pandora's box is open.
Or something in one VM getting at data from a different VM?
>"In the case of the previous Spectre attacks, developers have come up with a relatively easy way to prevent any sort of attack without a major performance penalty" for computing, Moody said. "The difference with this attack is you take a much greater performance penalty than those previous attacks."
>"Patches that disable the micro-op cache or halt speculative execution on legacy hardware would effectively roll back critical performance innovations in most modern Intel and AMD processors, and this just isn't feasible," Ren, the lead student author, said.
Which for old hardware will translate to "flush the micro op cache every time the address space changes".
I would guess that can be done with a microcode update and that the performance hit wont be too massive.
Context switches don't happen that often due to preemption unless your CPU is oversubscribed. Most context switches are due to syscalls, especially the ones used to wait for contended locks. Reducing those takes a lot more optimization work.
And you don't need to.
There is really very few uses for real multi-system vs multi-process shared systems.
Take a look on that whole "cloud" thing.
All people I knew who worked in cloud hosting tell that most system are ridiculously overprovisioned, effectively nullifying any economic justification for a shared system
Why is this at all a thing? Why would you ever leave something out there like that without documenting its existence?
Then when something unexpected like Spectre comes along, the people that have to deal with it can say "Oh yeah, those testing instructions provide another vector of attack that our patch has to account for."
Instead we're in this situation, and I'm pretty sure there's at least a half dozen nations that would have already devoted the resources needed to uncover undocumented instructions like this, meaning ample opportunity to have developed various exploits.
And therefore M1 seems so much more faster than it otherwise would?
In particular, the design uses a memory hub with the memory chips very close to the CPU core. And it has a massive L2 cache on top.
OpenBSD disabled HT by default.
Maybe it’s time to make clocks a privileged op as a mitigation. Even making execution time non predictable on untrusted code, such as JavaScript?
If precise time keeping is unavailable these become harder to do.
https://www.researchgate.net/publication/322000263_Fantastic...
I See Dead µops: Leaking Secrets via Intel/AMD Micro-Op Caches
I don’t think any new mitigations are needed.
"Intel's suggested defense against Spectre, which is called LFENCE, places sensitive code in a waiting area until the security checks are executed, and only then is the sensitive code allowed to execute," Venkat said. "But it turns out the walls of this waiting area have ears, which our attack exploits. We show how an attacker can smuggle secrets through the micro-op cache by using it as a covert channel."
If true, wouldn’t this also imply that an Intel Skylake CPU mitigates against such attempted attacks by one user against another in a shared CPU/ISP/cloud environment, whereas an AMD CPU theoretically would not? If true, this would be a key point that the authors failed to mention in their concluding remarks.
Anyone else read it this way? Or am I missing something?
The morphing of data into code pages with JITs like JS should also be subject to similar restrictions.