New x86 micro-op vulnerability breaks all known Spectre defenses (opens in new tab)

(sciencedaily.com)

417 pointsDoomHotel5y ago189 comments

189 comments

101 comments · 24 top-level

akersten5y ago· 33 in thread

I've been saying this from the start: the well of issues is infinitely deep as soon as you decide that multiple tenants running on the same physical hardware inferring something about another is a vulnerability. I assert, but cannot rigorously prove, that it is not possible to design a CPU such that execution of arbitrary instructions has no observable side-effects, especially if the CPU is speculating.

I don't know what that spells for cloud hosting providers - maybe they have to buy a lot more CPUs so every client can have their own, or commission a special "shared" SKU of CPU that doesn't have any speculative execution - but I know for me, if I have untrusted code running on my CPU, I've already lost. I could then care less about information leakage between threads.

We're going to wind up undoing the last 20 years of performance gains in the name of 'security', and it scares me.

MarkSweep5y ago

> if I have untrusted code running on my CPU, I've already lost

Don’t forget about JavaScript, a common way for people to run untrusted code on their computers. Not all of micro-architectural data sample are exploitable in JavaScript, but some are.

akersten5y ago

Yeah, JS is the only hairy part. I considered mentioning it, since I know it was going to come up. But luckily, all I've seen so far are basic demos (like leaky.page) that read data from a carefully-crafted array that the page itself populated. I've yet to be convinced that you could realistically exfiltrate meaningful data at any sort of scale with in-browser JS, especially now that more blatant bugs like Meltdown are fixed.

If anyone can show a proof-of-concept ("this page grabs your password manager extension's data") I'll eat my words. But I feel confident that most of these issues are purely academic and, while interesting, serve more to provide content for PhD theses than represent urgent hazards on the web.

5 more replies

baybal25y ago

Chrome had 7 exploits caught in the wild within 7 weeks in 2020.

I believe it is going towards JIT being disabled, or most severely limited.

2 more replies

indigochill5y ago

> We're going to wind up undoing the last 20 years of performance gains in the name of 'security', and it scares me.

This actually excites me. When the foundation is shown to be rotten, it's time for a new foundation.

I'm optimistic, though, that the future holds a fork, with some devices insecure-but-fast and others secure-but-slow. Because there's a market for both. I don't care if my gaming hardware is vulnerable to Spectre because ideally there's nothing worth stealing there anyway. Email/messaging hardware can afford to be a -lot- slower than my gaming rig without any appreciable impact on the experience.

Perhaps the future holds motherboards that look like the physical embodiment of Qubes OS, with secure and insecure chips running compartmentalized features based on their security/speed requirements. We already do something like this for performance with the divide between CPUs and GPUs.

creatonez5y ago

I have a feeling that the wide spectrum of different ways to mitigate CPU bugs will slow down the demand for a new "slow and steady" CPU architecture. Linux already comes with a feature to wipe the L1 cache on every context switch -- simply enabling this option will compete with brand new architectures for a while.

A major advantage of not adopting new CPU designs for a while is that you get to keep insecure-but-fast and secure-but-slow behavior in the same CPU by simply tweaking mitigations.

jka5y ago

Good thinking, although I'd be a little wary about making a clear distinction between those two classes of device.

It'd seem both theoretically and practically possible to engineer hardware that could enable and disable certain optimizations and extensions dynamically.

(note that energy consumption may also be a related factor here)

baybal25y ago

You are completely correct.

The safe execution of any untrusted Turing complete code is a pipe dream.

You, at least, need a clean sheet CPU design starting from ISA, and basic logic operations formally validated against instruction level analysis to have a fighting chance.

But even such chip do get pwned, as shown by key recovery from credit cards in the wild.

elihu5y ago

> The safe execution of any untrusted Turing complete code is a pipe dream.

I don't think that's true. It's not turing completeness that's the real problem here. It's that software usually has access to accurate timing information, whether it's via RDTSC, gettimeofday(), or sharing memory with another thread that does things that take a predictable amount of time. If a program has no notion of current time and cannot measure how long something takes, then a lot of those side channel attacks no longer work. (Note that this precludes using styles of threading that have nondeterministic results, but it doesn't preclude using styles of threading that are deterministic, like Haskell's parMap.)

I do think maybe we should move away from the model of "let's let people run programs comprised of arbitrary instructions on their computers, and build all our security around keeping programs from reading and writing things they shouldn't" to a model of "all programs running on this computer were compiled by a trusted compiler, and our security is based on the compiler disallowing certain unsafe constructs". This is sort of analogous to web browsers running javascript in a sandbox, or running eBPF in the Linux kernel.

1 more reply

mikewarot5y ago

>The safe execution of any untrusted Turing complete code is a pipe dream.

The safe execution of any code requires an operating environment that never trusts the code with more than the least privilege required to complete a task. It has worked in mainframes that way for decades.

The IT zeitgeist these days makes me sad. Things can be better, but almost everyone is pushing in counterproductive directions, or has given up hope.

3 more replies

Salgat5y ago

Will this even be an issue when we have CPUs with hundreds/thousands of cores that can just sandbox processes to their own set of cores/cache with exclusive unshared memory?

MaxBarraclough5y ago

I think this idea could be taken further: just build physical machines with lower capacity (RAM, cores), rather than filling data-centers with top-spec hardware then dividing them up with virtualisation. On the face of it at least, this seems like an idea worth taking seriously. With the right form-factor, I imagine it shouldn't even have much of an impact on space efficiency or power efficiency. Perhaps the CPU companies just aren't interested in making such hardware?

2 more replies

freemint5y ago

You could very much design and verify that a CPU such that execution of arbitrary code has no observable side effects.

However attempting to do so is multi year, for modern CPUs certainly multi decade project. This is not feasible as long as Moore's Law goes on.

Formal verification of kernels (SEL4) and a micro processors it runs on have been performed together to proof properties.

Large ALU blocks and vector units (such as multipliers) can be formally verified.

However efforts of end-to-end formal verification of entire processors won't happen unless there is demand to justify the huge investment of engineering resources and the market would be fine with chips many years behind.

Unless everyone is running dozens of clients on all their hardware it is cheaper to give everyone their own machine instead of investing the engineering time in to verifying chips. The economic incentives for Intel that selling more machines for isolation brings them more revenue without having to make investments in to decade long verification projects means it is not a mathematical certainty that it can't happen but just an economic one. After all speculative execution is just extra state and extra logic which can be formally verified. After all programs are just bit patterns and all quarters can be addressed in the formalism of Quantified Boolean Formulas.

But nobody is gonna undo 20 years of performance. You will just be told to buy more machines to isolate workloads.

rsj_hn5y ago

The largest system I'm aware of that meets formal verification -- like CC EAL 7 -- are small smart card (Gemalto) Operating Systems. Has it been done on anything bigger?

paulmd5y ago

I've been saying since this initially came up that big.LITTLE is the long-term solution for this.

In the grand scheme of things, high-intensity tasks are only infrequently high-security tasks - those two sets of workloads are mostly disjoint. So the long-term solution is to have "fast cores" and "secure cores".

The fast cores can have all the OoO, speculation, all of that good stuff. That's where you run anything that needs to go fast, or anything running "trusted" code. By and large, nobody cares if an ffmpeg process or HPC node might leak data. Databases? You control the queries that are running on them, right? There are some edge cases like video games where leaking data is moderately harmful (could be useful for exploits if you can reliably leak useful data) yet you still want maximum performance, but at the end of the day leaking data at a couple kB/s usually isn't going to be the end of the world especially if the data is rapidly changing.

If the code is untrusted or user-generated, or the data is sufficiently sensitive, then run it on a "secure" core. The "secure" cores have to be in-order, non-speculative, all that crap. Probably non-SMT as that seems to be a bottomless pit of sidechannels as well. But usually, you aren't churning huge workloads in the "secure" situations. You can still have crypto acceleration instructions built into the cores, AVX, whatever, just not speculative. It's probably better to get them fully out of the "normal" cache hierarchy as well.

There are a couple obvious problems here, but much smaller than trying to fix everything for every use-case. In particular web browsers are running untrusted code, and every single website is running 15 mb of shitty javascript code. It sucks but it's basically become an inner platform and you can't trust the code that it's bringing in, so that needs to be permanently isolated on its own secure cores. People will have to start paying attention to the performance of their javascript and optimizing out the real shitty bits.

Another big one is shared hosting environments - VPS environments are a prime target for trying to leak data from other clients on the same core/cache hierarchy, so those either need to be moved to "secure" cores, or switched to a model of renting out a whole core (or moved to a "hard time slice" where when the slice goes active you get the whole core for X seconds, then the processor stops, flushes everything, then switches clients). But VPS could conceivably be moved to "arrays of little cores" (to the extent that they aren't already) and that won't pose much problem for a lot of typical "micro" use-cases as long as every instance doesn't hit the server at once. Maybe for people that need faster than a dedicated "little" core the next increment becomes leasing the whole core, or even the whole complex of cores on that cache hierarchy.

Web application servers (not necessarily databases) are another one, unfortunately, since you can time web requests and use that to "leak" data down different code paths. If it's a directly user-facing service, probably best to get it onto a secure core.

The big task for humans is going to be identifying what stuff is allowable to run on the "fast" cores, and then get the schedulers set up so they understand that some stuff can only run in certain processor domains. It's not insurmountable, it just is going to take some time to plug away at it. Perhaps distribute whitelists, and allow the end-user to manually override it if they're really sure.

But yes I've been saying that too, my suspicion is that basically all of OoO and speculation is fundamentally incompatible with not leaking timing data between processes, and that the harder we tilt at this the more attacks we're going to turn up, it's going to turn into an endless game of whack-a-mole and it's going to eat up all the performance gains that we've spent the last 20 years building on the backs of OoO and speculation.

AMD is quite well-placed for this imo since each CCX basically acts like its own NUCA (non-uniform cache architecture) domain and they just happen to share a memory controller. That's pretty much the design you need to make it work right, just with big and little CCXs instead of only big. They just have to come up with their own little cores. Intel is going to be harder because the classic Sandy Bridge architecture (which is largely unchanged today) has all the cores collectively sharing their last-level cache, and I think that's probably a problem in the long term too. I think Skylake-X still works on the principle of cache being attached to each core and them talking to each other to share it.

AMD and Intel engineers, please make your consulting checks out to 'cash'. Thanks! ;)

hinkley5y ago

> It's probably better to get them fully out of the "normal" cache hierarchy as well.

I can’t seem to shake the notion that this idea of transparent, multi-level caching might have to go away too. That cache shared between cores may have to morph into a layer of chip-local memory that you allocate imperatively. It’s possible that languages like Rust or VMs like the Beam could either adapt to such hardware with fewer problems, or even leverage it. We keep trying to pretend like memory is flat but now we’re up to 3-4 layers of cache and memory banks. How much longer can you torture that abstraction?

willis9365y ago

>In the grand scheme of things, high-intensity tasks are only infrequently high-security tasks - those two sets of workloads are mostly disjoint.

The most intense thing my phone does is decrypt my password database, and it does this dozens of times a day.

3 more replies

tehbeard5y ago

Honest question, what's the "Explain like I'm a Freshman" for what they can work out/leak from all the spectre stuff?

I see alot about private keys etc etc, but is just a blind attack? Or do you need more info on the target? How quickly can you attack to get info?

In essence, is this something Joe Public needs to worry about their $5 vps, or something nefarious using against $CORP's public cloud infrastructure?

ChuckMcM5y ago

I think we will eventually see a return to company 'data centers' away from IaaS plays.

That said, its a pretty amazing time if you're a computer architect since you now have the transistors to spend on pretty much any crazy scheme you can dream up. So perhaps we'll see 'code safe' computer architectures emerge.

fauigerzigerk5y ago

> I think we will eventually see a return to company 'data centers' away from IaaS plays

I don’t see why. Cloud providers have been offering dedicated hardware for a long time. If this problem isn‘t reliably fixable then more customers will make use of these options.

dreamcompiler5y ago

It's not impossible but it does require some relatively unfamiliar architectural approaches coupled with a lot more use of formal methods.

Completely agree about SpecEx. That's a misfeature that needs to die.

Szpadel5y ago

I think that cloud providers get dedicated SKUs. I can imagine if you give each VM dedicated cores and you can partition L3 cache per user, you could mitigate most of those issues.

tinus_hn5y ago

It’s perfectly possible, you just can’t take shortcuts, you’ll lose quite a bit of performance. And of course if you really ‘share’ a resource the users are going to know a bit about each other. If each users share grows and shrinks as others are using more or less they are going to know that. But if you don’t want that you could just reserve a fixed part of the resource and they wouldn’t know.

devit5y ago

It's pretty trivial to make such a CPU: just execute instructions in order and with no cache.

The challenge is more how to make a fast CPU like that.

api5y ago

It’s also possible that we strip off a ton of complexity and then find new performance directions that are better.

For my money we’d end up going toward many-core with loads of simple in-order cores on a die. It’d almost look like a GPU. With 5nm how many in-order ARM or RISC-V cores could you put on a chip? You’d also probably move away from shared caches toward each core having more cache and processes having stronger core affinity. That would be both faster and less likely to allow cache timing attacks. You’d have so many cores a core per process would be feasible with sharing only happening at saturation.

Another direction would be to go back to trying to crank up clock speed with some new approaches. What could we do with today’s manufacturing techniques if we focused on faster transistors more than smaller ones? AFAIK almost nobody has been working on this since the game has been to use more transistors to implement more features and hacks instead.

I read about 10ghz parts on the lab bench in the 2000s. That’s eternity ago in terms of semiconductor process. A 10ghz in-order core would be like a 4X parallel 2.5ghz core… roughly… but more secure and broadly faster on code that’s hard to parallelize. Get rid of speculation and instead give it low branch latency and a ton of on board cache.

ThrowawayR25y ago

> "What could we do with today’s manufacturing techniques if we focused on faster transistors more than smaller ones? AFAIK almost nobody has been working on this since the game has been to use more transistors to implement more features and hacks instead."

Plenty of smart people spent lots of money trying it and as it happens the physics doesn't work out. There are countless articles explaining why processor clock speed isn't increasing, e.g.

- https://www.maketecheasier.com/why-cpu-clock-speed-isnt-incr...

- https://software.intel.com/content/www/us/en/develop/blogs/w...

yjftsjthsd-h5y ago

> For my money we’d end up going toward many-core with loads of simple in-order cores on a die.

That helps with multitasking and parallel-friendly workloads, but lots of stuff isn't easy to make multithreaded.

varispeed5y ago

The thing is most project could successfully run on a single dedicated server plus have a one or two spares. There is absolutely no need for a slow virtual nodes. I always thought of those cloud solutions as a clever scam.

yuliyp5y ago

If you see them as a scam you're welcome to not use them. There are still colo facilities out there, and if you don't want to use that there are ISPs that will let you connect to the Internet.

fulafel5y ago

> is not possible to design a CPU such that execution of arbitrary instructions has no observable side-effects, especially if the CPU is speculating

As a counter example, how about the 8086?

ForOldHack5y ago

I agree completely.

Bancakes5y ago

20 years ago, computer magazines wrote about single-core 10GHz CPUs. Billions of transistors. What we have can barely be described as performance gains more than "add SIMD and more cores, and performance hacks".

yjftsjthsd-h5y ago

> 20 years ago, computer magazines wrote about single-core 10GHz CPUs

Yeah, because they didn't realize how terribly the power consumption / heat output would scale; a 10GHz CPU will just melt itself.

2 more replies

mhh__5y ago

As opposed to the billion transistors we have now?

1 more reply

totallyabstract5y ago· 10 in thread

There are separate micro op caches per core however they are typically shared among hyperthreads. I wonder if this could be another good reason for cloud vendors to move away from 1vCPU = 1 hyperthread to 1vCPU = 1 core for x86 when sharing machines (not that there weren't enough good reasons already).

jiggawatts5y ago

One sneaky thing I've noticed them doing is slowly switching their licensing over to 1 vCPU = 1 CPU, even though you're now only getting one hyperthread instead of one core.

For Microsoft, this means that they've literally doubled their software licensing revenue relative to the hardware it is licensed to.

This kind of false incentive worries me a lot, because while I like the technical concepts like infrastructure-as-code enabled by the public cloud, I feel like greed will eventually destroy what they've built and we'll all be back to square one.

Ask your cloud sales representative these questions next time you have coffee with them:

- What incentive do you have to make your logging formats efficient, if you charge by the gigabyte ingested?

- If your customers are forced to "scale out" to compensate for a platform inefficiency, what incentive do you have to fix the underlying issue?

- What incentive do you have to make network flows take direct paths if you charge for cross-zone traffic? Or to put it another way: Why does load balancer team refuse to implement same-zone-preference as a default?

Etc...

Once you start looking at the cloud like this, you suddenly realise why there are so many user voice feedback posts with thousands of upvotes where the vendor responds with "willnotfix" or just radio silence.

the84725y ago

Cloud vendors probably use a hypervisor that schedules the VM time slices in a way that hyperthread siblings are only ever cooccupied by the same guest.

ljhsiung5y ago

Even putting aside security aspects aside, in general I've been seeing research pop up over the years criticizing SMT's performance claims of ~30%.

Hell, even Amazon's Graviton CPUs don't have it (though I'm sure that's a product of being ARM derived rather than a design decision).

tux35y ago

ARM vendors must be feeling pretty good about themselves yeah, but if you take AMD's cores... SMT might not be a huge win in every benchmark, but you just can't keep that wide backend fed from a single hyperthread (at least I can't!).

So turning SMT off is at the least wasted potential for those cores, the way they've been designed

1 more reply

hajile5y ago

The performance claims are true for all the worst reasons.

Let's say you can queue up 100 instructions. This yields the following

    1 port 100% of the time
    2 ports 60% of the time
    3 ports 30% of the time
    4 ports 10% of the time
    5 ports 2% of the time

Increasing the buffer to 200 instructions yields the following

    2 ports 80% of the time
    3 ports 40% of the time
    4 ports 15% of the time
    5 ports 4% of the time

As in that made-up example, doubling the window you can inspect doesn't double performance. You really want those extra ports because they offer a few percentage IPC uptick, but the cost is too high. So you keep increasing the window size until the extra ports become viable. As an aside, AMD Caymen switched from VLIW5 to VLIW4 because the fifth port was mostly unused. A few applications suffered from the slightly lower theoretical performance, but using that space for more VLIW 4 units (along with other changes) meant that for most things the overall performance went up.

Now comes the x86 fly in the ointment -- the decoders width gives rapidly diminishing returns (I believe an AMD exec mentioned 4 was the hard limit to keep power consumption under control). This limits the size of the reorder buffer that you can keep queued up. Since you have a maximum instruction window size, you have a hard port limit.

So you add a second thread. Sure, it requires it's own entire frontend and register sets, but in exchange you get a ton more opportunities to use those other ports. There are tradeoffs with the complexity and extra units required for SMT, but that's beyond our scope.

As you can see, SMT performance is DIRECTLY related to how inefficiently the main thread can use the resources. In less interdependent code, SMT performance increases are worse because finding uses for those extra ports on the main thread is easier.

Now, let's consider the M1 and one reason why it doesn't have SMT. Going 5, 6, or even 8-wide on the decoders is trivial compared to x86. Apple's M1 (and even the upcoming V1 or N2) have wider decode. This in turn feed a much larger buffer which can in turn extract more parallelism from the thread (this seems to be taking about as many transistors as the extra frontend stuff to implement SMT). Because they can keep most of their ports fed with just one thread, there's no need for the complexity of SMT.

IBM POWER does show a different side of SMT though. They go with 8-way SMT. This isn't because they have that many ports. It's so they can hide latency in their supercomputers. It's kind of like MIMT (multiple instruction, multiple thread) in modern GPUs, but even more flexible. They help to ensure that even when other threads waiting for data that there's still another thread that can be executing.

2 more replies

jcelerier5y ago

When doing audio processing I'm getting ~20/25% more oomph with HT enabled

tyingq5y ago

Or to roll out more ARM, where there isn't currently any hyperthreading.

jamieiles5y ago

Thunder X2 and X3 has 4 way SMT for general purpose, but yes, more ARM is good :-)

1 more reply

secondcoming5y ago

Why would this be an issue for machines on the cloud? If someone can upload binaries to your machine you have bigger problems, no?

derekp75y ago

Because the cloud is designed around people uploading binaries to your machine -- it is a basic principle of how services are allocated. When you go to AWS an spin up an EC2 instance, you don't get a machine to yourself. You get a VM running with many other peoples VMs on some arbitrary server in one of their data centers.

1 more reply

smasher1645y ago· 5 in thread

Maybe EPIC [1] architectures need a revival. Rely on compilers to take advantage of explicit instruction-level parallelism, and keep the CPU dumb.

[1] https://en.wikipedia.org/wiki/Explicitly_parallel_instructio...

bonzini5y ago

That failed for good reasons. Itanium processors ended up using out of order execution and speculation just like everyone else, because the compilers just don't have enough information compared to an out of order execution engine.

Paianni5y ago

From Poulson onwards anyway. https://www.realworldtech.com/poulson/

pabs35y ago

Reminds me of the Mill ISA:

https://millcomputing.com/

zokula5y ago

> Rely on compilers to take advantage of explicit instruction-level parallelism, and keep the CPU dumb.

This very much is never going to be feasible for consumer and general purpose computing.

smasher1645y ago

The ML folks are pulling themselves out of that rut now. There’s lots of interesting work going on for the next generation of compilers.

dataflow5y ago· 4 in thread

Question: How relevant are these for the average person? I know these matter for things like shared hosting, but I've yet to hear of an actual exploit in the wild that ordinary people have been attacked by, even with Spectre defenses turned off. Should normal people be worried about this?

jimmaswell5y ago

I personally disable spectre/meltdown mitigations for performance. I don't think any of it is very important for my use cases and I don't leave sketchy websites open for hours on end to give them a chance to make use of the exploits.

1e-95y ago

Yes. It could undermine your browser if you allow a malicious site to run JavaScript.

kllrnohj5y ago

Not really. So far these mostly haven't been able to cross process boundaries, and most browsers have tripled-down on process-sandboxing by this point (iframe sandboxing was the last major push here: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/if... )

So process-based sandboxing will continue to be the defense here, and process switching will just get a little bit slower as increasingly more caches are flushed (toss the uOp cache into that list now). For basically all consumer usages this will be perfectly fine. On the other hand, things like Cloudflare's Workers are looking a lot more suspect.

2 more replies

dataflow5y ago

Spectre could too, but again, my point was that I didn't hear of actual attacks on people in the wild, at least not on any scale that seemed to make the news. Is there a reason to believe this will be different?

3 more replies

PopePompus5y ago· 4 in thread

I don't understand this at all; I didn't think the mico-op cache was visible to code written for the x86 ISA at all. Can anyone explain to an idiot (me) how something in micro-op cache can become visible to the outside world?

tux35y ago

I'm simplifying a bit (edit: quite a bit =]), but the way these attacks work is generally by exploiting the difference in timing between something being in cache, and something not being in cache. Or some resource being contended vs not contended.

If something is in cache, and you also have access to that cache, accessing that thing will be fast and few CPU resources will be used.

So you can tell that something is in cache. And you know you didn't put it there. So some other thread that you're sharing a CPU core with must have put it there.

To exploit those attacks, you're going to intentionally watch the other thread as it, for example, (speculatively) takes a branch, and either puts something in cache or doesn't. Now you know whether the other thread (speculatively) took a branch or not! Just from measuring timings of the cache.

From that, you work back to what the branch condition (that was still only speculatively executed) must have been, and if this branch is based on (speculatively loaded) data, you just leaked one or more bits of the data.

Suddenly, things are not speculative anymore. You guessed data that wasn't yours, because speculatively using it had an effect on the cache, and you could measure that effect. Here, they use the micro-op cache (I haven't read the paper, so I don't know the details, but this is broad strokes).

Any mechanism that you can use during speculation, and that you can extract timing information from is potentially a problem. And these are everywhere.

That's why the Spectre problem is so hard to fix now that pandora's box is open.

fnord775y ago

so something say, sandboxed (like in a browser running webassembly) could get at non-sandboxed data?

Or something in one VM getting at data from a different VM?

1 more reply

mikewarot5y ago

The fix is simple, don't allow access to clocks or timing information in anything other than the microkernel that runs the OS.

2 more replies

nabla95y ago

There is no direct access. These exploits use a side channel attacks. They feed the CPU code where the execution reveals information indirectly.

tester7565y ago· 3 in thread

>"Intel's suggested defense against Spectre, which is called LFENCE, places sensitive code in a waiting area until the security checks are executed, and only then is the sensitive code allowed to execute," Venkat said. "But it turns out the walls of this waiting area have ears, which our attack exploits. We show how an attacker can smuggle secrets through the micro-op cache by using it as a covert channel."

>"In the case of the previous Spectre attacks, developers have come up with a relatively easy way to prevent any sort of attack without a major performance penalty" for computing, Moody said. "The difference with this attack is you take a much greater performance penalty than those previous attacks."

>"Patches that disable the micro-op cache or halt speculative execution on legacy hardware would effectively roll back critical performance innovations in most modern Intel and AMD processors, and this just isn't feasible," Ren, the lead student author, said.

Randor5y ago

The best part of the new "defense against Spectre" is that the LFENCE instruction has been around for ~20 years. It's not even not a defense against all variants.

mhh__5y ago

So what? The defense relies on it serializing the instruction stream which is not necessarily true based on the semantics of the instruction (until it was retroactively documented to do so)

the84725y ago

lfence behavior varies. On AMD CPUs you need to set an MSR to make it serialize instruction dispatch.

spacemanmatt5y ago· 3 in thread

Is ARM so much better? I can migrate my AWS hosts.

tyingq5y ago

Separate micro-op cache per core, and no hyperthreading, so ARM would seem better equipped to defend against this.

spacemanmatt5y ago

Good to know. This new era of aggressive hardware flaw exploitation has me motivated to leverage my mobility and flexibility to evade. I don't think I have a better strategy.

grishka5y ago

So ARM CPUs do have microcode after all?

1 more reply

londons_explore5y ago· 2 in thread

The solution will be "do not share the micro-op cache between different address spaces".

Which for old hardware will translate to "flush the micro op cache every time the address space changes".

I would guess that can be done with a microcode update and that the performance hit wont be too massive.

tachyonbeam5y ago

The micro-op cache is very small, on the order of ~1.5K uops AFAIK. It can also be repopulated quite fast. So yes, the performance hit should be quite small. You should presumably also be able to reduce the performance hit if you reduce the frequency of context switches, which should get easier the more cores you have, if I'm not mistaken. That is, the OS can have its own dedicated core, and some programs can be more or less pinned to other cores where they are rarely interrupted.

the84725y ago

> You should presumably also be able to reduce the performance hit if you reduce the frequency of context switches, which should get easier the more cores you have, if I'm not mistaken.

Context switches don't happen that often due to preemption unless your CPU is oversubscribed. Most context switches are due to syscalls, especially the ones used to wait for contended locks. Reducing those takes a lot more optimization work.

1 more reply

baybal25y ago· 2 in thread

You cannot realistically make a CPU invulnerable to performance analysis

And you don't need to.

There is really very few uses for real multi-system vs multi-process shared systems.

Take a look on that whole "cloud" thing.

All people I knew who worked in cloud hosting tell that most system are ridiculously overprovisioned, effectively nullifying any economic justification for a shared system

londons_explore5y ago

One day, when margins shrink for cloud compute, we'll see less and less overprovisioning...

lanstin5y ago

I usually end up over provisioning because I need something that is billed along with CPU; for example I have super good C or Go code to run proxies on, they use like 2% of the CPU when they max out the network connection. I add more so the bandwidth goes up.

1 more reply

ineedasername5y ago· 2 in thread

undocumented features in Intel and AMD processors

Why is this at all a thing? Why would you ever leave something out there like that without documenting its existence?

gravypod5y ago

I'm assuming these are instructions for self tests or verification. If so, removing the instructions after they are manufactured wouldn't be easy. You can do it in microcode at the cost of making all execution slightly slower (if instruction not in [a, b, c, d]) or by physically altering the die to remove those instructions. Either way, it doesn't sound fun. It's probably easier to leave them in.

ineedasername5y ago

There's no reason to remove them, that's not what I'm asking. By all means leave them in, but why leave them undocumented? Explain their existence, their parameters & capabilities. If not intended for use, explain that too.

Then when something unexpected like Spectre comes along, the people that have to deal with it can say "Oh yeah, those testing instructions provide another vector of attack that our patch has to account for."

Instead we're in this situation, and I'm pretty sure there's at least a half dozen nations that would have already devoted the resources needed to uncover undocumented instructions like this, meaning ample opportunity to have developed various exploits.

vmception5y ago· 2 in thread

Out of curiosity, is Apple's M1 processor seemingly faster because it is actually more similar to a normal CPU progression but all the other common CPU's - x86 - had retroactive performance hits due to patching Spectre.

And therefore M1 seems so much more faster than it otherwise would?

creshal5y ago

ARM CPUs, including Apple's have been found vulnerable to some SPECTRE variants. As far as I understand, M1 already contains hardware mitigations for all known ones, but so do the latest Intel/AMD chips. (Or rather, contain fixes for all but these latest ones.)

jlouis5y ago

I'm more inclined to believe M1 is fast because it's a modern design where every part of the architecture is under Apple's control.

In particular, the design uses a memory hub with the memory chips very close to the CPU core. And it has a massive L2 cache on top.

Woodi5y ago· 2 in thread

Simplest way around all of this is back to one-core MULTI-SOCKET systems for "civilian" computers like x86 is.

mort965y ago

You're gonna put 10-16 sockets on one motherboard?

daneel_w5y ago

Blade servers seem like an efficient solution in terms of area and volume used, and there's probably a lot more to explore in the concept.

ForOldHack5y ago· 1 in thread

This had to come. The only fix will be to add a BIOS setting for Speculative Access or no speculative access. Gamers all turn it on, with a machine patched, that runs nothing but their game. Everyone else, like browsing the web, off. Look for a encoded binary java script exploit that will own any speculative access system. Its coming too, just like this paper would eventually come.

bruce3434345y ago

This is not feasible. Not everyone has multiple computers. And not everyone wants to use different computers for different things, or take the effort to muck around in the (often mazelike) bios settings.

Causality15y ago· 1 in thread

I expect this to be just like Spectre. The media sizes it as a tool to use fear to drive engagement, vendors partially cripple their hardware to guard against it, and literally nobody ever bothers trying to actually use it against innocent people.

mhh__5y ago

Just like y2k!

anthk5y ago· 1 in thread

https://www.mail-archive.com/source-changes@openbsd.org/msg9...

OpenBSD disabled HT by default.

dTal5y ago

That's less a case of "OpenBSD is prescient" and more "OpenBSD disables everything by default". Even a stopped clock...

juancn5y ago· 1 in thread

This may sound stupid, but the commonality in all these side channel attacks is that high precision time keeping is a non privileged operation.

Maybe it’s time to make clocks a privileged op as a mitigation. Even making execution time non predictable on untrusted code, such as JavaScript?

If precise time keeping is unavailable these become harder to do.

lukastr05y ago

It is surprisingly difficult to make timekeeping unavailable. There are many methods, besides the official timer APIs, to get timing - as outlined in the paper "Fantastic Timers and where to find them" from TU Graz:

https://www.researchgate.net/publication/322000263_Fantastic...

druud625y ago· 1 in thread

The CPU needs to make the overheard signals look just like random noise. A cheap XOR-stream (compare 2FA like Google Authenticator, or the remote in your car keys) should cover that.

vletal5y ago

Well, some of these attacks exploit the actual values present in the memory, not their stored representations. Therefore it would not matter how you encode them on the way, right?

CalChris5y ago

The paper:

I See Dead µops: Leaking Secrets via Intel/AMD Micro-Op Caches

http://www.cs.virginia.edu/venkat/papers/isca2021a.pdf

amluto5y ago

My response: https://lore.kernel.org/lkml/CALCETrXRvhqw0fibE6qom3sDJ+nOa_...

I don’t think any new mitigations are needed.

floatingatoll5y ago

I’d like to highlight this excellent post about x86 micro-ops “fusion” from three years ago, as it’s the reason I have any idea at all what micro-ops are:

https://news.ycombinator.com/item?id=16304415

iam-TJ5y ago

The U of V Engineering Faculty release is at

https://engineering.virginia.edu/news/2021/04/defenseless

1cvmask5y ago

This quote from the article explains the danger quite well:

"Intel's suggested defense against Spectre, which is called LFENCE, places sensitive code in a waiting area until the security checks are executed, and only then is the sensitive code allowed to execute," Venkat said. "But it turns out the walls of this waiting area have ears, which our attack exploits. We show how an attacker can smuggle secrets through the micro-op cache by using it as a covert channel."

SG20005y ago

A close reading of the paper “I see dead uOps” would seem to indicate that Intel’s static thread partitioning of their micro-op cache would confer some inherent protection against uOp cache information leakage between threads - as compared to AMD’s dynamic thread partitioning scheme which could theoretically allow threads to spy on each other using the described techniques.

If true, wouldn’t this also imply that an Intel Skylake CPU mitigates against such attempted attacks by one user against another in a shared CPU/ISP/cloud environment, whereas an AMD CPU theoretically would not? If true, this would be a key point that the authors failed to mention in their concluding remarks.

Anyone else read it this way? Or am I missing something?

failwhaleshark5y ago

The act of loading code into memory, be it a hypervisor or a guest OS, should've been gated by sanitation and validation callbacks. Building all of these macro- and micro-op runtime defenses and mitigations in the processor and slowing down the OSes for every possible runtime edge-case are a waste of speed that can be avoided by establishing trust of code pages.

The morphing of data into code pages with JITs like JS should also be subject to similar restrictions.

j / k navigate · click thread line to collapse

189 comments

101 comments · 24 top-level

akersten5y ago· 33 in thread

We're going to wind up undoing the last 20 years of performance gains in the name of 'security', and it scares me.

MarkSweep5y ago

> if I have untrusted code running on my CPU, I've already lost

Don’t forget about JavaScript, a common way for people to run untrusted code on their computers. Not all of micro-architectural data sample are exploitable in JavaScript, but some are.

akersten5y ago

5 more replies

baybal25y ago

Chrome had 7 exploits caught in the wild within 7 weeks in 2020.

I believe it is going towards JIT being disabled, or most severely limited.

2 more replies

indigochill5y ago

> We're going to wind up undoing the last 20 years of performance gains in the name of 'security', and it scares me.

This actually excites me. When the foundation is shown to be rotten, it's time for a new foundation.

creatonez5y ago

A major advantage of not adopting new CPU designs for a while is that you get to keep insecure-but-fast and secure-but-slow behavior in the same CPU by simply tweaking mitigations.

jka5y ago

Good thinking, although I'd be a little wary about making a clear distinction between those two classes of device.

It'd seem both theoretically and practically possible to engineer hardware that could enable and disable certain optimizations and extensions dynamically.

(note that energy consumption may also be a related factor here)

baybal25y ago

You are completely correct.

The safe execution of any untrusted Turing complete code is a pipe dream.

You, at least, need a clean sheet CPU design starting from ISA, and basic logic operations formally validated against instruction level analysis to have a fighting chance.

But even such chip do get pwned, as shown by key recovery from credit cards in the wild.

elihu5y ago

> The safe execution of any untrusted Turing complete code is a pipe dream.

1 more reply

mikewarot5y ago

>The safe execution of any untrusted Turing complete code is a pipe dream.

The IT zeitgeist these days makes me sad. Things can be better, but almost everyone is pushing in counterproductive directions, or has given up hope.

3 more replies

Salgat5y ago

Will this even be an issue when we have CPUs with hundreds/thousands of cores that can just sandbox processes to their own set of cores/cache with exclusive unshared memory?

MaxBarraclough5y ago

2 more replies

freemint5y ago

You could very much design and verify that a CPU such that execution of arbitrary code has no observable side effects.

However attempting to do so is multi year, for modern CPUs certainly multi decade project. This is not feasible as long as Moore's Law goes on.

Formal verification of kernels (SEL4) and a micro processors it runs on have been performed together to proof properties.

Large ALU blocks and vector units (such as multipliers) can be formally verified.

But nobody is gonna undo 20 years of performance. You will just be told to buy more machines to isolate workloads.

rsj_hn5y ago

The largest system I'm aware of that meets formal verification -- like CC EAL 7 -- are small smart card (Gemalto) Operating Systems. Has it been done on anything bigger?

paulmd5y ago

I've been saying since this initially came up that big.LITTLE is the long-term solution for this.

AMD and Intel engineers, please make your consulting checks out to 'cash'. Thanks! ;)

hinkley5y ago

> It's probably better to get them fully out of the "normal" cache hierarchy as well.

willis9365y ago

>In the grand scheme of things, high-intensity tasks are only infrequently high-security tasks - those two sets of workloads are mostly disjoint.

The most intense thing my phone does is decrypt my password database, and it does this dozens of times a day.

3 more replies

tehbeard5y ago

Honest question, what's the "Explain like I'm a Freshman" for what they can work out/leak from all the spectre stuff?

I see alot about private keys etc etc, but is just a blind attack? Or do you need more info on the target? How quickly can you attack to get info?

In essence, is this something Joe Public needs to worry about their $5 vps, or something nefarious using against $CORP's public cloud infrastructure?

ChuckMcM5y ago

I think we will eventually see a return to company 'data centers' away from IaaS plays.

fauigerzigerk5y ago

> I think we will eventually see a return to company 'data centers' away from IaaS plays

I don’t see why. Cloud providers have been offering dedicated hardware for a long time. If this problem isn‘t reliably fixable then more customers will make use of these options.

dreamcompiler5y ago

It's not impossible but it does require some relatively unfamiliar architectural approaches coupled with a lot more use of formal methods.

Completely agree about SpecEx. That's a misfeature that needs to die.

Szpadel5y ago

I think that cloud providers get dedicated SKUs. I can imagine if you give each VM dedicated cores and you can partition L3 cache per user, you could mitigate most of those issues.

tinus_hn5y ago

devit5y ago

It's pretty trivial to make such a CPU: just execute instructions in order and with no cache.

The challenge is more how to make a fast CPU like that.

api5y ago

It’s also possible that we strip off a ton of complexity and then find new performance directions that are better.

ThrowawayR25y ago

Plenty of smart people spent lots of money trying it and as it happens the physics doesn't work out. There are countless articles explaining why processor clock speed isn't increasing, e.g.

- https://www.maketecheasier.com/why-cpu-clock-speed-isnt-incr...

- https://software.intel.com/content/www/us/en/develop/blogs/w...

yjftsjthsd-h5y ago

> For my money we’d end up going toward many-core with loads of simple in-order cores on a die.

That helps with multitasking and parallel-friendly workloads, but lots of stuff isn't easy to make multithreaded.

varispeed5y ago

yuliyp5y ago

If you see them as a scam you're welcome to not use them. There are still colo facilities out there, and if you don't want to use that there are ISPs that will let you connect to the Internet.

fulafel5y ago

> is not possible to design a CPU such that execution of arbitrary instructions has no observable side-effects, especially if the CPU is speculating

As a counter example, how about the 8086?

ForOldHack5y ago

I agree completely.

Bancakes5y ago

yjftsjthsd-h5y ago

> 20 years ago, computer magazines wrote about single-core 10GHz CPUs

Yeah, because they didn't realize how terribly the power consumption / heat output would scale; a 10GHz CPU will just melt itself.

2 more replies

mhh__5y ago

As opposed to the billion transistors we have now?

1 more reply

totallyabstract5y ago· 10 in thread

jiggawatts5y ago

One sneaky thing I've noticed them doing is slowly switching their licensing over to 1 vCPU = 1 CPU, even though you're now only getting one hyperthread instead of one core.

For Microsoft, this means that they've literally doubled their software licensing revenue relative to the hardware it is licensed to.

Ask your cloud sales representative these questions next time you have coffee with them:

- What incentive do you have to make your logging formats efficient, if you charge by the gigabyte ingested?

- If your customers are forced to "scale out" to compensate for a platform inefficiency, what incentive do you have to fix the underlying issue?

Etc...

the84725y ago

Cloud vendors probably use a hypervisor that schedules the VM time slices in a way that hyperthread siblings are only ever cooccupied by the same guest.

ljhsiung5y ago

Even putting aside security aspects aside, in general I've been seeing research pop up over the years criticizing SMT's performance claims of ~30%.

Hell, even Amazon's Graviton CPUs don't have it (though I'm sure that's a product of being ARM derived rather than a design decision).

tux35y ago

So turning SMT off is at the least wasted potential for those cores, the way they've been designed

1 more reply

hajile5y ago

The performance claims are true for all the worst reasons.

Let's say you can queue up 100 instructions. This yields the following

    1 port 100% of the time
    2 ports 60% of the time
    3 ports 30% of the time
    4 ports 10% of the time
    5 ports 2% of the time

Increasing the buffer to 200 instructions yields the following

    2 ports 80% of the time
    3 ports 40% of the time
    4 ports 15% of the time
    5 ports 4% of the time

2 more replies

jcelerier5y ago

When doing audio processing I'm getting ~20/25% more oomph with HT enabled

tyingq5y ago

Or to roll out more ARM, where there isn't currently any hyperthreading.

jamieiles5y ago

Thunder X2 and X3 has 4 way SMT for general purpose, but yes, more ARM is good :-)

1 more reply

secondcoming5y ago

Why would this be an issue for machines on the cloud? If someone can upload binaries to your machine you have bigger problems, no?

derekp75y ago

1 more reply

smasher1645y ago· 5 in thread

Maybe EPIC [1] architectures need a revival. Rely on compilers to take advantage of explicit instruction-level parallelism, and keep the CPU dumb.

[1] https://en.wikipedia.org/wiki/Explicitly_parallel_instructio...

bonzini5y ago

Paianni5y ago

From Poulson onwards anyway. https://www.realworldtech.com/poulson/

pabs35y ago

Reminds me of the Mill ISA:

https://millcomputing.com/

zokula5y ago

> Rely on compilers to take advantage of explicit instruction-level parallelism, and keep the CPU dumb.

This very much is never going to be feasible for consumer and general purpose computing.

smasher1645y ago

The ML folks are pulling themselves out of that rut now. There’s lots of interesting work going on for the next generation of compilers.

dataflow5y ago· 4 in thread

jimmaswell5y ago

1e-95y ago

Yes. It could undermine your browser if you allow a malicious site to run JavaScript.

kllrnohj5y ago

2 more replies

dataflow5y ago

3 more replies

PopePompus5y ago· 4 in thread

tux35y ago

If something is in cache, and you also have access to that cache, accessing that thing will be fast and few CPU resources will be used.

So you can tell that something is in cache. And you know you didn't put it there. So some other thread that you're sharing a CPU core with must have put it there.

Any mechanism that you can use during speculation, and that you can extract timing information from is potentially a problem. And these are everywhere.

That's why the Spectre problem is so hard to fix now that pandora's box is open.

fnord775y ago

so something say, sandboxed (like in a browser running webassembly) could get at non-sandboxed data?

Or something in one VM getting at data from a different VM?

1 more reply

mikewarot5y ago

The fix is simple, don't allow access to clocks or timing information in anything other than the microkernel that runs the OS.

2 more replies

nabla95y ago

There is no direct access. These exploits use a side channel attacks. They feed the CPU code where the execution reveals information indirectly.

tester7565y ago· 3 in thread

Randor5y ago

The best part of the new "defense against Spectre" is that the LFENCE instruction has been around for ~20 years. It's not even not a defense against all variants.

mhh__5y ago

So what? The defense relies on it serializing the instruction stream which is not necessarily true based on the semantics of the instruction (until it was retroactively documented to do so)

the84725y ago

lfence behavior varies. On AMD CPUs you need to set an MSR to make it serialize instruction dispatch.

spacemanmatt5y ago· 3 in thread

Is ARM so much better? I can migrate my AWS hosts.

tyingq5y ago

Separate micro-op cache per core, and no hyperthreading, so ARM would seem better equipped to defend against this.

spacemanmatt5y ago

Good to know. This new era of aggressive hardware flaw exploitation has me motivated to leverage my mobility and flexibility to evade. I don't think I have a better strategy.

grishka5y ago

So ARM CPUs do have microcode after all?

1 more reply

londons_explore5y ago· 2 in thread

The solution will be "do not share the micro-op cache between different address spaces".

Which for old hardware will translate to "flush the micro op cache every time the address space changes".

I would guess that can be done with a microcode update and that the performance hit wont be too massive.

tachyonbeam5y ago

the84725y ago

> You should presumably also be able to reduce the performance hit if you reduce the frequency of context switches, which should get easier the more cores you have, if I'm not mistaken.

1 more reply

baybal25y ago· 2 in thread

You cannot realistically make a CPU invulnerable to performance analysis

And you don't need to.

There is really very few uses for real multi-system vs multi-process shared systems.

Take a look on that whole "cloud" thing.

All people I knew who worked in cloud hosting tell that most system are ridiculously overprovisioned, effectively nullifying any economic justification for a shared system

londons_explore5y ago

One day, when margins shrink for cloud compute, we'll see less and less overprovisioning...

lanstin5y ago

1 more reply

ineedasername5y ago· 2 in thread

undocumented features in Intel and AMD processors

Why is this at all a thing? Why would you ever leave something out there like that without documenting its existence?

gravypod5y ago

ineedasername5y ago

vmception5y ago· 2 in thread

And therefore M1 seems so much more faster than it otherwise would?

creshal5y ago

jlouis5y ago

I'm more inclined to believe M1 is fast because it's a modern design where every part of the architecture is under Apple's control.

In particular, the design uses a memory hub with the memory chips very close to the CPU core. And it has a massive L2 cache on top.

Woodi5y ago· 2 in thread

Simplest way around all of this is back to one-core MULTI-SOCKET systems for "civilian" computers like x86 is.

mort965y ago

You're gonna put 10-16 sockets on one motherboard?

daneel_w5y ago

Blade servers seem like an efficient solution in terms of area and volume used, and there's probably a lot more to explore in the concept.

ForOldHack5y ago· 1 in thread

bruce3434345y ago

Causality15y ago· 1 in thread

mhh__5y ago

Just like y2k!

anthk5y ago· 1 in thread

https://www.mail-archive.com/source-changes@openbsd.org/msg9...

OpenBSD disabled HT by default.

dTal5y ago

That's less a case of "OpenBSD is prescient" and more "OpenBSD disables everything by default". Even a stopped clock...

juancn5y ago· 1 in thread

This may sound stupid, but the commonality in all these side channel attacks is that high precision time keeping is a non privileged operation.

Maybe it’s time to make clocks a privileged op as a mitigation. Even making execution time non predictable on untrusted code, such as JavaScript?

If precise time keeping is unavailable these become harder to do.

lukastr05y ago

https://www.researchgate.net/publication/322000263_Fantastic...

druud625y ago· 1 in thread

The CPU needs to make the overheard signals look just like random noise. A cheap XOR-stream (compare 2FA like Google Authenticator, or the remote in your car keys) should cover that.

vletal5y ago

Well, some of these attacks exploit the actual values present in the memory, not their stored representations. Therefore it would not matter how you encode them on the way, right?

CalChris5y ago

The paper:

I See Dead µops: Leaking Secrets via Intel/AMD Micro-Op Caches

http://www.cs.virginia.edu/venkat/papers/isca2021a.pdf

amluto5y ago

My response: https://lore.kernel.org/lkml/CALCETrXRvhqw0fibE6qom3sDJ+nOa_...

I don’t think any new mitigations are needed.

floatingatoll5y ago

I’d like to highlight this excellent post about x86 micro-ops “fusion” from three years ago, as it’s the reason I have any idea at all what micro-ops are:

https://news.ycombinator.com/item?id=16304415

iam-TJ5y ago

The U of V Engineering Faculty release is at

https://engineering.virginia.edu/news/2021/04/defenseless

1cvmask5y ago

This quote from the article explains the danger quite well:

SG20005y ago

Anyone else read it this way? Or am I missing something?

failwhaleshark5y ago

The morphing of data into code pages with JITs like JS should also be subject to similar restrictions.

j / k navigate · click thread line to collapse