SMT/hyperthreading is complicated. If you have a workload dominated by non-local DRAM fetches, it's a huge win because when the CPU pipeline is stalled on one thread it can still issue instructions from the other.
If you have a workload dominated by L1 cache bandwidth, the opposite is true because the threads compete for the same resource.
On balance, on typical workloads, it's a win. But there are real-world problems for which turning it off is a legitimate performance choice.
How often is that a polite way of saying "software that is inefficient"?
Also, to be fair: sometimes a DRAM fetch is just inherent in the problem. Big RAM-resident databases are all about DRAM latency because while sure, it's a lot slower than L1 cache, it's still faster than flash. I mean, memcached is a giant monument in praise of the pipeline stall, and it's hugely successful.
Indeed. It is arguably rational for Intel to take on the burden in a centralised place rather than expecting every two-bit software shop to to do it.
But then the existence of this kind of security issue shows that the added complexity is not always worthwhile. We might be forced to to accept that computers which actually behave well are a little bit slower than we thought. But in return they will be simpler and more amenable to software optimisation.
Trees or hashmaps which use non-local DRAM fetches can be more efficient than a brute force linear search through a continguous array given a sufficiently high enough number of elements.
At the same time continguous arrays can be significantly more efficient than linked lists which use non-local DRAM fetches.
[1]: https://blogs.msdn.microsoft.com/slavao/2005/11/12/be-aware-...
One could also say "software written with strong OOP patterns" because those are almost always written to benefit the developer later, rather than the CPU and RAM at runtime.
To take an extreme example, traversing graphs is notorious. Cray and Sun iirc have some fascinating processors with many many hyperthreads because all the programs do is wait on dram but luckily there are lots of searches that can be done in parallel.
Conversely: finding a task that is L1-cache-bound but does not frequently have to stall for memory is much harder. The only ones off the top of my head are streaming tasks like software video decode.
One task that is L1 cache bound and does not frequently stall for memory (if you code it up well) is matrix multiply.