undefined | Better HN

0 pointsajross8y ago0 comments

In practice it sometimes is the case, though.

SMT/hyperthreading is complicated. If you have a workload dominated by non-local DRAM fetches, it's a huge win because when the CPU pipeline is stalled on one thread it can still issue instructions from the other.

If you have a workload dominated by L1 cache bandwidth, the opposite is true because the threads compete for the same resource.

On balance, on typical workloads, it's a win. But there are real-world problems for which turning it off is a legitimate performance choice.

0 comments

13 comments · 2 top-level

adrianratnapala8y ago· 7 in thread

> workload dominated by non-local DRAM fetches,

How often is that a polite way of saying "software that is inefficient"?

ajrossOP8y ago

Often. But software is what software does, and a CPU that only worked well on "efficient" code will always fail when compared with one that runs the junk faster than the competition.

Also, to be fair: sometimes a DRAM fetch is just inherent in the problem. Big RAM-resident databases are all about DRAM latency because while sure, it's a lot slower than L1 cache, it's still faster than flash. I mean, memcached is a giant monument in praise of the pipeline stall, and it's hugely successful.

adrianratnapala8y ago

> But software is what software does, and a CPU that only

Indeed. It is arguably rational for Intel to take on the burden in a centralised place rather than expecting every two-bit software shop to to do it.

But then the existence of this kind of security issue shows that the added complexity is not always worthwhile. We might be forced to to accept that computers which actually behave well are a little bit slower than we thought. But in return they will be simpler and more amenable to software optimisation.

imtringued8y ago

I'm not sure there is a correlation. I can think of many situations in which non-local DRAM fetches are more efficient and I can think of many other situations where the opposite is true.

Trees or hashmaps which use non-local DRAM fetches can be more efficient than a brute force linear search through a continguous array given a sufficiently high enough number of elements.

At the same time continguous arrays can be significantly more efficient than linked lists which use non-local DRAM fetches.

yk8y ago

With most software, well most software is pretty inefficent and profits from HT. However there are a lot of reasons for that, writing in something interpreted because it is faster to develop and the software does not need to be very efficient in the first place would be one application. (Not to say that all Python/JS/etc is inefficient, just that software that needs to be efficient is precisely the kind were one would consider an unmanaged language.) Additionally, things like webservers or dbs often just don't know which piece of data they need next, simply because they don't know the next query, have a tendency to profit from HT, even though the software is hardly known for being inefficient.

paulirwin8y ago

FWIW, you mention databases, but even some database workloads can have better performance with HT turned off. I first learned this from a DBA at a former job when I was curious as to why they turned HT off. A member of the SQL Server team back in 2005 ran some experiments and found that you can get a 10% performance improvement in some workloads with HT off [1]. I don't know how much of that is still true today, however, as nearly all of my recent experience is PaaS in the cloud.

[1]: https://blogs.msdn.microsoft.com/slavao/2005/11/12/be-aware-...

naikrovek8y ago

> How often is that a polite way of saying "software that is inefficient"?

One could also say "software written with strong OOP patterns" because those are almost always written to benefit the developer later, rather than the CPU and RAM at runtime.

willvarfar8y ago

There are plenty of problems with poor mechanical sympathy.

To take an extreme example, traversing graphs is notorious. Cray and Sun iirc have some fascinating processors with many many hyperthreads because all the programs do is wait on dram but luckily there are lots of searches that can be done in parallel.

greglindahl8y ago· 4 in thread

Typical workloads? What's that? People run hugely diverse workloads on cpus, and they change over time.

ajrossOP8y ago

Building software, serving web pages, executing database queries, running a DOM layout, managing game logic... I mean, come on. You knew what I meant. Those are all tasks with "medium" cache residency and "occasional" stalls on DRAM. Anything that does a bunch of different things with a big-ish world of data.

Conversely: finding a task that is L1-cache-bound but does not frequently have to stall for memory is much harder. The only ones off the top of my head are streaming tasks like software video decode.

greglindahl8y ago

Oh, you meant typical for you.

One task that is L1 cache bound and does not frequently stall for memory (if you code it up well) is matrix multiply.

1 more reply

adrianN8y ago

I don't know whether it's still true, but a couple of years ago a majority of the world's CPU cycles were spent sorting things.

smaddox8y ago

That's an interesting claim. Do you remember the source?

j / k navigate · click thread line to collapse

0 comments

13 comments · 2 top-level

adrianratnapala8y ago· 7 in thread

> workload dominated by non-local DRAM fetches,

How often is that a polite way of saying "software that is inefficient"?

ajrossOP8y ago

Often. But software is what software does, and a CPU that only worked well on "efficient" code will always fail when compared with one that runs the junk faster than the competition.

adrianratnapala8y ago

> But software is what software does, and a CPU that only

Indeed. It is arguably rational for Intel to take on the burden in a centralised place rather than expecting every two-bit software shop to to do it.

imtringued8y ago

I'm not sure there is a correlation. I can think of many situations in which non-local DRAM fetches are more efficient and I can think of many other situations where the opposite is true.

Trees or hashmaps which use non-local DRAM fetches can be more efficient than a brute force linear search through a continguous array given a sufficiently high enough number of elements.

At the same time continguous arrays can be significantly more efficient than linked lists which use non-local DRAM fetches.

yk8y ago

paulirwin8y ago

[1]: https://blogs.msdn.microsoft.com/slavao/2005/11/12/be-aware-...

naikrovek8y ago

> How often is that a polite way of saying "software that is inefficient"?

One could also say "software written with strong OOP patterns" because those are almost always written to benefit the developer later, rather than the CPU and RAM at runtime.

willvarfar8y ago

There are plenty of problems with poor mechanical sympathy.

greglindahl8y ago· 4 in thread

Typical workloads? What's that? People run hugely diverse workloads on cpus, and they change over time.

ajrossOP8y ago

greglindahl8y ago

Oh, you meant typical for you.

One task that is L1 cache bound and does not frequently stall for memory (if you code it up well) is matrix multiply.

1 more reply

adrianN8y ago

I don't know whether it's still true, but a couple of years ago a majority of the world's CPU cycles were spent sorting things.

smaddox8y ago

That's an interesting claim. Do you remember the source?

j / k navigate · click thread line to collapse