undefined | Better HN

0 pointsbinarycrusader9y ago0 comments

It's not, not only did AMD move from CMT (clustered multi-thread) design used in the previous Bulldozer microarchitecture, they now have an SMT (simultaneous multithreading) architecture allowing for 2 threads per core.

By comparison, the performance of sparc substantially improved moving from the T1, T2 to T3+. The T1 used a round-robin policy to issue instructions from the next active thread each cycle, supporting up to 8 fine-grained threads in total. That made it more like a barrel processor.

Starting with the T3, two of the threads could be executed simultaneously. Then, starting with the T4, sparc added dynamic threading and out-of-order execution. Later versions are even faster and clock speeds have also risen considerably.

0 comments

3 comments · 1 top-level

gigatexal9y ago· 2 in thread

I didn't know about this. Are there benchmarks that aren't canned by Oracle that you know of? I'm intrigued by this round-robin way of threading. I'm not a cpu expert, but how does this compare with the Power arch's way of threading?

jabl9y ago

Think of it this way, the original Niagara (T1) was an in-order CPU. That is, instructions were executed in the order they occur in the program code. This is simple and power efficient but doesn't produce very good single thread performance, since the processor stalls if an instruction takes longer than expected. Say, a load instruction misses L1 cache and has to fetch the data from L2/L3/Lwhatever/memory. Now, one way to drive up the utilization of the CPU core is to add hardware threads. And the simplest way to do that? Well, just run an instruction from another available thread every cycle (that is, if a thread is blocked e.g. waiting for memory, skip it). So now you have a CPU that is still pretty small, simple and power efficient, but can still exploit memory level parallelism (i.e. have multiple outstanding memory ops in flight).

Now, the other approach, is that you have a CPU with out of order (OoO) execution. Meaning that the CPU contains a scheduler that handles a queue of instructions, and any instruction that has all its dependencies satisfied can be submitted for execution. And then later on a bunch of magic happens so that externally to the CPU it still looks like everything was executed in order like the program code specified. This is pretty good for getting good single thread performance, and can exploit some amount of MLP as well, e.g. if a bunch of instructions are waiting for a memory operation to complete, some other instructions can still proceed (perhaps executing a memory op themselves). So in this model the amount of MLP is limited by the inherent serial dependencies in the code, and on the length of the instruction queues that the scheduler maintains. The downside of this is that the OoO logic takes up quite a bit of chip area (making it more expensive), and also tends to be one of the more power-hungry parts of the chip. But, if you want good single-thread performance, that's the price you have to pay.. Anyway, now that you have this OoO CPU, what about adding hardware threads? Well, now that you already have all this scheduling logic, turns out it's relatively easy. Just "tag" each instruction with a thread ID, and let the scheduler sort it all out. So this is what is called Simultaneous Multi-Threading (SMT). So in a way it's a pretty different way of doing threading compared to the Niagara-style in-order processor. Also, since you already have all this OoO logic that is able to exploit some MLP within each thread, you don't need as many threads as the Niagara-style CPU to saturate the memory subsystem. So, this SMT style of threading is what you see in contemporary Intel x86 processors (they call it hyperthreading (HT)), IBM POWER, and now also AMD Zen cores.

As for benchmarks, I'm too lazy to search, but I'm sure you can find e.g. some speccpu results for Niagara.

gigatexal9y ago

Doing that now. Thanks for the write-up.

So although separated by time but not by clocks (the intel setup has the roughly the same base clocks and the same ram as the t4 setup) the 40 thread Xeon system had roughly double the perf of the 128 thread t4 setup running speccjvm2008 https://www.spec.org/jvm2008/results/jvm2008.html

2 more replies

j / k navigate · click thread line to collapse