undefined | Better HN

0 pointsfulafel3mo ago0 comments

This level corresponds to SMT in CPUs I gather. So you can argue your 192 core EPYC server cpu has 384 "vCPUs" since execution resources per core are overprovisioned and when execution blocks waiting for eg memory another thread can run in its place. As Intel and AMD only do 2-way SMT this doesn't make the numbers go up as much.

The single GPU warp is both beefier and wimpier than the SMT thread: they're in-order barely superscalar, whereas on CPU side it's wide superscalar big-window OoO brainiac. But on the other hand the SM has wider SIMD execution resources and there's enough througput for several warps without blocking.

A major difference is how the execution resources are tuned to the expected workloads. CPU's run application code that likes big low latency caches and high single thread performance on branchy integer code, but it doesn't pay to put in execution resources for maximizing AVX-512 FP math instructions per cycle or increasing memory bandwidth indefinitely.

0 comments

saagarjha3mo ago

Right, but the CPU does not have a matrix multiply unit or high bandwidth memory.

fulafelOP3mo ago

Yep. But from the point of view of running CPU-style code on GPUs (eg Rust std lib) and how the "thousands of cores" fiction relates those are less relevant.

And for GenAI matrix math there's of course all the non-gpu acceleration features in various shapes and forms, like the on-chip edge tpu on G phones or Intel and Apple's name things that are both called AMX.

j / k navigate · click thread line to collapse

0 comments

saagarjha3mo ago

Right, but the CPU does not have a matrix multiply unit or high bandwidth memory.

fulafelOP3mo ago

Yep. But from the point of view of running CPU-style code on GPUs (eg Rust std lib) and how the "thousands of cores" fiction relates those are less relevant.

j / k navigate · click thread line to collapse