undefined | Better HN

0 pointsphonon1y ago0 comments

Sorry. Didn't realize you meant Skylake-SP.

I am not sure what your point is? There are some nice dual socket Epyc examples floating around as well, that claim 6-8 tokens/s. (I think some of those are actually distilled versions with very small context sizes...I don't see any as thoroughly documented/benchmarked as the above). This is a dual socket Sapphire Rapids example with similar sized CPUs and a consumer graphics card that gives about 16 tokens/second. Sapphire Rapids CPU and MB are a bit more expensive, and a 4090 was $1500 until recently. So for a few thousand more you can double the speed. Also the prompt processing speed is waaaaay faster as well. (Something like 10x faster than the Epyc versions.)

In any case, these are all vastly cheaper approaches than trying to get enough H100s to fit the full R1 model in VRAM! A single H100 80 GB is more than $20k, and you would need many of them + server just to run R1.

0 comments

3 comments · 1 top-level

menaerus1y ago· 2 in thread

I don't argue their idea, which is sound, but I argue that the cost needed to achieve the claimed performance is not "for a few thousand more" as you stubbornly continue to claim.

The math is clear: single-socket ktransformers performance is 8.73 tok/s and it costs ~$12k to build such a rig. The same performance one gets from a $6k dual-EPYC system. It is a full-blown version of R1 and not a distilled one as you say.

Your claim about 16 tok/s is also misleading. It's a figure for 6 experts while we are comparing R1 with 8 experts against llama with 8 experts. 8 experts on dual-socket system per ktransformer benchmarks runs at 12.2 - 13.4 tok/s and not 16 tok/s.

So, ktransformers can roughly achieve 50% more in dual-socket configuration and 50% more than dual-EPYC system. This is not double as you say. And finally, the cost of such dual-socket system is ~$20k and therefore isn't the "best cost effective" solution since it is 3.5x more expensive for 50% better output.

And tbh llama.cpp is not quite optimized for pure CPU inference workloads. It has this strange "compute graph" framework which I don't understand what is it there for. It appears completely unnecessary to me. I also profiled couple of small-, mid- and large-sized models and the interesting thing was that majority of them turned out to be bottlenecked by the CPU compute on a system with 44 physical cores and 192G of RAM. I think it could do a much better job there.

phononOP1y ago

Are we doing this?

Cheapest 32 core latest EPYC (9335) x 2 = $3,079.00 x 2

Intel 32 Core CPU used above x 2 = $3,157 x 2 (I would choose the Intel Xeon Gold 6530 which is going for around $2k now, and with with higher clock speeds, and a 100 MB of more cache)

AMD Epyc Dual Socket Motherboard Supermicro H13DSH = $1899

Intel Supermicro X13DEG-QT = $1,800

Memory, PSU, Case = Same

4090 GPU = $1599 - $3,000 (temporary?)

Besides the GPU cost, the rest is about the same price. You only get a deep discount with AMD setups if you use EPYCs a few years old with cheaper (and slower) DDR4.

And again, if you go single CPU, you save over $4,000, but lose around 30% in token generation.

The "$6,000" AMD examples I've seen are pretty vague on exactly what parts were used and exactly what R1 settings including context length they were run at, making true apple to apple comparisons difficult. Plus the Sapphire Rapids + GPU example is about 10x faster in prompt processing. (53 seconds to 6 seconds is no joke!)

menaerus1y ago

> Are we doing this?

Yes, you're blatantly misrepresenting information and moving goalposts. Right now it has become clear that you're doing this because you're obviously affiliated with ktransformers project.

$6k for 8 tok/s or $20k for 12 tok/s. People are not stupid. I rest my case here.

j / k navigate · click thread line to collapse