undefined | Better HN

0 points3abiton1y ago0 comments

This is still amazing work, imagine running chungus models on a single 3090.

0 comments

The bottleneck on a consumer-grade GPU like a 3090 isn't the processing power, it's the lack of RAM. The PCI-Express bus ends up being your bottleneck from having to swap in parts of the model.

Even with PCIe 5.0 and 16 lanes, you only get 64 GB/s of bandwidth. If you're trying to run a model too big for your GPU, then for every token, it has to reload the entire model. With a 70B parameter model, 8 bit quantization, you're looking at just under 1 token/sec just from having to transfer parts of the model in constantly. Making the actual computation faster won't make it any faster.

dragonwriter1y ago

OTOH, doesn't it also mean that (given appropriate software framework support) iGPUs with less processing capacity and slower-but-more RAM available (because system RAM is comparatively cheap and plentiful compared to VRAM) without swapping anything are more competitive against consumer dGPUs with fast-but-small RAM for both inference and training with larger models?

Sohcahtoa821y ago

System memory isn't that fast, either. Even with DDR5-8400, the fastest memory you can get right now, you're only looking at a memory transfer speed of 67.2 GB/s, barely faster than the PCI-E bus. So even if you could store that entire 70B model in RAM, you're still getting just under 1 token/sec, and that's assuming your CPU doesn't become a bottleneck.

Your best bet would likely be a laptop that has integrated system RAM with VRAM, but I don't think any of those offer enough RAM to store an entire 70B model. A 7B parameter model would work fine, but you could do those on a consumer-grade GPU anyways.

DeveloperErrata1y ago

Macbook Pros with M3 & integrated RAM & VRAM can do 70B models :)

j / k navigate · click thread line to collapse

0 comments

Sohcahtoa821y ago

The bottleneck on a consumer-grade GPU like a 3090 isn't the processing power, it's the lack of RAM. The PCI-Express bus ends up being your bottleneck from having to swap in parts of the model.

dragonwriter1y ago

Sohcahtoa821y ago

DeveloperErrata1y ago

Macbook Pros with M3 & integrated RAM & VRAM can do 70B models :)

j / k navigate · click thread line to collapse