undefined | Better HN

0 pointsusrbinbash1y ago0 comments

That works in systems that exhibit economy of scale.

The problem with generative ai workloads: The costs rise linerly with the number of requests, because you need to compute every query.

0 comments

2 comments · 1 top-level

datadrivenangel1y ago· 1 in thread

GPU type and utilization mean that the costs likely rise only logarithmically or sub-linear. If you commit to buying enough inference over long enough, someone can buy a rack of the newest custom inference chips and run them at 100% for you, which may be a lot cheaper per request than doing them on a cpu somewhere.

usrbinbashOP1y ago

I disagree tbh. I mean, I accept that new silicon will have better power usage and probably be more efficient in terms of flops/Joule, but there would need to be a major technical breakthrought to get a logarithmic relationship between N requests and inference cost. N requests at P flops, still means I need C x P flops for C x N requests. A not-so-steep linear relationship is still linear.

j / k navigate · click thread line to collapse