GPU type and utilization mean that the costs likely rise only logarithmically or sub-linear. If you commit to buying enough inference over long enough, someone can buy a rack of the newest custom inference chips and run them at 100% for you, which may be a lot cheaper per request than doing them on a cpu somewhere.
I disagree tbh. I mean, I accept that new silicon will have better power usage and probably be more efficient in terms of flops/Joule, but there would need to be a major technical breakthrought to get a logarithmic relationship between N requests and inference cost. N requests at P flops, still means I need C x P flops for C x N requests. A not-so-steep linear relationship is still linear.