Sure, let's say that 8B llama3.1 gets comparable performance of it's 70B llama2 predecessor. Not quite true but let's say that hypothetically it is. That still leave us with 70B llama3.1.
How much VRAM and inference compute is required to run 3.1-70B vs 2-70B?
The argument is that the inference cost is dropping down significantly each year but how exactly if those two models require about the ~same, give or take, amount of VRAM and compute?
One way to drive the cost down is to innovate in inference algorithms such that the HW requirements are loosened up.
In the context of inference optimizations one such is flash-decode, similar to its training counter-part flash-attention, from the same authors. However, that particular optimization concerns only by improving the inference runtime by dropping down the number of memory accesses needed to compute the self-attention. Amount of total VRAM you need in order to just load the model still remains the same so although it is true that you might get a tad more from the same HW, the initial requirement of total HW you need remains to be the same. Flash-decode is also nowhere near the impact of flash-attention. Latter enabled much faster training iteration runtimes while the former has had quite limited impact, mostly because scale of inference is so much smaller than the training so the improvements do not always see the large gains.
> Not to mention the cost/flop and cost/gb for GPUs has dropped.
For training. Not for inference. GPU prices remained about the same, give or take.