Well one other massive disclaimer is that the author is "Backed by OpenAI"'s Startup Fund which they failed to disclose in the post.
So of course they would speculate that. This post is essentially a paid marketing piece by OpenAI, who is the lead investor in Anysphere (creators of Cursor)
Well there is your problem.
LLaMA quantized to 4 bits fits in 40GB. And it gets similar throughput split between dual consumer GPUs, which likely means much better throughput on a single 40GB A100 (or a cheaper 48GB Pro GPU)
https://github.com/turboderp/exllama#dual-gpu-results
And this is without any consideration of batching (which I am not familiar with TBH).
Also, I'm not sure which model was tested, but Llama 70B chat should have better performance than the base model if the prompting syntax is right. That was only reverse engineered from the Meta demo implementation recently.
There are other "perks" from llama too, like manually adjusting various generation parameters, custom grammar during generation and extended context.
I believe GPT-Q is not much faster than bf16 from skimming the AWQ paper - https://arxiv.org/pdf/2306.00978.pdf
It's 3x faster for a batch size of 1, but that's still over 10x more expensive than gpt-3.5
For larger batch sizes, bf16 costs dip below 3-bit quantized.
And as said below, whatever throughput you lose is going to be massively offset by the ability to use smaller single GPUs.
And if you’re running it on your own machine, then the cost of using Llama is just your electricity bill - you can theoretically run it on 2x 3090 which are now quite cheap to buy, or on a CPU with enough RAM (but it will be very very slow).