So yeah, I think models on local hardware will be quite common soon among the tech savvy (such as people creating software).
I do hope you're right that it will get cheaper over time (it should), but right now 32GB of VRAM is not affordable to a lot of people. You're talking ~$4500 just for the GPU, or $800 ish used if you can find one.
It's a tad less efficient and a bit more of a hassle, but still a good experience for only a fraction of the price.
I imagine having multiple providers competing will drive down hosted versions of open weight models drastically.
And we've barely started to scratch the surface on helping open-weight models "be the best they can be", with cloud burst parallel sampling and prompt mutation. Looking for best probabilistic results for a prompt, and looking for best prompt variants for a task. Adaptively scaling computes at generation, not just training.
And speculatively, if agentic coding is naturally a multiplicity, what UX might enable human devs to dance with that quantum superposition? Rather than quickly collapsing to one monkey and its keyboard.
https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-...
Certainly the transistors/chip or transistors/$ or flops/$ have not been progressing at the same exponential rate as during 1970-2010. There is still progress, but it's rather slower.
As you point out it's really cost per transistor or cost per flop that we mostly care about. I'm finding it hard to find a succinct and clear plot, but I believe one is provided by Our World In Data on "GPU computational performance per dollar" [1] which, to my eyes, clearly shows exponential growth in computational power per dollar.
The picture for storage is a little more muddied but if you squint just right you might still be able to recover an uninterrupted exponential growth [2].
In my view, it's pretty clear that advances in AI have progressed so quickly because GPUs have been keeping up with the exponential growth of computational power (per unit cost).
Exponential growth in this area is usually characterized by "S-curves", where one technology gets saturated but the exponential increase in power or decrease in cost is picked up by another, adjacent, technology, that allows the growth to continue. For compute it's CPUs to GPUs. For storage it's platter drives that are now being overtaken by SSDs.
The more general phenomena is called Wright's law, or experience curve effects [3].
[0] https://en.wikipedia.org/wiki/Moore%27s_law
[1] https://ourworldindata.org/grapher/gpu-price-performance?ySc...
[2] https://ourworldindata.org/grapher/historical-cost-of-comput...
Gotta remember inflation here.
$1K in 1995 was roughly equivalent to $2K now and wouldn't have been a particularly "good" machine then.
In 1982 the Commodore 64 started at about $600 bucks, also roughly around $2K today.
If you outgrew that, beefier machines back then were A LOT. It was easy to find $2k+ towers and (especially) laptops even into the 2000s, and a lot of those would be $5K+ equivalent today.
Especially because the world is likely to persist, at least for a while, in state where computing hardware demand drastically exceeds supply resulting in high prices for hardware. So why wouldn't you want to max out utilisation and amortize costs, at least for typical (non sensitive) use cases.
Started with computers around 2009 and later bought an oldish computer (a pentium 4 PC) for the equivalent of 50 usd. Codeblocks and Python Idle were free at the time (C and Python were the first languages I learned). The barrier to programming has always been low as the only thing you needed was books (the internet made things easier) and access to a PC (I had friends with laptop and my school lab).
It's definitely worth investing in self-hosting the agent infrastructure around the model though: all the documents, knowledge base, all the connectors, the agent itself to run on your hardware
Possibly it's the same price range, allowing for inflation.